Skip to content

Comments

docs(gitlab): add comprehensive troubleshooting guide#67

Open
celanthe wants to merge 5 commits intomasterfrom
gitlab-troubleshooting-guide
Open

docs(gitlab): add comprehensive troubleshooting guide#67
celanthe wants to merge 5 commits intomasterfrom
gitlab-troubleshooting-guide

Conversation

@celanthe
Copy link
Collaborator

@celanthe celanthe commented Feb 4, 2026

Summary

  • Add comprehensive troubleshooting guide for GitLab Custom Executor
  • Update README.md with link to troubleshooting guide
  • Covers authentication, VM deployment, SSH connectivity, environment variables, and network issues

Related

Test plan

  • Verified all CLI commands work against live Orka cluster (v3.5.2 client, v3.4.0 API)
  • Confirmed variable names match actual script implementations
  • Tested connectivity verification commands
  • Fixed non-existent /api/v1/health endpoint references

🤖 Generated with Claude Code

celanthe and others added 2 commits February 3, 2026 15:29
Add troubleshooting.md covering common issues and solutions:

- Authentication issues (token expiry, endpoint config)
- VM deployment failures (config validation, resource availability)
- SSH connection issues (key setup, timeouts, permissions)
- Environment variable configuration
- Network and connectivity troubleshooting
- Job execution problems
- Orphaned VM cleanup procedures

Also updates README.md to link to the new troubleshooting guide.

Addresses DI-342 requirement for troubleshooting documentation.

Verified:
- All variable names match scripts exactly
- Error messages match actual script output
- CLI commands verified against Orka3 CLI docs
- All reference links validated

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace non-existent /api/v1/health endpoint references with
working CLI-based verification methods (orka3 version).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@celanthe celanthe self-assigned this Feb 4, 2026
@celanthe celanthe added the documentation Improvements or additions to documentation label Feb 4, 2026
@celanthe celanthe requested a review from a team February 4, 2026 15:35

## Quick diagnostics

Before diving into specific issues, run these checks:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these are part of the Docker image.
If these fail to install, the image creation is going to fail.

Furthermore, if any if the conditions below are not met, the integration is going to show a meaningful error like:
orka3 cannot be found
permission denied, etc...

So technically, these are not needed.


**Causes:**
- `ORKA_TOKEN` is invalid, expired, or not set
- `ORKA_ENDPOINT` is incorrect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the endpoint is incorrect you are most likely going to get timeout errors. As you cannot connect to something that does not exist.


1. Verify the token is set correctly:
```bash
echo "$ORKA_TOKEN" | head -c 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot verify this in the container.
The token is passed through the GitLab job and it is present only during the job run.

Plus, so here you need to make sure you are passing the correct token. Or you could create another token and use it instead (as you suggest below).
Nothing more than that.

```bash
# For user authentication
orka3 login
orka3 user get-token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never be used in an integration. These expire after an hour.

3. Verify the endpoint is reachable:
```bash
# This shows both client and server versions if connected
orka3 version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not rely on the CLI here.
But rather ask the user to run curl http://<endpoint>/api/v1/cluster-info for example.


3. Test SSH connectivity manually:
```bash
ssh -i ~/.ssh/orka_deployment_key -p <PORT> admin@<VM_IP> "echo ok"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runner deletes the VMs that fail.
So here we need to suggest deploying a VM manually and trying to connect to it.


**Solutions:**

1. Verify all required variables are set in your GitLab CI/CD settings or `.gitlab-ci.yml`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or in the runner container as I mentioned above.


1. Verify network connectivity:
```bash
ping -c 3 $(echo "$ORKA_ENDPOINT" | sed 's|http://||')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the third place we suggest this.
And this is the third unique approach to do this.
Let's have an unified approach.
Also - ping is not a good idea, it can be disabled.

1. Manually delete orphaned VMs:
```bash
# List VMs with runner prefix
orka3 vm list | grep "gl-runner"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why listing the vms?
The logic here implies we know there is an orphaned VM and we know its name

orka3 vm list -o json | jq -r '.[].name' | grep "gl-runner" | xargs -I {} orka3 vm delete {}
```

2. Consider setting up a periodic cleanup job to remove stale VMs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be hard to implement.
How do we know which VMs are stale?

Changes based on ispasov's review:
- Remove redundant diagnostics section (Docker image validates these)
- Remove orka3 login/user get-token (CI/CD must use service accounts)
- Remove token verification steps (not available in container context)
- Remove grep piping (use CLI's built-in argument filtering)
- Remove export suggestions (vars must be in container/GitLab context)
- Consolidate network connectivity to single curl approach
- Remove "verify exists" steps (trust CLI error messages)
- Add guidance to deploy VMs manually for SSH troubleshooting
- Simplify cleanup section (remove stale VM detection complexity)
- Remove duplicate ping-based connectivity checks

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@celanthe celanthe requested a review from ispasov February 9, 2026 15:21
celanthe and others added 2 commits February 17, 2026 09:04
An incorrect endpoint produces timeout errors, not 401s.
Endpoint troubleshooting is already covered in the connectivity section.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reference the network section instead of repeating the curl check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
orka3 serviceaccount token <service-account-name>
```

2. Update the token in GitLab CI/CD settings:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned there are two ways of passing there variables:

  1. Either though the Gitlab UI
  2. Or on container startup (docker run...)

My suggestion would be to extract a section that tells people how vars are set and in all the places where we say "Make sure this variable is correctly set" to point them to the section so they know how exactly to set the variable.

WDYT

- Go to Settings > CI/CD > Variables
- Update `ORKA_TOKEN` with the new token

**Note:** Service account tokens are valid for 1 year by default. For custom duration, use `--duration` flag.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on the control plane. EKS does not allow you to have such long tokens. So there you would have to use the no-expiration flag.


2. Check available node resources:
```bash
orka3 node list -o wide
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need the wide flag here to see the available resources.

3. Increase deployment attempts by setting the environment variable in `.gitlab-ci.yml`:
```yaml
variables:
VM_DEPLOYMENT_ATTEMPTS: "3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why in the yml file? And not in the Gitlab UI? I would suggest an unified approach when setting variables. As I mentioned above we can have a section that talks about how vars are set so we can let users pick whatever works for them

- VM deployment returned unexpected JSON format
- VM is in a failed state

**Solutions:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't really a solution for this as it is just a symptom. So one needs to be able to find the root cause and fix it.
Usually if the VM information cannot be extracted, this means there are bigger issues.
It is a good suggestion to ask people to try to deploy a VM manually, but we do not give them anything actionable that can fix the issue.


2. Get connection details:
```bash
orka3 vm list test-debug
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You get the connection details from the output of the deploy command. There is no need for a separate command.

orka3 vm list test-debug
```

3. Connect via Screen Sharing (VNC) to check:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScreenSharing and VNC are two different things.
So maybe ScreenSharing or VNC.

**Causes:**
- SSH key has a passphrase (not supported)
- Wrong SSH user
- SSH key not in VM's authorized_keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or the key is wrong

| `ORKA_VM_NAME_PREFIX` | No | VM name prefix (default: `gl-runner`) |
| `VM_DEPLOYMENT_ATTEMPTS` | No | Retry count (default: `1`) |

For sensitive variables like `ORKA_TOKEN` and `ORKA_SSH_KEY_FILE`, enable the "Masked" option.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not recommend passing the whole file via the the Gitlab UI, but rather mounting the file inside the container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants