docs(gitlab): add comprehensive troubleshooting guide#67
docs(gitlab): add comprehensive troubleshooting guide#67
Conversation
Add troubleshooting.md covering common issues and solutions: - Authentication issues (token expiry, endpoint config) - VM deployment failures (config validation, resource availability) - SSH connection issues (key setup, timeouts, permissions) - Environment variable configuration - Network and connectivity troubleshooting - Job execution problems - Orphaned VM cleanup procedures Also updates README.md to link to the new troubleshooting guide. Addresses DI-342 requirement for troubleshooting documentation. Verified: - All variable names match scripts exactly - Error messages match actual script output - CLI commands verified against Orka3 CLI docs - All reference links validated Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace non-existent /api/v1/health endpoint references with working CLI-based verification methods (orka3 version). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
GitLab/troubleshooting.md
Outdated
|
|
||
| ## Quick diagnostics | ||
|
|
||
| Before diving into specific issues, run these checks: |
There was a problem hiding this comment.
All these are part of the Docker image.
If these fail to install, the image creation is going to fail.
Furthermore, if any if the conditions below are not met, the integration is going to show a meaningful error like:
orka3 cannot be found
permission denied, etc...
So technically, these are not needed.
GitLab/troubleshooting.md
Outdated
|
|
||
| **Causes:** | ||
| - `ORKA_TOKEN` is invalid, expired, or not set | ||
| - `ORKA_ENDPOINT` is incorrect |
There was a problem hiding this comment.
If the endpoint is incorrect you are most likely going to get timeout errors. As you cannot connect to something that does not exist.
GitLab/troubleshooting.md
Outdated
|
|
||
| 1. Verify the token is set correctly: | ||
| ```bash | ||
| echo "$ORKA_TOKEN" | head -c 20 |
There was a problem hiding this comment.
You cannot verify this in the container.
The token is passed through the GitLab job and it is present only during the job run.
Plus, so here you need to make sure you are passing the correct token. Or you could create another token and use it instead (as you suggest below).
Nothing more than that.
GitLab/troubleshooting.md
Outdated
| ```bash | ||
| # For user authentication | ||
| orka3 login | ||
| orka3 user get-token |
There was a problem hiding this comment.
This should never be used in an integration. These expire after an hour.
GitLab/troubleshooting.md
Outdated
| 3. Verify the endpoint is reachable: | ||
| ```bash | ||
| # This shows both client and server versions if connected | ||
| orka3 version |
There was a problem hiding this comment.
I would not rely on the CLI here.
But rather ask the user to run curl http://<endpoint>/api/v1/cluster-info for example.
|
|
||
| 3. Test SSH connectivity manually: | ||
| ```bash | ||
| ssh -i ~/.ssh/orka_deployment_key -p <PORT> admin@<VM_IP> "echo ok" |
There was a problem hiding this comment.
The runner deletes the VMs that fail.
So here we need to suggest deploying a VM manually and trying to connect to it.
GitLab/troubleshooting.md
Outdated
|
|
||
| **Solutions:** | ||
|
|
||
| 1. Verify all required variables are set in your GitLab CI/CD settings or `.gitlab-ci.yml`: |
There was a problem hiding this comment.
Or in the runner container as I mentioned above.
GitLab/troubleshooting.md
Outdated
|
|
||
| 1. Verify network connectivity: | ||
| ```bash | ||
| ping -c 3 $(echo "$ORKA_ENDPOINT" | sed 's|http://||') |
There was a problem hiding this comment.
This is the third place we suggest this.
And this is the third unique approach to do this.
Let's have an unified approach.
Also - ping is not a good idea, it can be disabled.
GitLab/troubleshooting.md
Outdated
| 1. Manually delete orphaned VMs: | ||
| ```bash | ||
| # List VMs with runner prefix | ||
| orka3 vm list | grep "gl-runner" |
There was a problem hiding this comment.
why listing the vms?
The logic here implies we know there is an orphaned VM and we know its name
GitLab/troubleshooting.md
Outdated
| orka3 vm list -o json | jq -r '.[].name' | grep "gl-runner" | xargs -I {} orka3 vm delete {} | ||
| ``` | ||
|
|
||
| 2. Consider setting up a periodic cleanup job to remove stale VMs. |
There was a problem hiding this comment.
This is going to be hard to implement.
How do we know which VMs are stale?
Changes based on ispasov's review: - Remove redundant diagnostics section (Docker image validates these) - Remove orka3 login/user get-token (CI/CD must use service accounts) - Remove token verification steps (not available in container context) - Remove grep piping (use CLI's built-in argument filtering) - Remove export suggestions (vars must be in container/GitLab context) - Consolidate network connectivity to single curl approach - Remove "verify exists" steps (trust CLI error messages) - Add guidance to deploy VMs manually for SSH troubleshooting - Simplify cleanup section (remove stale VM detection complexity) - Remove duplicate ping-based connectivity checks Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
An incorrect endpoint produces timeout errors, not 401s. Endpoint troubleshooting is already covered in the connectivity section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reference the network section instead of repeating the curl check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| orka3 serviceaccount token <service-account-name> | ||
| ``` | ||
|
|
||
| 2. Update the token in GitLab CI/CD settings: |
There was a problem hiding this comment.
as mentioned there are two ways of passing there variables:
- Either though the Gitlab UI
- Or on container startup (docker run...)
My suggestion would be to extract a section that tells people how vars are set and in all the places where we say "Make sure this variable is correctly set" to point them to the section so they know how exactly to set the variable.
WDYT
| - Go to Settings > CI/CD > Variables | ||
| - Update `ORKA_TOKEN` with the new token | ||
|
|
||
| **Note:** Service account tokens are valid for 1 year by default. For custom duration, use `--duration` flag. |
There was a problem hiding this comment.
Depends on the control plane. EKS does not allow you to have such long tokens. So there you would have to use the no-expiration flag.
|
|
||
| 2. Check available node resources: | ||
| ```bash | ||
| orka3 node list -o wide |
There was a problem hiding this comment.
You do not need the wide flag here to see the available resources.
| 3. Increase deployment attempts by setting the environment variable in `.gitlab-ci.yml`: | ||
| ```yaml | ||
| variables: | ||
| VM_DEPLOYMENT_ATTEMPTS: "3" |
There was a problem hiding this comment.
Why in the yml file? And not in the Gitlab UI? I would suggest an unified approach when setting variables. As I mentioned above we can have a section that talks about how vars are set so we can let users pick whatever works for them
| - VM deployment returned unexpected JSON format | ||
| - VM is in a failed state | ||
|
|
||
| **Solutions:** |
There was a problem hiding this comment.
There isn't really a solution for this as it is just a symptom. So one needs to be able to find the root cause and fix it.
Usually if the VM information cannot be extracted, this means there are bigger issues.
It is a good suggestion to ask people to try to deploy a VM manually, but we do not give them anything actionable that can fix the issue.
|
|
||
| 2. Get connection details: | ||
| ```bash | ||
| orka3 vm list test-debug |
There was a problem hiding this comment.
You get the connection details from the output of the deploy command. There is no need for a separate command.
| orka3 vm list test-debug | ||
| ``` | ||
|
|
||
| 3. Connect via Screen Sharing (VNC) to check: |
There was a problem hiding this comment.
ScreenSharing and VNC are two different things.
So maybe ScreenSharing or VNC.
| **Causes:** | ||
| - SSH key has a passphrase (not supported) | ||
| - Wrong SSH user | ||
| - SSH key not in VM's authorized_keys |
| | `ORKA_VM_NAME_PREFIX` | No | VM name prefix (default: `gl-runner`) | | ||
| | `VM_DEPLOYMENT_ATTEMPTS` | No | Retry count (default: `1`) | | ||
|
|
||
| For sensitive variables like `ORKA_TOKEN` and `ORKA_SSH_KEY_FILE`, enable the "Masked" option. |
There was a problem hiding this comment.
I would not recommend passing the whole file via the the Gitlab UI, but rather mounting the file inside the container.
Summary
Related
Test plan
/api/v1/healthendpoint references🤖 Generated with Claude Code