docs(gitlab): add comprehensive troubleshooting guide by celanthe · Pull Request #67 · macstadium/orka-integrations

celanthe · 2026-02-04T15:32:10Z

Summary

Add comprehensive troubleshooting guide for GitLab Custom Executor
Update README.md with link to troubleshooting guide
Covers authentication, VM deployment, SSH connectivity, environment variables, and network issues

Test plan

Verified all CLI commands work against live Orka cluster (v3.5.2 client, v3.4.0 API)
Confirmed variable names match actual script implementations
Tested connectivity verification commands
Fixed non-existent /api/v1/health endpoint references

🤖 Generated with Claude Code

Add troubleshooting.md covering common issues and solutions: - Authentication issues (token expiry, endpoint config) - VM deployment failures (config validation, resource availability) - SSH connection issues (key setup, timeouts, permissions) - Environment variable configuration - Network and connectivity troubleshooting - Job execution problems - Orphaned VM cleanup procedures Also updates README.md to link to the new troubleshooting guide. Addresses DI-342 requirement for troubleshooting documentation. Verified: - All variable names match scripts exactly - Error messages match actual script output - CLI commands verified against Orka3 CLI docs - All reference links validated Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace non-existent /api/v1/health endpoint references with working CLI-based verification methods (orka3 version). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ispasov · 2026-02-05T11:56:06Z

GitLab/troubleshooting.md

+
+## Quick diagnostics
+
+Before diving into specific issues, run these checks:


All these are part of the Docker image.
If these fail to install, the image creation is going to fail.

Furthermore, if any if the conditions below are not met, the integration is going to show a meaningful error like:
orka3 cannot be found
permission denied, etc...

So technically, these are not needed.

ispasov · 2026-02-05T11:56:54Z

GitLab/troubleshooting.md

+
+**Causes:**
+- `ORKA_TOKEN` is invalid, expired, or not set
+- `ORKA_ENDPOINT` is incorrect


If the endpoint is incorrect you are most likely going to get timeout errors. As you cannot connect to something that does not exist.

ispasov · 2026-02-05T11:58:38Z

GitLab/troubleshooting.md

+
+1. Verify the token is set correctly:
+   ```bash
+   echo "$ORKA_TOKEN" | head -c 20


You cannot verify this in the container.
The token is passed through the GitLab job and it is present only during the job run.

Plus, so here you need to make sure you are passing the correct token. Or you could create another token and use it instead (as you suggest below).
Nothing more than that.

ispasov · 2026-02-05T11:59:02Z

GitLab/troubleshooting.md

+   ```bash
+   # For user authentication
+   orka3 login
+   orka3 user get-token


This should never be used in an integration. These expire after an hour.

ispasov · 2026-02-05T12:00:36Z

GitLab/troubleshooting.md

+3. Verify the endpoint is reachable:
+   ```bash
+   # This shows both client and server versions if connected
+   orka3 version


I would not rely on the CLI here.
But rather ask the user to run curl http://<endpoint>/api/v1/cluster-info for example.

ispasov · 2026-02-05T12:06:56Z

GitLab/troubleshooting.md

+
+3. Test SSH connectivity manually:
+   ```bash
+   ssh -i ~/.ssh/orka_deployment_key -p <PORT> admin@<VM_IP> "echo ok"


The runner deletes the VMs that fail.
So here we need to suggest deploying a VM manually and trying to connect to it.

ispasov · 2026-02-05T12:07:57Z

GitLab/troubleshooting.md

+
+**Solutions:**
+
+1. Verify all required variables are set in your GitLab CI/CD settings or `.gitlab-ci.yml`:


Or in the runner container as I mentioned above.

ispasov · 2026-02-05T12:09:15Z

GitLab/troubleshooting.md

+
+1. Verify network connectivity:
+   ```bash
+   ping -c 3 $(echo "$ORKA_ENDPOINT" | sed 's|http://||')


This is the third place we suggest this.
And this is the third unique approach to do this.
Let's have an unified approach.
Also - ping is not a good idea, it can be disabled.

ispasov · 2026-02-05T12:10:53Z

GitLab/troubleshooting.md

+1. Manually delete orphaned VMs:
+   ```bash
+   # List VMs with runner prefix
+   orka3 vm list | grep "gl-runner"


why listing the vms?
The logic here implies we know there is an orphaned VM and we know its name

ispasov · 2026-02-05T12:11:30Z

GitLab/troubleshooting.md

+   orka3 vm list -o json | jq -r '.[].name' | grep "gl-runner" | xargs -I {} orka3 vm delete {}
+   ```
+
+2. Consider setting up a periodic cleanup job to remove stale VMs.


This is going to be hard to implement.
How do we know which VMs are stale?

Changes based on ispasov's review: - Remove redundant diagnostics section (Docker image validates these) - Remove orka3 login/user get-token (CI/CD must use service accounts) - Remove token verification steps (not available in container context) - Remove grep piping (use CLI's built-in argument filtering) - Remove export suggestions (vars must be in container/GitLab context) - Consolidate network connectivity to single curl approach - Remove "verify exists" steps (trust CLI error messages) - Add guidance to deploy VMs manually for SSH troubleshooting - Simplify cleanup section (remove stale VM detection complexity) - Remove duplicate ping-based connectivity checks Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

An incorrect endpoint produces timeout errors, not 401s. Endpoint troubleshooting is already covered in the connectivity section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reference the network section instead of repeating the curl check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ispasov · 2026-02-18T12:56:59Z

GitLab/troubleshooting.md

+   orka3 serviceaccount token <service-account-name>
+   ```
+
+2. Update the token in GitLab CI/CD settings:


as mentioned there are two ways of passing there variables:

Either though the Gitlab UI

Or on container startup (docker run...)

My suggestion would be to extract a section that tells people how vars are set and in all the places where we say "Make sure this variable is correctly set" to point them to the section so they know how exactly to set the variable.

WDYT

ispasov · 2026-02-18T13:00:05Z

GitLab/troubleshooting.md

+   - Go to Settings > CI/CD > Variables
+   - Update `ORKA_TOKEN` with the new token
+
+**Note:** Service account tokens are valid for 1 year by default. For custom duration, use `--duration` flag.


Depends on the control plane. EKS does not allow you to have such long tokens. So there you would have to use the no-expiration flag.

ispasov · 2026-02-18T13:01:14Z

GitLab/troubleshooting.md

+
+2. Check available node resources:
+   ```bash
+   orka3 node list -o wide


You do not need the wide flag here to see the available resources.

ispasov · 2026-02-18T13:02:54Z

GitLab/troubleshooting.md

+3. Increase deployment attempts by setting the environment variable in `.gitlab-ci.yml`:
+   ```yaml
+   variables:
+     VM_DEPLOYMENT_ATTEMPTS: "3"


Why in the yml file? And not in the Gitlab UI? I would suggest an unified approach when setting variables. As I mentioned above we can have a section that talks about how vars are set so we can let users pick whatever works for them

ispasov · 2026-02-18T13:04:36Z

GitLab/troubleshooting.md

+- VM deployment returned unexpected JSON format
+- VM is in a failed state
+
+**Solutions:**


There isn't really a solution for this as it is just a symptom. So one needs to be able to find the root cause and fix it.
Usually if the VM information cannot be extracted, this means there are bigger issues.
It is a good suggestion to ask people to try to deploy a VM manually, but we do not give them anything actionable that can fix the issue.

ispasov · 2026-02-18T13:06:02Z

GitLab/troubleshooting.md

+
+2. Get connection details:
+   ```bash
+   orka3 vm list test-debug


You get the connection details from the output of the deploy command. There is no need for a separate command.

ispasov · 2026-02-18T13:06:37Z

GitLab/troubleshooting.md

+   orka3 vm list test-debug
+   ```
+
+3. Connect via Screen Sharing (VNC) to check:


ScreenSharing and VNC are two different things.
So maybe ScreenSharing or VNC.

ispasov · 2026-02-18T13:07:25Z

GitLab/troubleshooting.md

+**Causes:**
+- SSH key has a passphrase (not supported)
+- Wrong SSH user
+- SSH key not in VM's authorized_keys


Or the key is wrong

ispasov · 2026-02-18T13:10:00Z

GitLab/troubleshooting.md

+| `ORKA_VM_NAME_PREFIX` | No | VM name prefix (default: `gl-runner`) |
+| `VM_DEPLOYMENT_ATTEMPTS` | No | Retry count (default: `1`) |
+
+For sensitive variables like `ORKA_TOKEN` and `ORKA_SSH_KEY_FILE`, enable the "Masked" option.


I would not recommend passing the whole file via the the Gitlab UI, but rather mounting the file inside the container.

celanthe and others added 2 commits February 3, 2026 15:29

Fix connectivity test commands in troubleshooting guide

6d9ffb6

Replace non-existent /api/v1/health endpoint references with working CLI-based verification methods (orka3 version). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

celanthe self-assigned this Feb 4, 2026

celanthe added the documentation Improvements or additions to documentation label Feb 4, 2026

celanthe requested a review from a team February 4, 2026 15:35

ispasov reviewed Feb 5, 2026

View reviewed changes

celanthe requested a review from ispasov February 9, 2026 15:21

celanthe and others added 2 commits February 17, 2026 09:04

fix: remove incorrect endpoint cause from 401 error section

b007022

An incorrect endpoint produces timeout errors, not 401s. Endpoint troubleshooting is already covered in the connectivity section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: deduplicate connectivity check in auth section

d42df70

Reference the network section instead of repeating the curl check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ispasov reviewed Feb 18, 2026

View reviewed changes


		## Quick diagnostics

		Before diving into specific issues, run these checks:


		Solutions:

		1. Verify all required variables are set in your GitLab CI/CD settings or `.gitlab-ci.yml`:

Comments

Conversation

celanthe commented Feb 4, 2026

Summary

Related

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants