feat: Add adb-coding-assistants-cluster module#227
feat: Add adb-coding-assistants-cluster module#227dgokeeffe wants to merge 3 commits intodatabricks:mainfrom
Conversation
Add Terraform module for deploying Claude Code CLI on Databricks clusters with MLflow tracing integration. Features: - Claude Code CLI installation with Node.js runtime - Databricks authentication integration via proxy endpoints - MLflow tracing for Claude Code sessions - VS Code/Cursor Remote SSH support - Token refresh helpers and cron automation - Databricks skills for common patterns - Network dependency validation script - Minimal installation option for constrained environments The module includes init scripts that: - Install Claude Code CLI and dependencies - Configure authentication via DATABRICKS_TOKEN - Set up bashrc helpers for token management - Support profile-based Azure authentication - Disable experimental betas for stability Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
Adds a new Terraform module + example to provision a Databricks cluster that installs/configures Claude Code CLI (with MLflow tracing) via init scripts, plus helper scripts for Remote SSH setup and network dependency checks.
Changes:
- Introduces
adb-coding-assistants-clusterTerraform module (UC volume + init script upload + cluster creation + outputs). - Adds installation and helper scripts (full + minimal installer, VS Code/Cursor Remote SSH helper, network dependency checker) and accompanying docs.
- Adds an end-to-end example deployment (providers/auth options, tfvars template, outputs, documentation) and indexes it in the repo README.
Reviewed changes
Copilot reviewed 20 out of 21 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| modules/adb-coding-assistants-cluster/versions.tf | Defines Terraform + Databricks provider constraints for the new module. |
| modules/adb-coding-assistants-cluster/variables.tf | Adds module inputs for cluster/volume/tracing configuration. |
| modules/adb-coding-assistants-cluster/main.tf | Creates UC volume, uploads init script, provisions single-user cluster wiring init script. |
| modules/adb-coding-assistants-cluster/outputs.tf | Exposes cluster and init-script/volume details for consumers. |
| modules/adb-coding-assistants-cluster/README.md | Documents module usage, assumptions, and generated TF docs. |
| modules/adb-coding-assistants-cluster/Makefile | Adds terraform-docs generation/check targets for module docs. |
| modules/adb-coding-assistants-cluster/scripts/install-claude.sh | Full online installer + bash helpers for tokens, tracing, and VS Code guidance. |
| modules/adb-coding-assistants-cluster/scripts/install-claude-minimal.sh | Minimal installer with basic PATH + env var setup. |
| modules/adb-coding-assistants-cluster/scripts/vscode-setup.sh | Standalone Remote SSH setup/check/settings generator for IDEs. |
| modules/adb-coding-assistants-cluster/scripts/check-network-deps.sh | Preflight connectivity validator for required external domains. |
| modules/adb-coding-assistants-cluster/scripts/README.md | Documents the scripts and operational guidance. |
| examples/adb-coding-assistants-cluster/versions.tf | Sets example Terraform version constraint. |
| examples/adb-coding-assistants-cluster/variables.tf | Example inputs including auth selection validation. |
| examples/adb-coding-assistants-cluster/providers.tf | Example provider config (profile vs Azure resource-id path). |
| examples/adb-coding-assistants-cluster/main.tf | Wires example variables into the new module. |
| examples/adb-coding-assistants-cluster/outputs.tf | Prints cluster+volume outputs and a user-facing instruction block. |
| examples/adb-coding-assistants-cluster/README.md | Step-by-step example deployment + post-deploy workflow. |
| examples/adb-coding-assistants-cluster/terraform.tfvars.example | Provides an example tfvars template for quick start. |
| examples/adb-coding-assistants-cluster/Makefile | Adds terraform-docs generation/check targets for example docs. |
| README.md | Adds the new example/module to the repository index tables. |
| .gitignore | Ignores terraform.tfvars and *.plan files (keeps tfvars.example). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| fi | ||
| fi | ||
|
|
||
| W="${DATABRICKS_HOST}" |
There was a problem hiding this comment.
install-claude.sh runs with set -u, but setup_bashrc dereferences DATABRICKS_HOST without a default. If DATABRICKS_HOST is not present in the init-script environment (common), this will abort the init script and can fail cluster startup. Use a safe expansion (e.g., ${DATABRICKS_HOST:-}) and/or avoid substituting host at init time (leave resolution to login-time env vars).
| W="${DATABRICKS_HOST}" | |
| W="${DATABRICKS_HOST:-}" |
| local venv_path | ||
| venv_path=$(claude-vscode-env 2>/dev/null) | ||
|
|
||
| echo "=== VS Code/Cursor settings.json Configuration ===" | ||
| echo "" | ||
| echo "Add this to your VS Code/Cursor settings.json:" | ||
| echo "" | ||
| echo "{" | ||
| echo " \"remote.SSH.defaultExtensions\": [" | ||
| echo " \"ms-Python.python\"," | ||
| echo " \"ms-toolsai.jupyter\"" | ||
| echo " ]" | ||
| if [ $? -eq 0 ] && [ -n "$venv_path" ]; then | ||
| echo "," | ||
| echo " \"python.defaultInterpreterPath\": \"$venv_path/bin/python\"" | ||
| fi | ||
| echo "}" | ||
| echo "" | ||
| if [ $? -eq 0 ] && [ -n "$venv_path" ]; then |
There was a problem hiding this comment.
claude-vscode-config checks $? long after venv_path=$(...), but multiple echo calls overwrite $? to 0. This makes the success checks effectively meaningless. Capture the exit status immediately (e.g., rc=$?) or just key off -n "$venv_path" (and/or have claude-vscode-env print nothing on failure) so the conditional reflects the actual detection result.
| fi | ||
|
|
||
| log "Installing Claude Code CLI..." | ||
| if curl -fsSL https://claude.ai/install.sh | bash &>>$L; then |
There was a problem hiding this comment.
Piping a remote script directly into bash is a supply-chain risk (no integrity verification, TOCTOU exposure). Prefer downloading to a temporary file, validating integrity (checksum/signature or pinned version), then executing it; at minimum, write the script to disk and log/inspect it before running.
| # Install Claude Code CLI | ||
| if ! command -v claude >/dev/null 2>&1; then | ||
| log "Installing Claude Code CLI..." | ||
| curl -fsSL https://claude.ai/install.sh | bash >> "$LOG_FILE" 2>&1 |
There was a problem hiding this comment.
Same issue as the full installer: curl | bash executes unverified remote content. Use a download + verification step (checksum/signature/pinned version) before execution (or mirror the installer internally for controlled environments).
| curl -fsSL https://claude.ai/install.sh | bash >> "$LOG_FILE" 2>&1 | |
| npm install -g @anthropic-ai/claude-code >> "$LOG_FILE" 2>&1 |
|
|
||
| # Install MLflow with Databricks support | ||
| log "Installing MLflow with Databricks support..." | ||
| if pip install --quiet --upgrade "mlflow[databricks]>=3.4" &>>$L; then |
There was a problem hiding this comment.
Using pip directly in init scripts can install into an unexpected interpreter (or fail if pip isn’t on PATH / points to a different Python). Prefer python3 -m pip install ... (and consider explicitly targeting the Databricks runtime env if required) so the installed mlflow CLI matches the Python environment you later invoke.
| if pip install --quiet --upgrade "mlflow[databricks]>=3.4" &>>$L; then | |
| if python3 -m pip install --quiet --upgrade "mlflow[databricks]>=3.4" &>>$L; then |
| | Script | Purpose | Network Required | | ||
| |--------|---------|------------------| | ||
| | `install-claude.sh` | Online installation (default) | ✅ Yes | |
There was a problem hiding this comment.
The scripts overview table lists only install-claude.sh, but this directory also includes install-claude-minimal.sh, vscode-setup.sh, and check-network-deps.sh. Also, the doc claims wget is installed, but the installer installs curl git jq (and minimal installs curl git)—either install wget or update the documentation to match actual behavior.
| - ✅ **Node.js 20.x** - Required runtime for Claude CLI | ||
| - ✅ **Claude Code CLI** - AI coding assistant | ||
| - ✅ **MLflow** - For tracing Claude interactions | ||
| - ✅ **System tools** - curl, wget, git, jq |
There was a problem hiding this comment.
The scripts overview table lists only install-claude.sh, but this directory also includes install-claude-minimal.sh, vscode-setup.sh, and check-network-deps.sh. Also, the doc claims wget is installed, but the installer installs curl git jq (and minimal installs curl git)—either install wget or update the documentation to match actual behavior.
| │ Databricks Cluster (on startup) │ | ||
| │ │ | ||
| │ 1. Executes init script from volume │ | ||
| │ 2. Installs Node.js, OpenCode, Claude CLI │ |
There was a problem hiding this comment.
The module README describes installing/configuring OpenCode (opencode) and generating ~/.opencode/config.json, but the provided init scripts in this PR don’t install OpenCode or add related helpers. Either implement OpenCode installation/configuration in the init script(s), or remove/update these README sections to avoid incorrect guidance.
| │ • DATABRICKS_TOKEN available from environment │ | ||
| │ • Configs auto-generate: │ | ||
| │ - ~/.claude/settings.json │ | ||
| │ - ~/.opencode/config.json │ | ||
| │ • Commands ready: claude, opencode │ |
There was a problem hiding this comment.
The module README describes installing/configuring OpenCode (opencode) and generating ~/.opencode/config.json, but the provided init scripts in this PR don’t install OpenCode or add related helpers. Either implement OpenCode installation/configuration in the init script(s), or remove/update these README sections to avoid incorrect guidance.
| local cron_cmd="[ -n \"\$DATABRICKS_TOKEN\" ] && [ -n \"\$DATABRICKS_HOST\" ] && source \"\$HOME/.bashrc\" && _check_and_refresh_token >/dev/null 2>&1" | ||
| local cron_job="0 * * * * $cron_cmd" |
There was a problem hiding this comment.
cron_cmd and cron_job are defined but not used (the function later installs cron_file directly into crontab). Removing unused locals (or actually using cron_job) will reduce confusion and keep the function as a single source of truth for the scheduled command.
| local cron_cmd="[ -n \"\$DATABRICKS_TOKEN\" ] && [ -n \"\$DATABRICKS_HOST\" ] && source \"\$HOME/.bashrc\" && _check_and_refresh_token >/dev/null 2>&1" | |
| local cron_job="0 * * * * $cron_cmd" |
- Add safe expansion for DATABRICKS_HOST to prevent crash under set -u - Remove unused local variables in claude-setup-token-refresh() - Fix broken $? check in claude-vscode-config() by capturing exit code - Use python3 -m pip instead of bare pip for safer execution - Add supply-chain verification comment to minimal installer - Fix node_type_id description to match actual default value - Update scripts README with all available scripts - Remove wget from system tools list (not installed) - Remove OpenCode references throughout documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
@dgokeeffe I'm thinking - do we really need this as a module? We may have only code in |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 21 changed files in this pull request and generated 18 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| log "Installing Node.js 20.x..." | ||
| if curl -fsSL --max-time 300 --retry 3 https://deb.nodesource.com/setup_20.x | sudo -E bash - &>>$L; then | ||
| if sudo apt-get update -qq -y &>>$L && sudo apt-get install -y -qq nodejs &>>$L; then | ||
| if cmd_exists node && cmd_exists npm; then | ||
| log "[OK] Node.js/npm installed successfully ($(node --version))" | ||
| return 0 | ||
| fi |
There was a problem hiding this comment.
The NodeSource installer is also executed via curl ... | sudo bash - with no integrity verification. Consider pinning to a specific Node.js package version/repo key and validating the script (or using distro packages) to reduce supply-chain exposure.
| log "Installing Node.js 20.x..." | |
| if curl -fsSL --max-time 300 --retry 3 https://deb.nodesource.com/setup_20.x | sudo -E bash - &>>$L; then | |
| if sudo apt-get update -qq -y &>>$L && sudo apt-get install -y -qq nodejs &>>$L; then | |
| if cmd_exists node && cmd_exists npm; then | |
| log "[OK] Node.js/npm installed successfully ($(node --version))" | |
| return 0 | |
| fi | |
| log "Installing Node.js (distro package)..." | |
| if sudo apt-get update -qq -y &>>$L && sudo apt-get install -y -qq nodejs npm &>>$L; then | |
| if cmd_exists node && cmd_exists npm; then | |
| log "[OK] Node.js/npm installed successfully ($(node --version))" | |
| return 0 |
| set -euo pipefail | ||
| export DEBIAN_FRONTEND=noninteractive | ||
|
|
||
| LOG_FILE="/tmp/init-script-claude.log" | ||
| log() { | ||
| echo "[$(date '+%H:%M:%S')] $1" | tee -a "$LOG_FILE" | ||
| } | ||
|
|
||
| # Install system dependencies | ||
| log "Installing system dependencies..." | ||
| sudo apt-get update -qq -y >> "$LOG_FILE" 2>&1 | ||
| sudo apt-get install -y -qq curl git >> "$LOG_FILE" 2>&1 || log "Warning: Some packages failed to install" |
There was a problem hiding this comment.
The minimal init script runs with set -e, but apt-get update is not guarded. In restricted networks this will cause the init script (and thus cluster startup) to fail immediately. If the goal is a lightweight/best-effort installer, consider handling failures similarly to install-claude.sh (log a warning and continue) or make the fail-fast behavior explicit in docs.
| | `vscode-setup.sh` | VS Code/Cursor Remote SSH helper | No | | ||
| | `check-network-deps.sh` | Network connectivity preflight check | Yes | | ||
|
|
||
| > **Note**: For offline/air-gapped installations, use the separate [`adb-coding-assistants-cluster-offline`](../adb-coding-assistants-cluster-offline/README.md) module. |
There was a problem hiding this comment.
This scripts README links to an adb-coding-assistants-cluster-offline module that is not present in the repository, resulting in broken links. Either add the referenced offline module/docs or remove/update these references.
| > **Note**: For offline/air-gapped installations, use the separate [`adb-coding-assistants-cluster-offline`](../adb-coding-assistants-cluster-offline/README.md) module. | |
| > **Note**: These scripts are intended for clusters with network access as indicated above. Offline or air-gapped installations are not covered by this module. |
| ## Offline Installation | ||
|
|
||
| For air-gapped or restricted network environments, use the separate offline module: [`adb-coding-assistants-cluster-offline`](../../modules/adb-coding-assistants-cluster-offline/README.md). See the [Offline Installation Guide](../../modules/adb-coding-assistants-cluster-offline/scripts/OFFLINE-INSTALLATION.md) for detailed instructions. | ||
|
|
There was a problem hiding this comment.
The example README references an adb-coding-assistants-cluster-offline module and an offline installation guide path that do not exist in this repo. Update or remove these links unless the offline module is added in the same PR.
| ## Offline Installation | |
| For air-gapped or restricted network environments, use the separate offline module: [`adb-coding-assistants-cluster-offline`](../../modules/adb-coding-assistants-cluster-offline/README.md). See the [Offline Installation Guide](../../modules/adb-coding-assistants-cluster-offline/scripts/OFFLINE-INSTALLATION.md) for detailed instructions. |
| validation { | ||
| condition = var.databricks_profile != null || var.databricks_resource_id != null | ||
| error_message = "Either databricks_profile or databricks_resource_id must be set. Recommended: use databricks_profile for simpler configuration." | ||
| } |
There was a problem hiding this comment.
The databricks_resource_id validation only checks that one of the two auth variables is set; it doesn’t validate the Azure resource ID format. Adding a regex-based validation for the expected /subscriptions/.../resourceGroups/.../providers/Microsoft.Databricks/workspaces/... shape would prevent hard-to-debug failures later in providers.tf parsing.
| } | |
| } | |
| validation { | |
| condition = var.databricks_profile != null || | |
| var.databricks_resource_id == null || | |
| can(regex("^/subscriptions/[^/]+/resourceGroups/[^/]+/providers/Microsoft\\.Databricks/workspaces/[^/]+$", var.databricks_resource_id)) | |
| error_message = "When databricks_profile is not set, databricks_resource_id must be a valid Azure Databricks workspace resource ID of the form /subscriptions/{subscription-id}/resourceGroups/{resource-group}/providers/Microsoft.Databricks/workspaces/{workspace-name}." | |
| } |
|
|
||
| # Regenerate configs | ||
| claude-refresh-token | ||
| claude-refresh-token |
There was a problem hiding this comment.
claude-refresh-token is listed twice in a row in this snippet. Removing the duplicate avoids confusion in the troubleshooting instructions.
| claude-refresh-token |
| # Configure the Azure Provider | ||
| # When using profile-based auth, subscription_id is not needed (provider will auto-detect if Azure CLI is configured) | ||
| # When using Azure resource ID approach, subscription_id is extracted from the resource ID | ||
| provider "azurerm" { | ||
| subscription_id = local.subscription_id | ||
| features {} | ||
| skip_provider_registration = local.use_profile_auth | ||
|
|
||
| # Allow provider to work without explicit subscription_id when using profile auth | ||
| # It will attempt to auto-detect from Azure CLI or environment variables | ||
| } |
There was a problem hiding this comment.
Even when databricks_profile is used, this example still configures the azurerm provider. In practice, the azurerm provider often requires Azure credentials during plan/apply, which can make the “profile-based (cloud-agnostic)” path fail unexpectedly. Consider splitting this example into two (profile-only without azurerm/external vs Azure-resource-id with azurerm) or clearly documenting that Azure auth may still be required.
| For standard access mode, you must: | ||
| 1. Set up an allowlist for init scripts | ||
| 2. Grant permissions to the volume | ||
|
|
||
| See [Allowlist documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/allowlist). | ||
|
|
There was a problem hiding this comment.
The module hard-codes data_security_mode = "SINGLE_USER" in main.tf, but this README includes a “Standard Access Mode” section that reads like a supported configuration option. Consider either adding a module input to select the access mode, or clarifying in docs that standard access mode is not supported by this module as-written.
| For standard access mode, you must: | |
| 1. Set up an allowlist for init scripts | |
| 2. Grant permissions to the volume | |
| See [Allowlist documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/allowlist). | |
| > Note: Standard access mode is **not supported by this module as currently implemented**. | |
| > | |
| > The cluster created by this module is always configured with: | |
| > | |
| > ```hcl | |
| > data_security_mode = "SINGLE_USER" | |
| > ``` | |
| > | |
| > If you need to run in standard access mode, you must fork or customize this module | |
| > (for example, by updating `main.tf` to use the desired `data_security_mode`) | |
| > and then manage cluster security and permissions yourself. | |
| > | |
| > In particular, for standard access mode you must: | |
| > 1. Set up an allowlist for init scripts | |
| > 2. Grant appropriate permissions to the Unity Catalog volume | |
| > | |
| > For more details, see the Databricks documentation on allowlists: | |
| > [Allowlist documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/allowlist). |
| ```hcl | ||
| module "coding_cluster" { | ||
| source = "./modules/coding-assistants-cluster" | ||
|
|
There was a problem hiding this comment.
The module usage examples reference a non-existent path (./modules/coding-assistants-cluster). This should point to ./modules/adb-coding-assistants-cluster (or the correct registry source) so copy/paste deployments work.
| | <a name="input_min_workers"></a> [min\_workers](#input\_min\_workers) | Minimum number of workers for autoscaling | `number` | `1` | no | | ||
| | <a name="input_mlflow_experiment_name"></a> [mlflow\_experiment\_name](#input\_mlflow\_experiment\_name) | MLflow experiment name for Claude Code tracing | `string` | `"/Workspace/Shared/claude-code-tracing"` | no | | ||
| | <a name="input_node_type_id"></a> [node\_type\_id](#input\_node\_type\_id) | Node type for the cluster. Default is Standard_D8pds_v6 (modern, premium SSD + local NVMe). If unavailable in your region, consider Standard_DS13_v2 as fallback. | `string` | `"Standard_D8pds_v6"` | no | | ||
| | <a name="input_num_workers"></a> [num\_workers](#input\_num\_workers) | Number of worker nodes (null for autoscaling) | `number` | `null` | no | |
There was a problem hiding this comment.
The embedded terraform-docs section appears out of sync with the actual module inputs (e.g., node_type_id default/description differs from variables.tf). Regenerate terraform-docs so the Requirements/Inputs tables reflect the real defaults and descriptions.
| | <a name="input_num_workers"></a> [num\_workers](#input\_num\_workers) | Number of worker nodes (null for autoscaling) | `number` | `null` | no | | |
| | <a name="input_num_workers"></a> [num\_workers](#input\_num\_workers) | Number of worker nodes for fixed-size clusters. Set to `null` when using autoscaling with `min_workers` and `max_workers`. | `number` | `null` | no | |
|
@dgokeeffe can you address comments? |
|
@alexott I assumed every example needed a module, but I see that's not the case, will fix up now. |
Per review feedback, not every example needs a corresponding module. Inlined all resources from modules/adb-coding-assistants-cluster/ directly into examples/adb-coding-assistants-cluster/ to make it self-contained. Copied scripts into the example directory and updated all references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
This PR adds a new Terraform module for deploying Claude Code CLI on Databricks clusters with MLflow tracing integration.
Key Features
Module Structure
modules/adb-coding-assistants-cluster/- Main module with full featuresexamples/adb-coding-assistants-cluster/- Example deployment configurationInit Scripts
Authentication
The module configures Claude Code to use Databricks as the model provider:
Helper Commands
The init scripts add several helper commands to the cluster:
check-claude- Verify installation statusclaude-refresh-token- Regenerate authentication settingsclaude-token-status- Check token freshnessclaude-tracing-enable/disable/status- Manage MLflow tracingclaude-vscode-setup- Remote SSH setup guideTest Plan
Made with Cursor