|
| 1 | +# SkyDeck Development Guidelines |
| 2 | + |
| 3 | +## SkyPilot Integration |
| 4 | + |
| 5 | +**IMPORTANT**: Always use the SkyPilot API for accessing job and cluster information. Never query the SkyPilot SQLite databases directly (e.g., `~/.sky/jobs.db`, `~/.sky/state.db`). |
| 6 | + |
| 7 | +### API Access |
| 8 | + |
| 9 | +- API endpoint: `https://skypilot-api.softmax-research.net` |
| 10 | +- Authentication: OAuth2 cookies stored in `~/.sky/cookies.txt` |
| 11 | +- Configuration: `~/.sky/config.yaml` |
| 12 | + |
| 13 | +### Why Use the API |
| 14 | + |
| 15 | +1. **Centralized data**: The API provides access to managed jobs across all users and clusters |
| 16 | +2. **Up-to-date information**: The API reflects the current state of the jobs controller |
| 17 | +3. **Proper abstractions**: The API provides structured data with proper types |
| 18 | +4. **Security**: Direct database access bypasses authentication and auditing |
| 19 | + |
| 20 | +## Data Model |
| 21 | + |
| 22 | +### Experiment Groups |
| 23 | +- **Created by**: User (via UI) |
| 24 | +- **Purpose**: Organize experiments into logical groups |
| 25 | +- **Contains**: Ordered list of experiments (many-to-many relationship) |
| 26 | +- **Fields**: id, name, flags (columns to display), order, collapsed |
| 27 | + |
| 28 | +### Experiments |
| 29 | +- **Created by**: User (via "Create" button or "Duplicate") |
| 30 | +- **Purpose**: Configuration template that defines what to run |
| 31 | +- **Key fields**: |
| 32 | + - `id`: Auto-increment integer (internal) |
| 33 | + - `name`: Unique string identifier (user-facing, used for matching jobs/checkpoints) |
| 34 | + - `desired_state`: RUNNING or STOPPED |
| 35 | + - `current_state`: Reflects latest job status |
| 36 | + - `flags`: Configuration key-value pairs |
| 37 | + - `nodes`, `gpus`: Resource requirements |
| 38 | + |
| 39 | +### Jobs |
| 40 | +- **Created by**: Synced from SkyPilot API (poller) |
| 41 | +- **Purpose**: Track actual job executions |
| 42 | +- **Matching**: Jobs display under experiments where `job.experiment_id == experiment.name` |
| 43 | +- **Key fields**: id, experiment_id (matches experiment.name), status, command, git_ref, nodes, gpus |
| 44 | + |
| 45 | +### Checkpoints |
| 46 | +- **Created by**: Synced from S3 (syncer) |
| 47 | +- **Purpose**: Track model checkpoints and replay files |
| 48 | +- **S3 path**: `s3://softmax-public/policies/{experiment.name}/` |
| 49 | +- **Key fields**: experiment_id (references experiment.id), epoch, model_path, replay_paths, policy_version |
| 50 | + |
| 51 | +### Key Relationships |
| 52 | +- **Jobs** match to experiments by **name**: `job.experiment_id == experiment.name` |
| 53 | +- **Checkpoints** are stored by **id**: `checkpoint.experiment_id == experiment.id` |
| 54 | +- **S3 paths** use experiment **name** (e.g., `s3://.../{experiment.name}/`) |
| 55 | +- If no experiment exists with matching name, jobs appear as "orphaned" |
| 56 | + |
| 57 | +## Database |
| 58 | + |
| 59 | +**Database Location**: SkyDeck uses SQLite for persistent storage. |
| 60 | + |
| 61 | +- **Default location**: `~/.skydeck/skydeck.db` |
| 62 | +- **Configuration**: Can be overridden with `--db-path` flag or `SKYDECK_DB_PATH` environment variable |
| 63 | +- **Schema**: Defined in `skydeck/database.py` with automatic migrations on startup |
| 64 | + |
| 65 | +### Database Scripts |
| 66 | + |
| 67 | +When working with the database directly: |
| 68 | + |
| 69 | +```bash |
| 70 | +# Backfill checkpoint versions (example) |
| 71 | +uv run python -c " |
| 72 | +import asyncio |
| 73 | +from pathlib import Path |
| 74 | +from skydeck.backfill_versions import backfill_checkpoint_versions |
| 75 | +db_path = str(Path.home() / '.skydeck' / 'skydeck.db') |
| 76 | +asyncio.run(backfill_checkpoint_versions(db_path)) |
| 77 | +" |
| 78 | + |
| 79 | +# Query database directly |
| 80 | +sqlite3 ~/.skydeck/skydeck.db "SELECT COUNT(*) FROM experiments;" |
| 81 | +``` |
| 82 | + |
| 83 | +### Code Style |
| 84 | + |
| 85 | +- Always use `uv` for pip and python operations |
| 86 | +- Imports should go at the top of the file if possible |
| 87 | +- Follow existing patterns in the codebase for consistency |
| 88 | +- **NEVER add fallbacks** - fix the underlying problem instead |
| 89 | +- When making backend changes, restart the server: the user must restart skydeck for changes to take effect |
| 90 | + |
| 91 | +## Development Workflow |
| 92 | + |
| 93 | +After making backend changes (Python), restart the server to pick up changes: |
| 94 | +```bash |
| 95 | +lsof -ti:8000 | xargs kill -9 2>/dev/null || true |
| 96 | +sleep 2 |
| 97 | +nohup uv run skydeck --port 8000 > /tmp/skydeck.log 2>&1 & |
| 98 | +sleep 3 |
| 99 | +curl -s http://localhost:8000/api/health | head -c 100 # Verify it's running |
| 100 | +``` |
| 101 | + |
| 102 | +After making frontend changes (TypeScript/React): |
| 103 | +```bash |
| 104 | +cd packages/skydeck/frontend |
| 105 | +npm run build # Builds to ../skydeck/static/ |
| 106 | +``` |
| 107 | + |
| 108 | +**Important**: Always restart the backend yourself to test changes. Do not ask the user to restart. |
0 commit comments