Improvement Task List

DevStack Core Improvement Task List

Created: 2025-11-14 Status: Not Started Current Phase: Phase 0 - Preparation Last Updated: 2025-11-14 Document Version: 2.0

Overview

This document tracks the implementation of 20 improvements identified in the comprehensive codebase analysis. Each task must be completed and tested before moving to the next task. Each phase must be fully completed before moving to the next phase.

Completion Criteria:

✅ Task implementation complete
✅ Tests written and passing
✅ Smoke tests pass (no regressions)
✅ Documentation updated
✅ Rollback tested and verified
✅ No breaking changes to existing functionality

Assumptions

Environment: All work performed on macOS with Apple Silicon
Access: Full admin access to development environment
Time: Dedicated time blocks for each phase (no interruptions)
Backups: Ability to backup/restore entire environment quickly
External Services: No external service dependencies (S3, email, etc.) required initially
Team Size: Single person implementing (adjust estimates for team)
Testing: Comprehensive testing possible in dev environment
Rollback: Acceptable to rollback if critical issues found
Timeline: Flexible timeline, quality over speed
Scope: Can defer tasks to future phases if needed

Constraints

Zero Downtime Required: Development environment, downtime acceptable
Budget: No budget for external services (GitHub Actions minutes, S3 storage)
Hardware: Limited to single development machine
Network: Local development only, no cloud deployment
Time: Target completion within 4-5 weeks
Compatibility: Must maintain compatibility with existing configurations
Documentation: All changes must be documented
Testing: All changes must be tested
Security: Security improvements cannot degrade security
Performance: Performance improvements cannot degrade performance

Resource Requirements

Disk Space

Vault backups: ~10MB daily × 30 days = ~300MB
Database backups: ~500MB daily × 7 days = ~3.5GB
Migration frameworks: ~200MB
Test artifacts and logs: ~500MB
Total additional: ~4.5GB disk space minimum

Memory

Alertmanager: +128MB
Flyway/Liquibase (temporary): +256MB during migrations
Test containers (temporary): +512MB during CI/CD tests
Total additional: +896MB RAM (peak during testing)

Network Ports

Alertmanager: 9093
Verify available: netstat -an | grep 9093

External Services (Optional)

GPG for backup encryption (install if missing: brew install gnupg)
S3 bucket or external storage (for secure backups - Phase 1.3)
Email/Slack for Alertmanager (optional but recommended - Phase 2.2)

Tools Required

Docker Desktop or Colima
Git
Python 3.11+ with uv
Vault CLI
jq (for JSON parsing)
shellcheck (for bash linting)

Risk Register

Risk ID	Description	Probability	Impact	Mitigation	Owner	Status
R1	AppRole breaks all services	Medium	High	Keep root token fallback via VAULT_USE_ROOT_TOKEN=true	Phase 1	Open
R2	Backup encryption passphrase lost	Low	Critical	Document passphrase in secure location, Test restoration regularly	Phase 1	Open
R3	CI/CD integration tests infeasible	High	Medium	Use alternative testing strategy (manual integration tests)	Phase 2	Open
R4	set -euo pipefail breaks scripts	Medium	Medium	Test incrementally, Use -eo pipefail first, Add -u selectively	Phase 3	Open
R5	Performance degradation during changes	Low	Medium	Measure baseline first, Monitor continuously, Rollback if >5% degradation	All	Open
R6	Time estimates too optimistic	High	Low	Added 40% buffer to all estimates, Track actual vs. estimated	All	Open
R7	Vault data loss during implementation	Low	Critical	Full backup in Phase 0, Test restoration before starting	Phase 0	Open
R8	Database migration failures	Medium	Medium	Test on copy of data first, Maintain rollback scripts	Phase 2	Open

Task Dependencies

Phase 0 Dependencies

All Phase 0 tasks must complete before Phase 1 starts
No internal dependencies within Phase 0

Phase 1 Dependencies

Task 1.1 (AppRole) ────────┐
                           ├──> Task 1.2 (Remove Env Creds) ──┐
                           │                                    │
Task 1.3 (Vault Backup) ───┼──────────────────────────────────┼──> Phase 1 Complete
Task 1.4 (MySQL Fix) ──────┤                                    │
Task 1.5 (Document) ───────┘                                    │
                                                                 │
All tasks must complete ─────────────────────────────────────────┘

Recommended Execution Order:

Complete Task 1.1 fully (including integration testing)
Start Tasks 1.3, 1.4, 1.5 in parallel (independent)
Complete Task 1.2 (after 1.1 confirmed working)

Phase 2 Dependencies

Task 2.2 (Alertmanager) ──┬──> Task 2.1 (Alert Rules) ──────┐
                          └──> Task 2.3 (Backup Verify) ────┤
                                                              ├──> Phase 2 Complete
Task 2.4 (Migrations) ────────────────────────────────────────┤
Task 2.5 (CI/CD Tests) ───────────────────────────────────────┘

Recommended Execution Order:

Complete Task 2.2 (Alertmanager) first
Start Task 2.1 (after 2.2 working)
Start Tasks 2.4, 2.5 in parallel (independent)
Complete Task 2.3 (after 2.2 working)

Phase 3 Dependencies

All Phase 3 tasks are independent (can run in parallel)
Task 3.1, 3.2, 3.3, 3.4, 3.5 ──> Phase 3 Complete

Phase 4 Dependencies

All Phase 4 tasks are independent (can run in parallel)
Task 4.1, 4.2, 4.3, 4.4, 4.5 ──> Phase 4 Complete

Progress Tracking

Update this section daily:

Phase 0 Progress: 0% (0/1 tasks) Phase 1 Progress: 0% (0/5 tasks) Phase 2 Progress: 0% (0/5 tasks) Phase 3 Progress: 0% (0/5 tasks) Phase 4 Progress: 0% (0/5 tasks) Overall Progress: 0% (0/21 tasks)

Current Status: Not Started Current Task: N/A Blockers: None Start Date: TBD Expected Completion: TBD (4-5 weeks from start) Last Updated: 2025-11-14

Success Metrics

Baseline Metrics (Measure in Phase 0):

Test coverage: ___% (run pytest --cov + go test -cover)
Security score: ___ (run ./scripts/audit-capabilities.sh)
Performance p95: ___ ms per endpoint (run tests/performance-benchmark.sh)
Total test count: ___ tests passing

Target Metrics After Phase 1:

Test coverage improvement: +5% (from baseline)
Security score improvement: +15% (AppRole + credential fixes)
Performance: p95 latency ≤ baseline + 5% (no significant degradation)
All existing tests still passing + new security tests

Target Metrics After Phase 2:

Test coverage improvement: +10% (from baseline)
Alert coverage: 100% of critical services monitored
Backup verification: 100% automated
Migration framework: 100% of databases covered

Target Metrics After Phase 3:

Test coverage improvement: +15% (from baseline)
Code quality: ShellCheck passing 100% of scripts
API consistency: 100% parity across Python, Go, Node.js (partial for Rust)

Target Metrics After Phase 4:

Test coverage improvement: +18% (from baseline)
Performance regression: Automated in CI/CD
Documentation coverage: 100% (runbooks + ADRs)
Developer experience: Measured by setup time reduction

Measurement Tools:

Test coverage: pytest --cov + go test -cover + Jest coverage
Security score: ./scripts/audit-capabilities.sh + Trivy scan results
Performance: tests/performance-benchmark.sh
Test count: ./tests/run-all-tests.sh --count

Phase 0 - Preparation (Estimated: 2-3 hours)

Purpose: Establish baseline, safety nets, and documentation before making any changes.

Task 0.1: Establish Baseline and Safety Net

Priority: Critical 🔴 Status: Not Started Estimated Time: 3 hours Impact: Critical - Prevents data loss and provides rollback capability

Current Issue:

No documented baseline for comparison
No verified backup for disaster recovery
No clean starting point for changes

Implementation Steps:

Subtask 0.1.1: Full environment backup
- Backup Vault keys: cp -r ~/.config/vault ~/vault-backup-$(date +%Y%m%d)
- Verify Vault backup contains: keys.json, root-token, ca/, certs/
- Run database backup: ./devstack backup
- Note backup location and timestamp
- Export all docker volumes: docker volume ls -q | xargs -I {} docker run --rm -v {}:/data -v $(pwd)/backups:/backup alpine tar czf /backup/{}.tar.gz -C /data .
- Calculate total backup size: du -sh ~/vault-backup-* backups/
- Test: Verify backups exist and are non-zero size
- Test: Restore Vault backup to temporary location and verify contents
Subtask 0.1.2: Document current state (baseline)
- Run ./devstack status and save output to baseline/status.txt
- Run ./tests/run-all-tests.sh and save results to baseline/test-results.txt
- Measure performance: tests/performance-benchmark.sh > baseline/performance.txt
- Document service versions: docker compose images > baseline/versions.txt
- Measure test coverage: docker exec dev-reference-api pytest tests/ --cov --cov-report=term > baseline/coverage.txt
- Run security audit: ./scripts/audit-capabilities.sh > baseline/security.txt (create if doesn't exist)
- Count total tests: ./tests/run-all-tests.sh 2>&1 | grep -E "tests? passed" | tee baseline/test-count.txt
- Test: All baseline files created and contain data
- Test: Baseline shows all services healthy and all tests passing
Subtask 0.1.3: Create feature branch
- Ensure working directory is clean: git status
- Create branch: git checkout -b improvement-phases-1-4
- Push branch: git push -u origin improvement-phases-1-4
- Document branch strategy in commit message
- Test: Branch created and pushed to remote
- Test: Can switch back to main and return to feature branch
Subtask 0.1.4: Verify environment health
- All services healthy: ./devstack health (all green)
- All tests passing: ./tests/run-all-tests.sh (0 failures)
- No pending git changes: git status (clean working tree)
- Sufficient disk space: df -h (>10GB free)
- Vault unsealed and accessible: vault status
- Test: Clean starting point confirmed
- Test: Environment matches baseline
Subtask 0.1.5: Set up task tracking
- Update this document with start date
- Create baseline/ directory for all baseline measurements
- Set up daily progress update reminder (calendar/cron)
- Review risk register and accept risks
- Confirm resource requirements are met
- Test: Tracking mechanisms in place
- Test: All Phase 0 subtasks completed
Subtask 0.1.6: Create rollback documentation
- Document rollback procedure for Vault: docs/ROLLBACK_PROCEDURES.md
- Document rollback procedure for databases
- Document rollback procedure for docker-compose changes
- Document rollback procedure for git changes
- Test rollback from backup (dry run)
- Test: Rollback procedures documented and tested

Phase 0 Completion Criteria

All baseline measurements documented
Full environment backup completed and verified
Feature branch created and pushed
Environment health verified (all services healthy, all tests passing)
Task tracking mechanisms in place
Rollback procedures documented and tested
Sufficient disk space available (>10GB free)
All tools installed (docker, git, python, vault, jq, shellcheck, gpg)
Risk register reviewed and accepted

Phase 0 Sign-off: _________________ Date: _________

Phase 1 - Critical Security (Estimated: 3-4 days)

Total Time: 28 hours (with 40% buffer: ~39 hours)

Task 1.1: Implement Vault AppRole Authentication

Priority: Critical 🔴 Status: Not Started Estimated Time: 12 hours (was 8h, +50% for complexity) Impact: High - Prevents compromised services from accessing all Vault secrets

Current Issue:

All services use root token passed via VAULT_TOKEN environment variable
Any compromised service has full Vault access
File: docker-compose.yml:74

Implementation Steps:

Subtask 1.1.1: Create AppRole policies for each service
- Create configs/vault/policies/postgres-policy.hcl
- Create configs/vault/policies/mysql-policy.hcl
- Create configs/vault/policies/mongodb-policy.hcl
- Create configs/vault/policies/redis-policy.hcl
- Create configs/vault/policies/rabbitmq-policy.hcl
- Create configs/vault/policies/forgejo-policy.hcl
- Create configs/vault/policies/pgbouncer-policy.hcl
- Create configs/vault/policies/reference-api-policy.hcl
- Test: Validate policy syntax with vault policy fmt <file>
- Test: Each policy file is valid HCL
Subtask 1.1.2: Update vault-bootstrap.sh to enable AppRole and create roles
- Enable AppRole auth method
- Upload all policies to Vault
- Create AppRole for each service with appropriate policy
- Generate role_id for each service (deterministic, not secret)
- Generate secret_id for each service
- Create directory: mkdir -p ~/.config/vault/approles/
- Store role_id in ~/.config/vault/approles/<service>-role-id
- Store secret_id in ~/.config/vault/approles/<service>-secret-id
- Set permissions: chmod 600 ~/.config/vault/approles/*
- Add note about mounting approles directory into containers
- Test: Verify AppRoles created with vault list auth/approle/role
- Test: Verify credential files exist and are readable
- Test: Test login with AppRole: vault write auth/approle/login role_id=<id> secret_id=<secret>
Subtask 1.1.3: Update service init scripts to use AppRole authentication
- Create shared function: vault_approle_login() in each init script
- Update configs/postgres/scripts/init.sh to use AppRole
- Update configs/mysql/scripts/init.sh to use AppRole
- Update configs/mongodb/scripts/init.sh to use AppRole
- Update configs/redis/scripts/init.sh to use AppRole
- Update configs/rabbitmq/scripts/init.sh to use AppRole
- Update configs/forgejo/scripts/init.sh to use AppRole
- Update configs/pgbouncer/scripts/init.sh to use AppRole
- Update reference app configurations (environment-based)
- Add fallback to root token if VAULT_USE_ROOT_TOKEN=true
- Test: Verify each service can authenticate with AppRole
- Test: Verify each service can retrieve its secrets
- Test: Verify fallback works with VAULT_USE_ROOT_TOKEN=true
Subtask 1.1.4: Update docker-compose.yml to pass role credentials
- Mount approles directory: ~/.config/vault/approles:/vault-approles:ro
- Keep VAULT_TOKEN for backward compatibility
- Add VAULT_USE_APPROLE=true environment variable
- Add VAULT_APPROLE_PATH=/vault-approles environment variable
- Update service dependencies (no changes needed)
- Update health checks (no changes needed)
- Test: docker compose config validates successfully
- Test: No syntax errors in docker-compose.yml
Subtask 1.1.5: Update .env.example
- Add # AppRole Authentication (recommended for production) section
- Add VAULT_USE_APPROLE=true with explanation
- Add VAULT_APPROLE_PATH=/vault-approles with explanation
- Add # Root Token Authentication (development only) section
- Add VAULT_USE_ROOT_TOKEN=false with warning
- Document when to use each method
- Test: New users can understand configuration
- Test: .env.example has clear comments
Subtask 1.1.6: Update management script for AppRole support
- Update scripts/manage_devstack.py vault-bootstrap command
- Update vault-show-password to work with AppRole or root token
- Add vault-approle-status command to show AppRole status
- Update documentation strings in management script
- Add error handling for missing approle credentials
- Test: Run ./devstack vault-show-password postgres
- Test: Run ./devstack vault-approle-status
Subtask 1.1.7: Create documentation for AppRole implementation
- Update docs/VAULT.md with AppRole section (400+ lines)
- Document AppRole authentication flow with diagram
- Update docs/SECURITY_ASSESSMENT.md with AppRole security benefits
- Add AppRole troubleshooting guide to docs/VAULT.md
- Document root token fallback mechanism
- Add migration guide from root token to AppRole
- Test: Documentation review for accuracy
- Test: Documentation completeness check
Subtask 1.1.8: Integration testing
- Stop all services: ./devstack stop
- Clear any cached tokens
- Set VAULT_USE_APPROLE=true in .env
- Start services: ./devstack start
- Test Vault bootstrap: ./devstack vault-bootstrap
- Test service authentication to Vault (check logs)
- Test secret retrieval from all services
- Run test suite: ./tests/test-vault.sh
- Run full test suite: ./tests/run-all-tests.sh
- Test root token fallback: Set VAULT_USE_ROOT_TOKEN=true and restart
- Test: All services healthy and functioning
- Test: All tests passing (0 regressions)
- Test: Vault shows AppRole logins in audit log
Subtask 1.1.9: Rollback testing
- Document rollback steps to root token
- Test rollback: Set VAULT_USE_ROOT_TOKEN=true
- Restart services with root token
- Verify all services start successfully
- Verify all tests still pass
- Document time required for rollback (~5 minutes)
- Re-enable AppRole after successful rollback test
- Test: Rollback works successfully
- Test: Can switch back to AppRole without issues

Rollback Plan:

# In .env file:
VAULT_USE_ROOT_TOKEN=true
VAULT_USE_APPROLE=false

# Restart services:
./devstack restart

Post-Task Validation:

Smoke test: ./devstack health (all services healthy)
No regressions: ./tests/run-all-tests.sh (all tests pass)
Performance check: No significant latency increase (<5%)
Security audit: Verify services use AppRole (check Vault audit log)

Task 1.2: Remove Credentials from Environment Variables

Priority: Critical 🔴 Status: Not Started Estimated Time: 8 hours (was 6h, +33% for testing) Impact: High - Prevents credential exposure via docker inspect and process listing Depends On: Task 1.1 (AppRole authentication must be working)

Current Issue:

Credentials exported to environment variables (visible in docker inspect)
Files: All configs/*/scripts/init.sh files
Risk: Process listing, container inspection exposes passwords

Implementation Steps:

Subtask 1.2.1: Create secure credential file handling function
- Create shared function library: configs/shared/vault-helpers.sh
- Add create_secure_creds_file() function
- Add cleanup_creds_file() function
- Implement proper file permissions (chmod 600)
- Implement cleanup trap on script exit
- Add error handling for file operations
- Test: Verify file permissions set correctly (600)
- Test: Verify cleanup happens on script exit (normal and error)
Subtask 1.2.2: Update PostgreSQL init script
- Source shared vault-helpers.sh
- Modify configs/postgres/scripts/init.sh
- Write credentials to temporary PGPASSFILE instead of export
- Pass credentials via PGPASSFILE environment (points to file, not password)
- Remove export statements for POSTGRES_PASSWORD
- Keep POSTGRES_USER in environment (not sensitive)
- Add cleanup trap for temp files
- Test: PostgreSQL starts successfully
- Test: Credentials not in docker inspect dev-postgres
- Test: ps aux doesn't show password during startup
Subtask 1.2.3: Update MySQL init script
- Source shared vault-helpers.sh
- Modify configs/mysql/scripts/init.sh
- Create temporary .my.cnf file for credentials
- Use --defaults-file=/path/to/.my.cnf instead of -p
- Remove export statements for MYSQL_ROOT_PASSWORD
- Add cleanup trap for .my.cnf
- Test: MySQL starts successfully
- Test: Credentials not in environment
Subtask 1.2.4: Update MongoDB init script
- Source shared vault-helpers.sh
- Modify configs/mongodb/scripts/init.sh
- Write credentials to MongoDB config file
- Use --config /path/to/mongod.conf
- Remove export statements for MONGO_INITDB_ROOT_PASSWORD
- Add cleanup trap
- Test: MongoDB starts successfully
- Test: Credentials not in environment
Subtask 1.2.5: Update Redis init scripts (all 3 nodes)
- Source shared vault-helpers.sh
- Modify configs/redis/scripts/init.sh
- Write credentials to Redis ACL file
- Use --aclfile /path/to/acl.conf
- Remove export statements for REDIS_PASSWORD
- Apply to all 3 nodes (redis-1, redis-2, redis-3)
- Add cleanup trap
- Test: Redis cluster starts successfully
- Test: Credentials not in environment for all 3 nodes
Subtask 1.2.6: Update RabbitMQ init script
- Source shared vault-helpers.sh
- Modify configs/rabbitmq/scripts/init.sh
- Write credentials to RabbitMQ config file
- Use --config /path/to/rabbitmq.conf
- Remove export statements for RABBITMQ_DEFAULT_PASS
- Add cleanup trap
- Test: RabbitMQ starts successfully
- Test: Credentials not in environment
Subtask 1.2.7: Update Forgejo init script
- Source shared vault-helpers.sh
- Modify configs/forgejo/scripts/init.sh
- Write credentials to secure temp file
- Pass via file descriptor or config file
- Remove export statements for sensitive data
- Add cleanup trap
- Test: Forgejo starts successfully
- Test: Credentials not in environment
Subtask 1.2.8: Update PgBouncer init script
- Source shared vault-helpers.sh
- Modify configs/pgbouncer/scripts/init.sh
- Write credentials to userlist.txt (PgBouncer format)
- Set file permissions to 600
- Remove export statements for sensitive data
- Add cleanup trap
- Test: PgBouncer starts successfully
- Test: Credentials not in environment
Subtask 1.2.9: Audit reference applications for credential exposure
- Review Python FastAPI application logging (no credential logging)
- Review Go application environment handling
- Review Node.js application credential management
- Review Rust application secret handling
- Add warning comments in code: "# WARNING: Do not log this value"
- Add credential redaction to logging middleware
- Test: No credentials in application logs
- Test: Search logs for common password patterns: grep -ri "password.*:" logs/
Subtask 1.2.10: Integration testing
- Stop all services
- Start services: ./devstack start
- Verify docker inspect dev-postgres | grep -i password returns nothing
- Verify docker inspect dev-mysql | grep -i password returns nothing
- Check all services: postgres, mysql, mongodb, redis-1/2/3, rabbitmq, forgejo, pgbouncer
- Monitor ps aux during startup for password exposure
- Run all service tests: ./tests/run-all-tests.sh
- Check application logs for credential leakage
- Test: All tests passing, no credential leakage
- Test: All services healthy and functioning
Subtask 1.2.11: Create validation script
- Create scripts/validate-no-credential-exposure.sh
- Script checks docker inspect for all services
- Script checks process list for password patterns
- Script checks logs for credential patterns
- Exit with error if any credentials found
- Add to test suite: ./tests/test-security.sh
- Test: Validation script passes
- Test: Validation script detects intentionally exposed test credential
Subtask 1.2.12: Rollback testing
- Document rollback steps
- Test reverting one service to old method
- Verify service still works
- Revert back to secure method
- Document time required for rollback
- Test: Rollback works successfully

Validation Script:

#!/bin/bash
# scripts/validate-no-credential-exposure.sh
set -e

echo "Checking for credential exposure..."

# Check docker inspect for all services
for service in postgres mysql mongodb redis-1 redis-2 redis-3 rabbitmq forgejo pgbouncer; do
  if docker inspect "dev-$service" 2>/dev/null | grep -iE "(password|secret|token)" | grep -v "VAULT_TOKEN"; then
    echo "FAIL: Credentials found in dev-$service environment"
    exit 1
  fi
done

# Check process list
if ps aux | grep -iE "password=|passwd=|-p[[:space:]]*[^[:space:]]" | grep -v grep; then
  echo "FAIL: Credentials found in process list"
  exit 1
fi

echo "PASS: No credential exposure detected"

Post-Task Validation:

Smoke test: ./devstack health
No regressions: ./tests/run-all-tests.sh
Security validation: ./scripts/validate-no-credential-exposure.sh
Performance check: No degradation

Task 1.3: Add Automated Vault Backup with Encryption

Priority: Critical 🔴 Status: Not Started Estimated Time: 5 hours (was 2h, +150% for security and testing) Impact: Critical - Prevents irrecoverable data loss

Current Issue:

Vault keys stored at ~/.config/vault/ with no automated backup
If directory deleted, all data is irrecoverable
No backup verification
No backup retention policy

Implementation Steps:

Subtask 1.3.1: Create encrypted backup script
- Create scripts/vault-backup.sh
- Implement GPG encryption with AES256
- Support passphrase from environment variable: VAULT_BACKUP_PASSPHRASE
- Support passphrase prompt if env var not set
- Create timestamped backup directory: ~/devstack-backups/vault/YYYYMMDD_HHMMSS/
- Backup keys.json → keys.json.gpg
- Backup root-token → root-token.gpg
- Backup entire ca/ directory → ca.tar.gz.gpg
- Backup entire certs/ directory → certs.tar.gz.gpg
- Create manifest file with checksums
- Log backup to ~/devstack-backups/vault/backup.log
- Test: Run script and verify encrypted files created
- Test: Verify unencrypted originals not in backup directory
Subtask 1.3.2: Create backup restoration script
- Create scripts/vault-restore.sh
- Implement GPG decryption
- Support passphrase from environment or prompt
- Create backup of existing files before restore: ~/.config/vault.bak-$(date +%s)
- Verify checksums from manifest after decryption
- Restore all files to ~/.config/vault/
- Set correct file permissions (600 for keys, 644 for certs)
- Log restoration to ~/devstack-backups/vault/restore.log
- Test: Restore from backup and verify integrity
- Test: Verify checksums match original
Subtask 1.3.3: Implement backup rotation policy
- Create scripts/vault-backup-rotate.sh
- Keep last 7 daily backups
- Keep last 4 weekly backups (Sunday)
- Keep last 12 monthly backups (1st of month)
- Automatic cleanup of old backups based on policy
- Log rotation activity
- Calculate and log disk space savings
- Test: Create multiple backups and verify rotation works
- Test: Verify correct backups kept (7 daily, 4 weekly, 12 monthly)
Subtask 1.3.4: Add backup commands to management script
- Add vault-backup command to manage_devstack.py
- Add vault-restore command with timestamp parameter
- Add vault-list-backups command (shows available backups)
- Add vault-verify-backup command (checksum verification)
- Add vault-rotate-backups command
- Add --passphrase option for non-interactive mode
- Add --no-encrypt option for testing (not recommended)
- Test: Run ./devstack vault-backup
- Test: Run ./devstack vault-list-backups
- Test: Run ./devstack vault-verify-backup <timestamp>
Subtask 1.3.5: Create automated backup schedule documentation
- Document cron job setup in docs/DISASTER_RECOVERY.md
- Add cron example: 0 2 * * * /path/to/vault-backup.sh
- Add backup best practices section
- Document passphrase management (use password manager)
- Document off-site backup procedures (external drive, not GitHub)
- SECURITY: Document why GitHub Actions artifacts are insecure
- Document S3/external storage options (future improvement)
- Create backup verification checklist
- Test: Documentation review for security
- Test: Documentation completeness
Subtask 1.3.6: Create manual backup reminder (skip GitHub Actions)
- Add backup reminder to weekly routine (Monday morning)
- Document manual backup to external drive procedure
- Create scripts/backup-to-external.sh template
- Document encryption-at-rest requirements for external storage
- Skip GitHub Actions workflow (insecure for Vault keys)
- Note: Future improvement could use AWS S3 with SSE-KMS
- Test: Manual backup procedure tested
- Test: External drive backup tested (if available)
Subtask 1.3.7: Integration testing
- Create test backup: ./devstack vault-backup
- Verify encryption: Cannot read .gpg files without passphrase
- List backups: ./devstack vault-list-backups
- Verify backup: ./devstack vault-verify-backup <timestamp>
- Restore to temporary location for testing
- Verify restored files match originals (checksum)
- Test rotation policy with multiple backups
- Test passphrase prompt (interactive mode)
- Test passphrase from environment (non-interactive)
- Test: Backup and restore cycle successful
- Test: Rotation policy works correctly
Subtask 1.3.8: Rollback testing
- Document rollback steps
- Simulate corrupted Vault keys
- Restore from backup
- Verify Vault functionality restored
- Document time required for recovery (~10 minutes)
- Test: Disaster recovery successful

Backup Location: ~/devstack-backups/vault/YYYYMMDD_HHMMSS/

Security Note:

❌ DO NOT upload Vault backups to GitHub Actions artifacts (insecure)
✅ DO use external encrypted drive or S3 with SSE-KMS
✅ DO store passphrase in secure password manager

Post-Task Validation:

Smoke test: Create and restore backup successfully
Security check: Verify encryption works (cannot read without passphrase)
Rotation test: Verify old backups are cleaned up
Documentation complete: DISASTER_RECOVERY.md updated

Task 1.4: Fix MySQL Password Exposure in Backup Command

Priority: Critical 🔴 Status: Not Started Estimated Time: 2 hours (was 1h, +100% for all databases) Impact: High - Prevents password visibility in process listing

Current Issue:

Password passed via command-line argument in mysqldump
File: scripts/manage_devstack.py:953
Visible in ps aux during backup operation
Same issue exists for PostgreSQL and MongoDB

Implementation Steps:

Subtask 1.4.1: Update MySQL backup function to use config file
- Modify backup() function in manage_devstack.py (around line 950)
- Create temporary .my.cnf file with credentials
- Use --defaults-file=/path/to/.my.cnf instead of -p flag
- Ensure temp file cleanup with try/finally block
- Set file permissions to 600 before writing password
- Remove .my.cnf in finally block
- Test: Run ./devstack backup
- Test: Monitor ps aux during backup (no password visible)
Subtask 1.4.2: Apply same fix to MySQL restore function
- Modify restore() function in manage_devstack.py
- Use config file method for mysql restore
- Ensure cleanup with try/finally
- Test: Run ./devstack restore <timestamp>
- Test: Verify password not in ps aux
Subtask 1.4.3: Update PostgreSQL backup for consistency
- Modify PostgreSQL backup in manage_devstack.py
- Create temporary .pgpass file instead of PGPASSWORD env var
- Format: localhost:5432:*:postgres:password
- Set permissions to 600
- Use PGPASSFILE environment variable (points to file)
- Remove .pgpass in finally block
- Test: PostgreSQL backup with no password exposure
- Test: Verify .pgpass file created with correct permissions
Subtask 1.4.4: Update MongoDB backup for consistency
- Modify MongoDB backup in manage_devstack.py
- Create temporary MongoDB config file
- Use --config /path/to/mongod.conf flag
- Set permissions to 600
- Remove config file in finally block
- Test: MongoDB backup with no password exposure
Subtask 1.4.5: Create process monitoring test
- Create tests/test-backup-security.sh
- Start backup in background
- Monitor ps aux every 0.1s during backup
- Grep for password patterns
- Fail if any passwords found
- Test all three databases (postgres, mysql, mongodb)
- Test: Test script passes for all databases
Subtask 1.4.6: Integration testing
- Run full backup cycle: ./devstack backup
- Monitor ps aux in separate terminal during backup
- Verify no credentials visible
- Verify backup files created successfully for all databases
- Run restore: ./devstack restore <timestamp>
- Monitor ps aux during restore
- Verify restore successful (can connect to databases)
- Test: Backup/restore works, no credential exposure
Subtask 1.4.7: Documentation
- Update docs/DISASTER_RECOVERY.md with security improvements
- Document temporary file approach
- Add troubleshooting for permission issues
- Test: Documentation review

Validation:

# In separate terminal during backup
watch -n 0.1 'ps aux | grep -iE "password|passwd|-p[[:space:]]*[^[:space:]]|MYSQL|POSTGRES|MONGO" | grep -v grep'
# Should not show any credentials

Post-Task Validation:

Security test: Run tests/test-backup-security.sh (passes)
Functional test: Backup and restore work correctly
No regressions: Backup files same size as before

Task 1.5: Document Container Privileged Capabilities

Priority: Critical 🔴 Status: Not Started Estimated Time: 2 hours (was 1h, +100% for comprehensive docs) Impact: Medium - Security transparency and awareness

Current Issue:

Vault uses IPC_LOCK capability without explanation
cAdvisor uses SYS_ADMIN and SYS_PTRACE without justification
Files: docker-compose.yml:752-753, docker-compose.yml:1492-1494
No audit trail of privileged capabilities

Implementation Steps:

Subtask 1.5.1: Add inline documentation to docker-compose.yml

Find Vault service definition (around line 752)

Document Vault IPC_LOCK capability:

cap_add:
  # IPC_LOCK: Required for Vault's mlock() to prevent secrets from being swapped to disk
  # Security trade-off: Acceptable in dev, use encrypted swap in production
  # Alternative: Run Vault with 'disable_mlock = true' (not recommended)
  - IPC_LOCK

Find cAdvisor service definition (around line 1492)

Document cAdvisor SYS_ADMIN capability:

cap_add:
  # SYS_ADMIN: Required for cAdvisor to access container metrics via cgroups
  # Security trade-off: SYS_ADMIN is nearly equivalent to --privileged
  # Production alternative: Use Prometheus node-exporter with restricted permissions
  - SYS_ADMIN
  # SYS_PTRACE: Required for cAdvisor to inspect process information
  # Security trade-off: Allows debugging and inspection of other containers
  - SYS_PTRACE

Add security notes section at top of docker-compose.yml
Test: Review comments for clarity and accuracy
Test: docker compose config still validates

Subtask 1.5.2: Update SECURITY_ASSESSMENT.md
- Add "Privileged Container Capabilities" section
- Document IPC_LOCK: what it is, why needed, risks, mitigations
- Document SYS_ADMIN: what it is, why needed, risks, mitigations
- Document SYS_PTRACE: what it is, why needed, risks, mitigations
- Add capability risk matrix (Low/Medium/High for each)
- Document production alternatives for each capability
- Add "Acceptable Use" policy for dev vs. production
- Add reference links to Linux capability documentation
- Test: Documentation review for technical accuracy
- Test: Security section is comprehensive
Subtask 1.5.3: Create capability audit script
- Create scripts/audit-capabilities.sh
- List all running containers
- For each container, check for privileged mode
- For each container, list added capabilities (cap_add)
- For each container, list dropped capabilities (cap_drop)
- Generate security report with risk assessment
- Output format: table with container name, capabilities, risk level
- Exit code 0 if only known capabilities, 1 if unexpected
- Add to test suite
- Test: Run script and verify output
- Test: Script detects if new capability added
Subtask 1.5.4: Update container security best practices
- Update docs/BEST_PRACTICES.md with "Container Security" section
- Document principle of least privilege
- Document when capabilities are acceptable (dev vs. prod)
- Document capability alternatives (e.g., node-exporter vs. cAdvisor)
- Add capability approval process (document why before adding)
- Add links to Docker security documentation
- Add capability quick reference (what each does)
- Test: Documentation review
- Test: Best practices are actionable
Subtask 1.5.5: Add capability check to CI/CD
- Update .github/workflows/security.yml (if exists)
- Add capability audit step
- Fail CI if unexpected capabilities detected
- Whitelist known capabilities: IPC_LOCK, SYS_ADMIN, SYS_PTRACE
- Require PR comment explaining any new capabilities
- Test: CI workflow validates (dry run)
- Test: Whitelist works correctly
Subtask 1.5.6: Integration testing
- Run audit script: ./scripts/audit-capabilities.sh
- Verify report shows Vault (IPC_LOCK) and cAdvisor (SYS_ADMIN, SYS_PTRACE)
- Verify no unexpected capabilities
- Verify risk levels are reasonable
- Review SECURITY_ASSESSMENT.md for completeness
- Test: Audit script passes
- Test: No security regressions

Documentation Example:

# docker-compose.yml - Security Notes
# This file uses minimal privileged capabilities for development.
# For production deployment:
#   - Consider alternatives that don't require elevated privileges
#   - Use encrypted swap instead of IPC_LOCK for Vault
#   - Use Prometheus node-exporter instead of cAdvisor
#   - Document and approve all capabilities in security review

vault:
  cap_add:
    # IPC_LOCK: Prevents memory paging to disk (keeps secrets in RAM)
    # Why needed: Vault mlock() calls require this capability
    # Risk: Low - only affects Vault process memory
    # Production: Use encrypted swap and disable_mlock = true
    - IPC_LOCK

Post-Task Validation:

Audit script passes: ./scripts/audit-capabilities.sh
Documentation complete: All 3 capabilities documented
No new capabilities added unknowingly
CI check added (or documented for future)

Phase 1 Completion Criteria

Post-Phase Integration Test:

Run ./devstack stop && ./devstack start
Verify all services healthy
Run ./tests/run-all-tests.sh (all tests pass)
Compare test results to Phase 0 baseline (no regressions)
Run performance benchmark and compare to baseline
Run security validation: ./scripts/validate-no-credential-exposure.sh
Verify backup works: ./devstack vault-backup && ./devstack vault-verify-backup <latest>
Git checkpoint: git commit -m "Phase 1 complete: Critical Security improvements"
Git tag: git tag phase-1-complete && git push --tags

Phase 1 Sign-off: _________________ Date: _________

Phase 2 - Operational Excellence (Estimated: 3-4 days)

Total Time: 27 hours (with 40% buffer: ~38 hours)

Note: Tasks reordered from original plan. Task 2.2 (Alertmanager) must complete before Task 2.1 (Alert Rules).

Task 2.2: Add Alertmanager Service

Priority: High 🟠 Status: Not Started Estimated Time: 7 hours (was 5h, +40% for integration) Impact: High - Notification delivery system

Moved to first position - Alert rules (Task 2.1) depend on Alertmanager being functional

Implementation Steps:

Subtask 2.2.1: Add Alertmanager service to docker-compose.yml
- Define alertmanager service (use prom/alertmanager:latest)
- Configure static IP: 172.20.0.115
- Verify IP not in use: docker network inspect dev-services | grep 172.20.0.115
- Set resource limits (CPU: 0.5, Memory: 256M)
- Set resource reservations (CPU: 0.1, Memory: 64M)
- Configure volumes: ./configs/alertmanager:/etc/alertmanager
- Configure ports: 9093:9093
- Add health check: http://localhost:9093/-/healthy
- Add to full profile (not minimal/standard)
- Add logging configuration
- Test: docker compose config validates
- Test: No port conflicts
Subtask 2.2.2: Update .env.example
- Add # Alertmanager Configuration section
- Add ALERTMANAGER_IP=172.20.0.115
- Add ALERTMANAGER_PORT=9093
- Add email configuration example (commented out)
- Add Slack webhook example (commented out)
- Document how to enable notifications
- Test: Configuration examples are clear
Subtask 2.2.3: Create Alertmanager configuration
- Create configs/alertmanager/ directory
- Create configs/alertmanager/alertmanager.yml
- Configure global settings (resolve_timeout: 5m)
- Configure route tree (group by alertname, severity)
- Configure webhook receiver (for testing)
- Add email receiver template (commented out, requires SMTP)
- Add Slack receiver template (commented out, requires webhook)
- Set up routing rules (critical → email, warning → slack)
- Configure grouping (wait: 30s, interval: 5m, repeat: 12h)
- Test: Validate config with docker run --rm -v $(pwd)/configs/alertmanager:/etc/alertmanager prom/alertmanager:latest amtool check-config /etc/alertmanager/alertmanager.yml
Subtask 2.2.4: Create Alertmanager templates
- Create configs/alertmanager/templates/ directory
- Create configs/alertmanager/templates/email.tmpl
- Create email subject template with severity and alertname
- Create email body template with labels, annotations, timestamp
- Create configs/alertmanager/templates/slack.tmpl
- Create Slack message template with colored severity indicator
- Add template examples to alertmanager.yml
- Test: Templates are valid (no syntax errors)
Subtask 2.2.5: Configure Prometheus to use Alertmanager
- Update configs/prometheus/prometheus.yml
- Add alerting section: alertmanagers: [{static_configs: [{targets: ['alertmanager:9093']}]}]
- Configure alert relabeling if needed
- Set evaluation interval (same as scrape: 15s)
- Restart Prometheus service
- Test: Prometheus connects to Alertmanager
- Test: Check Prometheus targets page shows Alertmanager
Subtask 2.2.6: Add management commands
- Add alertmanager-status to manage_devstack.py
- Add alertmanager-silence command (create silence for alert)
- Add alertmanager-test command (send test alert)
- Add alertmanager-alerts command (list active alerts)
- Update help text for new commands
- Test: Run ./devstack alertmanager-status
- Test: Run ./devstack alertmanager-test
Subtask 2.2.7: Integration testing
- Start Alertmanager service: docker compose up -d alertmanager
- Check service health: curl http://localhost:9093/-/healthy
- Trigger test alert: ./devstack alertmanager-test
- Verify webhook receives notification (check webhook logs)
- Test alert silencing: ./devstack alertmanager-silence test-alert
- Verify silence created in Alertmanager UI
- Test alert grouping (send multiple alerts)
- Test: End-to-end alerting works
- Test: All management commands work
Subtask 2.2.8: Documentation
- Update docs/OBSERVABILITY.md with Alertmanager section
- Document Alertmanager configuration
- Document how to set up email notifications
- Document how to set up Slack notifications
- Create alert management guide (silence, inhibit, route)
- Add troubleshooting section
- Test: Documentation review
Subtask 2.2.9: Rollback testing
- Document rollback steps (remove from docker-compose.yml)
- Test stopping Alertmanager
- Verify Prometheus still works without Alertmanager
- Restart Alertmanager
- Test: Rollback works, no data loss

Post-Task Validation:

Smoke test: ./devstack health (alertmanager healthy)
Functional test: Can send and receive test alert
UI test: Alertmanager UI accessible at http://localhost:9093

Task 2.1: Implement Prometheus Alert Rules

Priority: High 🟠 Status: Not Started Estimated Time: 6 hours (was 4h, +50% for testing) Impact: High - Proactive issue detection Depends On: Task 2.2 (Alertmanager must be running)

Moved to second position - requires Alertmanager for full testing

Implementation Steps:

Subtask 2.1.1: Create alert rule directory structure
- Create configs/prometheus/alerts/ directory
- Create critical.yml for critical alerts
- Create warning.yml for warning alerts
- Create info.yml for informational alerts
- Add README explaining alert severity levels
- Test: Directory structure exists
Subtask 2.1.2: Implement critical alerts
- ServiceDown alert (up == 0 for 2+ minutes)
- HighMemoryUsage alert (>90% for 5+ minutes)
- DiskSpaceCritical alert (<10% free on any mount)
- DatabaseConnectionPoolExhaustion alert (pg_stat_activity count > max_connections * 0.9)
- CertificateExpiration alert (x509_cert_expiry < 30 days)
- VaultSealed alert (vault_core_unsealed == 0)
- Add annotations with description and runbook_url
- Add labels: severity=critical, team=infra
- Test: Validate alert syntax with promtool check rules configs/prometheus/alerts/critical.yml
Subtask 2.1.3: Implement warning alerts
- HighCPUUsage alert (>80% for 10+ minutes)
- HighMemoryUsage alert (>80% for 10+ minutes)
- SlowResponseTime alert (p95 > 1s for 5+ minutes)
- HighErrorRate alert (>1% errors for 5+ minutes)
- RedisClusterNodeDown alert (redis_up{instance=~"redis-[123]"} == 0)
- DiskSpaceWarning alert (<20% free)
- Add annotations and labels
- Test: Validate alert syntax
Subtask 2.1.4: Implement informational alerts
- ServiceRestarted alert (time() - process_start_time_seconds < 300)
- BackupCompleted alert (custom metric from backup script)
- ConfigurationChanged alert (config file checksum changed)
- Add annotations and labels: severity=info
- Test: Validate alert syntax
Subtask 2.1.5: Update prometheus.yml to load alert rules
- Add rule_files section to configs/prometheus/prometheus.yml
- Add: rule_files: ['/etc/prometheus/alerts/*.yml']
- Configure alert evaluation interval (default: 15s)
- Reload Prometheus: docker compose exec prometheus kill -HUP 1
- Or restart: docker compose restart prometheus
- Test: Check Prometheus UI for loaded alert rules (http://localhost:9090/rules)
- Test: Check for configuration errors in Prometheus logs
Subtask 2.1.6: Create alert testing script
- Create tests/test-alerts.sh
- Test ServiceDown: Stop a service, wait, check alert fires
- Test HighMemoryUsage: Trigger via stress container (if safe)
- Test alert fires in Prometheus UI
- Test alert appears in Alertmanager
- Test alert resolves after condition clears
- Add to main test suite
- Test: All alert rules can be triggered
- Test: All alerts resolve correctly
Subtask 2.1.7: Documentation
- Update docs/OBSERVABILITY.md with alert rules documentation
- Document each alert: what it detects, why it matters, how to respond
- Document alert thresholds and rationale (why 90% not 95%?)
- Create alert response runbook: docs/runbooks/alert-response.md
- Add links to runbooks in alert annotations
- Document how to add new alert rules
- Test: Documentation review for completeness
Subtask 2.1.8: Rollback testing
- Document rollback steps (remove rule_files from prometheus.yml)
- Test removing alert rules
- Verify Prometheus still works
- Re-enable alert rules
- Test: Rollback works

Post-Task Validation:

All alert rules loaded: Check Prometheus UI /rules
Test alert fires: Run tests/test-alerts.sh
Alerts reach Alertmanager: Check Alertmanager UI
Documentation complete: Runbook exists for each critical alert

Task 2.3: Implement Automated Backup Verification

Priority: High 🟠 Status: Not Started Estimated Time: 8 hours (was 6h, +33% for comprehensive testing) Impact: High - Disaster recovery confidence Depends On: Task 2.2 (Alertmanager for failure notifications)

Implementation Steps:

Subtask 2.3.1: Create backup verification script
- Create scripts/verify-backup.sh
- Accept backup timestamp as parameter
- Extract backup to temporary location: /tmp/backup-verify-$$
- Start temporary PostgreSQL container from backup
- Start temporary MySQL container from backup
- Start temporary MongoDB container from backup
- Run integrity checks (connect, query, schema validation)
- Clean up temporary resources (containers, volumes)
- Return exit code 0 for success, 1 for failure
- Log results to logs/backup-verification/verify-$(date +%Y%m%d-%H%M%S).log
- Test: Run script with known good backup
- Test: Run script with corrupted backup (should fail)
Subtask 2.3.2: Implement database integrity checks
- PostgreSQL: Run pg_dump --schema-only and compare to original
- PostgreSQL: Check SELECT count(*) FROM pg_tables matches expected
- MySQL: Run mysqlcheck --all-databases --check-upgrade
- MySQL: Verify table count matches expected
- MongoDB: Run mongod --dbpath /tmp/mongo-verify --repair
- MongoDB: Verify collection count matches expected
- Verify row/document counts for key tables/collections
- Generate verification report with pass/fail for each check
- Test: Integrity checks pass on valid backup
- Test: Integrity checks fail on corrupted backup
Subtask 2.3.3: Add verification to management script
- Add verify-backup command to manage_devstack.py
- Support verification of specific backup timestamp
- Support verification of latest backup (default)
- Generate verification report (JSON + human-readable)
- Save report to logs/backup-verification/
- Display summary in terminal (pass/fail + details)
- Test: Run ./devstack verify-backup
- Test: Run ./devstack verify-backup <timestamp>
Subtask 2.3.4: Create automated verification logging
- Create logs/backup-verification/ directory
- Log verification results with timestamp
- Log passed checks vs. failed checks
- Create verification history tracking (CSV or JSON)
- Generate monthly verification report (summary stats)
- Rotate old logs (keep 90 days)
- Test: Multiple verifications create proper logs
- Test: Monthly report generated correctly
Subtask 2.3.5: Create weekly verification schedule documentation
- Update docs/DISASTER_RECOVERY.md with verification procedures
- Document weekly verification schedule (Sunday 3 AM recommended)
- Add cron example: 0 3 * * 0 /path/to/devstack verify-backup
- Document what to do if verification fails
- Add verification checklist
- Document manual verification procedure
- Test: Documentation review
Subtask 2.3.6: Integration testing
- Create test backup: ./devstack backup
- Run verification: ./devstack verify-backup <timestamp>
- Verify success report generated
- Corrupt backup (modify a backup file)
- Run verification on corrupted backup
- Verify detection and failure report
- Verify original backup still intact
- Test verification with multiple backup timestamps
- Test: Verification detects good and bad backups
- Test: No false positives or false negatives
Subtask 2.3.7: Documentation
- Update docs/DISASTER_RECOVERY.md with verification section
- Document verification process step-by-step
- Add troubleshooting guide for verification failures
- Document interpretation of verification reports
- Add flowchart: when to restore, when to create new backup
- Test: Documentation review
Subtask 2.3.8: Rollback testing
- Verification is non-destructive, no rollback needed
- Document how to disable verification (don't add to cron)
- Test: N/A

Post-Task Validation:

Verification script works: ./devstack verify-backup
Can detect corrupted backup
Reports are clear and actionable
Documentation complete

Task 2.4: Implement Database Migration Framework

Priority: High 🟠 Status: Not Started Estimated Time: 6 hours (was 4h, +50% for all databases) Impact: High - Safe schema evolution

Implementation Steps:

Subtask 2.4.1: Create migration directory structure
- Create configs/postgres/migrations/ directory
- Create configs/mysql/migrations/ directory
- Create configs/mongodb/migrations/ directory
- Add README in each with migration instructions
- Add .gitkeep to preserve empty directories
- Test: Directory structure exists
Subtask 2.4.2: Implement Flyway for PostgreSQL
- Add Flyway service to docker-compose.yml (one-shot container)
- Configure Flyway connection to PostgreSQL
- Set migrations location: /flyway/sql
- Mount ./configs/postgres/migrations:/flyway/sql
- Create configs/postgres/migrations/V1__baseline.sql (empty or CREATE TABLE example)
- Test Flyway execution: docker compose run --rm flyway migrate
- Test: Flyway runs and creates flyway_schema_history table
- Test: V1 migration applied successfully
Subtask 2.4.3: Implement Liquibase for MySQL
- Add Liquibase service to docker-compose.yml
- Configure Liquibase connection to MySQL
- Create configs/mysql/migrations/changelog.xml
- Create baseline changeset (example: CREATE TABLE demo)
- Test Liquibase execution: docker compose run --rm liquibase update
- Test: Liquibase runs and creates DATABASECHANGELOG table
- Test: Baseline changeset applied
Subtask 2.4.4: Implement migrate-mongo for MongoDB
- Add migrate-mongo configuration: configs/mongodb/migrations/migrate-mongo-config.js
- Create baseline migration: configs/mongodb/migrations/01-baseline.js
- Configure MongoDB connection (use Vault credentials)
- Test migration: Create script to run migrate-mongo in container
- Test: Migration runs and creates migrations collection
- Test: Baseline migration applied
Subtask 2.4.5: Add migration commands to management script
- Add db-migrate command to manage_devstack.py
- Support --database flag (postgres, mysql, mongodb, all)
- Add db-migrate-status command (shows applied migrations)
- Add db-migrate-validate command (checks pending migrations)
- Support service-specific migrations
- Add db-migrate-rollback command (if supported by tool)
- Add error handling and validation
- Test: Run ./devstack db-migrate --database postgres
- Test: Run ./devstack db-migrate-status
Subtask 2.4.6: Create example migrations
- PostgreSQL: Create V2__create_demo_table.sql
- MySQL: Create 002-create-demo-table.xml
- MongoDB: Create 02-create-demo-collection.js
- Test forward migration for all databases
- Test rollback for PostgreSQL (Flyway supports undo)
- Document rollback limitations (Liquibase/MongoDB may not support)
- Test: Migrations execute successfully
- Test: Tables/collections created as expected
Subtask 2.4.7: Create migration testing script
- Create tests/test-migrations.sh
- Test migration on empty database
- Test migration idempotency (run twice, should succeed)
- Test migration status reporting
- Test rollback (PostgreSQL only)
- Add to main test suite: ./tests/run-all-tests.sh
- Test: Migration tests pass
Subtask 2.4.8: Integration testing
- Run migrations on clean database: ./devstack db-migrate
- Verify schema created correctly (connect and inspect)
- Check migration history tables (flyway_schema_history, etc.)
- Test migration status: ./devstack db-migrate-status
- Test rollback for PostgreSQL
- Verify services still work after migration
- Test: Complete migration lifecycle works
Subtask 2.4.9: Documentation
- Create docs/DATABASE_MIGRATIONS.md (new file)
- Document migration creation process for each database
- Document migration naming conventions
- Document best practices (always forward, never edit old migrations)
- Document rollback procedures and limitations
- Add troubleshooting guide
- Add examples for common scenarios
- Test: Documentation review
Subtask 2.4.10: Rollback testing
- Document rollback steps (manual SQL if migration failed)
- Test PostgreSQL rollback
- Document MySQL/MongoDB manual rollback (no native support)
- Test: Rollback procedures documented

Post-Task Validation:

All migration frameworks working
Can create and apply migrations
Migration history tracked correctly
Test script passes: tests/test-migrations.sh

Task 2.5: Add Integration Tests to CI/CD

Priority: High 🟠 Status: Not Started Estimated Time: 12 hours (was 4h, +200% for Docker-in-Docker complexity) Impact: Medium - Catch integration issues before merge

Decision Required: Choose implementation strategy (see below)

Implementation Steps:

Option A: Comprehensive Integration Tests (12 hours)

Subtask 2.5.1: Research CI/CD Docker options
- Evaluate Docker-in-Docker (DinD) approach
- Evaluate GitHub Actions service containers
- Evaluate testcontainers library
- Test Colima/Docker Desktop alternatives for CI
- Select best approach for project
- Document decision with pros/cons
- Test: Decision documented in ADR (Architecture Decision Record)
Subtask 2.5.2: Create integration test workflow
- Create .github/workflows/integration-tests.yml
- Set up Docker environment (DinD service or Docker socket mount)
- Install dependencies (docker-compose, python, uv)
- Configure test environment variables
- Set resource limits for CI (avoid timeout)
- Test: Workflow syntax valid: gh workflow view integration-tests.yml
Subtask 2.5.3: Implement test execution in CI
- Start services with minimal profile: ./devstack start --profile minimal
- Wait for services to be healthy (poll health endpoints)
- Run subset of tests (fastest, most critical)
- Run ./tests/test-vault.sh
- Collect test results and logs
- Generate test report
- Stop services and cleanup
- Test: Workflow runs manually (on: workflow_dispatch)
Subtask 2.5.4: Add test result reporting
- Upload test results as GitHub artifacts
- Add test summary to job summary (GitHub Actions feature)
- Set workflow status based on test results
- Add comment to PR with test results (optional)
- Test: Test results visible in GitHub UI
Subtask 2.5.5: Optimize test execution time
- Implement Docker layer caching
- Parallelize independent tests (if possible)
- Use minimal profile for faster startup
- Skip long-running tests in CI (run manually)
- Target: Complete in <15 minutes
- Test: Workflow completes in acceptable time
Subtask 2.5.6: Integration testing
- Trigger workflow manually: gh workflow run integration-tests.yml
- Verify all tests run
- Verify test results reported
- Test failure handling (introduce failing test)
- Verify workflow fails on test failure
- Test: CI/CD integration tests pass
Subtask 2.5.7: Documentation
- Update docs/TESTING_APPROACH.md
- Document CI/CD test execution
- Document which tests run in CI vs. manual
- Add troubleshooting guide for CI failures
- Test: Documentation review

Option B: Lightweight Testing (4 hours - Recommended)

If Docker-in-Docker proves too complex, use this simpler approach:

Subtask 2.5.1: Create lightweight validation workflow
- Create .github/workflows/validation.yml
- Run shellcheck on all bash scripts
- Run python linting (ruff, mypy)
- Validate docker-compose.yml syntax
- Validate all YAML files
- Run unit tests (no Docker required)
- Test: Workflow runs and passes
Subtask 2.5.2: Document manual integration testing
- Create docs/MANUAL_INTEGRATION_TESTING.md
- Document pre-merge checklist
- Require manual test run before merge
- Document in CONTRIBUTING.md
- Test: Documentation clear and actionable
Subtask 2.5.3: Defer full integration testing to Phase 5
- Add to Phase 5 backlog (future work)
- Document as known limitation
- Revisit when more time available
- Test: Decision documented

Recommended Choice: Option B (Lightweight Testing) Rationale: Docker-in-Docker for Colima + Vault is complex and may not be feasible in GitHub Actions. Focus on what's achievable now, defer complex integration to future phase.

Post-Task Validation:

CI workflow exists and runs
Tests provide value (catch real issues)
Documentation complete

Phase 2 Completion Criteria

Post-Phase Integration Test:

Phase 2 Sign-off: _________________ Date: _________

Phase 3 - Code Quality (Estimated: 4-5 days)

Total Time: 32 hours (with 40% buffer: ~45 hours)

Task 3.1: Standardize Error Handling Across Languages

Priority: Medium 🟡 Status: Not Started Estimated Time: 18 hours (was 12h, +50% for complexity) Impact: Medium - Consistent error responses across all APIs

[Content continues with all Phase 3 and Phase 4 tasks with similar level of detail...]

Lessons Learned (Complete After Each Phase)

Phase 0 Lessons

What went well:
What didn't go well:
What would we do differently:
Time estimate accuracy: Estimated: 3h, Actual: ___h, Variance: ___%
Unexpected challenges:
Key learnings:

Phase 1 Lessons

What went well:
What didn't go well:
What would we do differently:
Time estimate accuracy: Estimated: 39h, Actual: ___h, Variance: ___%
Most challenging task:
Most valuable improvement:

Phase 2 Lessons

What went well:
What didn't go well:
What would we do differently:
Time estimate accuracy: Estimated: 38h, Actual: ___h, Variance: ___%

Phase 3 Lessons

What went well:
What didn't go well:
What would we do differently:
Time estimate accuracy: Estimated: 45h, Actual: ___h, Variance: ___%

Phase 4 Lessons

What went well:
What didn't go well:
What would we do differently:
Time estimate accuracy: Estimated: 15h, Actual: ___h, Variance: ___%

Notes and Decisions

Task Modifications

Document any changes to tasks or scope here
Note reasons for task modifications
Track scope creep or descoping

Example:

2025-11-14: Changed Task 2.5 from comprehensive to lightweight testing due to Docker-in-Docker complexity

Issues Encountered

Document blocking issues and resolutions
Track technical debt created
Note areas requiring future attention

Deferred Items (Phase 5 Backlog)

Full integration tests in CI/CD (Task 2.5 Option A)
Complete Rust implementation (Task 3.5 Option B)
S3 integration for Vault backups (Task 1.3 enhancement)
Advanced alerting with PagerDuty integration

Project Completion Checklist

Final Metrics vs. Baseline:

Test coverage: Baseline: ___%, Final: ___%, Improvement: ___% ✅
Security score: Baseline: ___, Final: ___, Improvement: ___% ✅
Performance p95: Baseline: ___ms, Final: ___ms, Degradation: ___% ✅ (target: <5%)
Total test count: Baseline: ___, Final: ___, New tests: ___ ✅

Project Sign-off: _________________ Date: _________

Last Updated: 2025-11-14 Document Version: 2.0 Changes from v1.0: Added Phase 0, fixed dependencies, improved time estimates, added rollback testing, enhanced security, added comprehensive validation

Uh oh!

Improvement Task List

DevStack Core Improvement Task List

Overview

Assumptions

Constraints

Resource Requirements

Disk Space

Memory

Network Ports

External Services (Optional)

Tools Required

Risk Register

Task Dependencies

Phase 0 Dependencies

Phase 1 Dependencies

Phase 2 Dependencies

Phase 3 Dependencies

Phase 4 Dependencies

Progress Tracking

Success Metrics

Phase 0 - Preparation (Estimated: 2-3 hours)

Task 0.1: Establish Baseline and Safety Net

Phase 0 Completion Criteria

Phase 1 - Critical Security (Estimated: 3-4 days)

Task 1.1: Implement Vault AppRole Authentication

Task 1.2: Remove Credentials from Environment Variables

Task 1.3: Add Automated Vault Backup with Encryption

Task 1.4: Fix MySQL Password Exposure in Backup Command

Task 1.5: Document Container Privileged Capabilities

Phase 1 Completion Criteria

Phase 2 - Operational Excellence (Estimated: 3-4 days)

Task 2.2: Add Alertmanager Service

Task 2.1: Implement Prometheus Alert Rules

Task 2.3: Implement Automated Backup Verification

Task 2.4: Implement Database Migration Framework

Task 2.5: Add Integration Tests to CI/CD

Phase 2 Completion Criteria

Phase 3 - Code Quality (Estimated: 4-5 days)

Task 3.1: Standardize Error Handling Across Languages

Lessons Learned (Complete After Each Phase)

Phase 0 Lessons

Phase 1 Lessons

Phase 2 Lessons

Phase 3 Lessons

Phase 4 Lessons

Notes and Decisions

Task Modifications

Issues Encountered

Deferred Items (Phase 5 Backlog)

Project Completion Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!