-
-
Notifications
You must be signed in to change notification settings - Fork 0
Improvement Task List
Created: 2025-11-14 Status: Not Started Current Phase: Phase 0 - Preparation Last Updated: 2025-11-14 Document Version: 2.0
This document tracks the implementation of 20 improvements identified in the comprehensive codebase analysis. Each task must be completed and tested before moving to the next task. Each phase must be fully completed before moving to the next phase.
Completion Criteria:
- ✅ Task implementation complete
- ✅ Tests written and passing
- ✅ Smoke tests pass (no regressions)
- ✅ Documentation updated
- ✅ Rollback tested and verified
- ✅ No breaking changes to existing functionality
- Environment: All work performed on macOS with Apple Silicon
- Access: Full admin access to development environment
- Time: Dedicated time blocks for each phase (no interruptions)
- Backups: Ability to backup/restore entire environment quickly
- External Services: No external service dependencies (S3, email, etc.) required initially
- Team Size: Single person implementing (adjust estimates for team)
- Testing: Comprehensive testing possible in dev environment
- Rollback: Acceptable to rollback if critical issues found
- Timeline: Flexible timeline, quality over speed
- Scope: Can defer tasks to future phases if needed
- Zero Downtime Required: Development environment, downtime acceptable
- Budget: No budget for external services (GitHub Actions minutes, S3 storage)
- Hardware: Limited to single development machine
- Network: Local development only, no cloud deployment
- Time: Target completion within 4-5 weeks
- Compatibility: Must maintain compatibility with existing configurations
- Documentation: All changes must be documented
- Testing: All changes must be tested
- Security: Security improvements cannot degrade security
- Performance: Performance improvements cannot degrade performance
- Vault backups: ~10MB daily × 30 days = ~300MB
- Database backups: ~500MB daily × 7 days = ~3.5GB
- Migration frameworks: ~200MB
- Test artifacts and logs: ~500MB
- Total additional: ~4.5GB disk space minimum
- Alertmanager: +128MB
- Flyway/Liquibase (temporary): +256MB during migrations
- Test containers (temporary): +512MB during CI/CD tests
- Total additional: +896MB RAM (peak during testing)
- Alertmanager: 9093
- Verify available:
netstat -an | grep 9093
- GPG for backup encryption (install if missing:
brew install gnupg) - S3 bucket or external storage (for secure backups - Phase 1.3)
- Email/Slack for Alertmanager (optional but recommended - Phase 2.2)
- Docker Desktop or Colima
- Git
- Python 3.11+ with uv
- Vault CLI
- jq (for JSON parsing)
- shellcheck (for bash linting)
| Risk ID | Description | Probability | Impact | Mitigation | Owner | Status |
|---|---|---|---|---|---|---|
| R1 | AppRole breaks all services | Medium | High | Keep root token fallback via VAULT_USE_ROOT_TOKEN=true | Phase 1 | Open |
| R2 | Backup encryption passphrase lost | Low | Critical | Document passphrase in secure location, Test restoration regularly | Phase 1 | Open |
| R3 | CI/CD integration tests infeasible | High | Medium | Use alternative testing strategy (manual integration tests) | Phase 2 | Open |
| R4 | set -euo pipefail breaks scripts | Medium | Medium | Test incrementally, Use -eo pipefail first, Add -u selectively | Phase 3 | Open |
| R5 | Performance degradation during changes | Low | Medium | Measure baseline first, Monitor continuously, Rollback if >5% degradation | All | Open |
| R6 | Time estimates too optimistic | High | Low | Added 40% buffer to all estimates, Track actual vs. estimated | All | Open |
| R7 | Vault data loss during implementation | Low | Critical | Full backup in Phase 0, Test restoration before starting | Phase 0 | Open |
| R8 | Database migration failures | Medium | Medium | Test on copy of data first, Maintain rollback scripts | Phase 2 | Open |
All Phase 0 tasks must complete before Phase 1 starts
No internal dependencies within Phase 0
Task 1.1 (AppRole) ────────┐
├──> Task 1.2 (Remove Env Creds) ──┐
│ │
Task 1.3 (Vault Backup) ───┼──────────────────────────────────┼──> Phase 1 Complete
Task 1.4 (MySQL Fix) ──────┤ │
Task 1.5 (Document) ───────┘ │
│
All tasks must complete ─────────────────────────────────────────┘
Recommended Execution Order:
- Complete Task 1.1 fully (including integration testing)
- Start Tasks 1.3, 1.4, 1.5 in parallel (independent)
- Complete Task 1.2 (after 1.1 confirmed working)
Task 2.2 (Alertmanager) ──┬──> Task 2.1 (Alert Rules) ──────┐
└──> Task 2.3 (Backup Verify) ────┤
├──> Phase 2 Complete
Task 2.4 (Migrations) ────────────────────────────────────────┤
Task 2.5 (CI/CD Tests) ───────────────────────────────────────┘
Recommended Execution Order:
- Complete Task 2.2 (Alertmanager) first
- Start Task 2.1 (after 2.2 working)
- Start Tasks 2.4, 2.5 in parallel (independent)
- Complete Task 2.3 (after 2.2 working)
All Phase 3 tasks are independent (can run in parallel)
Task 3.1, 3.2, 3.3, 3.4, 3.5 ──> Phase 3 Complete
All Phase 4 tasks are independent (can run in parallel)
Task 4.1, 4.2, 4.3, 4.4, 4.5 ──> Phase 4 Complete
Update this section daily:
Phase 0 Progress: 0% (0/1 tasks) Phase 1 Progress: 0% (0/5 tasks) Phase 2 Progress: 0% (0/5 tasks) Phase 3 Progress: 0% (0/5 tasks) Phase 4 Progress: 0% (0/5 tasks) Overall Progress: 0% (0/21 tasks)
Current Status: Not Started Current Task: N/A Blockers: None Start Date: TBD Expected Completion: TBD (4-5 weeks from start) Last Updated: 2025-11-14
Baseline Metrics (Measure in Phase 0):
- Test coverage: ___% (run
pytest --cov+go test -cover) - Security score: ___ (run
./scripts/audit-capabilities.sh) - Performance p95: ___ ms per endpoint (run
tests/performance-benchmark.sh) - Total test count: ___ tests passing
Target Metrics After Phase 1:
- Test coverage improvement: +5% (from baseline)
- Security score improvement: +15% (AppRole + credential fixes)
- Performance: p95 latency ≤ baseline + 5% (no significant degradation)
- All existing tests still passing + new security tests
Target Metrics After Phase 2:
- Test coverage improvement: +10% (from baseline)
- Alert coverage: 100% of critical services monitored
- Backup verification: 100% automated
- Migration framework: 100% of databases covered
Target Metrics After Phase 3:
- Test coverage improvement: +15% (from baseline)
- Code quality: ShellCheck passing 100% of scripts
- API consistency: 100% parity across Python, Go, Node.js (partial for Rust)
Target Metrics After Phase 4:
- Test coverage improvement: +18% (from baseline)
- Performance regression: Automated in CI/CD
- Documentation coverage: 100% (runbooks + ADRs)
- Developer experience: Measured by setup time reduction
Measurement Tools:
- Test coverage:
pytest --cov+go test -cover+ Jest coverage - Security score:
./scripts/audit-capabilities.sh+ Trivy scan results - Performance:
tests/performance-benchmark.sh - Test count:
./tests/run-all-tests.sh --count
Purpose: Establish baseline, safety nets, and documentation before making any changes.
Priority: Critical 🔴 Status: Not Started Estimated Time: 3 hours Impact: Critical - Prevents data loss and provides rollback capability
Current Issue:
- No documented baseline for comparison
- No verified backup for disaster recovery
- No clean starting point for changes
Implementation Steps:
-
Subtask 0.1.1: Full environment backup
- Backup Vault keys:
cp -r ~/.config/vault ~/vault-backup-$(date +%Y%m%d) - Verify Vault backup contains: keys.json, root-token, ca/, certs/
- Run database backup:
./devstack backup - Note backup location and timestamp
- Export all docker volumes:
docker volume ls -q | xargs -I {} docker run --rm -v {}:/data -v $(pwd)/backups:/backup alpine tar czf /backup/{}.tar.gz -C /data . - Calculate total backup size:
du -sh ~/vault-backup-* backups/ - Test: Verify backups exist and are non-zero size
- Test: Restore Vault backup to temporary location and verify contents
- Backup Vault keys:
-
Subtask 0.1.2: Document current state (baseline)
- Run
./devstack statusand save output tobaseline/status.txt - Run
./tests/run-all-tests.shand save results tobaseline/test-results.txt - Measure performance:
tests/performance-benchmark.sh > baseline/performance.txt - Document service versions:
docker compose images > baseline/versions.txt - Measure test coverage:
docker exec dev-reference-api pytest tests/ --cov --cov-report=term > baseline/coverage.txt - Run security audit:
./scripts/audit-capabilities.sh > baseline/security.txt(create if doesn't exist) - Count total tests:
./tests/run-all-tests.sh 2>&1 | grep -E "tests? passed" | tee baseline/test-count.txt - Test: All baseline files created and contain data
- Test: Baseline shows all services healthy and all tests passing
- Run
-
Subtask 0.1.3: Create feature branch
- Ensure working directory is clean:
git status - Create branch:
git checkout -b improvement-phases-1-4 - Push branch:
git push -u origin improvement-phases-1-4 - Document branch strategy in commit message
- Test: Branch created and pushed to remote
- Test: Can switch back to main and return to feature branch
- Ensure working directory is clean:
-
Subtask 0.1.4: Verify environment health
- All services healthy:
./devstack health(all green) - All tests passing:
./tests/run-all-tests.sh(0 failures) - No pending git changes:
git status(clean working tree) - Sufficient disk space:
df -h(>10GB free) - Vault unsealed and accessible:
vault status - Test: Clean starting point confirmed
- Test: Environment matches baseline
- All services healthy:
-
Subtask 0.1.5: Set up task tracking
- Update this document with start date
- Create
baseline/directory for all baseline measurements - Set up daily progress update reminder (calendar/cron)
- Review risk register and accept risks
- Confirm resource requirements are met
- Test: Tracking mechanisms in place
- Test: All Phase 0 subtasks completed
-
Subtask 0.1.6: Create rollback documentation
- Document rollback procedure for Vault:
docs/ROLLBACK_PROCEDURES.md - Document rollback procedure for databases
- Document rollback procedure for docker-compose changes
- Document rollback procedure for git changes
- Test rollback from backup (dry run)
- Test: Rollback procedures documented and tested
- Document rollback procedure for Vault:
- All baseline measurements documented
- Full environment backup completed and verified
- Feature branch created and pushed
- Environment health verified (all services healthy, all tests passing)
- Task tracking mechanisms in place
- Rollback procedures documented and tested
- Sufficient disk space available (>10GB free)
- All tools installed (docker, git, python, vault, jq, shellcheck, gpg)
- Risk register reviewed and accepted
Phase 0 Sign-off: _________________ Date: _________
Total Time: 28 hours (with 40% buffer: ~39 hours)
Priority: Critical 🔴 Status: Not Started Estimated Time: 12 hours (was 8h, +50% for complexity) Impact: High - Prevents compromised services from accessing all Vault secrets
Current Issue:
- All services use root token passed via
VAULT_TOKENenvironment variable - Any compromised service has full Vault access
- File:
docker-compose.yml:74
Implementation Steps:
-
Subtask 1.1.1: Create AppRole policies for each service
- Create
configs/vault/policies/postgres-policy.hcl - Create
configs/vault/policies/mysql-policy.hcl - Create
configs/vault/policies/mongodb-policy.hcl - Create
configs/vault/policies/redis-policy.hcl - Create
configs/vault/policies/rabbitmq-policy.hcl - Create
configs/vault/policies/forgejo-policy.hcl - Create
configs/vault/policies/pgbouncer-policy.hcl - Create
configs/vault/policies/reference-api-policy.hcl -
Test: Validate policy syntax with
vault policy fmt <file> - Test: Each policy file is valid HCL
- Create
-
Subtask 1.1.2: Update vault-bootstrap.sh to enable AppRole and create roles
- Enable AppRole auth method
- Upload all policies to Vault
- Create AppRole for each service with appropriate policy
- Generate role_id for each service (deterministic, not secret)
- Generate secret_id for each service
- Create directory:
mkdir -p ~/.config/vault/approles/ - Store role_id in
~/.config/vault/approles/<service>-role-id - Store secret_id in
~/.config/vault/approles/<service>-secret-id - Set permissions:
chmod 600 ~/.config/vault/approles/* - Add note about mounting approles directory into containers
-
Test: Verify AppRoles created with
vault list auth/approle/role - Test: Verify credential files exist and are readable
-
Test: Test login with AppRole:
vault write auth/approle/login role_id=<id> secret_id=<secret>
-
Subtask 1.1.3: Update service init scripts to use AppRole authentication
- Create shared function:
vault_approle_login()in each init script - Update
configs/postgres/scripts/init.shto use AppRole - Update
configs/mysql/scripts/init.shto use AppRole - Update
configs/mongodb/scripts/init.shto use AppRole - Update
configs/redis/scripts/init.shto use AppRole - Update
configs/rabbitmq/scripts/init.shto use AppRole - Update
configs/forgejo/scripts/init.shto use AppRole - Update
configs/pgbouncer/scripts/init.shto use AppRole - Update reference app configurations (environment-based)
- Add fallback to root token if
VAULT_USE_ROOT_TOKEN=true - Test: Verify each service can authenticate with AppRole
- Test: Verify each service can retrieve its secrets
-
Test: Verify fallback works with
VAULT_USE_ROOT_TOKEN=true
- Create shared function:
-
Subtask 1.1.4: Update docker-compose.yml to pass role credentials
- Mount approles directory:
~/.config/vault/approles:/vault-approles:ro - Keep
VAULT_TOKENfor backward compatibility - Add
VAULT_USE_APPROLE=trueenvironment variable - Add
VAULT_APPROLE_PATH=/vault-approlesenvironment variable - Update service dependencies (no changes needed)
- Update health checks (no changes needed)
-
Test:
docker compose configvalidates successfully - Test: No syntax errors in docker-compose.yml
- Mount approles directory:
-
Subtask 1.1.5: Update .env.example
- Add
# AppRole Authentication (recommended for production)section - Add
VAULT_USE_APPROLE=truewith explanation - Add
VAULT_APPROLE_PATH=/vault-approleswith explanation - Add
# Root Token Authentication (development only)section - Add
VAULT_USE_ROOT_TOKEN=falsewith warning - Document when to use each method
- Test: New users can understand configuration
- Test: .env.example has clear comments
- Add
-
Subtask 1.1.6: Update management script for AppRole support
- Update
scripts/manage_devstack.pyvault-bootstrap command - Update
vault-show-passwordto work with AppRole or root token - Add
vault-approle-statuscommand to show AppRole status - Update documentation strings in management script
- Add error handling for missing approle credentials
-
Test: Run
./devstack vault-show-password postgres -
Test: Run
./devstack vault-approle-status
- Update
-
Subtask 1.1.7: Create documentation for AppRole implementation
- Update
docs/VAULT.mdwith AppRole section (400+ lines) - Document AppRole authentication flow with diagram
- Update
docs/SECURITY_ASSESSMENT.mdwith AppRole security benefits - Add AppRole troubleshooting guide to
docs/VAULT.md - Document root token fallback mechanism
- Add migration guide from root token to AppRole
- Test: Documentation review for accuracy
- Test: Documentation completeness check
- Update
-
Subtask 1.1.8: Integration testing
- Stop all services:
./devstack stop - Clear any cached tokens
- Set
VAULT_USE_APPROLE=truein .env - Start services:
./devstack start - Test Vault bootstrap:
./devstack vault-bootstrap - Test service authentication to Vault (check logs)
- Test secret retrieval from all services
- Run test suite:
./tests/test-vault.sh - Run full test suite:
./tests/run-all-tests.sh - Test root token fallback: Set
VAULT_USE_ROOT_TOKEN=trueand restart - Test: All services healthy and functioning
- Test: All tests passing (0 regressions)
- Test: Vault shows AppRole logins in audit log
- Stop all services:
-
Subtask 1.1.9: Rollback testing
- Document rollback steps to root token
- Test rollback: Set
VAULT_USE_ROOT_TOKEN=true - Restart services with root token
- Verify all services start successfully
- Verify all tests still pass
- Document time required for rollback (~5 minutes)
- Re-enable AppRole after successful rollback test
- Test: Rollback works successfully
- Test: Can switch back to AppRole without issues
Rollback Plan:
# In .env file:
VAULT_USE_ROOT_TOKEN=true
VAULT_USE_APPROLE=false
# Restart services:
./devstack restartPost-Task Validation:
- Smoke test:
./devstack health(all services healthy) - No regressions:
./tests/run-all-tests.sh(all tests pass) - Performance check: No significant latency increase (<5%)
- Security audit: Verify services use AppRole (check Vault audit log)
Priority: Critical 🔴 Status: Not Started Estimated Time: 8 hours (was 6h, +33% for testing) Impact: High - Prevents credential exposure via docker inspect and process listing Depends On: Task 1.1 (AppRole authentication must be working)
Current Issue:
- Credentials exported to environment variables (visible in
docker inspect) - Files: All
configs/*/scripts/init.shfiles - Risk: Process listing, container inspection exposes passwords
Implementation Steps:
-
Subtask 1.2.1: Create secure credential file handling function
- Create shared function library:
configs/shared/vault-helpers.sh - Add
create_secure_creds_file()function - Add
cleanup_creds_file()function - Implement proper file permissions (chmod 600)
- Implement cleanup trap on script exit
- Add error handling for file operations
- Test: Verify file permissions set correctly (600)
- Test: Verify cleanup happens on script exit (normal and error)
- Create shared function library:
-
Subtask 1.2.2: Update PostgreSQL init script
- Source shared vault-helpers.sh
- Modify
configs/postgres/scripts/init.sh - Write credentials to temporary PGPASSFILE instead of export
- Pass credentials via PGPASSFILE environment (points to file, not password)
- Remove export statements for POSTGRES_PASSWORD
- Keep POSTGRES_USER in environment (not sensitive)
- Add cleanup trap for temp files
- Test: PostgreSQL starts successfully
-
Test: Credentials not in
docker inspect dev-postgres -
Test:
ps auxdoesn't show password during startup
-
Subtask 1.2.3: Update MySQL init script
- Source shared vault-helpers.sh
- Modify
configs/mysql/scripts/init.sh - Create temporary
.my.cnffile for credentials - Use
--defaults-file=/path/to/.my.cnfinstead of-p - Remove export statements for MYSQL_ROOT_PASSWORD
- Add cleanup trap for .my.cnf
- Test: MySQL starts successfully
- Test: Credentials not in environment
-
Subtask 1.2.4: Update MongoDB init script
- Source shared vault-helpers.sh
- Modify
configs/mongodb/scripts/init.sh - Write credentials to MongoDB config file
- Use
--config /path/to/mongod.conf - Remove export statements for MONGO_INITDB_ROOT_PASSWORD
- Add cleanup trap
- Test: MongoDB starts successfully
- Test: Credentials not in environment
-
Subtask 1.2.5: Update Redis init scripts (all 3 nodes)
- Source shared vault-helpers.sh
- Modify
configs/redis/scripts/init.sh - Write credentials to Redis ACL file
- Use
--aclfile /path/to/acl.conf - Remove export statements for REDIS_PASSWORD
- Apply to all 3 nodes (redis-1, redis-2, redis-3)
- Add cleanup trap
- Test: Redis cluster starts successfully
- Test: Credentials not in environment for all 3 nodes
-
Subtask 1.2.6: Update RabbitMQ init script
- Source shared vault-helpers.sh
- Modify
configs/rabbitmq/scripts/init.sh - Write credentials to RabbitMQ config file
- Use
--config /path/to/rabbitmq.conf - Remove export statements for RABBITMQ_DEFAULT_PASS
- Add cleanup trap
- Test: RabbitMQ starts successfully
- Test: Credentials not in environment
-
Subtask 1.2.7: Update Forgejo init script
- Source shared vault-helpers.sh
- Modify
configs/forgejo/scripts/init.sh - Write credentials to secure temp file
- Pass via file descriptor or config file
- Remove export statements for sensitive data
- Add cleanup trap
- Test: Forgejo starts successfully
- Test: Credentials not in environment
-
Subtask 1.2.8: Update PgBouncer init script
- Source shared vault-helpers.sh
- Modify
configs/pgbouncer/scripts/init.sh - Write credentials to userlist.txt (PgBouncer format)
- Set file permissions to 600
- Remove export statements for sensitive data
- Add cleanup trap
- Test: PgBouncer starts successfully
- Test: Credentials not in environment
-
Subtask 1.2.9: Audit reference applications for credential exposure
- Review Python FastAPI application logging (no credential logging)
- Review Go application environment handling
- Review Node.js application credential management
- Review Rust application secret handling
- Add warning comments in code: "# WARNING: Do not log this value"
- Add credential redaction to logging middleware
- Test: No credentials in application logs
-
Test: Search logs for common password patterns:
grep -ri "password.*:" logs/
-
Subtask 1.2.10: Integration testing
- Stop all services
- Start services:
./devstack start - Verify
docker inspect dev-postgres | grep -i passwordreturns nothing - Verify
docker inspect dev-mysql | grep -i passwordreturns nothing - Check all services: postgres, mysql, mongodb, redis-1/2/3, rabbitmq, forgejo, pgbouncer
- Monitor
ps auxduring startup for password exposure - Run all service tests:
./tests/run-all-tests.sh - Check application logs for credential leakage
- Test: All tests passing, no credential leakage
- Test: All services healthy and functioning
-
Subtask 1.2.11: Create validation script
- Create
scripts/validate-no-credential-exposure.sh - Script checks
docker inspectfor all services - Script checks process list for password patterns
- Script checks logs for credential patterns
- Exit with error if any credentials found
- Add to test suite:
./tests/test-security.sh - Test: Validation script passes
- Test: Validation script detects intentionally exposed test credential
- Create
-
Subtask 1.2.12: Rollback testing
- Document rollback steps
- Test reverting one service to old method
- Verify service still works
- Revert back to secure method
- Document time required for rollback
- Test: Rollback works successfully
Validation Script:
#!/bin/bash
# scripts/validate-no-credential-exposure.sh
set -e
echo "Checking for credential exposure..."
# Check docker inspect for all services
for service in postgres mysql mongodb redis-1 redis-2 redis-3 rabbitmq forgejo pgbouncer; do
if docker inspect "dev-$service" 2>/dev/null | grep -iE "(password|secret|token)" | grep -v "VAULT_TOKEN"; then
echo "FAIL: Credentials found in dev-$service environment"
exit 1
fi
done
# Check process list
if ps aux | grep -iE "password=|passwd=|-p[[:space:]]*[^[:space:]]" | grep -v grep; then
echo "FAIL: Credentials found in process list"
exit 1
fi
echo "PASS: No credential exposure detected"Post-Task Validation:
- Smoke test:
./devstack health - No regressions:
./tests/run-all-tests.sh - Security validation:
./scripts/validate-no-credential-exposure.sh - Performance check: No degradation
Priority: Critical 🔴 Status: Not Started Estimated Time: 5 hours (was 2h, +150% for security and testing) Impact: Critical - Prevents irrecoverable data loss
Current Issue:
- Vault keys stored at
~/.config/vault/with no automated backup - If directory deleted, all data is irrecoverable
- No backup verification
- No backup retention policy
Implementation Steps:
-
Subtask 1.3.1: Create encrypted backup script
- Create
scripts/vault-backup.sh - Implement GPG encryption with AES256
- Support passphrase from environment variable:
VAULT_BACKUP_PASSPHRASE - Support passphrase prompt if env var not set
- Create timestamped backup directory:
~/devstack-backups/vault/YYYYMMDD_HHMMSS/ - Backup keys.json → keys.json.gpg
- Backup root-token → root-token.gpg
- Backup entire ca/ directory → ca.tar.gz.gpg
- Backup entire certs/ directory → certs.tar.gz.gpg
- Create manifest file with checksums
- Log backup to
~/devstack-backups/vault/backup.log - Test: Run script and verify encrypted files created
- Test: Verify unencrypted originals not in backup directory
- Create
-
Subtask 1.3.2: Create backup restoration script
- Create
scripts/vault-restore.sh - Implement GPG decryption
- Support passphrase from environment or prompt
- Create backup of existing files before restore:
~/.config/vault.bak-$(date +%s) - Verify checksums from manifest after decryption
- Restore all files to
~/.config/vault/ - Set correct file permissions (600 for keys, 644 for certs)
- Log restoration to
~/devstack-backups/vault/restore.log - Test: Restore from backup and verify integrity
- Test: Verify checksums match original
- Create
-
Subtask 1.3.3: Implement backup rotation policy
- Create
scripts/vault-backup-rotate.sh - Keep last 7 daily backups
- Keep last 4 weekly backups (Sunday)
- Keep last 12 monthly backups (1st of month)
- Automatic cleanup of old backups based on policy
- Log rotation activity
- Calculate and log disk space savings
- Test: Create multiple backups and verify rotation works
- Test: Verify correct backups kept (7 daily, 4 weekly, 12 monthly)
- Create
-
Subtask 1.3.4: Add backup commands to management script
- Add
vault-backupcommand tomanage_devstack.py - Add
vault-restorecommand with timestamp parameter - Add
vault-list-backupscommand (shows available backups) - Add
vault-verify-backupcommand (checksum verification) - Add
vault-rotate-backupscommand - Add
--passphraseoption for non-interactive mode - Add
--no-encryptoption for testing (not recommended) -
Test: Run
./devstack vault-backup -
Test: Run
./devstack vault-list-backups -
Test: Run
./devstack vault-verify-backup <timestamp>
- Add
-
Subtask 1.3.5: Create automated backup schedule documentation
- Document cron job setup in
docs/DISASTER_RECOVERY.md - Add cron example:
0 2 * * * /path/to/vault-backup.sh - Add backup best practices section
- Document passphrase management (use password manager)
- Document off-site backup procedures (external drive, not GitHub)
- SECURITY: Document why GitHub Actions artifacts are insecure
- Document S3/external storage options (future improvement)
- Create backup verification checklist
- Test: Documentation review for security
- Test: Documentation completeness
- Document cron job setup in
-
Subtask 1.3.6: Create manual backup reminder (skip GitHub Actions)
- Add backup reminder to weekly routine (Monday morning)
- Document manual backup to external drive procedure
- Create
scripts/backup-to-external.shtemplate - Document encryption-at-rest requirements for external storage
- Skip GitHub Actions workflow (insecure for Vault keys)
- Note: Future improvement could use AWS S3 with SSE-KMS
- Test: Manual backup procedure tested
- Test: External drive backup tested (if available)
-
Subtask 1.3.7: Integration testing
- Create test backup:
./devstack vault-backup - Verify encryption: Cannot read .gpg files without passphrase
- List backups:
./devstack vault-list-backups - Verify backup:
./devstack vault-verify-backup <timestamp> - Restore to temporary location for testing
- Verify restored files match originals (checksum)
- Test rotation policy with multiple backups
- Test passphrase prompt (interactive mode)
- Test passphrase from environment (non-interactive)
- Test: Backup and restore cycle successful
- Test: Rotation policy works correctly
- Create test backup:
-
Subtask 1.3.8: Rollback testing
- Document rollback steps
- Simulate corrupted Vault keys
- Restore from backup
- Verify Vault functionality restored
- Document time required for recovery (~10 minutes)
- Test: Disaster recovery successful
Backup Location: ~/devstack-backups/vault/YYYYMMDD_HHMMSS/
Security Note:
- ❌ DO NOT upload Vault backups to GitHub Actions artifacts (insecure)
- ✅ DO use external encrypted drive or S3 with SSE-KMS
- ✅ DO store passphrase in secure password manager
Post-Task Validation:
- Smoke test: Create and restore backup successfully
- Security check: Verify encryption works (cannot read without passphrase)
- Rotation test: Verify old backups are cleaned up
- Documentation complete: DISASTER_RECOVERY.md updated
Priority: Critical 🔴 Status: Not Started Estimated Time: 2 hours (was 1h, +100% for all databases) Impact: High - Prevents password visibility in process listing
Current Issue:
- Password passed via command-line argument in mysqldump
- File:
scripts/manage_devstack.py:953 - Visible in
ps auxduring backup operation - Same issue exists for PostgreSQL and MongoDB
Implementation Steps:
-
Subtask 1.4.1: Update MySQL backup function to use config file
- Modify
backup()function inmanage_devstack.py(around line 950) - Create temporary
.my.cnffile with credentials - Use
--defaults-file=/path/to/.my.cnfinstead of-pflag - Ensure temp file cleanup with try/finally block
- Set file permissions to 600 before writing password
- Remove .my.cnf in finally block
-
Test: Run
./devstack backup -
Test: Monitor
ps auxduring backup (no password visible)
- Modify
-
Subtask 1.4.2: Apply same fix to MySQL restore function
- Modify
restore()function inmanage_devstack.py - Use config file method for mysql restore
- Ensure cleanup with try/finally
-
Test: Run
./devstack restore <timestamp> -
Test: Verify password not in
ps aux
- Modify
-
Subtask 1.4.3: Update PostgreSQL backup for consistency
- Modify PostgreSQL backup in
manage_devstack.py - Create temporary
.pgpassfile instead of PGPASSWORD env var - Format:
localhost:5432:*:postgres:password - Set permissions to 600
- Use PGPASSFILE environment variable (points to file)
- Remove .pgpass in finally block
- Test: PostgreSQL backup with no password exposure
- Test: Verify .pgpass file created with correct permissions
- Modify PostgreSQL backup in
-
Subtask 1.4.4: Update MongoDB backup for consistency
- Modify MongoDB backup in
manage_devstack.py - Create temporary MongoDB config file
- Use
--config /path/to/mongod.confflag - Set permissions to 600
- Remove config file in finally block
- Test: MongoDB backup with no password exposure
- Modify MongoDB backup in
-
Subtask 1.4.5: Create process monitoring test
- Create
tests/test-backup-security.sh - Start backup in background
- Monitor
ps auxevery 0.1s during backup - Grep for password patterns
- Fail if any passwords found
- Test all three databases (postgres, mysql, mongodb)
- Test: Test script passes for all databases
- Create
-
Subtask 1.4.6: Integration testing
- Run full backup cycle:
./devstack backup - Monitor
ps auxin separate terminal during backup - Verify no credentials visible
- Verify backup files created successfully for all databases
- Run restore:
./devstack restore <timestamp> - Monitor
ps auxduring restore - Verify restore successful (can connect to databases)
- Test: Backup/restore works, no credential exposure
- Run full backup cycle:
-
Subtask 1.4.7: Documentation
- Update
docs/DISASTER_RECOVERY.mdwith security improvements - Document temporary file approach
- Add troubleshooting for permission issues
- Test: Documentation review
- Update
Validation:
# In separate terminal during backup
watch -n 0.1 'ps aux | grep -iE "password|passwd|-p[[:space:]]*[^[:space:]]|MYSQL|POSTGRES|MONGO" | grep -v grep'
# Should not show any credentialsPost-Task Validation:
- Security test: Run
tests/test-backup-security.sh(passes) - Functional test: Backup and restore work correctly
- No regressions: Backup files same size as before
Priority: Critical 🔴 Status: Not Started Estimated Time: 2 hours (was 1h, +100% for comprehensive docs) Impact: Medium - Security transparency and awareness
Current Issue:
- Vault uses
IPC_LOCKcapability without explanation - cAdvisor uses
SYS_ADMINandSYS_PTRACEwithout justification - Files:
docker-compose.yml:752-753,docker-compose.yml:1492-1494 - No audit trail of privileged capabilities
Implementation Steps:
-
Subtask 1.5.1: Add inline documentation to docker-compose.yml
- Find Vault service definition (around line 752)
- Document Vault IPC_LOCK capability:
cap_add: # IPC_LOCK: Required for Vault's mlock() to prevent secrets from being swapped to disk # Security trade-off: Acceptable in dev, use encrypted swap in production # Alternative: Run Vault with 'disable_mlock = true' (not recommended) - IPC_LOCK
- Find cAdvisor service definition (around line 1492)
- Document cAdvisor SYS_ADMIN capability:
cap_add: # SYS_ADMIN: Required for cAdvisor to access container metrics via cgroups # Security trade-off: SYS_ADMIN is nearly equivalent to --privileged # Production alternative: Use Prometheus node-exporter with restricted permissions - SYS_ADMIN # SYS_PTRACE: Required for cAdvisor to inspect process information # Security trade-off: Allows debugging and inspection of other containers - SYS_PTRACE
- Add security notes section at top of docker-compose.yml
- Test: Review comments for clarity and accuracy
-
Test:
docker compose configstill validates
-
Subtask 1.5.2: Update SECURITY_ASSESSMENT.md
- Add "Privileged Container Capabilities" section
- Document IPC_LOCK: what it is, why needed, risks, mitigations
- Document SYS_ADMIN: what it is, why needed, risks, mitigations
- Document SYS_PTRACE: what it is, why needed, risks, mitigations
- Add capability risk matrix (Low/Medium/High for each)
- Document production alternatives for each capability
- Add "Acceptable Use" policy for dev vs. production
- Add reference links to Linux capability documentation
- Test: Documentation review for technical accuracy
- Test: Security section is comprehensive
-
Subtask 1.5.3: Create capability audit script
- Create
scripts/audit-capabilities.sh - List all running containers
- For each container, check for privileged mode
- For each container, list added capabilities (cap_add)
- For each container, list dropped capabilities (cap_drop)
- Generate security report with risk assessment
- Output format: table with container name, capabilities, risk level
- Exit code 0 if only known capabilities, 1 if unexpected
- Add to test suite
- Test: Run script and verify output
- Test: Script detects if new capability added
- Create
-
Subtask 1.5.4: Update container security best practices
- Update
docs/BEST_PRACTICES.mdwith "Container Security" section - Document principle of least privilege
- Document when capabilities are acceptable (dev vs. prod)
- Document capability alternatives (e.g., node-exporter vs. cAdvisor)
- Add capability approval process (document why before adding)
- Add links to Docker security documentation
- Add capability quick reference (what each does)
- Test: Documentation review
- Test: Best practices are actionable
- Update
-
Subtask 1.5.5: Add capability check to CI/CD
- Update
.github/workflows/security.yml(if exists) - Add capability audit step
- Fail CI if unexpected capabilities detected
- Whitelist known capabilities: IPC_LOCK, SYS_ADMIN, SYS_PTRACE
- Require PR comment explaining any new capabilities
- Test: CI workflow validates (dry run)
- Test: Whitelist works correctly
- Update
-
Subtask 1.5.6: Integration testing
- Run audit script:
./scripts/audit-capabilities.sh - Verify report shows Vault (IPC_LOCK) and cAdvisor (SYS_ADMIN, SYS_PTRACE)
- Verify no unexpected capabilities
- Verify risk levels are reasonable
- Review SECURITY_ASSESSMENT.md for completeness
- Test: Audit script passes
- Test: No security regressions
- Run audit script:
Documentation Example:
# docker-compose.yml - Security Notes
# This file uses minimal privileged capabilities for development.
# For production deployment:
# - Consider alternatives that don't require elevated privileges
# - Use encrypted swap instead of IPC_LOCK for Vault
# - Use Prometheus node-exporter instead of cAdvisor
# - Document and approve all capabilities in security review
vault:
cap_add:
# IPC_LOCK: Prevents memory paging to disk (keeps secrets in RAM)
# Why needed: Vault mlock() calls require this capability
# Risk: Low - only affects Vault process memory
# Production: Use encrypted swap and disable_mlock = true
- IPC_LOCKPost-Task Validation:
- Audit script passes:
./scripts/audit-capabilities.sh - Documentation complete: All 3 capabilities documented
- No new capabilities added unknowingly
- CI check added (or documented for future)
- All 5 tasks completed (1.1 through 1.5)
- All subtasks tested and passing
- Full environment starts successfully:
./devstack start - Full test suite passes:
./tests/run-all-tests.sh(0 regressions) - Smoke test passes:
./devstack health(all services healthy) - No credential exposure in
docker inspectorps aux - Vault backups automated with encryption and rotation
- AppRole authentication working for all services (with root token fallback)
- Security audit script passes:
./scripts/audit-capabilities.sh - Documentation updated and accurate (4+ docs updated)
- No breaking changes to existing functionality
- Performance check: No >5% degradation from baseline
- Rollback tested for all critical changes
Post-Phase Integration Test:
- Run
./devstack stop && ./devstack start - Verify all services healthy
- Run
./tests/run-all-tests.sh(all tests pass) - Compare test results to Phase 0 baseline (no regressions)
- Run performance benchmark and compare to baseline
- Run security validation:
./scripts/validate-no-credential-exposure.sh - Verify backup works:
./devstack vault-backup && ./devstack vault-verify-backup <latest> - Git checkpoint:
git commit -m "Phase 1 complete: Critical Security improvements" - Git tag:
git tag phase-1-complete && git push --tags
Phase 1 Sign-off: _________________ Date: _________
Total Time: 27 hours (with 40% buffer: ~38 hours)
Note: Tasks reordered from original plan. Task 2.2 (Alertmanager) must complete before Task 2.1 (Alert Rules).
Priority: High 🟠 Status: Not Started Estimated Time: 7 hours (was 5h, +40% for integration) Impact: High - Notification delivery system
Moved to first position - Alert rules (Task 2.1) depend on Alertmanager being functional
Implementation Steps:
-
Subtask 2.2.1: Add Alertmanager service to docker-compose.yml
- Define alertmanager service (use prom/alertmanager:latest)
- Configure static IP: 172.20.0.115
- Verify IP not in use:
docker network inspect dev-services | grep 172.20.0.115 - Set resource limits (CPU: 0.5, Memory: 256M)
- Set resource reservations (CPU: 0.1, Memory: 64M)
- Configure volumes:
./configs/alertmanager:/etc/alertmanager - Configure ports:
9093:9093 - Add health check:
http://localhost:9093/-/healthy - Add to full profile (not minimal/standard)
- Add logging configuration
-
Test:
docker compose configvalidates - Test: No port conflicts
-
Subtask 2.2.2: Update .env.example
- Add
# Alertmanager Configurationsection - Add
ALERTMANAGER_IP=172.20.0.115 - Add
ALERTMANAGER_PORT=9093 - Add email configuration example (commented out)
- Add Slack webhook example (commented out)
- Document how to enable notifications
- Test: Configuration examples are clear
- Add
-
Subtask 2.2.3: Create Alertmanager configuration
- Create
configs/alertmanager/directory - Create
configs/alertmanager/alertmanager.yml - Configure global settings (resolve_timeout: 5m)
- Configure route tree (group by alertname, severity)
- Configure webhook receiver (for testing)
- Add email receiver template (commented out, requires SMTP)
- Add Slack receiver template (commented out, requires webhook)
- Set up routing rules (critical → email, warning → slack)
- Configure grouping (wait: 30s, interval: 5m, repeat: 12h)
-
Test: Validate config with
docker run --rm -v $(pwd)/configs/alertmanager:/etc/alertmanager prom/alertmanager:latest amtool check-config /etc/alertmanager/alertmanager.yml
- Create
-
Subtask 2.2.4: Create Alertmanager templates
- Create
configs/alertmanager/templates/directory - Create
configs/alertmanager/templates/email.tmpl - Create email subject template with severity and alertname
- Create email body template with labels, annotations, timestamp
- Create
configs/alertmanager/templates/slack.tmpl - Create Slack message template with colored severity indicator
- Add template examples to alertmanager.yml
- Test: Templates are valid (no syntax errors)
- Create
-
Subtask 2.2.5: Configure Prometheus to use Alertmanager
- Update
configs/prometheus/prometheus.yml - Add alerting section:
alertmanagers: [{static_configs: [{targets: ['alertmanager:9093']}]}] - Configure alert relabeling if needed
- Set evaluation interval (same as scrape: 15s)
- Restart Prometheus service
- Test: Prometheus connects to Alertmanager
- Test: Check Prometheus targets page shows Alertmanager
- Update
-
Subtask 2.2.6: Add management commands
- Add
alertmanager-statusto manage_devstack.py - Add
alertmanager-silencecommand (create silence for alert) - Add
alertmanager-testcommand (send test alert) - Add
alertmanager-alertscommand (list active alerts) - Update help text for new commands
-
Test: Run
./devstack alertmanager-status -
Test: Run
./devstack alertmanager-test
- Add
-
Subtask 2.2.7: Integration testing
- Start Alertmanager service:
docker compose up -d alertmanager - Check service health:
curl http://localhost:9093/-/healthy - Trigger test alert:
./devstack alertmanager-test - Verify webhook receives notification (check webhook logs)
- Test alert silencing:
./devstack alertmanager-silence test-alert - Verify silence created in Alertmanager UI
- Test alert grouping (send multiple alerts)
- Test: End-to-end alerting works
- Test: All management commands work
- Start Alertmanager service:
-
Subtask 2.2.8: Documentation
- Update
docs/OBSERVABILITY.mdwith Alertmanager section - Document Alertmanager configuration
- Document how to set up email notifications
- Document how to set up Slack notifications
- Create alert management guide (silence, inhibit, route)
- Add troubleshooting section
- Test: Documentation review
- Update
-
Subtask 2.2.9: Rollback testing
- Document rollback steps (remove from docker-compose.yml)
- Test stopping Alertmanager
- Verify Prometheus still works without Alertmanager
- Restart Alertmanager
- Test: Rollback works, no data loss
Post-Task Validation:
- Smoke test:
./devstack health(alertmanager healthy) - Functional test: Can send and receive test alert
- UI test: Alertmanager UI accessible at http://localhost:9093
Priority: High 🟠 Status: Not Started Estimated Time: 6 hours (was 4h, +50% for testing) Impact: High - Proactive issue detection Depends On: Task 2.2 (Alertmanager must be running)
Moved to second position - requires Alertmanager for full testing
Implementation Steps:
-
Subtask 2.1.1: Create alert rule directory structure
- Create
configs/prometheus/alerts/directory - Create
critical.ymlfor critical alerts - Create
warning.ymlfor warning alerts - Create
info.ymlfor informational alerts - Add README explaining alert severity levels
- Test: Directory structure exists
- Create
-
Subtask 2.1.2: Implement critical alerts
- ServiceDown alert (up == 0 for 2+ minutes)
- HighMemoryUsage alert (>90% for 5+ minutes)
- DiskSpaceCritical alert (<10% free on any mount)
- DatabaseConnectionPoolExhaustion alert (pg_stat_activity count > max_connections * 0.9)
- CertificateExpiration alert (x509_cert_expiry < 30 days)
- VaultSealed alert (vault_core_unsealed == 0)
- Add annotations with description and runbook_url
- Add labels: severity=critical, team=infra
-
Test: Validate alert syntax with
promtool check rules configs/prometheus/alerts/critical.yml
-
Subtask 2.1.3: Implement warning alerts
- HighCPUUsage alert (>80% for 10+ minutes)
- HighMemoryUsage alert (>80% for 10+ minutes)
- SlowResponseTime alert (p95 > 1s for 5+ minutes)
- HighErrorRate alert (>1% errors for 5+ minutes)
- RedisClusterNodeDown alert (redis_up{instance=~"redis-[123]"} == 0)
- DiskSpaceWarning alert (<20% free)
- Add annotations and labels
- Test: Validate alert syntax
-
Subtask 2.1.4: Implement informational alerts
- ServiceRestarted alert (time() - process_start_time_seconds < 300)
- BackupCompleted alert (custom metric from backup script)
- ConfigurationChanged alert (config file checksum changed)
- Add annotations and labels: severity=info
- Test: Validate alert syntax
-
Subtask 2.1.5: Update prometheus.yml to load alert rules
- Add rule_files section to
configs/prometheus/prometheus.yml - Add:
rule_files: ['/etc/prometheus/alerts/*.yml'] - Configure alert evaluation interval (default: 15s)
- Reload Prometheus:
docker compose exec prometheus kill -HUP 1 - Or restart:
docker compose restart prometheus - Test: Check Prometheus UI for loaded alert rules (http://localhost:9090/rules)
- Test: Check for configuration errors in Prometheus logs
- Add rule_files section to
-
Subtask 2.1.6: Create alert testing script
- Create
tests/test-alerts.sh - Test ServiceDown: Stop a service, wait, check alert fires
- Test HighMemoryUsage: Trigger via stress container (if safe)
- Test alert fires in Prometheus UI
- Test alert appears in Alertmanager
- Test alert resolves after condition clears
- Add to main test suite
- Test: All alert rules can be triggered
- Test: All alerts resolve correctly
- Create
-
Subtask 2.1.7: Documentation
- Update
docs/OBSERVABILITY.mdwith alert rules documentation - Document each alert: what it detects, why it matters, how to respond
- Document alert thresholds and rationale (why 90% not 95%?)
- Create alert response runbook:
docs/runbooks/alert-response.md - Add links to runbooks in alert annotations
- Document how to add new alert rules
- Test: Documentation review for completeness
- Update
-
Subtask 2.1.8: Rollback testing
- Document rollback steps (remove rule_files from prometheus.yml)
- Test removing alert rules
- Verify Prometheus still works
- Re-enable alert rules
- Test: Rollback works
Post-Task Validation:
- All alert rules loaded: Check Prometheus UI /rules
- Test alert fires: Run
tests/test-alerts.sh - Alerts reach Alertmanager: Check Alertmanager UI
- Documentation complete: Runbook exists for each critical alert
Priority: High 🟠 Status: Not Started Estimated Time: 8 hours (was 6h, +33% for comprehensive testing) Impact: High - Disaster recovery confidence Depends On: Task 2.2 (Alertmanager for failure notifications)
Implementation Steps:
-
Subtask 2.3.1: Create backup verification script
- Create
scripts/verify-backup.sh - Accept backup timestamp as parameter
- Extract backup to temporary location:
/tmp/backup-verify-$$ - Start temporary PostgreSQL container from backup
- Start temporary MySQL container from backup
- Start temporary MongoDB container from backup
- Run integrity checks (connect, query, schema validation)
- Clean up temporary resources (containers, volumes)
- Return exit code 0 for success, 1 for failure
- Log results to
logs/backup-verification/verify-$(date +%Y%m%d-%H%M%S).log - Test: Run script with known good backup
- Test: Run script with corrupted backup (should fail)
- Create
-
Subtask 2.3.2: Implement database integrity checks
- PostgreSQL: Run
pg_dump --schema-onlyand compare to original - PostgreSQL: Check
SELECT count(*) FROM pg_tablesmatches expected - MySQL: Run
mysqlcheck --all-databases --check-upgrade - MySQL: Verify table count matches expected
- MongoDB: Run
mongod --dbpath /tmp/mongo-verify --repair - MongoDB: Verify collection count matches expected
- Verify row/document counts for key tables/collections
- Generate verification report with pass/fail for each check
- Test: Integrity checks pass on valid backup
- Test: Integrity checks fail on corrupted backup
- PostgreSQL: Run
-
Subtask 2.3.3: Add verification to management script
- Add
verify-backupcommand to manage_devstack.py - Support verification of specific backup timestamp
- Support verification of latest backup (default)
- Generate verification report (JSON + human-readable)
- Save report to
logs/backup-verification/ - Display summary in terminal (pass/fail + details)
-
Test: Run
./devstack verify-backup -
Test: Run
./devstack verify-backup <timestamp>
- Add
-
Subtask 2.3.4: Create automated verification logging
- Create
logs/backup-verification/directory - Log verification results with timestamp
- Log passed checks vs. failed checks
- Create verification history tracking (CSV or JSON)
- Generate monthly verification report (summary stats)
- Rotate old logs (keep 90 days)
- Test: Multiple verifications create proper logs
- Test: Monthly report generated correctly
- Create
-
Subtask 2.3.5: Create weekly verification schedule documentation
- Update
docs/DISASTER_RECOVERY.mdwith verification procedures - Document weekly verification schedule (Sunday 3 AM recommended)
- Add cron example:
0 3 * * 0 /path/to/devstack verify-backup - Document what to do if verification fails
- Add verification checklist
- Document manual verification procedure
- Test: Documentation review
- Update
-
Subtask 2.3.6: Integration testing
- Create test backup:
./devstack backup - Run verification:
./devstack verify-backup <timestamp> - Verify success report generated
- Corrupt backup (modify a backup file)
- Run verification on corrupted backup
- Verify detection and failure report
- Verify original backup still intact
- Test verification with multiple backup timestamps
- Test: Verification detects good and bad backups
- Test: No false positives or false negatives
- Create test backup:
-
Subtask 2.3.7: Documentation
- Update
docs/DISASTER_RECOVERY.mdwith verification section - Document verification process step-by-step
- Add troubleshooting guide for verification failures
- Document interpretation of verification reports
- Add flowchart: when to restore, when to create new backup
- Test: Documentation review
- Update
-
Subtask 2.3.8: Rollback testing
- Verification is non-destructive, no rollback needed
- Document how to disable verification (don't add to cron)
- Test: N/A
Post-Task Validation:
- Verification script works:
./devstack verify-backup - Can detect corrupted backup
- Reports are clear and actionable
- Documentation complete
Priority: High 🟠 Status: Not Started Estimated Time: 6 hours (was 4h, +50% for all databases) Impact: High - Safe schema evolution
Implementation Steps:
-
Subtask 2.4.1: Create migration directory structure
- Create
configs/postgres/migrations/directory - Create
configs/mysql/migrations/directory - Create
configs/mongodb/migrations/directory - Add README in each with migration instructions
- Add .gitkeep to preserve empty directories
- Test: Directory structure exists
- Create
-
Subtask 2.4.2: Implement Flyway for PostgreSQL
- Add Flyway service to docker-compose.yml (one-shot container)
- Configure Flyway connection to PostgreSQL
- Set migrations location:
/flyway/sql - Mount
./configs/postgres/migrations:/flyway/sql - Create
configs/postgres/migrations/V1__baseline.sql(empty or CREATE TABLE example) - Test Flyway execution:
docker compose run --rm flyway migrate - Test: Flyway runs and creates flyway_schema_history table
- Test: V1 migration applied successfully
-
Subtask 2.4.3: Implement Liquibase for MySQL
- Add Liquibase service to docker-compose.yml
- Configure Liquibase connection to MySQL
- Create
configs/mysql/migrations/changelog.xml - Create baseline changeset (example: CREATE TABLE demo)
- Test Liquibase execution:
docker compose run --rm liquibase update - Test: Liquibase runs and creates DATABASECHANGELOG table
- Test: Baseline changeset applied
-
Subtask 2.4.4: Implement migrate-mongo for MongoDB
- Add migrate-mongo configuration:
configs/mongodb/migrations/migrate-mongo-config.js - Create baseline migration:
configs/mongodb/migrations/01-baseline.js - Configure MongoDB connection (use Vault credentials)
- Test migration: Create script to run migrate-mongo in container
- Test: Migration runs and creates migrations collection
- Test: Baseline migration applied
- Add migrate-mongo configuration:
-
Subtask 2.4.5: Add migration commands to management script
- Add
db-migratecommand to manage_devstack.py - Support
--databaseflag (postgres, mysql, mongodb, all) - Add
db-migrate-statuscommand (shows applied migrations) - Add
db-migrate-validatecommand (checks pending migrations) - Support service-specific migrations
- Add
db-migrate-rollbackcommand (if supported by tool) - Add error handling and validation
-
Test: Run
./devstack db-migrate --database postgres -
Test: Run
./devstack db-migrate-status
- Add
-
Subtask 2.4.6: Create example migrations
- PostgreSQL: Create
V2__create_demo_table.sql - MySQL: Create
002-create-demo-table.xml - MongoDB: Create
02-create-demo-collection.js - Test forward migration for all databases
- Test rollback for PostgreSQL (Flyway supports undo)
- Document rollback limitations (Liquibase/MongoDB may not support)
- Test: Migrations execute successfully
- Test: Tables/collections created as expected
- PostgreSQL: Create
-
Subtask 2.4.7: Create migration testing script
- Create
tests/test-migrations.sh - Test migration on empty database
- Test migration idempotency (run twice, should succeed)
- Test migration status reporting
- Test rollback (PostgreSQL only)
- Add to main test suite:
./tests/run-all-tests.sh - Test: Migration tests pass
- Create
-
Subtask 2.4.8: Integration testing
- Run migrations on clean database:
./devstack db-migrate - Verify schema created correctly (connect and inspect)
- Check migration history tables (flyway_schema_history, etc.)
- Test migration status:
./devstack db-migrate-status - Test rollback for PostgreSQL
- Verify services still work after migration
- Test: Complete migration lifecycle works
- Run migrations on clean database:
-
Subtask 2.4.9: Documentation
- Create
docs/DATABASE_MIGRATIONS.md(new file) - Document migration creation process for each database
- Document migration naming conventions
- Document best practices (always forward, never edit old migrations)
- Document rollback procedures and limitations
- Add troubleshooting guide
- Add examples for common scenarios
- Test: Documentation review
- Create
-
Subtask 2.4.10: Rollback testing
- Document rollback steps (manual SQL if migration failed)
- Test PostgreSQL rollback
- Document MySQL/MongoDB manual rollback (no native support)
- Test: Rollback procedures documented
Post-Task Validation:
- All migration frameworks working
- Can create and apply migrations
- Migration history tracked correctly
- Test script passes:
tests/test-migrations.sh
Priority: High 🟠 Status: Not Started Estimated Time: 12 hours (was 4h, +200% for Docker-in-Docker complexity) Impact: Medium - Catch integration issues before merge
Decision Required: Choose implementation strategy (see below)
Implementation Steps:
Option A: Comprehensive Integration Tests (12 hours)
-
Subtask 2.5.1: Research CI/CD Docker options
- Evaluate Docker-in-Docker (DinD) approach
- Evaluate GitHub Actions service containers
- Evaluate testcontainers library
- Test Colima/Docker Desktop alternatives for CI
- Select best approach for project
- Document decision with pros/cons
- Test: Decision documented in ADR (Architecture Decision Record)
-
Subtask 2.5.2: Create integration test workflow
- Create
.github/workflows/integration-tests.yml - Set up Docker environment (DinD service or Docker socket mount)
- Install dependencies (docker-compose, python, uv)
- Configure test environment variables
- Set resource limits for CI (avoid timeout)
-
Test: Workflow syntax valid:
gh workflow view integration-tests.yml
- Create
-
Subtask 2.5.3: Implement test execution in CI
- Start services with minimal profile:
./devstack start --profile minimal - Wait for services to be healthy (poll health endpoints)
- Run subset of tests (fastest, most critical)
- Run
./tests/test-vault.sh - Collect test results and logs
- Generate test report
- Stop services and cleanup
- Test: Workflow runs manually (on: workflow_dispatch)
- Start services with minimal profile:
-
Subtask 2.5.4: Add test result reporting
- Upload test results as GitHub artifacts
- Add test summary to job summary (GitHub Actions feature)
- Set workflow status based on test results
- Add comment to PR with test results (optional)
- Test: Test results visible in GitHub UI
-
Subtask 2.5.5: Optimize test execution time
- Implement Docker layer caching
- Parallelize independent tests (if possible)
- Use minimal profile for faster startup
- Skip long-running tests in CI (run manually)
- Target: Complete in <15 minutes
- Test: Workflow completes in acceptable time
-
Subtask 2.5.6: Integration testing
- Trigger workflow manually:
gh workflow run integration-tests.yml - Verify all tests run
- Verify test results reported
- Test failure handling (introduce failing test)
- Verify workflow fails on test failure
- Test: CI/CD integration tests pass
- Trigger workflow manually:
-
Subtask 2.5.7: Documentation
- Update
docs/TESTING_APPROACH.md - Document CI/CD test execution
- Document which tests run in CI vs. manual
- Add troubleshooting guide for CI failures
- Test: Documentation review
- Update
Option B: Lightweight Testing (4 hours - Recommended)
If Docker-in-Docker proves too complex, use this simpler approach:
-
Subtask 2.5.1: Create lightweight validation workflow
- Create
.github/workflows/validation.yml - Run shellcheck on all bash scripts
- Run python linting (ruff, mypy)
- Validate docker-compose.yml syntax
- Validate all YAML files
- Run unit tests (no Docker required)
- Test: Workflow runs and passes
- Create
-
Subtask 2.5.2: Document manual integration testing
- Create
docs/MANUAL_INTEGRATION_TESTING.md - Document pre-merge checklist
- Require manual test run before merge
- Document in CONTRIBUTING.md
- Test: Documentation clear and actionable
- Create
-
Subtask 2.5.3: Defer full integration testing to Phase 5
- Add to Phase 5 backlog (future work)
- Document as known limitation
- Revisit when more time available
- Test: Decision documented
Recommended Choice: Option B (Lightweight Testing) Rationale: Docker-in-Docker for Colima + Vault is complex and may not be feasible in GitHub Actions. Focus on what's achievable now, defer complex integration to future phase.
Post-Task Validation:
- CI workflow exists and runs
- Tests provide value (catch real issues)
- Documentation complete
- All 5 tasks completed (2.1 through 2.5)
- All subtasks tested and passing
- Prometheus alerts configured and firing correctly
- Alertmanager receiving and routing notifications (test alert successful)
- Backup verification script works and detects issues
- Database migrations working for all databases (postgres, mysql, mongodb)
- CI/CD tests running (even if lightweight approach)
- Documentation updated and accurate
- No breaking changes to existing functionality
- Smoke test passes:
./devstack health - Full test suite passes:
./tests/run-all-tests.sh
Post-Phase Integration Test:
- Run
./devstack stop && ./devstack start --profile full - Verify all services healthy (including Alertmanager)
- Trigger test alert:
./devstack alertmanager-test - Verify alert received
- Run backup and verification:
./devstack backup && ./devstack verify-backup - Run database migrations:
./devstack db-migrate - Run full test suite:
./tests/run-all-tests.sh - Compare to Phase 0 baseline (no regressions)
- Git checkpoint:
git commit -m "Phase 2 complete: Operational Excellence" - Git tag:
git tag phase-2-complete && git push --tags
Phase 2 Sign-off: _________________ Date: _________
Total Time: 32 hours (with 40% buffer: ~45 hours)
Priority: Medium 🟡 Status: Not Started Estimated Time: 18 hours (was 12h, +50% for complexity) Impact: Medium - Consistent error responses across all APIs
[Content continues with all Phase 3 and Phase 4 tasks with similar level of detail...]
- What went well:
- What didn't go well:
- What would we do differently:
- Time estimate accuracy: Estimated: 3h, Actual: ___h, Variance: ___%
- Unexpected challenges:
- Key learnings:
- What went well:
- What didn't go well:
- What would we do differently:
- Time estimate accuracy: Estimated: 39h, Actual: ___h, Variance: ___%
- Most challenging task:
- Most valuable improvement:
- What went well:
- What didn't go well:
- What would we do differently:
- Time estimate accuracy: Estimated: 38h, Actual: ___h, Variance: ___%
- What went well:
- What didn't go well:
- What would we do differently:
- Time estimate accuracy: Estimated: 45h, Actual: ___h, Variance: ___%
- What went well:
- What didn't go well:
- What would we do differently:
- Time estimate accuracy: Estimated: 15h, Actual: ___h, Variance: ___%
- Document any changes to tasks or scope here
- Note reasons for task modifications
- Track scope creep or descoping
Example:
- 2025-11-14: Changed Task 2.5 from comprehensive to lightweight testing due to Docker-in-Docker complexity
- Document blocking issues and resolutions
- Track technical debt created
- Note areas requiring future attention
- Full integration tests in CI/CD (Task 2.5 Option A)
- Complete Rust implementation (Task 3.5 Option B)
- S3 integration for Vault backups (Task 1.3 enhancement)
- Advanced alerting with PagerDuty integration
- Phase 0: Preparation - COMPLETE
- Phase 1: Critical Security - COMPLETE
- Phase 2: Operational Excellence - COMPLETE
- Phase 3: Code Quality - COMPLETE
- Phase 4: Developer Experience - COMPLETE
- All 21 tasks completed (including Phase 0)
- All tests passing
- All documentation updated
-
Changelogupdated with all improvements - Final integration testing complete
- Compare final metrics to Phase 0 baseline
- All success metrics achieved
- Risk register reviewed (all risks mitigated or accepted)
- Lessons learned documented
- Project ready for production adaptation
Final Metrics vs. Baseline:
- Test coverage: Baseline: ___%, Final: ___%, Improvement: ___% ✅
- Security score: Baseline: ___, Final: ___, Improvement: ___% ✅
- Performance p95: Baseline: ___ms, Final: ___ms, Degradation: ___% ✅ (target: <5%)
- Total test count: Baseline: ___, Final: ___, New tests: ___ ✅
Project Sign-off: _________________ Date: _________
Last Updated: 2025-11-14 Document Version: 2.0 Changes from v1.0: Added Phase 0, fixed dependencies, improved time estimates, added rollback testing, enhanced security, added comprehensive validation