# Architecture Deep Dive Complete architectural documentation for the DevStack Core infrastructure project. --- ## Table of Contents 1. [Overview](#overview) 2. [Architecture Philosophy](#architecture-philosophy) 3. [System Components](#system-components) 4. [Network Architecture](#network-architecture) 5. [Security Architecture](#security-architecture) 6. [Data Flow](#data-flow) 7. [Observability Architecture](#observability-architecture) 8. [Service Dependencies](#service-dependencies) 9. [Deployment Architecture](#deployment-architecture) 10. [Scaling Considerations](#scaling-considerations) --- ## Overview DevStack Core is a **container-native, infrastructure-as-code** project providing a complete local development environment optimized for Apple Silicon Macs. ### Core Characteristics - **Container Runtime:** Colima (Lima + containerd/Docker) - **Orchestration:** Docker Compose - **Service Count:** 23 containerized services - **Network Model:** Bridge network with static IP assignments - **Security Model:** Vault-managed credentials with optional TLS - **Target Environment:** Local development (NOT production) ### Design Goals 1. **Completeness** - All services needed for modern development 2. **Security** - Vault-managed secrets, TLS support 3. **Observability** - Full metrics, logs, and visualization stack 4. **Educational** - Multiple reference implementations demonstrating patterns 5. **Reproducibility** - Infrastructure as code, Docker Compose --- ## Architecture Philosophy ### Infrastructure-First Approach Services are defined declaratively in `docker-compose.yml` with: - Explicit dependencies - Health checks for all services - Static IP assignments for predictability - Volume persistence for stateful services - Environment-based configuration ### Security by Design - **No hardcoded credentials** - All passwords in Vault - **AppRole authentication** - Core services (7/16) use least-privilege AppRole auth - **TLS optional but supported** - PKI infrastructure via Vault - **Secrets at runtime** - Services fetch credentials on startup - **Network isolation** - 4-tier network segmentation (vault/data/app/observability) - **Secret scanning** - Pre-commit hooks and CI/CD **AppRole Adoption Status (as of November 2025):** - ✅ **Core Data Tier (7 services):** PostgreSQL, MySQL, MongoDB, Redis (x3), RabbitMQ, Forgejo, FastAPI - ⚠️ **Infrastructure (9 services):** PGBouncer, additional reference apps, exporters, Vector - use VAULT_TOKEN - 🎯 **Target:** 95%+ adoption via Phase 4 migration ### Container-Native - All services run in containers - No host dependencies (except Colima/Docker) - Portable across developers - Easy to reset/rebuild --- ## System Components ### Component Hierarchy ```mermaid graph TB subgraph Colima["Colima VM (Lima)"] Docker["Docker Daemon"] subgraph Network["Docker Compose Network
(dev-services: 172.20.0.0/16)"] Vault["Secrets Management
- Vault (PKI + KV)"] subgraph DataServices["Data Services"] PostgreSQL["PostgreSQL"] MySQL["MySQL"] MongoDB["MongoDB"] RedisCluster["Redis Cluster (3 nodes)"] RabbitMQ["RabbitMQ"] end subgraph AppServices["Application Services"] Forgejo["Forgejo (Git)"] PgBouncer["PgBouncer"] APIs["5 Reference APIs"] end subgraph Observability["Observability Stack"] Prometheus["Prometheus (metrics)"] Grafana["Grafana (visualization)"] Loki["Loki (logs)"] Vector["Vector (pipeline)"] Promtail["Promtail (log collector)"] cAdvisor["cAdvisor (container)"] Exporters["Exporters (3x Redis)"] end end end Docker --> Network Vault -->|all services depend| DataServices DataServices --> AppServices AppServices --> Observability ``` ### Service Catalog | Service | Type | Port(s) | Purpose | |---------|------|---------|---------| | **vault** | Secrets | 8200 | Secrets management & PKI | | **postgres** | Database | 5432 | PostgreSQL (Forgejo backend) | | **pgbouncer** | Proxy | 6432 | PostgreSQL connection pooler | | **mysql** | Database | 3306 | MySQL database | | **mongodb** | Database | 27017 | MongoDB NoSQL | | **redis-1/2/3** | Cache | 6379+ | Redis cluster (3 masters) | | **rabbitmq** | Queue | 5672, 15672 | Message queue + mgmt UI | | **forgejo** | Git | 3000, 2222 | Git server | | **reference-api** | App | 8000, 8443 | FastAPI code-first | | **api-first** | App | 8001, 8444 | FastAPI API-first | | **golang-api** | App | 8002, 8445 | Go reference | | **nodejs-api** | App | 8003, 8446 | Node.js/Express reference | | **rust-api** | App | 8004, 8447 | Rust/Actix-web reference | | **prometheus** | Metrics | 9090 | Metrics collection | | **grafana** | Viz | 3001 | Dashboard & visualization | | **loki** | Logs | 3100 | Log aggregation | | **promtail** | Logs | - | Log shipping (internal) | | **vector** | Pipeline | 8686 | Unified observability | | **cadvisor** | Metrics | 8080 | Container metrics | | **redis-exporter-1/2/3** | Metrics | 9121+ | Redis metrics (per node) | ### Reference API Implementations This project includes **5 reference API implementations** demonstrating identical functionality across different technology stacks. Each implementation showcases best practices for integrating with the infrastructure services. #### Purpose and Philosophy **Why 5 Implementations?** 1. **Educational** - Demonstrate patterns across different languages and frameworks 2. **Comparison** - Enable performance and architecture comparisons 3. **Best Practices** - Show idiomatic approaches in each ecosystem 4. **Pattern Library** - Reference implementations for common integration patterns **Shared Functionality:** - HashiCorp Vault integration for secrets management - PostgreSQL, MySQL, and MongoDB database connections - Redis cluster integration with connection pooling - RabbitMQ message queue integration - Comprehensive health checks for all services - Structured logging with security best practices - Prometheus metrics exposition - Optional TLS/SSL support - RESTful API design #### Implementation Details ##### 1. Python FastAPI (Code-First) - Port 8000/8443 **Location:** `reference-apps/fastapi/` **Technology Stack:** - **Framework:** FastAPI (async/await, Pydantic validation) - **Language:** Python 3.11+ - **Approach:** Code-first (define routes in code, generate OpenAPI) - **Key Libraries:** asyncpg, motor (MongoDB), redis-py, aio-pika **Characteristics:** - Fully asynchronous architecture - Type hints and Pydantic models for validation - Auto-generated OpenAPI/Swagger documentation - Comprehensive endpoint coverage (all services) - Production-ready logging and error handling **Use Cases:** - Rapid prototyping and development - ML/AI integration scenarios - Data-heavy applications - Teams familiar with Python ecosystem ##### 2. Python FastAPI (API-First) - Port 8001/8444 **Location:** `reference-apps/fastapi-api-first/` **Technology Stack:** - **Framework:** FastAPI (async/await, Pydantic validation) - **Language:** Python 3.11+ - **Approach:** API-first (OpenAPI spec → code generation) - **Key Libraries:** asyncpg, motor, redis-py, aio-pika **Characteristics:** - OpenAPI specification drives implementation - Contract-first design methodology - Identical runtime behavior to code-first - Demonstrates spec-driven development workflow - Scaffolded structure for code generation **Use Cases:** - Contract-first API development - Multi-team coordination (API contracts) - Client SDK generation scenarios - Governance and compliance requirements ##### 3. Go (Gin Framework) - Port 8002/8445 **Location:** `reference-apps/golang/` **Technology Stack:** - **Framework:** Gin (HTTP router and middleware) - **Language:** Go 1.23+ - **Approach:** Code-first with strong typing - **Key Libraries:** pgx (PostgreSQL), mongo-go-driver, go-redis, amqp091-go **Characteristics:** - Compiled binary for fast startup - Strong static typing and compile-time checks - Excellent concurrency with goroutines - Low memory footprint - Structured logging with logrus **Use Cases:** - High-performance requirements - Microservices architectures - Cloud-native deployments - Systems programming background teams ##### 4. Node.js (Express) - Port 8003/8446 **Location:** `reference-apps/nodejs/` **Technology Stack:** - **Framework:** Express (minimalist web framework) - **Language:** Node.js (JavaScript/TypeScript) - **Approach:** Code-first with async/await - **Key Libraries:** pg, mongodb, ioredis, amqplib **Characteristics:** - Event-driven, non-blocking I/O - Large ecosystem (npm) - Async/await for clean asynchronous code - Full infrastructure integration - JSON-native processing **Use Cases:** - JavaScript/TypeScript-centric teams - Real-time applications (WebSockets) - Rapid iteration and prototyping - Microservices with npm ecosystem ##### 5. Rust (Actix-web) - Port 8004/8447 **Location:** `reference-apps/rust/` **Technology Stack:** - **Framework:** Actix-web (async actor framework) - **Language:** Rust (memory-safe systems language) - **Approach:** Partial implementation (~40% complete) with comprehensive testing - **Key Libraries:** tokio, serde, reqwest, actix-cors **Characteristics:** - Zero-cost abstractions and memory safety - Exceptional performance and low latency - Compile-time guarantees (no runtime errors) - Comprehensive test coverage (5 unit + 11 integration tests) - High-performance async runtime (Tokio) - Production-ready patterns (CORS, logging, environment config) **Use Cases:** - Ultra-high-performance requirements - Safety-critical applications - Resource-constrained environments - Teams prioritizing performance and safety #### API Parity and Testing **Parity Tests:** `tests/api-parity-tests.sh` - Validates identical behavior across implementations - Tests all common endpoints - Ensures consistent responses and error handling **Performance Benchmarks:** `tests/performance-benchmark.sh` - Compares throughput and latency - Measures resource utilization - Identifies performance characteristics per stack #### Comparison Matrix | Feature | FastAPI (Code) | FastAPI (API) | Go/Gin | Node.js | Rust | |---------|----------------|---------------|--------|---------|------| | **Startup Time** | Medium (~2s) | Medium (~2s) | Fast (<1s) | Fast (~1s) | Fast (<1s) | | **Memory Footprint** | Medium (~80MB) | Medium (~80MB) | Low (~20MB) | Medium (~60MB) | Very Low (~10MB) | | **Development Speed** | Fast | Medium | Medium | Fast | Slow | | **Type Safety** | Runtime | Runtime | Compile-time | Runtime* | Compile-time | | **Concurrency Model** | async/await | async/await | Goroutines | Event loop | async/await | | **Ecosystem Size** | Large (PyPI) | Large (PyPI) | Medium | Very Large (npm) | Growing | | **Learning Curve** | Low | Medium | Medium | Low | High | | **Production Maturity** | High | High | Very High | Very High | High | *Runtime with TypeScript, compile-time checks available --- ## Network Architecture ### Network Topology DevStack Core uses **4-tier network segmentation** for security isolation and logical service grouping: **4-Tier Network Segmentation:** ```mermaid graph TB subgraph VaultNet["vault-network (172.20.1.0/24)"] Vault1[Vault
172.20.1.10] PG_Auth[PostgreSQL AppRole] MySQL_Auth[MySQL AppRole] Redis_Auth[Redis AppRole] end subgraph DataNet["data-network (172.20.2.0/24)"] PG[PostgreSQL
172.20.2.10] PGBOUNCER[PgBouncer
172.20.2.11] MySQL[MySQL
172.20.2.12] Redis1[Redis-1
172.20.2.13] Redis2[Redis-2
172.20.2.16] Redis3[Redis-3
172.20.2.17] RabbitMQ[RabbitMQ
172.20.2.14] MongoDB[MongoDB
172.20.2.15] end subgraph AppNet["app-network (172.20.3.0/24)"] Forgejo[Forgejo
172.20.3.10] RefAPI[Reference APIs
172.20.3.20-24] end subgraph ObsNet["observability-network (172.20.4.0/24)"] Prometheus[Prometheus
172.20.4.10] Grafana[Grafana
172.20.4.11] Loki[Loki
172.20.4.12] Vector[Vector
172.20.4.13] end Vault1 -.->|AppRole Auth| PG Vault1 -.->|AppRole Auth| MySQL Vault1 -.->|AppRole Auth| Redis1 Vault1 -.->|AppRole Auth| Redis2 Vault1 -.->|AppRole Auth| Redis3 AppNet -.->|Query| DataNet Forgejo -->|Metadata| PG RefAPI -.->|Connect| DataNet ObsNet -.->|Scrape Metrics| DataNet ObsNet -.->|Scrape Metrics| AppNet style VaultNet fill:#ffa726,stroke:#f57c00,stroke-width:3px style DataNet fill:#66bb6a,stroke:#388e3c,stroke-width:3px style AppNet fill:#42a5f5,stroke:#1976d2,stroke-width:3px style ObsNet fill:#ab47bc,stroke:#7b1fa2,stroke-width:3px ``` **Network Isolation:** - **vault-network (172.20.1.0/24):** Isolated for secrets management and AppRole authentication - **data-network (172.20.2.0/24):** Database, cache, and message queue services - **app-network (172.20.3.0/24):** Application services (Forgejo, reference APIs) - **observability-network (172.20.4.0/24):** Monitoring and logging infrastructure ### Static IP Assignments ``` Vault Network (172.20.1.0/24): 172.20.1.10 - vault Data Network (172.20.2.0/24): 172.20.2.10 - postgres 172.20.2.11 - pgbouncer 172.20.2.12 - mysql 172.20.2.13 - redis-1 172.20.2.14 - rabbitmq 172.20.2.15 - mongodb 172.20.2.16 - redis-2 172.20.2.17 - redis-3 Application Network (172.20.3.0/24): 172.20.3.10 - forgejo 172.20.3.20 - reference-api (FastAPI code-first) 172.20.3.21 - api-first (FastAPI API-first) 172.20.3.22 - golang-api (Go reference) 172.20.3.23 - nodejs-api (Node.js/Express reference) 172.20.3.24 - rust-api (Rust/Actix-web reference) Observability Network (172.20.4.0/24): 172.20.4.10 - prometheus 172.20.4.11 - grafana 172.20.4.12 - loki 172.20.4.13 - vector 172.20.4.14 - promtail 172.20.4.15 - cadvisor 172.20.4.16 - redis-exporter-1 172.20.4.17 - redis-exporter-2 172.20.4.18 - redis-exporter-3 ``` ### Port Exposure Strategy **Exposed to Host:** - Web UIs: Grafana (3001), RabbitMQ (15672), Prometheus (9090), Loki (3100) - Databases: PostgreSQL (5432), MySQL (3306), MongoDB (27017), Redis (6379+) - Applications: APIs on 8000-8004 (HTTP) and 8443-8447 (HTTPS) - Git: Forgejo HTTP (3000), SSH (2222) - Vault: 8200 **Internal Only:** - Container metrics (cAdvisor) - Log shipping (Promtail) - Exporters (internal scraping) ### DNS Resolution Services resolve each other by service name across networks: - `vault` resolves to 172.20.1.10 (vault-network) - `postgres` resolves to 172.20.2.10 (data-network) - `forgejo` resolves to 172.20.3.10 (app-network) - `prometheus` resolves to 172.20.4.10 (observability-network) Docker's embedded DNS handles resolution across all networks. Services connected to multiple networks can reach services on any of their connected networks. --- ## Security Architecture ### Secrets Management Flow ```mermaid graph TD Init["Vault Init
(One-time: creates unseal keys & root token)"] Unseal["Vault Unseal
(Auto: runs on container start)"] Bootstrap["Vault Bootstrap
(Required: populates credentials)"] KV["Enable KV engine (secret/)"] PKI["Setup PKI (Root + Intermediate CA)"] Roles["Create certificate roles (9 services)"] Passwords["Generate & store passwords"] Policies["Create Vault policies"] Export["Export CA certificates"] Services["Services fetch credentials on startup:
service → Vault API → secret/{service-name} → credentials"] Init --> Unseal Unseal --> Bootstrap Bootstrap --> KV Bootstrap --> PKI Bootstrap --> Roles Bootstrap --> Passwords Bootstrap --> Policies Bootstrap --> Export Export --> Services ``` ### PKI Architecture **Two-Tier Certificate Authority:** ```mermaid graph TD RootCA["Root CA (pki/)
- TTL: 10 years (87600h)
- Key: RSA 2048"] IntermediateCA["Intermediate CA (pki_int/)
- TTL: 5 years (43800h)
- Key: RSA 2048"] ServiceCerts["Service Certificates
- TTL: 1 year (8760h)
- Roles: postgres-role, mysql-role, redis-1-role, etc.
- SANs: service name, IP address, localhost"] RootCA -->|Signs| IntermediateCA IntermediateCA -->|Issues| ServiceCerts ``` **Certificate Issuance Flow:** 1. Service requests cert from Vault PKI 2. Vault validates request against role 3. Intermediate CA signs certificate 4. Service receives cert + private key 5. Service configures TLS with cert ### TLS Configuration **Optional TLS (Development Mode):** - Controlled by `tls_enabled` flag in Vault - Default: `true` for all services - Services check flag on startup - If enabled: configure TLS - If disabled: plain connections **TLS Endpoints:** - PostgreSQL: Port 5432 (TLS) - MySQL: Port 3306 (TLS) - MongoDB: Port 27017 (preferTLS) - Redis: Ports 6390-6392 (TLS on separate ports) - RabbitMQ: Port 5671 (TLS) - APIs: Ports 8443-8447 (HTTPS) ### Credential Storage **In Vault (secret/ KV engine):** ``` secret/postgresql ├─ username: dev_admin ├─ password: <25-char random> ├─ database: dev_database └─ tls_enabled: true secret/mysql ├─ root_password: <25-char random> ├─ username: dev_user ├─ password: <25-char random> ├─ database: dev_database └─ tls_enabled: true secret/redis-1, redis-2, redis-3 ├─ password: └─ tls_enabled: true secret/rabbitmq ├─ username: dev_user ├─ password: <25-char random> ├─ vhost: dev_vhost └─ tls_enabled: true secret/mongodb ├─ username: dev_user ├─ password: <25-char random> ├─ database: dev_database └─ tls_enabled: true ``` ### Network Security - **No host network mode** - All services use bridge - **Static IPs** - Predictable, no dynamic assignment - **Internal-only services** - Many services not exposed to host - **Firewall-ready** - Port exposure controlled via Docker --- ## Data Flow ### Service Startup Flow ```mermaid graph TD ColimaStart["1. Colima VM starts"] Docker["Docker daemon initializes"] VaultStart["Step 1: Vault Container starts"] VaultUnseal["Auto-unseal script runs"] VaultAPI["Vault API becomes available"] VaultHealthy["Health check: healthy"] DataStart["Step 2: Data Services (parallel)"] PG["PostgreSQL:
- Waits for Vault
- Fetches credentials
- Initializes database
- Configures TLS
- Health check: healthy"] MySQL_DB["MySQL (same pattern)"] Mongo["MongoDB (same pattern)"] Redis["Redis-1/2/3 (same pattern)"] Rabbit["RabbitMQ (same pattern)"] AppStart["Step 3: Application Services"] Forgejo_App["Forgejo (depends on PostgreSQL)"] PgBouncer_App["PgBouncer (depends on PostgreSQL)"] APIs_App["Reference APIs (depend on all data services)"] ObsStart["Step 4: Observability (parallel)"] Prom["Prometheus (scrapes metrics)"] Graf["Grafana (visualizes from Prometheus)"] Loki_Obs["Loki (receives logs)"] Vector_Obs["Vector (collects & forwards)"] Promtail_Obs["Promtail (ships logs to Loki)"] cAdvisor_Obs["cAdvisor (collects container metrics)"] Exporters_Obs["Redis Exporters (expose Redis metrics)"] ColimaStart --> Docker Docker --> VaultStart VaultStart --> VaultUnseal VaultUnseal --> VaultAPI VaultAPI --> VaultHealthy VaultHealthy --> DataStart DataStart --> PG DataStart --> MySQL_DB DataStart --> Mongo DataStart --> Redis DataStart --> Rabbit PG --> AppStart MySQL_DB --> AppStart Mongo --> AppStart Redis --> AppStart Rabbit --> AppStart AppStart --> Forgejo_App AppStart --> PgBouncer_App AppStart --> APIs_App APIs_App --> ObsStart ObsStart --> Prom ObsStart --> Graf ObsStart --> Loki_Obs ObsStart --> Vector_Obs ObsStart --> Promtail_Obs ObsStart --> cAdvisor_Obs ObsStart --> Exporters_Obs ``` ### Request Flow (FastAPI Example) ```mermaid graph LR Client["Client"] FastAPI["FastAPI API (port 8000)"] subgraph HealthCheck["Health Check Request"] CheckVault["Check Vault connectivity"] CheckPG["Check PostgreSQL connectivity"] CheckMySQL["Check MySQL connectivity"] CheckMongo["Check MongoDB connectivity"] CheckRedis["Check Redis cluster status"] CheckRabbit["Check RabbitMQ connectivity"] ReturnHealth["Return aggregated health status"] end subgraph DBQuery["Database Query Request"] FetchCredsDB["Fetch credentials from Vault (cached)"] ConnectDB["Connect to database (connection pool)"] ExecQuery["Execute query over TLS"] ReturnResults["Return results"] RecordMetricsDB["Record metrics (Prometheus)"] end subgraph CacheOp["Cache Operation Request"] FetchCredsCache["Fetch Redis credentials from Vault (cached)"] ConnectCache["Connect to Redis cluster"] ExecCmd["Execute command (redirected to correct node)"] ReturnCache["Return result"] RecordMetricsCache["Record metrics"] end Client --> FastAPI FastAPI -.-> CheckVault CheckVault --> CheckPG CheckPG --> CheckMySQL CheckMySQL --> CheckMongo CheckMongo --> CheckRedis CheckRedis --> CheckRabbit CheckRabbit --> ReturnHealth FastAPI -.-> FetchCredsDB FetchCredsDB --> ConnectDB ConnectDB --> ExecQuery ExecQuery --> ReturnResults ReturnResults --> RecordMetricsDB FastAPI -.-> FetchCredsCache FetchCredsCache --> ConnectCache ConnectCache --> ExecCmd ExecCmd --> ReturnCache ReturnCache --> RecordMetricsCache ``` ### Metrics Collection Flow ```mermaid graph TD Services["Services expose metrics (Prometheus format)"] FastAPI_Metrics["FastAPI: /metrics"] Redis_Metrics["Redis Exporters: :9121/metrics (per node)"] cAdvisor_Metrics["cAdvisor: :8080/metrics"] App_Metrics["Application custom metrics"] Prometheus["Prometheus scrapes every 15s"] Store["Stores time-series data"] Query["Makes available for querying"] Grafana["Grafana queries Prometheus"] PromQL["Dashboard panels execute PromQL"] Visualize["Visualize metrics over time"] Present["Present to user (port 3001)"] Services --> FastAPI_Metrics Services --> Redis_Metrics Services --> cAdvisor_Metrics Services --> App_Metrics FastAPI_Metrics --> Prometheus Redis_Metrics --> Prometheus cAdvisor_Metrics --> Prometheus App_Metrics --> Prometheus Prometheus --> Store Prometheus --> Query Query --> Grafana Grafana --> PromQL PromQL --> Visualize Visualize --> Present ``` ### Log Collection Flow ```mermaid graph TD Stdout["Container stdout/stderr"] DockerLog["Docker logging driver"] Promtail["Promtail (reads Docker logs)"] Parse["Parses log format"] Labels["Adds labels (container, service)"] Ship["Ships to Loki"] Loki["Loki aggregates logs"] Index["Indexes by labels (not content)"] StoreLogs["Stores log data"] QueryLogs["Makes available for querying"] GrafanaLogs["Grafana queries Loki"] LogQL["LogQL queries"] Filter["Filter by service, time, etc."] Display["Display logs in Explore view"] Stdout --> DockerLog DockerLog --> Promtail Promtail --> Parse Parse --> Labels Labels --> Ship Ship --> Loki Loki --> Index Loki --> StoreLogs Loki --> QueryLogs QueryLogs --> GrafanaLogs GrafanaLogs --> LogQL LogQL --> Filter Filter --> Display ``` --- ## Observability Architecture ### Three Pillars 1. **Metrics** (Prometheus + Grafana) 2. **Logs** (Loki + Promtail + Grafana) 3. **Traces** (Future: OpenTelemetry) ### Metrics Pipeline ```mermaid graph TD Prometheus["Prometheus (collector)"] FastAPI_Met["FastAPI /metrics"] cAdvisor_Met["cAdvisor metrics"] Redis_Met["Redis Exporter"] FastAPI_Met -->|scrapes every 15s| Prometheus cAdvisor_Met -->|scrapes every 15s| Prometheus Redis_Met -->|scrapes every 15s| Prometheus ``` **Metric Types:** - **Counters:** Request counts, error counts - **Gauges:** Active connections, memory usage - **Histograms:** Request durations, response sizes - **Summaries:** Percentiles (p50, p95, p99) ### Grafana Dashboards Pre-configured dashboards in `configs/grafana/dashboards/`: 1. **redis-cluster-dashboard.json** - Redis cluster health 2. **postgres-dashboard.json** - PostgreSQL metrics 3. **mysql-dashboard.json** - MySQL metrics 4. **mongodb-dashboard.json** - MongoDB metrics 5. **application-metrics.json** - API metrics 6. **infrastructure-overview.json** - Overall health ### Log Aggregation Strategy **Structured Logging:** - JSON format for all application logs - Consistent fields: timestamp, level, message, service, request_id - Easy to parse and query **Label Strategy:** ``` {service="fastapi", container="dev-reference-api", level="error"} {service="postgres", container="dev-postgres"} {service="redis-1", container="dev-redis-1"} ``` **Retention:** - Development: 7 days (configurable) - Logs stored in Docker volumes --- ## Service Dependencies ### Dependency Graph ```mermaid graph TD Vault["Vault (no dependencies)"] Vault --> Postgres["postgres"] Vault --> MySQL["mysql"] Vault --> MongoDB["mongodb"] Vault --> Redis["redis-1/2/3"] Vault --> RabbitMQ["rabbitmq"] Postgres --> Forgejo["forgejo"] Postgres --> PgBouncer["pgbouncer"] Redis --> RedisExporter["redis-exporter-1/2/3"] Postgres --> RefAPIs["5 Reference APIs
(FastAPI x2, Go, Node.js, Rust)
(depend on all data services)"] MySQL --> RefAPIs MongoDB --> RefAPIs Redis --> RefAPIs RabbitMQ --> RefAPIs ``` ### Health Check Cascade Each service has a health check that validates: 1. Process is running 2. Port is listening 3. Service-specific checks (e.g., DB can execute queries) Docker Compose won't start dependent services until dependencies are healthy. ### AppRole Authentication Flow Services using AppRole follow this authentication sequence: ```mermaid sequenceDiagram participant Container as Service Container participant Script as init-approle.sh participant FS as Filesystem participant Vault as Vault API participant Service as Service Process Note over Container: Container starts with init-approle.sh entrypoint Container->>Script: Execute wrapper script Script->>Vault: Wait for Vault health check Vault-->>Script: Vault healthy (200 OK) Script->>FS: Read /vault-approles/{service}/role-id FS-->>Script: role-id (e.g., abc123...) Script->>FS: Read /vault-approles/{service}/secret-id FS-->>Script: secret-id (e.g., xyz789...) Script->>Vault: POST /v1/auth/approle/login Note over Script,Vault: {"role_id": "abc123...", "secret_id": "xyz789..."} Vault-->>Script: Service token (hvs.CAESIE..., 1h TTL) Script->>Vault: GET /v1/secret/data/{service} Note over Script,Vault: X-Vault-Token: hvs.CAESIE... Vault-->>Script: Credentials (user, password, database) Script->>Script: Export environment variables Note over Script: POSTGRES_USER=devuser
POSTGRES_PASSWORD=***
POSTGRES_DB=devdb Script->>Service: exec docker-entrypoint.sh Service->>Service: Service starts with credentials Note over Container,Service: Service token expires after 1 hour (renewable) ``` **AppRole Security Benefits:** 1. **No Root Token in Containers** - Core services never see root token 2. **Least Privilege** - Each service policy allows access ONLY to own secrets 3. **Short-Lived Tokens** - Service tokens expire after 1 hour 4. **Audit Trail** - All AppRole logins logged by Vault 5. **Policy Enforcement** - Cross-service access prevented by Vault policies **Services Using AppRole (7):** - PostgreSQL, MySQL, MongoDB, Redis (3 nodes), RabbitMQ, Forgejo, Reference API (FastAPI) **Services Using Root Token (9):** - PGBouncer, API-First, Golang API, Node.js API, Rust API, Redis Exporters (3), Vector ### Startup Order ```mermaid sequenceDiagram participant Colima as Colima VM participant Vault as Vault participant PG as PostgreSQL participant MYSQL as MySQL participant REDIS as Redis Cluster participant RABBIT as RabbitMQ participant MONGO as MongoDB participant FORGEJO as Forgejo participant API as Reference APIs participant PROM as Prometheus Note over Colima: User runs: ./devstack start --profile standard Colima->>Colima: Start VM (5-10s) Colima->>Vault: Start container Note over Vault: Initialize & Unseal (5-10s) Vault->>Vault: Vault healthy ✓ par Data Services Start (depend on Vault) Vault->>PG: AppRole auth Note over PG: Fetch credentials from Vault PG->>PG: Initialize database (10-15s) PG->>PG: PostgreSQL healthy ✓ Vault->>MYSQL: AppRole auth Note over MYSQL: Fetch credentials from Vault MYSQL->>MYSQL: Initialize database (10-15s) MYSQL->>MYSQL: MySQL healthy ✓ Vault->>REDIS: AppRole auth (all 3 nodes) Note over REDIS: Fetch credentials from Vault REDIS->>REDIS: Start 3 nodes (10s) Note over REDIS: redis-cluster-init required REDIS->>REDIS: Redis nodes healthy ✓ Vault->>RABBIT: AppRole auth Note over RABBIT: Fetch credentials from Vault RABBIT->>RABBIT: Initialize (15-20s) RABBIT->>RABBIT: RabbitMQ healthy ✓ Vault->>MONGO: AppRole auth Note over MONGO: Fetch credentials from Vault MONGO->>MONGO: Initialize (10-15s) MONGO->>MONGO: MongoDB healthy ✓ end par Application Services Start (depend on databases) PG->>FORGEJO: Database ready Vault->>FORGEJO: AppRole auth FORGEJO->>FORGEJO: Initialize (10-15s) FORGEJO->>FORGEJO: Forgejo healthy ✓ PG-->>API: All data services ready MYSQL-->>API: All data services ready REDIS-->>API: All data services ready RABBIT-->>API: All data services ready MONGO-->>API: All data services ready Vault->>API: AppRole auth API->>API: Start 5 APIs (5-10s) API->>API: All APIs healthy ✓ end Note over PROM: Observability starts independently PROM->>PROM: Start Prometheus, Grafana, Loki (5s) PROM->>PG: Begin scraping metrics PROM->>MYSQL: Begin scraping metrics PROM->>REDIS: Begin scraping metrics Note over Colima,PROM: Total Startup Time: ~90-120 seconds ``` **Startup Sequence Summary:** ``` 1. Vault (5-10s to unseal) 2. Data Services (30-60s for initialization) - PostgreSQL, MySQL, MongoDB - Redis cluster (needs all 3 nodes) - RabbitMQ 3. Application Services (10-20s) - Forgejo (waits for PostgreSQL) - PgBouncer (waits for PostgreSQL) - 5 Reference APIs (wait for all data services) * FastAPI code-first (port 8000) * FastAPI API-first (port 8001) * Go/Gin (port 8002) * Node.js/Express (port 8003) * Rust/Actix-web (port 8004) 4. Observability (starts immediately, waits for targets) - Prometheus, Grafana, Loki start fast - Begin scraping/collecting once targets available ``` **Total Startup Time:** ~90-120 seconds from cold start --- ## Deployment Architecture ### Colima VM Specifications **Default Configuration:** - **CPU:** 4 cores - **Memory:** 8 GB - **Disk:** 60 GB - **Architecture:** ARM64 (Apple Silicon) - **Runtime:** Docker - **Networking:** Bridged (VZ framework) **Customizable via devstack.sh:** ```bash COLIMA_CPU=8 COLIMA_MEMORY=16 COLIMA_DISK=100 ./devstack.sh start ``` ### Volume Strategy **Named Volumes (Persistent):** - `postgres_data` - PostgreSQL database files - `mysql_data` - MySQL database files - `mongodb_data` - MongoDB database files - `redis_data_1/2/3` - Redis persistence (3 volumes) - `rabbitmq_data` - RabbitMQ message store - `vault_data` - Vault storage backend - `forgejo_data` - Git repositories - `prometheus_data` - Time-series metrics - `grafana_data` - Dashboard configs - `loki_data` - Log storage **Bind Mounts (Configuration):** - `./configs/{service}/` → Container config directories - Configuration files are version-controlled **Benefits:** - Data persists across container restarts - Can backup volumes independently - Easy to reset individual services ### Resource Allocation **Per-Service Limits (if configured):** - Not set by default (development mode) - Can add via deploy.resources in docker-compose.yml - Recommended for resource-constrained environments **Observed Resource Usage (28 services):** - **Total Memory:** ~4-6 GB - **Total CPU:** ~1-2 cores average - **Disk:** ~10-15 GB (with data) --- ## Scaling Considerations ### Current Limitations (Development Mode) 1. **Single-Node Redis Cluster** - 3 masters, no replicas - No high availability - Suitable for development only 2. **Single Instance Per Service** - No load balancing - No redundancy - Fast restarts instead 3. **File-Based Vault Storage** - Not HA-capable - Single point of failure - Fine for development ### Production Adaptation Strategies **If adapting for production:** 1. **Redis Cluster** - Add replicas: 3 masters + 3 replicas minimum - Enable cluster failover - Use Redis Sentinel or Redis Cluster mode 2. **Database Replication** - PostgreSQL: Streaming replication (primary + standby) - MySQL: Master-slave or Galera cluster - MongoDB: Replica sets (3+ nodes) 3. **Vault** - Consul or etcd storage backend - 3+ Vault nodes for HA - Auto-unsealing via cloud KMS 4. **Load Balancing** - Add nginx/traefik for API load balancing - Multiple API instances - Session affinity if needed 5. **Observability** - Prometheus federation for multiple clusters - Remote write to long-term storage (Thanos, Cortex) - Centralized Loki for multi-cluster logs ### Horizontal Scaling **Services that can scale horizontally:** - ✅ Reference APIs (stateless) - ✅ PgBouncer (connection pooler) - ⚠️ Forgejo (needs shared storage) **Services that require special handling:** - ❌ Databases (need replication setup) - ❌ Redis (needs cluster reconfiguration) - ❌ RabbitMQ (needs cluster mode) - ❌ Vault (needs HA storage backend) --- ## Architectural Patterns ### Initialization Pattern All stateful services follow this pattern: ```bash #!/bin/bash # init.sh 1. Wait for Vault to be ready (health check loop) 2. Fetch credentials from Vault (secret/{service}) 3. Parse credentials (jq) 4. Configure service with credentials 5. Start service process 6. Health check validates service is ready ``` ### Configuration Pattern ``` Environment Variables (from docker-compose.yml) │ ▼ Service init script (./init.sh) │ ├─► Fetch secrets from Vault ├─► Generate config files └─► Export environment Service starts with configuration ``` ### Health Check Pattern ```yaml healthcheck: test: ["CMD", "command", "to", "test", "health"] interval: 10s timeout: 5s retries: 5 start_period: 30s ``` **Progressive Health Checks:** 1. Start period: Service initialization time 2. Interval: How often to check 3. Retries: How many failures before unhealthy 4. Timeout: Max time for check command --- ## Future Architecture Considerations ### Potential Enhancements 1. **Service Mesh (Istio/Linkerd)** - mTLS between services - Advanced traffic management - Observability built-in 2. **Kubernetes Migration** - Convert docker-compose to K8s manifests - Use Helm charts - Enable true cloud-native operations 3. **GitOps Integration** - ArgoCD or Flux - Declarative configuration management - Automated drift detection 4. **Multi-Environment Support** - Dev, staging, production configs - Environment-specific overrides - Promotion workflows --- ## Reference Documentation - [Docker Compose Networking](https://docs.docker.com/compose/networking/) - [HashiCorp Vault Architecture](https://www.vaultproject.io/docs/internals/architecture) - [Prometheus Architecture](https://prometheus.io/docs/introduction/overview/) - [Redis Cluster Specification](https://redis.io/docs/latest/operate/oss_and_stack/management/scaling/) - [Colima Documentation](https://github.com/abiosoft/colima) --- *For operational procedures, see [TROUBLESHOOTING.md](./TROUBLESHOOTING.md). For performance optimization, see [PERFORMANCE_TUNING.md](./PERFORMANCE_TUNING.md).*