Skip to content

Reliability/HA: Resilience, Backups, Agent/Job Lifecycle, OTA, Kill Switch #102

@IgnacioPro

Description

@IgnacioPro

Lumo must support HA clusters (API/DB/Redis) with automated backup/restore, PITR/migrations, documented agent lifecycle (heartbeats/backoff/offline queue/OTA/kill switch), batch and job orchestration primitives.

Paths to start:

  • Cluster and failover: kube manifests, DB/Redis configurations (deployments/kubernetes/kind/deploy-saas, Helm chart, internal/database/)
  • Agent lifecycle: heartbeat, backoff, recovery and resync; see agent registration and reporting (internal/agent/, API jobs, agent CLI)
  • Backups: docs/scripts for DB and Redis, backup recipe in SaaS deploy docs
  • OTA: agent update channel/versioning, roll-forward/rollback hooks
  • Job orchestration: concurrency, timeouts, blackout; see job runner code

Test with chaos/fault injection (API/agent/DB/Redis), load suites for p99 latency, recovery time, queue depth.

References to update: internal/agent/, internal/database/, deployments, agent CLI.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions