Reliability/HA: Resilience, Backups, Agent/Job Lifecycle, OTA, Kill Switch

Lumo must support HA clusters (API/DB/Redis) with automated backup/restore, PITR/migrations, documented agent lifecycle (heartbeats/backoff/offline queue/OTA/kill switch), batch and job orchestration primitives.

Paths to start:
- Cluster and failover: kube manifests, DB/Redis configurations (`deployments/kubernetes/kind/deploy-saas`, Helm chart, `internal/database/`)
- Agent lifecycle: heartbeat, backoff, recovery and resync; see agent registration and reporting (`internal/agent/`, API jobs, agent CLI)
- Backups: docs/scripts for DB and Redis, backup recipe in SaaS deploy docs
- OTA: agent update channel/versioning, roll-forward/rollback hooks
- Job orchestration: concurrency, timeouts, blackout; see job runner code

Test with chaos/fault injection (API/agent/DB/Redis), load suites for p99 latency, recovery time, queue depth.

References to update: `internal/agent/`, `internal/database/`, deployments, agent CLI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reliability/HA: Resilience, Backups, Agent/Job Lifecycle, OTA, Kill Switch #102

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Reliability/HA: Resilience, Backups, Agent/Job Lifecycle, OTA, Kill Switch #102

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions