System defibrillator - monitors health and auto-recovers from common failure modes.
When your containers stop responding, processes go runaway, or swap pressure threatens to freeze your system - defib detects the problem and fixes it automatically.
Works with Docker or Podman (auto-detects).
β οΈ Safety First: defib kills processes and restarts services. Don't run as root. Don't use on multi-user systems. Test patterns before enabling auto-kill. Full security guide β
defib has three monitoring commands, each targeting a different failure mode:
defib container- Watches an HTTP health endpoint. If it stops responding, restarts the container via docker-compose/podman-compose.defib processes- Scans for runaway processes (high CPU or memory). Auto-kills processes that match your safe-to-kill patterns.defib system- Monitors swap pressure and stuck (D-state) processes. Can kill memory hogs or restart services to recover.defib all- Runs all three. Best used with a config file.
# Requires Bun (https://bun.sh)
curl -fsSL https://bun.sh/install | bash
# Clone and run
git clone https://github.com/alexknowshtml/defib.git
cd defib# Monitor a container - restart if health check fails
bun run defib.ts container --health http://localhost:8000/health --compose-dir /home/deploy/my-app
# Monitor processes - kill runaway worker processes
bun run defib.ts processes --safe-to-kill "node /app/worker" --ignore "postgres"
# Monitor system - restart app when swap gets critical
bun run defib.ts system --swap-kill "leaky-app" --swap-restart-dir /home/deploy/my-app
# Monitor everything
bun run defib.ts all --config ./defib.config.jsonMonitors container health via HTTP endpoint. If the endpoint stops responding or responds too slowly, defib restarts the container via docker-compose/podman-compose.
bun run defib.ts container \
--health http://localhost:8000/health \
--compose-dir /home/deploy/my-app \
--timeout 10 \
--max-response 15 \
--backoff 10 \
--service webOptions:
| Flag | Default | Description |
|---|---|---|
--health <url> |
required | Health endpoint URL |
--compose-dir <path> |
required | Directory with docker-compose.yml |
--timeout <sec> |
10 | Health check timeout |
--max-response <sec> |
15 | Max acceptable response time |
--backoff <min> |
10 | Cooldown between restart attempts |
--service <name> |
- | Specific service to restart |
Monitors for runaway processes. When a process exceeds CPU or memory thresholds for too long, defib can automatically kill it if it matches a safe-to-kill pattern.
bun run defib.ts processes \
--cpu-threshold 80 \
--memory-threshold 2000 \
--max-runtime 2 \
--safe-to-kill "node mcp-" \
--safe-to-kill "python worker" \
--ignore "postgres" \
--ignore "ollama"Options:
| Flag | Default | Description |
|---|---|---|
--cpu-threshold <pct> |
80 | CPU % to flag as runaway |
--memory-threshold <mb> |
2000 | Memory MB to flag |
--max-runtime <hours> |
2 | Hours at high CPU before action |
--safe-to-kill <pattern> |
- | Process patterns safe to auto-kill (repeatable) |
--ignore <pattern> |
- | Process patterns to skip (repeatable) |
Monitors system health: swap pressure and stuck processes (D-state). When swap gets critical, defib can kill specified processes or restart a service to free memory.
bun run defib.ts system \
--swap-threshold 80 \
--swap-kill "electron" \
--swap-kill "chrome" \
--swap-restart-dir /home/deploy/my-app \
--swap-restart-service webOptions:
| Flag | Default | Description |
|---|---|---|
--swap-threshold <pct> |
80 | Swap % to trigger action |
--swap-kill <pattern> |
- | Process patterns to kill when swap critical (repeatable) |
--swap-restart-dir <path> |
- | Compose dir to restart when swap critical |
--swap-restart-service <n> |
- | Specific service to restart |
--no-dstate |
false | Disable D-state monitoring |
Runs all monitors. Best used with a config file for complex setups.
bun run defib.ts all --config ./defib.config.jsonSuppress alerts for a specific process. Use this when you've investigated a process and decided it's fine.
bun run defib.ts dismiss 12345The process will not be re-alerted until it exits and a new process takes its PID.
defib has three action modes that control how it responds to issues:
| Mode | Behavior |
|---|---|
auto |
Execute the fix immediately |
ask |
Print human-friendly guidance with commands to copy-paste |
deny |
Alert only, no action or guidance |
| Action | Default | Why |
|---|---|---|
restartContainer |
auto | Containers are designed to restart safely |
killRunaway |
auto | Only kills processes matching safe-to-kill patterns |
killUnknown |
ask | Unknown processes need human review |
killSwapHog |
ask | Swap remediation is invasive |
restartForSwap |
ask | Service restarts need human review |
When an action is set to ask, defib prints detailed guidance instead of taking action:
============================================================
π΄ ISSUE DETECTED: Runaway Process
============================================================
PID 12345 is using 95% CPU and has been running for 3.5 hours.
Process: node /app/worker.js
WHY THIS IS A PROBLEM:
This process is consuming almost all available CPU, which slows down
everything else on your system. After 3+ hours at this level, it's
likely stuck in a loop rather than doing useful work.
RECOMMENDED FIX:
Kill the process. It will free up CPU immediately. If this is a managed
service (PM2, systemd, Docker), it will auto-restart fresh.
TO FIX, RUN:
kill 12345
TO INVESTIGATE FIRST:
ps -p 12345 -o pid,pcpu,pmem,etime,args
cat /proc/12345/wchan 2>/dev/null
ls -la /proc/12345/fd 2>/dev/null | wc -l
TO IGNORE THIS ALERT:
defib dismiss 12345
============================================================
In your config file, add an actions section:
{
"webhookUrl": "...",
"actions": {
"restartContainer": "auto",
"killRunaway": "auto",
"killUnknown": "deny",
"killSwapHog": "auto",
"restartForSwap": "ask"
}
}When an action is set to ask mode, defib can optionally use an AI model to analyze the issue and provide tailored diagnosis instead of generic guidance.
This is completely optional. Without AI configured, defib prints useful hardcoded guidance. AI adds context-specific analysis of why a process might be misbehaving and what to do about it.
| Provider | Cost | Setup |
|---|---|---|
none |
Free | Default. No AI, hardcoded guidance only. |
ollama |
Free | Local. Install Ollama, run ollama pull llama3.1:8b |
anthropic |
Paid | API key from console.anthropic.com. Uses Claude Haiku. |
openai |
Paid | API key from platform.openai.com. Uses GPT-4o Mini. |
# Free local AI via Ollama
bun run defib.ts processes --ai ollama --safe-to-kill "node /app/worker"
# Anthropic (paid, most capable)
bun run defib.ts processes --ai anthropic --ai-key sk-ant-... --safe-to-kill "node /app/worker"
# Override the default model
bun run defib.ts processes --ai ollama --ai-model mistral:7bOr in your config file:
{
"ai": {
"provider": "ollama",
"model": "llama3.1:8b",
"ollamaUrl": "http://localhost:11434"
}
}| Provider | Default Model |
|---|---|
anthropic |
claude-haiku-4-20250414 |
openai |
gpt-4o-mini |
ollama |
llama3.1:8b |
AI diagnosis only runs when an action is in ask mode. If all your actions are auto or deny, AI is never called even if configured.
For complex setups, use a JSON config file:
{
"webhookUrl": "https://discord.com/api/webhooks/...",
"stateFile": "~/.local/state/defib/state.json",
"container": {
"healthUrl": "http://localhost:8000/health",
"composeDir": "/path/to/app",
"timeoutSeconds": 10,
"maxResponseSeconds": 15,
"backoffMinutes": 10,
"serviceName": "web"
},
"processes": {
"cpuThreshold": 80,
"memoryThresholdMB": 2000,
"maxRuntimeHours": 2,
"safeToKillPatterns": ["mcp-", "node.*watchdog"],
"ignorePatterns": ["postgres", "ollama", "code-server"]
},
"ai": {
"provider": "ollama",
"model": "llama3.1:8b",
"ollamaUrl": "http://localhost:11434"
},
"system": {
"swapThreshold": 80,
"checkDState": true,
"swapKillPatterns": ["electron", "chrome"],
"swapRestartCompose": {
"composeDir": "/path/to/app",
"serviceName": "web"
}
}
}export DEFIB_WEBHOOK_URL=https://discord.com/api/webhooks/...
export DEFIB_HEALTH_URL=http://localhost:8000/health
export DEFIB_COMPOSE_DIR=/path/to/app
export DEFIB_AI_API_KEY=sk-...defib is designed to run periodically, not as a daemon. Use cron, systemd timers, or PM2.
# Check containers every 2 minutes
*/2 * * * * /path/to/bun /path/to/defib.ts container --health http://localhost:8000/health --compose-dir /app
# Check processes every 15 minutes
*/15 * * * * /path/to/bun /path/to/defib.ts processes --safe-to-kill "node mcp-"
# Full health check every 5 minutes
*/5 * * * * /path/to/bun /path/to/defib.ts all --config /etc/defib/config.jsonpm2 start defib.ts --name defib-container --cron "*/2 * * * *" --no-autorestart -- container --health http://localhost:8000/health --compose-dir /app# /etc/systemd/system/defib.timer
[Unit]
Description=Run defib health check
[Timer]
OnCalendar=*:0/2
Persistent=true
[Install]
WantedBy=timers.target- HTTP GET to health endpoint with configurable timeout
- If unhealthy β
docker-compose down && docker-compose up -d - Verify health after restart
- Enter backoff period to prevent thrashing
- Parse
psoutput for CPU, memory, runtime - Flag processes exceeding thresholds
- Auto-kill if matches
safe-to-killpattern - Track known issues to avoid duplicate alerts
- Check swap usage via
free -m - If critical β kill matching processes and/or restart compose stack
- Check for D-state processes via
ps - Skip kernel threads and short D-states (normal I/O)
- Alert on resolution when issues clear
Supports Discord and Slack webhooks. Notifications include:
- Container Restarted - Service was down, now recovered
- Container Restart FAILED - Manual intervention needed
- Runaway Process Killed - Auto-killed a safe process
- Runaway Process Detected - High CPU, needs attention
- High Memory Process - Memory hog detected
- Swap Critical - Auto-Remediated - Killed processes/restarted services
- Swap Pressure Critical - No auto-fix configured, manual action needed
- Stuck Process Detected - Process in D-state (uninterruptible sleep)
defib maintains state in ~/.local/state/defib/state.json (configurable via --state-file):
- Tracks restart backoff timers
- Remembers known issues to avoid duplicate alerts
- Cleans up resolved issues automatically
- State directory and file are created with restrictive permissions (700/600)
defib kills processes and restarts services. Use with care.
-
Pattern validation - Patterns must be at least 3 characters and cannot be common dangerous terms like "node", "python", "bash", or ".". This prevents accidentally matching all processes.
-
Path validation - Compose directory paths must be absolute and cannot contain shell metacharacters (`; & | $ `` etc).
-
Secure state file - State is stored in
~/.local/state/defib/with owner-only permissions (not world-readable/tmp). -
Conservative defaults - Only
restartContainerandkillRunaway(for explicit safe-to-kill patterns) are set to "auto". Everything else requires human review.
- Multi-user systems - Other users could potentially exploit the process-killing behavior
- As root - defib can kill any process on the system when run as root
- With untrusted config files - Config files can specify patterns and paths
- Without testing patterns first - Always verify patterns match only what you intend
- Run as a dedicated non-root user with minimal privileges
- Test patterns with
--ignore(detection-only) before enabling--safe-to-kill - Start with
actions.killUnknown: "deny"and review alerts before enabling auto-kill - Keep config files readable only by the user running defib
- Use specific patterns like
"node /app/worker.js"rather than broad ones like"worker"
defib has an integration test suite that verifies security validations, monitoring, and container health detection.
cd test && ./run-tests.shTests auto-detect Docker or Podman. Container tests require a working compose setup; they're marked optional and skipped gracefully if unavailable.
Like a defibrillator shocks a stopped heart back to life, defib shocks your stopped services back to health. It's the tool you hope you never need, but when you do, it's there.
MIT
