fix: graceful bridge restarts — SIGTERM handling + EADDRINUSE retry#161
fix: graceful bridge restarts — SIGTERM handling + EADDRINUSE retry#161benvinegar merged 2 commits intomainfrom
Conversation
Three layers of defense against port conflicts during bridge restarts: 1. broker-bridge.mjs: SIGTERM/SIGINT handler closes the HTTP server cleanly before exiting, so the port is released immediately instead of lingering in TIME_WAIT. 2. broker-bridge.mjs: EADDRINUSE retry with backoff (5 attempts, 2s apart) so if the port IS briefly held, the bridge waits instead of crashing. 3. startup-cleanup.sh: Kill the tmux restart loop FIRST (prevents respawning), then SIGTERM the port holder and wait up to 3s for graceful exit before falling back to SIGKILL. The restart loop also checks port availability before each relaunch.
Greptile SummaryThis PR eliminates a recurring EADDRINUSE crash loop that occurred on every deploy/restart by adding three layers of defense:
The changes are well-scoped and focused on the specific race condition. The three layers (graceful shutdown, bind retry, ordered cleanup) provide good defense-in-depth. Existing tests pass and the integration test suite already exercises SIGTERM teardown of the bridge. Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant SC as startup-cleanup.sh
participant TM as tmux session
participant OB as old bridge process
participant NB as new bridge process
participant Port as Port 7890
Note over SC: Deploy/restart triggered
SC->>TM: kill-session (stop restart loop)
TM-->>OB: (no longer supervised)
SC->>OB: SIGTERM
OB->>OB: gracefulShutdown()
OB->>Port: server.close()
Port-->>OB: port released
OB->>OB: process.exit(0)
SC->>SC: wait up to 3s for port free
alt Port still held
SC->>OB: SIGKILL (fallback)
end
SC->>TM: new tmux session
TM->>NB: start bridge
NB->>Port: server.listen(7890)
alt EADDRINUSE
NB->>NB: retry (up to 5x, 2s backoff)
NB->>Port: server.listen(7890)
end
Port-->>NB: listening
Note over NB: Bridge operational
Last reviewed commit: cccf3b6 |
| server.on("error", (err) => { | ||
| if (err.code === "EADDRINUSE" && bindAttempt < MAX_BIND_RETRIES) { | ||
| logWarn(`⚠️ port ${API_PORT} in use, retrying in ${BIND_RETRY_DELAY_MS}ms (attempt ${bindAttempt}/${MAX_BIND_RETRIES})`); | ||
| setTimeout(tryListen, BIND_RETRY_DELAY_MS); | ||
| } else { |
There was a problem hiding this comment.
Missing server.close() before retry
The Node.js documentation recommends calling server.close() before retrying server.listen() after an EADDRINUSE error, to properly reset the server's internal state. While the current code may work in practice (the failed bind doesn't create resources), adding server.close() follows the documented pattern and avoids potential edge cases in future Node.js versions.
| server.on("error", (err) => { | |
| if (err.code === "EADDRINUSE" && bindAttempt < MAX_BIND_RETRIES) { | |
| logWarn(`⚠️ port ${API_PORT} in use, retrying in ${BIND_RETRY_DELAY_MS}ms (attempt ${bindAttempt}/${MAX_BIND_RETRIES})`); | |
| setTimeout(tryListen, BIND_RETRY_DELAY_MS); | |
| } else { | |
| if (err.code === "EADDRINUSE" && bindAttempt < MAX_BIND_RETRIES) { | |
| logWarn(`⚠️ port ${API_PORT} in use, retrying in ${BIND_RETRY_DELAY_MS}ms (attempt ${bindAttempt}/${MAX_BIND_RETRIES})`); | |
| server.close(); | |
| setTimeout(tryListen, BIND_RETRY_DELAY_MS); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: slack-bridge/broker-bridge.mjs
Line: 1072-1076
Comment:
**Missing `server.close()` before retry**
The [Node.js documentation](https://nodejs.org/api/net.html#event-error) recommends calling `server.close()` before retrying `server.listen()` after an EADDRINUSE error, to properly reset the server's internal state. While the current code may work in practice (the failed bind doesn't create resources), adding `server.close()` follows the documented pattern and avoids potential edge cases in future Node.js versions.
```suggestion
if (err.code === "EADDRINUSE" && bindAttempt < MAX_BIND_RETRIES) {
logWarn(`⚠️ port ${API_PORT} in use, retrying in ${BIND_RETRY_DELAY_MS}ms (attempt ${bindAttempt}/${MAX_BIND_RETRIES})`);
server.close();
setTimeout(tryListen, BIND_RETRY_DELAY_MS);
```
How can I resolve this? If you propose a fix, please make it concise.
Problem
Every deploy/restart, the bridge crash-loops on EADDRINUSE for 30+ seconds because:
startup-cleanup.shkills port 7890 withkill -9— no graceful shutdown, port may lingerThis happens reliably on every deploy and requires manual debugging each time.
Fix
Three layers of defense:
1. Bridge: SIGTERM/SIGINT handler (
broker-bridge.mjs)Catches SIGTERM, calls
server.close()to release the port cleanly, then exits. 5s forced-exit timeout as safety net.2. Bridge: EADDRINUSE retry (
broker-bridge.mjs)Instead of crashing on EADDRINUSE, retries up to 5 times with 2s backoff. If the port is briefly held by a dying predecessor, the bridge just waits.
3. Startup: ordered cleanup (
startup-cleanup.sh)Testing