Skip to content

Durable Object reset loop prevents gateway startup after rapid deployments #238

@Batsirai

Description

@Batsirai

Issue

When deploying moltworker to Cloudflare Workers, rapid deployments (multiple deployments within 5-10 minutes) cause a Durable Object reset loop that prevents the OpenClaw gateway from starting.

Error Messages

Failed to start process: Error: Durable Object reset because its code was updated.
[PROXY] Failed to start Moltbot: Error: Durable Object reset because its code was updated.

Environment

  • Platform: Cloudflare Workers with Durable Objects + Container bindings
  • OpenClaw Version: 2026.2.3-1
  • Moltworker: Based on cloudflare/moltworker architecture
  • Container: Docker with openclaw gateway running in Cloudflare Sandbox

Steps to Reproduce

  1. Deploy moltworker to Cloudflare
  2. Wait for gateway to start successfully
  3. Deploy again within 5 minutes (e.g., bug fix or feature change)
  4. Deploy a third time within another 5 minutes
  5. Observe: Gateway fails to start with "Durable Object reset" errors in a loop

Expected Behavior

Gateway should recover gracefully Gateway should recover gracefully Gateway should recover gracefully Gateway should recover gracefully Gateway should recover gracefully Gateway should recover gracefully Gateway should recoveris interrupted by another DO reset

  • Gateway never becomes ready on port 18789
  • Process times out after 90 seconds
  • Only resolves after waiting 5-10+ minutes without any deployments

Impact

  • Production downtime during multiple deployments
  • Cannot do rapid iteration/bug fixes in production
  • Data is safe (R2 backup/restore works correctly), but service is unavailable during reset loop

Workaround

Wait 5-10 minutes between deployments to allow the Durable Object to fully stabilize before deploying again.

Proposed Solutions

  1. Better error handling: Detect DO reset scenarios and retry with exponential backoff
  2. **Startup s2. **Startup s2. **Startup s2. **Stas in progres2. **Startup s2. **Startup s2. **Startup s2. **Stas in progres2. **Startup s2. *guide (batch changes, avoid rapid deploys)
  3. Graceful degradation: Return a "deployment in progress" status instead of timing out
  4. Gradual rollouts: Consider using Workers deploy_config.version_id for canary deployments

Additional Context

  • Using R2 for persistent storage (config, skills, conversations)
  • R2 restore completes successfully before the reset occurs
  • The issue is purely with the Durable Object lifecycle during rapid code updates
  • This appears to be a Cloudflare platform limitation, but better handling would improve the deployment experience

Related

This might be related to how Durable Objects handle alarm() during code updates - our KeepAlive DO pings the Sandbox DO every 30 seconds, which may interact poorly with deployments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions