Skip to content

Conversation

@kptdobe
Copy link
Contributor

@kptdobe kptdobe commented Aug 28, 2025

This is a first attempt to fix #76

@karlpauls and I brainstormed and the current hypothesis is:

  • The DocRoom is the Durable Object.
  • It contains multiple properties, including the storage.
  • It creates ydoc instances which are cached in the docs map.
  • This docs object is a global object, not necessarily bound to the Durable Object.
  • Depending on how the worker memory is managed, it may happen that this global object still contains references to the DO (especially since the docs are attached to the storage)
  • Sometimes, it may happen that the DO instance or the global object are updated / refreshed / recreated / ? and then there is some old reference still around.
  • This "might" lead to the error

This is a pure hypothesis and we need to make sure:

  1. the code is doing the same thing - all tests still work as before, just needed to tweak some of them
  2. deploy in production and monitor to see if this occurs again

In any case, I think we should continue re-working the code: the DocRoom code should be isolated in its own file and, in the ideal scenario, contain all code it need - not sharing anything, to avoid those kind of memory mix. This would allow to clearly visualize what belongs to the DO (super singleton) vs what belong to the worker.

@github-actions
Copy link

github-actions bot commented Aug 28, 2025

LCOV of commit 4a1b60c during Install, lint, and test #206

Summary coverage rate:
  lines......: 95.3% (1652 of 1734 lines)
  functions..: 96.7% (58 of 60 functions)
  branches...: no data found

Files changed coverage rate:
                  |Lines       |Functions  |Branches    
  Filename        |Rate     Num|Rate    Num|Rate     Num
  ======================================================
  src/edge.js     | 2.5%    360| 0.0%     8|    -      0
  src/shareddoc.js| 3.5%    569| 0.0%    20|    -      0

Copy link
Contributor

@bosschaert bosschaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for looking into this!

src/edge.js Outdated
this.docs.set(docName, ydoc);
}

webSocket.addEventListener('close', () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is correct. Wouldn't this basically remove the doc if a connection goes away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On connection close, the doc is removed from the map ("local cache"). This is the same logic as before, just moved the code here to have access to the cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, what was missing in the new version is the check if there is no more connections. Added.

// when bound.
doc.promise = persistence.bindState(docname, doc, conn, storage);
}
doc.promise = persistence.bindState(docname, doc, conn, storage, docsCache);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not doing what happened before. IIRC, the idea was to use the promise to make sure the doc doesn't get bound to persistence called concurrently. @bosschaert, are you sure this is still correct (I would have expected to have this move up to the caller like it used to for the await)...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, the doc.promise should only be set if it wasn't there yet. Then we await on it in the next line of code. If it resolved already the await will be a no-op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I restored the logic but split the method in 2: createYDoc and setupYDoc. Create is only called once, setup each time a new session starts.
Too bad this is not tested.

@bosschaert bosschaert self-requested a review August 28, 2025 14:55
Copy link
Contributor

@bosschaert bosschaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I would also get approval from @karlpauls

@auniverseaway
Copy link
Member

Do we want to pull this down to stage given the comments?

@kptdobe
Copy link
Contributor Author

kptdobe commented Sep 4, 2025

We could go to stage first but this means someone would test there. I tested it locally as much as I can, basic features work as expected (editing, saving, awareness). What I cannot control is the long run on CF but this will be true also on staging. We can merge to prod and I'll monitor as I do for all changes I am doing on this project anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Intermittent Durable Object Errors

6 participants