Skip to content

Worker: after disconnect, runs don't seem to recover #1179

@josephjclark

Description

@josephjclark

Not sure about this, and it may just be a duplicate of #1178

In a flurry of lost runs yesterday, one pattern I notice is:

  • Connection to lightning is lost
  • The worker reconnects
  • Any messages on the wire are lost: the message in flight when connection drops doesn't seem to ever get delivered (citation needed)
  • If that message happened to be step:complete, we're dead

There's also a risk at this point that the run has been marked as Lost by lightning.

Actually, I think the problem we have is that once a run is marked Lost, that's it, game over. If the worker was just held up, and all the events come home an hour later for /reasons/, the run will still be lost - even though all the information gets piped in.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    DevX Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions