-
Notifications
You must be signed in to change notification settings - Fork 14
Open
Labels
Description
Not sure about this, and it may just be a duplicate of #1178
In a flurry of lost runs yesterday, one pattern I notice is:
- Connection to lightning is lost
- The worker reconnects
- Any messages on the wire are lost: the message in flight when connection drops doesn't seem to ever get delivered (citation needed)
- If that message happened to be step:complete, we're dead
There's also a risk at this point that the run has been marked as Lost by lightning.
Actually, I think the problem we have is that once a run is marked Lost, that's it, game over. If the worker was just held up, and all the events come home an hour later for /reasons/, the run will still be lost - even though all the information gets piped in.