-
-
Notifications
You must be signed in to change notification settings - Fork 260
fix: timing issue in S3 locker to prevent zombie locks #812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 20110090962Details
💛 - Coveralls |
| } | ||
|
|
||
| async renewLock(id: string): Promise<boolean> { | ||
| async renewLock(id: string, checkLocked: () => boolean): Promise<boolean> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific reason why we need to pass a function a not check directly this.isLocked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in renewLock() wich is a part of S3Locker, but isLocked is part of S3Lock. So when S3Lock calls this.locker.renewLock() it passes in a function to allow S3Locker to check the lock state of S3Lock.
Alternatives would be
- move the lock states to S3Locker with a
map<string, boolean>(id -> isLocked) and share it that way - pass in the whole S3Lock instance (instead of checkLocked) and expose isLocked from S3Lock (currently private) so the locker can check the lock directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this fix wouldn't work in a distributed environment, it would only work on a single instance setup.
I think the proper fix here is to use If-Match directive in the PutObject after retrieving the current lock, and only write the file if the etag matches with the current lock. ex:
// Get current lock to verify ownership
const response = await this.s3Client.send(
new GetObjectCommand({
Bucket: this.bucket,
Key: lockKey,
})
)
// Update the lock with new expiration
await this.s3Client.send(
new PutObjectCommand({
Bucket: this.bucket,
Key: lockKey,
IfMatch: response.Etag, // <--- here is the trick
Body: JSON.stringify(updatedLock),
ContentType: 'application/json',
Metadata: {
lockId: id,
expiresAt: updatedLock.expiresAt.toString(),
},
})
)This way it won't create the zombie lock if there isn't a file with the same etag and it should work in a distributed environment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also want to add a check if (id === currentLock.id) before sending the put
What kind of change does this PR introduce?
Bug fix
What is the current behavior?
There is a potential race condition in the S3 locker that can cause zombie locks to be created if the lock is unlocked during the renewal operation (Between get and put)
What is the new behavior?
Check that the lock is still locked after the S3 get operation completes and exit if it was already unlocked
Additional context
This issue was causing intermittent test failures which would manifest in the "double unlock" test but was actually caused by the abort test leaving behind zombie objects. Running the S3 locker tests in a loop I was able to reproduce this semi-consistently. Usually it would take only a few test runs, but sometimes it'd go upwards of 50 runs without a failure.
The newly added test expands on the "handle abort signal" test and consistently fails without this fix.