-
Notifications
You must be signed in to change notification settings - Fork 83
Cache corruption protection: Atomic replacement of manifests, statistics and object files #286
Conversation
According to documentation and some internet posts os.replace(...) is atomic (https://docs.python.org/3/library/os.html#os.replace). This commit changes the code writing the manifest and the statistics files to write to a temporary file and then use a new atomicRename() function to atomically replace the old file with then new contents. It should avoid corrupting the cache when the clcache process crashes or is killed during a write, partly addressing the root cause issue frerich#263. Note that the object files are still not updated atomically.
Use atomicRename() to save and restore the object files from the cache. It should reduce problems with corrupted object files in the cache as reported in issue frerich#263. While it fixes the most important part (corrupted object files) it still it does not completely fix the problem as the cache entry is composed of: * object file * stdout * stderr which should be replaced atomically.
If writing the json files is atomic, reading can be done safely without locks.
Since the manifest is not locked for reading before calling touchEntry it may have changed on disk. It is needed to lock, read the manifest, touch and write it atomically. touchEntry(...) also cannot rely on the index to be valid after read, so it was re-implemented to use the objectKey to find the correct entry.
Hide locking inside these functions to have more fine-grained control. Since writing to the statistics file is now atomic, locking is not needed.
|
This looks interesting! Working on concurrency is always tricky business, so I'll need to think a bit about these patches to see whether I can understand what's going on. My first thought was that I recalled a comment explaining that |
|
Just a quick heads up: alas, I still didn't find a quiet moment to review your changes. This kind of patch can be quite subtle so I'd rather do a careful review instead of just a superficial look. ;-) |
|
Sure, there is no rush. I was considering making the writes to the cache dir atomic following the same pattern: writing to temporary directory + atomic replace but I didn't have time to implement anything useful yet. |
clcache.py
Outdated
|
|
||
|
|
||
| def atomicRename(tempFile, destFile): | ||
| os.replace(tempFile, destFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this function pulls its own weight; atomicRename is even longer than os.replace and chances are that the latter is already known to readers of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, replaced by the context manager and os.replace(...)
|
|
||
| def resetStatistics(cache): | ||
| with cache.statistics as stats: | ||
| with cache.statistics.lock, cache.statistics as stats: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If writing the statistics happens atomically now, is it really necessary to lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. In general writes need to be protected to avoid other processes updating the information at the same time (based on old information). It is a bit different in this specific part of the code because the resulting statistics is all zeroes but we need to lock anyway because the atomic write always writes first to outputPath + ".new" and we need to prevent two processes doing so at the same time.
We could avoid the lock here by making atomicWrite write to new temporary unique files but I don't think it is worth it at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking a bit more, even if the temporary file did not always have the same name we would need to lock. We don't want another process to write old information right after we reset the statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that the way the temporary file name is currently constructed requires locking. However, I'm not sure I appreciate with your second comment yet. Can you give an example of how two concurrent clcache processes (one of which is merely resetting the statistics) might interact which yields broken statistics in case no locking is performed when resetting the statistics? The only potential issue I could think of was:
- clcache instance A reads the statistics in order to bump a counter (e.g. increase the number of cache hits)
- clcache instance B resets the statistics
- clcache instance A writes out the new statistics, with an increased counter
If this were to happen, the number of cache hits at the end would not be 1 plus whatever the value was before the stats were reset.
However, from looking at the code, all the updates to the cache stats seem to follow the theme
with cache.statistics.lock, cache.statistics as stats:
stats.registerCacheHit()I.e. reading, updating and writing the cache is always an atomic operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on your example, instance A is going to register a cache hit and instance B is going to reset the statistics (without the lock):
- A: acquires the lock
- A: reads current statistics, number of hits = 123
- B: resets statistics without acquiring the lock, number of hits on disk = 0
- A: writes updated statistics, number of hits on disk = 123 + 1
Resulting in incorrect data. If resetStatistics() acquires the lock this situation is not possible as either A or B would have to wait for the other process to finish the write.
- A: acquires the lock
- A: reads current statistics, number of hits = 123
- B: about to reset statistics, blocks trying to acquire the lock
- A: writes updated statistics, number of hits on disk = 123 + 1
- A: releases the lock
- B: succeeds acquiring the lock, writes all zeros in statistics
- B: releases the lock
or
- B: acquires the lock
- A: about to bump cache hits, blocks trying to acquire the lock
- B: writes all zeros in statistics file
- B: releases the lock
- A: succeeds acquiring the lock
- A: reads current statistics, number of hits = 0
- A: writes updated number of hits = 0 + 1
- A: releases the lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D'oh, of course you're right. My own example was giving an argument in favor of locking here.
clcache.py
Outdated
| for i, e in enumerate(self.entries()): | ||
| if e.objectHash == objectHash: | ||
| entryIndex = i | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could shorten this to
entryIndex = next((i, e in enumerate(self.entries()) if e.objectHash == objectHash), 0)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
clcache.py
Outdated
| tempFileName = self._fileName + ".new" | ||
| with open(tempFileName, 'w') as f: | ||
| json.dump(self._dict, f, sort_keys=True, indent=4) | ||
| atomicRename(tempFileName, self._fileName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing this repetition makes me wonder whether it might be nice to have a little context manager like
@contextlib.contextmanager
def atomicallyWritten(fn):
tempFileName = fn + ".new"
with open(tempFileName, 'w') as f:
yield f
os.replace(tempFileName, fn)...so you could just replace open(self._fileName, 'w') with atomicallyWritten(self._fileName). Maybe not worth it, given that it's just two occurrences. I'm mostly thinking out loud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense, let's unify it for the moment, it can be unfolded later on if it is not needed anymore.
frerich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand it all now and it seems plausible to me. Just one thing I was wondering:
There are still two places where files are written were you do not use atomicWrite yet. Is this intentional? I found CompilerArtifactsSection._setCachedCompilerConsoleOutput and CacheFileStrategy.deserializeCacheEntry.
|
I left them out only partly on purpose as I feel that those are less important and require more discussion. The cache artifacts are 3 files (object, stdout, stderr), if we want to never corrupt the cache if the process is killed in the middle of a write we would:
Which would guarantee that the 3 files are always in sync. Currently we have the atomic replacement of object file (which is the biggest of the 3) but no protection for the other 2 and no protection for the cache entry as a whole. I think that replacing the whole cache entry is more interesting than adding individual file protection now but I would do that in another PR if you agree. After the cache entry is replaced atomically we can remove the signal handlers ;) |
|
Makes sense! |
|
Merged this now, thanks a lot for your work! |
Unfortunately windows does not allow to replace a file that is opened for reading by another process/thread. Reads and writes need to be serialized by using the locks. This reverts the locking of manifests and statistics to the state before pull request frerich#286.
) Unfortunately windows does not allow to replace a file that is opened for reading by another process/thread. Reads and writes need to be serialized by using the locks. This reverts the locking of manifests and statistics to the state before pull request #286.
The signal handlers were introduced to avoid cache corruption when the clcache process is stopped in the middle of a write to the cache (statistics, manifests or objects). See frerich#233. Even if SIGINT and SIGTERM were ignored the cache still had a chance to be corrupted in the event of a SIGTERM (which cannot be ignored). Since frerich#233 the writing of files to the cache has been improved to replace the files atomically, reducing the risk of storing corrupted files in the cache. See pull requests frerich#286, frerich#292 and frerich#296. Therefore ignoring these signals is not needed anymore.
The signal handlers were introduced to avoid cache corruption when the clcache process is stopped in the middle of a write to the cache (statistics, manifests or objects). See frerich#233. Even if SIGINT and SIGTERM were ignored the cache still had a chance to be corrupted in the event of a SIGTERM (which cannot be ignored). Since frerich#233 the writing of files to the cache has been improved to replace the files atomically, reducing the risk of storing corrupted files in the cache. See pull requests frerich#286, frerich#292 and frerich#296. Therefore ignoring these signals is not needed anymore.
The signal handlers were introduced to avoid cache corruption when the clcache process is stopped in the middle of a write to the cache (statistics, manifests or objects). See frerich#233. Even if SIGINT and SIGTERM were ignored the cache still had a chance to be corrupted in the event of a SIGTERM (which cannot be ignored). Since frerich#233 the writing of files to the cache has been improved to replace the files atomically, reducing the risk of storing corrupted files in the cache. See pull requests frerich#286, frerich#292 and frerich#296. Therefore ignoring these signals is not needed anymore.
In this PR the writing of manifests, statistics and object files is done following this pattern:
According to the documentation and what I read on the internet os.replace(...) is an atomic operation, meaning that it can be used to prevent cache corruption if the process is stopped in the middle of a write operation. This should help with the root cause of #263.
In addition, if writing files is atomic, reading those files can be done without locking. The last 3 commits in the PR exploit that fact to avoid locking the manifests or statistics for reading. This may need some more testing, so I am fine taking them to another PR. Theoretically this would make processDirect faster with a cold cache because it doesn't need to lock the manifest to decide that it is a miss. My measurements
show a 3% faster compilation times in that case but I am not sure if it is relevantdo not show any relevant change in performance. Another benefit of not locking is thatclcache -sdoes not hang until it can lock the statistics file.I tested these changes locally and I didn't notice any problem.