Cache corruption protection: Atomic replacement of manifests, statistics and object files #286

siu · 2017-10-03T14:32:47Z

In this PR the writing of manifests, statistics and object files is done following this pattern:

Write to temporary file
Call atomic os.replace(tmp, dst)

According to the documentation and what I read on the internet os.replace(...) is an atomic operation, meaning that it can be used to prevent cache corruption if the process is stopped in the middle of a write operation. This should help with the root cause of #263.

In addition, if writing files is atomic, reading those files can be done without locking. The last 3 commits in the PR exploit that fact to avoid locking the manifests or statistics for reading. This may need some more testing, so I am fine taking them to another PR. Theoretically this would make processDirect faster with a cold cache because it doesn't need to lock the manifest to decide that it is a miss. My measurements ~~show a 3% faster compilation times in that case but I am not sure if it is relevant~~ do not show any relevant change in performance. Another benefit of not locking is that clcache -s does not hang until it can lock the statistics file.

I tested these changes locally and I didn't notice any problem.

According to documentation and some internet posts os.replace(...) is atomic (https://docs.python.org/3/library/os.html#os.replace). This commit changes the code writing the manifest and the statistics files to write to a temporary file and then use a new atomicRename() function to atomically replace the old file with then new contents. It should avoid corrupting the cache when the clcache process crashes or is killed during a write, partly addressing the root cause issue frerich#263. Note that the object files are still not updated atomically.

Use atomicRename() to save and restore the object files from the cache. It should reduce problems with corrupted object files in the cache as reported in issue frerich#263. While it fixes the most important part (corrupted object files) it still it does not completely fix the problem as the cache entry is composed of: * object file * stdout * stderr which should be replaced atomically.

If writing the json files is atomic, reading can be done safely without locks.

Since the manifest is not locked for reading before calling touchEntry it may have changed on disk. It is needed to lock, read the manifest, touch and write it atomically. touchEntry(...) also cannot rely on the index to be valid after read, so it was re-implemented to use the objectKey to find the correct entry.

Hide locking inside these functions to have more fine-grained control. Since writing to the statistics file is now atomic, locking is not needed.

frerich · 2017-10-04T08:34:55Z

This looks interesting! Working on concurrency is always tricky business, so I'll need to think a bit about these patches to see whether I can understand what's going on. My first thought was that I recalled a comment explaining that os.replace is subject to some limitations (I only now noticed it was you who wrote that comment :-) ) but I see that the issue related to renaming files across drives should not kick in here since you always store the temporary file on the same drive.

frerich · 2017-10-09T21:07:38Z

Just a quick heads up: alas, I still didn't find a quiet moment to review your changes. This kind of patch can be quite subtle so I'd rather do a careful review instead of just a superficial look. ;-)

siu · 2017-10-10T07:29:58Z

Sure, there is no rush. I was considering making the writes to the cache dir atomic following the same pattern: writing to temporary directory + atomic replace but I didn't have time to implement anything useful yet.

frerich · 2017-10-10T12:10:14Z

clcache.py



+def atomicRename(tempFile, destFile):
+    os.replace(tempFile, destFile)


I'm not sure this function pulls its own weight; atomicRename is even longer than os.replace and chances are that the latter is already known to readers of the code.

ok, replaced by the context manager and os.replace(...)

frerich · 2017-10-10T12:11:11Z

clcache.py


 def resetStatistics(cache):
-    with cache.statistics as stats:
+    with cache.statistics.lock, cache.statistics as stats:


If writing the statistics happens atomically now, is it really necessary to lock?

Yes, I think so. In general writes need to be protected to avoid other processes updating the information at the same time (based on old information). It is a bit different in this specific part of the code because the resulting statistics is all zeroes but we need to lock anyway because the atomic write always writes first to outputPath + ".new" and we need to prevent two processes doing so at the same time.

We could avoid the lock here by making atomicWrite write to new temporary unique files but I don't think it is worth it at the moment.

Thinking a bit more, even if the temporary file did not always have the same name we would need to lock. We don't want another process to write old information right after we reset the statistics.

It's true that the way the temporary file name is currently constructed requires locking. However, I'm not sure I appreciate with your second comment yet. Can you give an example of how two concurrent clcache processes (one of which is merely resetting the statistics) might interact which yields broken statistics in case no locking is performed when resetting the statistics? The only potential issue I could think of was:

clcache instance A reads the statistics in order to bump a counter (e.g. increase the number of cache hits)

clcache instance B resets the statistics

clcache instance A writes out the new statistics, with an increased counter

If this were to happen, the number of cache hits at the end would not be 1 plus whatever the value was before the stats were reset.

However, from looking at the code, all the updates to the cache stats seem to follow the theme

with cache.statistics.lock, cache.statistics as stats: stats.registerCacheHit()

I.e. reading, updating and writing the cache is always an atomic operation.

Based on your example, instance A is going to register a cache hit and instance B is going to reset the statistics (without the lock):

A: acquires the lock

A: reads current statistics, number of hits = 123

B: resets statistics without acquiring the lock, number of hits on disk = 0

A: writes updated statistics, number of hits on disk = 123 + 1

Resulting in incorrect data. If resetStatistics() acquires the lock this situation is not possible as either A or B would have to wait for the other process to finish the write.

A: acquires the lock

A: reads current statistics, number of hits = 123

B: about to reset statistics, blocks trying to acquire the lock

A: writes updated statistics, number of hits on disk = 123 + 1

A: releases the lock

B: succeeds acquiring the lock, writes all zeros in statistics

B: releases the lock

or

B: acquires the lock

A: about to bump cache hits, blocks trying to acquire the lock

B: writes all zeros in statistics file

B: releases the lock

A: succeeds acquiring the lock

A: reads current statistics, number of hits = 0

A: writes updated number of hits = 0 + 1

A: releases the lock

D'oh, of course you're right. My own example was giving an argument in favor of locking here.

frerich · 2017-10-10T12:15:52Z

clcache.py

+        for i, e in enumerate(self.entries()):
+            if e.objectHash == objectHash:
+                entryIndex = i
+                break


I think you could shorten this to

entryIndex = next((i, e in enumerate(self.entries()) if e.objectHash == objectHash), 0)

frerich · 2017-10-10T12:20:09Z

clcache.py

+            tempFileName = self._fileName + ".new"
+            with open(tempFileName, 'w') as f:
                json.dump(self._dict, f, sort_keys=True, indent=4)
+            atomicRename(tempFileName, self._fileName)


Seeing this repetition makes me wonder whether it might be nice to have a little context manager like

@contextlib.contextmanager def atomicallyWritten(fn): tempFileName = fn + ".new" with open(tempFileName, 'w') as f: yield f os.replace(tempFileName, fn)

...so you could just replace open(self._fileName, 'w') with atomicallyWritten(self._fileName). Maybe not worth it, given that it's just two occurrences. I'm mostly thinking out loud.

It makes sense, let's unify it for the moment, it can be unfolded later on if it is not needed anymore.

frerich

I think I understand it all now and it seems plausible to me. Just one thing I was wondering:

There are still two places where files are written were you do not use atomicWrite yet. Is this intentional? I found CompilerArtifactsSection._setCachedCompilerConsoleOutput and CacheFileStrategy.deserializeCacheEntry.

siu · 2017-10-10T16:00:30Z

I left them out only partly on purpose as I feel that those are less important and require more discussion.

The cache artifacts are 3 files (object, stdout, stderr), if we want to never corrupt the cache if the process is killed in the middle of a write we would:

Write the 3 files to a temporary folder
Replace the old folder by the new one

Which would guarantee that the 3 files are always in sync. Currently we have the atomic replacement of object file (which is the biggest of the 3) but no protection for the other 2 and no protection for the cache entry as a whole. I think that replacing the whole cache entry is more interesting than adding individual file protection now but I would do that in another PR if you agree.

After the cache entry is replaced atomically we can remove the signal handlers ;)

frerich · 2017-10-11T06:06:11Z

Makes sense!

frerich · 2017-10-11T06:06:48Z

Merged this now, thanks a lot for your work!

Unfortunately windows does not allow to replace a file that is opened for reading by another process/thread. Reads and writes need to be serialized by using the locks. This reverts the locking of manifests and statistics to the state before pull request frerich#286.

) Unfortunately windows does not allow to replace a file that is opened for reading by another process/thread. Reads and writes need to be serialized by using the locks. This reverts the locking of manifests and statistics to the state before pull request #286.

The signal handlers were introduced to avoid cache corruption when the clcache process is stopped in the middle of a write to the cache (statistics, manifests or objects). See frerich#233. Even if SIGINT and SIGTERM were ignored the cache still had a chance to be corrupted in the event of a SIGTERM (which cannot be ignored). Since frerich#233 the writing of files to the cache has been improved to replace the files atomically, reducing the risk of storing corrupted files in the cache. See pull requests frerich#286, frerich#292 and frerich#296. Therefore ignoring these signals is not needed anymore.

siu added 5 commits October 3, 2017 15:38

Read manifests and statistics without locking

62cbd0f

If writing the json files is atomic, reading can be done safely without locks.

Move cache lock inside clearCache, cleanCache and printStatistics

41be5a7

Hide locking inside these functions to have more fine-grained control. Since writing to the statistics file is now atomic, locking is not needed.

siu force-pushed the master-safeRename branch from f8db2a0 to 41be5a7 Compare October 3, 2017 14:53

frerich reviewed Oct 10, 2017

View reviewed changes

siu added 2 commits October 10, 2017 15:07

Simplify search expression in Manifest.touchEntry(...)

3631edb

Reduce duplication with atomicWrite context manager

5ca3a14

frerich reviewed Oct 10, 2017

View reviewed changes

frerich merged commit e441305 into frerich:master Oct 11, 2017

siu mentioned this pull request Oct 11, 2017

Revert reading manifests and statistics without acquiring the lock #292

Merged

siu deleted the master-safeRename branch October 11, 2017 19:56

This was referenced Oct 12, 2017

[WIP] Investigate why os.replace fails in processNoDirect #294

Closed

Replace whole cache entries atomically #296

Merged

siu mentioned this pull request Oct 16, 2017

Remove signal handlers #298

Merged



		def atomicRename(tempFile, destFile):
		os.replace(tempFile, destFile)

Cache corruption protection: Atomic replacement of manifests, statistics and object files #286

Cache corruption protection: Atomic replacement of manifests, statistics and object files #286

Uh oh!

Conversation

siu commented Oct 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frerich commented Oct 4, 2017

Uh oh!

frerich commented Oct 9, 2017

Uh oh!

siu commented Oct 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siu Oct 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frerich left a comment

Choose a reason for hiding this comment

Uh oh!

siu commented Oct 10, 2017

Uh oh!

frerich commented Oct 11, 2017

Uh oh!

frerich commented Oct 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

siu commented Oct 3, 2017 •

edited

Loading

siu Oct 10, 2017 •

edited

Loading