Skip to content
This repository was archived by the owner on Feb 4, 2020. It is now read-only.

Conversation

@siu
Copy link
Contributor

@siu siu commented Oct 3, 2017

In this PR the writing of manifests, statistics and object files is done following this pattern:

  • Write to temporary file
  • Call atomic os.replace(tmp, dst)

According to the documentation and what I read on the internet os.replace(...) is an atomic operation, meaning that it can be used to prevent cache corruption if the process is stopped in the middle of a write operation. This should help with the root cause of #263.

In addition, if writing files is atomic, reading those files can be done without locking. The last 3 commits in the PR exploit that fact to avoid locking the manifests or statistics for reading. This may need some more testing, so I am fine taking them to another PR. Theoretically this would make processDirect faster with a cold cache because it doesn't need to lock the manifest to decide that it is a miss. My measurements show a 3% faster compilation times in that case but I am not sure if it is relevant do not show any relevant change in performance. Another benefit of not locking is that clcache -s does not hang until it can lock the statistics file.

I tested these changes locally and I didn't notice any problem.

siu added 5 commits October 3, 2017 15:38
According to documentation and some internet posts os.replace(...) is atomic
 (https://docs.python.org/3/library/os.html#os.replace). This commit changes
the code writing the manifest and the statistics files to write to a temporary
file and then use a new atomicRename() function to atomically replace the old
file with then new contents.

It should avoid corrupting the cache when the clcache process crashes or is
killed during a write, partly addressing the root cause issue frerich#263. Note that
the object files are still not updated atomically.
Use atomicRename() to save and restore the object files from the
cache.  It should reduce problems with corrupted object files in the
cache as reported in issue frerich#263.
While it fixes the most important part (corrupted object files) it
still it does not completely fix the problem as the cache entry is
composed of:

 * object file
 * stdout
 * stderr

which should be replaced atomically.
If writing the json files is atomic, reading can be done safely without locks.
Since the manifest is not locked for reading before calling touchEntry it
may have changed on disk. It is needed to lock, read the manifest, touch
and write it atomically. touchEntry(...) also cannot rely on the index to
be valid after read, so it was re-implemented to use the objectKey to find
the correct entry.
Hide locking inside these functions to have more fine-grained control.
Since writing to the statistics file is now atomic, locking is not
needed.
@siu siu force-pushed the master-safeRename branch from f8db2a0 to 41be5a7 Compare October 3, 2017 14:53
@frerich
Copy link
Owner

frerich commented Oct 4, 2017

This looks interesting! Working on concurrency is always tricky business, so I'll need to think a bit about these patches to see whether I can understand what's going on. My first thought was that I recalled a comment explaining that os.replace is subject to some limitations (I only now noticed it was you who wrote that comment :-) ) but I see that the issue related to renaming files across drives should not kick in here since you always store the temporary file on the same drive.

@frerich
Copy link
Owner

frerich commented Oct 9, 2017

Just a quick heads up: alas, I still didn't find a quiet moment to review your changes. This kind of patch can be quite subtle so I'd rather do a careful review instead of just a superficial look. ;-)

@siu
Copy link
Contributor Author

siu commented Oct 10, 2017

Sure, there is no rush. I was considering making the writes to the cache dir atomic following the same pattern: writing to temporary directory + atomic replace but I didn't have time to implement anything useful yet.

clcache.py Outdated


def atomicRename(tempFile, destFile):
os.replace(tempFile, destFile)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this function pulls its own weight; atomicRename is even longer than os.replace and chances are that the latter is already known to readers of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, replaced by the context manager and os.replace(...)


def resetStatistics(cache):
with cache.statistics as stats:
with cache.statistics.lock, cache.statistics as stats:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If writing the statistics happens atomically now, is it really necessary to lock?

Copy link
Contributor Author

@siu siu Oct 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. In general writes need to be protected to avoid other processes updating the information at the same time (based on old information). It is a bit different in this specific part of the code because the resulting statistics is all zeroes but we need to lock anyway because the atomic write always writes first to outputPath + ".new" and we need to prevent two processes doing so at the same time.

We could avoid the lock here by making atomicWrite write to new temporary unique files but I don't think it is worth it at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a bit more, even if the temporary file did not always have the same name we would need to lock. We don't want another process to write old information right after we reset the statistics.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that the way the temporary file name is currently constructed requires locking. However, I'm not sure I appreciate with your second comment yet. Can you give an example of how two concurrent clcache processes (one of which is merely resetting the statistics) might interact which yields broken statistics in case no locking is performed when resetting the statistics? The only potential issue I could think of was:

  1. clcache instance A reads the statistics in order to bump a counter (e.g. increase the number of cache hits)
  2. clcache instance B resets the statistics
  3. clcache instance A writes out the new statistics, with an increased counter

If this were to happen, the number of cache hits at the end would not be 1 plus whatever the value was before the stats were reset.

However, from looking at the code, all the updates to the cache stats seem to follow the theme

   with cache.statistics.lock, cache.statistics as stats:
            stats.registerCacheHit()

I.e. reading, updating and writing the cache is always an atomic operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your example, instance A is going to register a cache hit and instance B is going to reset the statistics (without the lock):

  1. A: acquires the lock
  2. A: reads current statistics, number of hits = 123
  3. B: resets statistics without acquiring the lock, number of hits on disk = 0
  4. A: writes updated statistics, number of hits on disk = 123 + 1

Resulting in incorrect data. If resetStatistics() acquires the lock this situation is not possible as either A or B would have to wait for the other process to finish the write.

  1. A: acquires the lock
  2. A: reads current statistics, number of hits = 123
  3. B: about to reset statistics, blocks trying to acquire the lock
  4. A: writes updated statistics, number of hits on disk = 123 + 1
  5. A: releases the lock
  6. B: succeeds acquiring the lock, writes all zeros in statistics
  7. B: releases the lock

or

  1. B: acquires the lock
  2. A: about to bump cache hits, blocks trying to acquire the lock
  3. B: writes all zeros in statistics file
  4. B: releases the lock
  5. A: succeeds acquiring the lock
  6. A: reads current statistics, number of hits = 0
  7. A: writes updated number of hits = 0 + 1
  8. A: releases the lock

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh, of course you're right. My own example was giving an argument in favor of locking here.

clcache.py Outdated
for i, e in enumerate(self.entries()):
if e.objectHash == objectHash:
entryIndex = i
break
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could shorten this to

entryIndex = next((i, e in enumerate(self.entries()) if e.objectHash == objectHash), 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.

clcache.py Outdated
tempFileName = self._fileName + ".new"
with open(tempFileName, 'w') as f:
json.dump(self._dict, f, sort_keys=True, indent=4)
atomicRename(tempFileName, self._fileName)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing this repetition makes me wonder whether it might be nice to have a little context manager like

@contextlib.contextmanager
def atomicallyWritten(fn):
    tempFileName = fn + ".new"
    with open(tempFileName, 'w') as f:
        yield f
    os.replace(tempFileName, fn)

...so you could just replace open(self._fileName, 'w') with atomicallyWritten(self._fileName). Maybe not worth it, given that it's just two occurrences. I'm mostly thinking out loud.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, let's unify it for the moment, it can be unfolded later on if it is not needed anymore.

Copy link
Owner

@frerich frerich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand it all now and it seems plausible to me. Just one thing I was wondering:

There are still two places where files are written were you do not use atomicWrite yet. Is this intentional? I found CompilerArtifactsSection._setCachedCompilerConsoleOutput and CacheFileStrategy.deserializeCacheEntry.

@siu
Copy link
Contributor Author

siu commented Oct 10, 2017

I left them out only partly on purpose as I feel that those are less important and require more discussion.

The cache artifacts are 3 files (object, stdout, stderr), if we want to never corrupt the cache if the process is killed in the middle of a write we would:

  • Write the 3 files to a temporary folder
  • Replace the old folder by the new one

Which would guarantee that the 3 files are always in sync. Currently we have the atomic replacement of object file (which is the biggest of the 3) but no protection for the other 2 and no protection for the cache entry as a whole. I think that replacing the whole cache entry is more interesting than adding individual file protection now but I would do that in another PR if you agree.

After the cache entry is replaced atomically we can remove the signal handlers ;)

@frerich
Copy link
Owner

frerich commented Oct 11, 2017

Makes sense!

@frerich frerich merged commit e441305 into frerich:master Oct 11, 2017
@frerich
Copy link
Owner

frerich commented Oct 11, 2017

Merged this now, thanks a lot for your work!

siu added a commit to siu/clcache that referenced this pull request Oct 11, 2017
Unfortunately windows does not allow to replace a file that is opened
for reading by another process/thread. Reads and writes need to be
serialized by using the locks.

This reverts the locking of manifests and statistics to the state before
pull request frerich#286.
@siu siu deleted the master-safeRename branch October 11, 2017 19:56
frerich pushed a commit that referenced this pull request Oct 12, 2017
)

Unfortunately windows does not allow to replace a file that is opened
for reading by another process/thread. Reads and writes need to be
serialized by using the locks.

This reverts the locking of manifests and statistics to the state before
pull request #286.
siu pushed a commit to siu/clcache that referenced this pull request Oct 16, 2017
The signal handlers were introduced to avoid cache corruption when the
clcache process is stopped in the middle of a write to the cache
(statistics, manifests or objects). See frerich#233. Even if SIGINT and SIGTERM
were ignored the cache still had a chance to be corrupted in the event
of a SIGTERM (which cannot be ignored).

Since frerich#233 the writing of files to the cache has been improved to
replace the files atomically, reducing the risk of storing corrupted
files in the cache. See pull requests frerich#286, frerich#292 and frerich#296. Therefore
ignoring these signals is not needed anymore.
@siu siu mentioned this pull request Oct 16, 2017
siu pushed a commit to siu/clcache that referenced this pull request Oct 30, 2017
The signal handlers were introduced to avoid cache corruption when the
clcache process is stopped in the middle of a write to the cache
(statistics, manifests or objects). See frerich#233. Even if SIGINT and SIGTERM
were ignored the cache still had a chance to be corrupted in the event
of a SIGTERM (which cannot be ignored).

Since frerich#233 the writing of files to the cache has been improved to
replace the files atomically, reducing the risk of storing corrupted
files in the cache. See pull requests frerich#286, frerich#292 and frerich#296. Therefore
ignoring these signals is not needed anymore.
siu pushed a commit to siu/clcache that referenced this pull request Oct 30, 2017
The signal handlers were introduced to avoid cache corruption when the
clcache process is stopped in the middle of a write to the cache
(statistics, manifests or objects). See frerich#233. Even if SIGINT and SIGTERM
were ignored the cache still had a chance to be corrupted in the event
of a SIGTERM (which cannot be ignored).

Since frerich#233 the writing of files to the cache has been improved to
replace the files atomically, reducing the risk of storing corrupted
files in the cache. See pull requests frerich#286, frerich#292 and frerich#296. Therefore
ignoring these signals is not needed anymore.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants