copyOrLink race condition #236

siu · 2016-10-25T10:35:46Z

This is a work-in-progress. This branch is based on #235 because cachekeys need to be deterministic for it to work. The only change w.r.t. #235 is the last commit, which adds a failing integration test showing one way to trigger #155 (there may be more ways to trigger it).

This race condition happens when two clcache instances decide that an object file is not in the cache. In that case both instances invoke the compiler and compute the cachekey for the object file. After that they lock the appropriate artifacts section and call addObjectToCache which calls copyOrLink. The race is that the other process may have run through the addObjectToCache already and the object may be in the cache already, making the rename fail. In the end it is a simple bug but showing it in an automatic test is challenging.

codecov-io · 2016-10-25T11:50:00Z

Current coverage is 88.86% (diff: 100%)

Merging #236 into master will not change coverage

@@             master       #236   diff @@
==========================================
  Files             1          1          
  Lines           997        997          
  Methods           0          0          
  Messages          0          0          
  Branches        158        158          
==========================================
  Hits            886        886          
  Misses           83         83          
  Partials         28         28

Powered by Codecov. Last update c852c14...684d549

frerich · 2016-10-25T12:59:30Z

I think I understand the problem description - but wouldn't fixing this be a mere matter applying a patch like

--- a/clcache.py
+++ b/clcache.py
@@ -1430,7 +1430,7 @@ def postprocessUnusableManifestMiss(
     section = cache.compilerArtifactsRepository.section(cachekey)
     with section.lock, cache.statistics.lock, cache.statistics as stats:
         reason(stats)
-        if returnCode == 0 and os.path.exists(objectFile):
+        if returnCode == 0 and os.path.exists(objectFile) and not section.hasEntry(cachekey):
             artifacts = CompilerArtifacts(objectFile, compilerOutput, compilerStderr)
             cleanupRequired = addObjectToCache(stats, cache, section, cachekey, artifacts)

?

I.e. locking the artifact section and then checking whether it already has an entry with the name?

siu · 2016-10-25T14:01:24Z

@frerich I believe that is a valid fix. I was attempting to add a regression test before fixing it to avoid re-introducing the bug in the future.

siu · 2016-10-25T15:05:11Z

I am closing this PR until I have something useful to reduce noise, I will reopen or add another one later.

frerich · 2016-10-25T15:23:32Z

Thinking about this, I suppose that to trigger this issue, you need to have two concurrency clcache executions which both run in 'direct' mode and which both encounter a cache miss - however, they need to use slightly different command lines (such that they end up with different manifest sections being locked) but still end up with the same artifact cache key.

That way, there's a chance that both of them are executing postprocessUnusableManifestMiss concurrently and then clash in the addObjectToCache call.

siu · 2016-10-25T16:11:34Z

There is something else I still don't understand: I was able to run the integration tests in this PR locally and get crashes 100% of the time but it wouldn't crash in appveyor.

frerich · 2016-10-25T19:36:07Z

I tried to test the theory mentioned in my last comment but failed to create a test which demonstrates the issue. By now, I think it's not possible to create two invocations with different manifest keys but the same artifact key, which was an assumption I made in my theory.

siu · 2016-10-26T12:11:16Z

I have now managed to reproduce the bug consistently from a unit test in appveyor but I didn't consider locking the manifest section before locking the artifact section, so the test is artificial. As you point, If I introduced the locking and creation of the manifest as the real clcache the race condition would not be there.

Some more background: We had problems with #155 in our CI server but only after the cache got full or it was shrinked. Usually we have more than one compilation task running per node in different directories. In order to share the cache we use CLCACHE_BASEDIR.

(Side note: getManifestHash does not unify paths with CLCACHE_BASEDIR so it is going to create different manifests for equivalent files. In the end it is not going to point to the same object in the cache because the cachekey contains the manifesthash but it is going to reduce the number of cache hits.)

An important hint is that it only happened after the cache was full and once the bug happened once it happened for every compilation. The clean process doesn't attempt to keep the cache "consistent", in the sense that it may remove manifests or objects and does not keep the references updated. I think we are now getting close to the real issue: evicted objects are handled well by the code but evicted manifests (and not the objects) are not. I will try to write an integration test to prove this.

frerich · 2016-10-26T12:40:21Z

Thanks a lot for your thoughtful comment - looking forward to that integration test!

As for being able to reproduce the race condition which this ticket is about: indeed, I noticed that if you consider processUnusableManifestMiss() in isolation, it is possible to trigger a race condition. An integration test however which verifies clcache as a whole doesn't seem to trigger it.

sasobadovinac · 2016-10-26T13:37:10Z

I would just like to confirm that also from my tests this is happening only in direct mode and when the cache gets full https://ci.appveyor.com/project/sasobadovinac/freecad/history

siu added 4 commits October 25, 2016 16:02

Add failing test that triggers frerich#155

3ac667f

Rename method (pylint) and add documentation

adf808a

Increase sleep time to trigger race condition

7bcc8ab

Force crlf attribute on text files to have reproducible object hashes

684d549

siu force-pushed the crashCopyOrLink branch from e606c0a to 684d549 Compare October 25, 2016 14:03

Update cachekey

a932706

siu closed this Oct 25, 2016

frerich mentioned this pull request Oct 26, 2016

Fix crash when a manifest was evicted but the object is in the cache #237

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

copyOrLink race condition #236

copyOrLink race condition #236

Uh oh!

siu commented Oct 25, 2016

Uh oh!

codecov-io commented Oct 25, 2016 •

edited

Loading

Uh oh!

frerich commented Oct 25, 2016

Uh oh!

siu commented Oct 25, 2016

Uh oh!

siu commented Oct 25, 2016

Uh oh!

frerich commented Oct 25, 2016

Uh oh!

siu commented Oct 25, 2016 •

edited

Loading

Uh oh!

frerich commented Oct 25, 2016

Uh oh!

siu commented Oct 26, 2016

Uh oh!

frerich commented Oct 26, 2016

Uh oh!

sasobadovinac commented Oct 26, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

copyOrLink race condition #236

copyOrLink race condition #236

Uh oh!

Conversation

siu commented Oct 25, 2016

Uh oh!

codecov-io commented Oct 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 88.86% (diff: 100%)

Uh oh!

frerich commented Oct 25, 2016

Uh oh!

siu commented Oct 25, 2016

Uh oh!

siu commented Oct 25, 2016

Uh oh!

frerich commented Oct 25, 2016

Uh oh!

siu commented Oct 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frerich commented Oct 25, 2016

Uh oh!

siu commented Oct 26, 2016

Uh oh!

frerich commented Oct 26, 2016

Uh oh!

sasobadovinac commented Oct 26, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-io commented Oct 25, 2016 •

edited

Loading

siu commented Oct 25, 2016 •

edited

Loading