Skip to content
This repository was archived by the owner on Feb 4, 2020. It is now read-only.

Conversation

@siu
Copy link
Contributor

@siu siu commented Oct 25, 2016

This is a work-in-progress. This branch is based on #235 because cachekeys need to be deterministic for it to work. The only change w.r.t. #235 is the last commit, which adds a failing integration test showing one way to trigger #155 (there may be more ways to trigger it).

This race condition happens when two clcache instances decide that an object file is not in the cache. In that case both instances invoke the compiler and compute the cachekey for the object file. After that they lock the appropriate artifacts section and call addObjectToCache which calls copyOrLink. The race is that the other process may have run through the addObjectToCache already and the object may be in the cache already, making the rename fail. In the end it is a simple bug but showing it in an automatic test is challenging.

@codecov-io
Copy link

codecov-io commented Oct 25, 2016

Current coverage is 88.86% (diff: 100%)

Merging #236 into master will not change coverage

@@             master       #236   diff @@
==========================================
  Files             1          1          
  Lines           997        997          
  Methods           0          0          
  Messages          0          0          
  Branches        158        158          
==========================================
  Hits            886        886          
  Misses           83         83          
  Partials         28         28          

Powered by Codecov. Last update c852c14...684d549

@frerich
Copy link
Owner

frerich commented Oct 25, 2016

I think I understand the problem description - but wouldn't fixing this be a mere matter applying a patch like

--- a/clcache.py
+++ b/clcache.py
@@ -1430,7 +1430,7 @@ def postprocessUnusableManifestMiss(
     section = cache.compilerArtifactsRepository.section(cachekey)
     with section.lock, cache.statistics.lock, cache.statistics as stats:
         reason(stats)
-        if returnCode == 0 and os.path.exists(objectFile):
+        if returnCode == 0 and os.path.exists(objectFile) and not section.hasEntry(cachekey):
             artifacts = CompilerArtifacts(objectFile, compilerOutput, compilerStderr)
             cleanupRequired = addObjectToCache(stats, cache, section, cachekey, artifacts)

?

I.e. locking the artifact section and then checking whether it already has an entry with the name?

@siu
Copy link
Contributor Author

siu commented Oct 25, 2016

@frerich I believe that is a valid fix. I was attempting to add a regression test before fixing it to avoid re-introducing the bug in the future.

@siu siu force-pushed the crashCopyOrLink branch from e606c0a to 684d549 Compare October 25, 2016 14:03
@siu
Copy link
Contributor Author

siu commented Oct 25, 2016

I am closing this PR until I have something useful to reduce noise, I will reopen or add another one later.

@siu siu closed this Oct 25, 2016
@frerich
Copy link
Owner

frerich commented Oct 25, 2016

Thinking about this, I suppose that to trigger this issue, you need to have two concurrency clcache executions which both run in 'direct' mode and which both encounter a cache miss - however, they need to use slightly different command lines (such that they end up with different manifest sections being locked) but still end up with the same artifact cache key.

That way, there's a chance that both of them are executing postprocessUnusableManifestMiss concurrently and then clash in the addObjectToCache call.

@siu
Copy link
Contributor Author

siu commented Oct 25, 2016

There is something else I still don't understand: I was able to run the integration tests in this PR locally and get crashes 100% of the time but it wouldn't crash in appveyor.

@frerich
Copy link
Owner

frerich commented Oct 25, 2016

I tried to test the theory mentioned in my last comment but failed to create a test which demonstrates the issue. By now, I think it's not possible to create two invocations with different manifest keys but the same artifact key, which was an assumption I made in my theory.

@siu
Copy link
Contributor Author

siu commented Oct 26, 2016

I have now managed to reproduce the bug consistently from a unit test in appveyor but I didn't consider locking the manifest section before locking the artifact section, so the test is artificial. As you point, If I introduced the locking and creation of the manifest as the real clcache the race condition would not be there.

Some more background: We had problems with #155 in our CI server but only after the cache got full or it was shrinked. Usually we have more than one compilation task running per node in different directories. In order to share the cache we use CLCACHE_BASEDIR.

(Side note: getManifestHash does not unify paths with CLCACHE_BASEDIR so it is going to create different manifests for equivalent files. In the end it is not going to point to the same object in the cache because the cachekey contains the manifesthash but it is going to reduce the number of cache hits.)

An important hint is that it only happened after the cache was full and once the bug happened once it happened for every compilation. The clean process doesn't attempt to keep the cache "consistent", in the sense that it may remove manifests or objects and does not keep the references updated. I think we are now getting close to the real issue: evicted objects are handled well by the code but evicted manifests (and not the objects) are not. I will try to write an integration test to prove this.

@frerich
Copy link
Owner

frerich commented Oct 26, 2016

Thanks a lot for your thoughtful comment - looking forward to that integration test!

As for being able to reproduce the race condition which this ticket is about: indeed, I noticed that if you consider processUnusableManifestMiss() in isolation, it is possible to trigger a race condition. An integration test however which verifies clcache as a whole doesn't seem to trigger it.

@sasobadovinac
Copy link

I would just like to confirm that also from my tests this is happening only in direct mode and when the cache gets full https://ci.appveyor.com/project/sasobadovinac/freecad/history

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants