-
Notifications
You must be signed in to change notification settings - Fork 83
copyOrLink race condition #236
Conversation
Current coverage is 88.86% (diff: 100%)@@ master #236 diff @@
==========================================
Files 1 1
Lines 997 997
Methods 0 0
Messages 0 0
Branches 158 158
==========================================
Hits 886 886
Misses 83 83
Partials 28 28
|
|
I think I understand the problem description - but wouldn't fixing this be a mere matter applying a patch like --- a/clcache.py
+++ b/clcache.py
@@ -1430,7 +1430,7 @@ def postprocessUnusableManifestMiss(
section = cache.compilerArtifactsRepository.section(cachekey)
with section.lock, cache.statistics.lock, cache.statistics as stats:
reason(stats)
- if returnCode == 0 and os.path.exists(objectFile):
+ if returnCode == 0 and os.path.exists(objectFile) and not section.hasEntry(cachekey):
artifacts = CompilerArtifacts(objectFile, compilerOutput, compilerStderr)
cleanupRequired = addObjectToCache(stats, cache, section, cachekey, artifacts)? I.e. locking the artifact section and then checking whether it already has an entry with the name? |
|
@frerich I believe that is a valid fix. I was attempting to add a regression test before fixing it to avoid re-introducing the bug in the future. |
|
I am closing this PR until I have something useful to reduce noise, I will reopen or add another one later. |
|
Thinking about this, I suppose that to trigger this issue, you need to have two concurrency That way, there's a chance that both of them are executing |
|
There is something else I still don't understand: I was able to run the integration tests in this PR locally and get crashes 100% of the time but it wouldn't crash in appveyor. |
|
I tried to test the theory mentioned in my last comment but failed to create a test which demonstrates the issue. By now, I think it's not possible to create two invocations with different manifest keys but the same artifact key, which was an assumption I made in my theory. |
|
I have now managed to reproduce the bug consistently from a unit test in appveyor but I didn't consider locking the manifest section before locking the artifact section, so the test is artificial. As you point, If I introduced the locking and creation of the manifest as the real clcache the race condition would not be there. Some more background: We had problems with #155 in our CI server but only after the cache got full or it was shrinked. Usually we have more than one compilation task running per node in different directories. In order to share the cache we use CLCACHE_BASEDIR. (Side note: getManifestHash does not unify paths with CLCACHE_BASEDIR so it is going to create different manifests for equivalent files. In the end it is not going to point to the same object in the cache because the cachekey contains the manifesthash but it is going to reduce the number of cache hits.) An important hint is that it only happened after the cache was full and once the bug happened once it happened for every compilation. The clean process doesn't attempt to keep the cache "consistent", in the sense that it may remove manifests or objects and does not keep the references updated. I think we are now getting close to the real issue: evicted objects are handled well by the code but evicted manifests (and not the objects) are not. I will try to write an integration test to prove this. |
|
Thanks a lot for your thoughtful comment - looking forward to that integration test! As for being able to reproduce the race condition which this ticket is about: indeed, I noticed that if you consider |
|
I would just like to confirm that also from my tests this is happening only in direct mode and when the cache gets full https://ci.appveyor.com/project/sasobadovinac/freecad/history |
This is a work-in-progress. This branch is based on #235 because cachekeys need to be deterministic for it to work. The only change w.r.t. #235 is the last commit, which adds a failing integration test showing one way to trigger #155 (there may be more ways to trigger it).
This race condition happens when two clcache instances decide that an object file is not in the cache. In that case both instances invoke the compiler and compute the cachekey for the object file. After that they lock the appropriate artifacts section and call addObjectToCache which calls copyOrLink. The race is that the other process may have run through the addObjectToCache already and the object may be in the cache already, making the rename fail. In the end it is a simple bug but showing it in an automatic test is challenging.