-
Notifications
You must be signed in to change notification settings - Fork 83
Per-section locking #217
Per-section locking #217
Conversation
Current coverage is 87.33% (diff: 84.53%)@@ master #217 diff @@
==========================================
Files 1 1
Lines 969 1003 +34
Methods 0 0
Messages 0 0
Branches 160 159 -1
==========================================
+ Hits 853 876 +23
- Misses 86 94 +8
- Partials 30 33 +3
|
|
I run the performancetests.py with the current PR and I got this: With a bigger codebase (hardlink=1, 6 real cores, 12 logical cores, 1223 object files) I am experiencing crashes with this stack trace:
After retrying a few times the whole codebase compiled. I cleaned the working dir and recompiled with the hot cache. The speedup is about 15x compared to 5.6x in previous versions. A very similar speedup to #208. |
|
Ah, interesting! That Did you change this variable to a lower value by any chance? It's surprising that ten seconds are not sufficient... |
|
@frerich I got same error multiple times in row: File "clcache.py", line 1609, in <module>
File "clcache.py", line 1483, in main
File "clcache.py", line 1512, in processCompileRequest
File "clcache.py", line 1594, in processNoDirect
File "clcache.py", line 250, in __enter__
File "clcache.py", line 266, in acquire
__main__.CacheLockException: Error! WaitForSingleObject returns 258, last error 183
Failed to execute script clcache
Traceback (most recent call last):
File "clcache.py", line 1609, in <module>
File "clcache.py", line 1483, in main
File "clcache.py", line 1512, in processCompileRequest
File "clcache.py", line 1594, in processNoDirect
File "clcache.py", line 250, in __enter__
File "clcache.py", line 266, in acquire
__main__.CacheLockException: Error! WaitForSingleObject returns 258, last error 183
Failed to execute script clcache
Traceback (most recent call last):
File "clcache.py", line 1609, in <module>
File "clcache.py", line 1483, in main
File "clcache.py", line 1512, in processCompileRequest
File "clcache.py", line 1594, in processNoDirect
File "clcache.py", line 250, in __enter__
File "clcache.py", line 266, in acquire
__main__.CacheLockException: Error! WaitForSingleObject returns 258, last error 183
Failed to execute script clcache
Traceback (most recent call last):
File "clcache.py", line 1609, in <module>
File "clcache.py", line 1483, in main
File "clcache.py", line 1512, in processCompileRequest
File "clcache.py", line 1596, in processNoDirect
File "clcache.py", line 1340, in processCacheHit
File "clcache.py", line 250, in __enter__
File "clcache.py", line 266, in acquire
__main__.CacheLockException: Error! WaitForSingleObject returns 258, last error 183
Failed to execute script clcache
Traceback (most recent call last):
File "clcache.py", line 1609, in <module>
File "clcache.py", line 1483, in main
File "clcache.py", line 1512, in processCompileRequest
File "clcache.py", line 1594, in processNoDirect
File "clcache.py", line 250, in __enter__
File "clcache.py", line 266, in acquire
__main__.CacheLockException: Error! WaitForSingleObject returns 258, last error 183
Failed to execute script clcache
Traceback (most recent call last):
File "clcache.py", line 1609, in <module>
File "clcache.py", line 1483, in main
File "clcache.py", line 1512, in processCompileRequest
File "clcache.py", line 1594, in processNoDirect
File "clcache.py", line 250, in __enter__
File "clcache.py", line 266, in acquire
__main__.CacheLockException: Error! WaitForSingleObject returns 258, last error 183
Failed to execute script clcache
Traceback (most recent call last):
File "clcache.py", line 1609, in <module>
File "clcache.py", line 1483, in main
File "clcache.py", line 1512, in processCompileRequest
File "clcache.py", line 1594, in processNoDirect
File "clcache.py", line 250, in __enter__
File "clcache.py", line 266, in acquire
__main__.CacheLockException: Error! WaitForSingleObject returns 258, last error 183
Failed to execute script clcache
Traceback (most recent call last):
File "clcache.py", line 1609, in <module>
File "clcache.py", line 1483, in main
File "clcache.py", line 1512, in processCompileRequest
File "clcache.py", line 1594, in processNoDirect
File "clcache.py", line 250, in __enter__
File "clcache.py", line 266, in acquire
__main__.CacheLockException: Error! WaitForSingleObject returns 258, last error 183I used default value (10 seconds), so, I think it's kind of deadlock, because it happened in same moment in several projects, it's not "dead" only because of timeout. p.s. 16 cores. |
|
I am using the default timeout, I had to increase it in the past but with #208 it wasn't needed anymore so it is not set in the environment. Some of our targets take almost a minute to compile (that's one of the reasons why we need clcache). The locks still have the chance of being shared (less often), so if the lock is held during compilation other processes will have to wait more than 10 seconds. |
|
@Jimilian That's very interesting, I agree - it looks like a deadlock which is eventually resolved when the timeout hits. It's very interesting that you manage to trigger this even with There are two locks involved for every
My first idea was that maybe the order in which these locks are acquired is mixed up somewhere, i.e. if process A would first acquire 1 and then 2 and process B would first acquire 2 and then 1, then the two might deadlock each other. Alas, I couldn't spot such a bug in the code on first glance... |
|
@Jimilian & @siu -- I just extended this PR with a little temporary commit adding some debug output for acquiring/releasing locks. This will generate a lot of output, but it may help us with telling what's going on. The output will cause clcache to print its PID and the name of the lock everytime it starts to wait for a lock, acquires a lock and releases a lock. I hope that this allows reconstructing the order of events which lead to the deadlock. It would be much appreciated if you could give it a try and then upload the generated output somewhere. |
|
@siu wrote:
That's correct, but other process will only have to wait more than 10 seconds if you a) have a cache miss and b) you only block other processes accessing the same cache section (a 1/256 chance assumingn an even distribution of hash sums). |
|
It's not related to waiting during compilation. Because I didn't have same problem before, and timeout was same and in similar place. I had broken json files - it's true. But not a deadlocks. @frerich, I also added information about mutexName to exception locally. So, now it prints: _main__.CacheLockException: Error! WaitForSingleObject returns 258 for Local\C--clcache-stats.txt, last error 0I think that deadlock was created on statistics. Also bug happens only on cold cache. I failed to repeat on hot. Tail (I hope it's enough): https://gist.github.com/frerich/60b424950f66b3e41a4831d2f1e70b56 (edited by @frerich such that the log is a gist instead of shown inline) |
|
Also it could be related to cleanup, probably. When we need to lock all sections. |
|
Thanks for posting those logs - I believe I can explain this issue now: in non-direct mode, the following locks are acquired and released during a cache miss:
So what can happen is this: two parallel clcache instances 'A' and 'B' can get executed e.g. like this:
At this point, 'B' cannot proceed because 'A' still holds the statistics lock, but 'A' cannot proceed because 'B' still holds the lock on section 'y'. So they deadlock, and the situation is resolved by 'B' timing out while waiting for the statistics. |
|
@frerich How are you going to solve it? I see only one option - check cleanup after unlocking/before locking. Probably, then script starts/ends. |
|
I think one alternative is to not nest the locks, i.e. only update the statistics after working on the actual cache. This means that there's a small time window in which the statistics don't reflect the actual state visible to other clcache processes, but I think that should not hurt. |
|
FYI when you are back to direct mode. The problem I reported before vanishes if I configure a big timeout (5 mins) so I don't think it is a deadlock in my case. |
|
@Jimilian I now implemented the idea of cleaning the cache before starting a compile job, not after it. This is more like what e.g. Git does (checking whether it needs to GC before actually doing its work). It's a regression for cache misses which now may be a little bit slower than before (because they need to open two JSON files to test whether any cleaning is to be done), but it's conceptually a lot simpler. |
|
So, I failed to reproduce deadlock. I tested it multiple times with cold cache (code was completely new, but cache itself was full) and compilation passed without any issues as expected. |
|
I got a new stacktrace with the current version: I hope it helps. |
|
@siu Hmm, that's interesting. I see I.e. a timeout while waiting for a cache section lock in In master, the error message given for timeouts is a lot more useful because it tells which lock it wanted to acquired. This might shed some light - I'll rebase this PR onto master, could you then maybe try it again? |
b95c123 to
ef688a7
Compare
|
@siu Ok, rebased now - would be great if you could give it another shot. :-) |
|
@frerich sure, I will try tomorrow |
|
I am analyzing the code trying to understand if deadlocks can occur. Deadlocks cannot occur if all possible executions of the software acquire the locks in the same order. For that reason I wrote down here all functions that take locks and checked if they respect the same order: With this information we can see that there are code paths that acquire the locks in different order and that will lead to deadlocks. I know, manifestSection and artifactSection are different in the majority of cases, but for sure there will be collisions every once in a while. In addition the operations on the whole cache need to acquire all of them. For example, let's say one process A has to clean the cache, which involves:
Another process B may be running processDirect which may involve:
In this scenario, if A executes until A2 and B executes until B3, they will deadlock, A needs a manifestSection locked by B and B needs the stats locked by A. In reality one of them would timeout and crash and the other process would succeed. Note that increasing the timeout delay would not help in this situation, one of the processes would still timeout. There are a few fixes that need to be applied here and there but the most problematic one seems to be the lock of the artifactsSection inside addObjectToCache. It could be removed from there and acquired in the postprocessing functions before the stats are locked (in the same with for example). We can elaborate a full list of things to change and iterate if you wish. |
|
More logs with stacktrace: 20160829-cold-01-edited.zip The first one is the original in the repository and the second contains timestamps in addition. |
|
I've run some preliminary tests and this time the compilation finished properly with cold and hot caches. I am running clcache with hardlink=1, 12 cores, timeout=300000ms, 1222 targets. The speedup with this PR is 9.7x compared to 5.6x of my last tests of master. I was expecting faster times, I will enable profiling and post the results. |
|
@siu Thanks for sharing those numbers! The strongest Windows build machine I ahve access to unfortunately only has two cores, so I very much depend on people with better hardware to try this stuff. I really appreciate it, and am looking forward to the profiling output. Did your build run all the way to the end without any exceptions this time? If so, that would be encouraging since it suggests we fixed all the locking issues. I agree that 'just' 9.7x is less than expected, but at least this is a good argument to merge the PR and then implement additional improvements (maybe to the statistics locking) in a separate PR. I'm curious how things are going for @Jimilian with this version. :-) |
|
Here are the results of the profiling with cold and hot caches: I will keep using this branch and report any issue if there is any.
Yes, this time it worked without problems, the only thing is that I have a huge timeout. I agree it is looking good for merging, let's see what the others say. |
|
@siu Ah, that's interesting! Looking at the profiling for the hot cache, I see it starts with this: I.e. of the 287s seconds the whole build took, I think a huge portion is spent in I wonder about the 96 seconds spent in |
|
I'm planing to provide additional test results of this PR in a 6-core local SSD, hardlink machine within 2-3 days. |
|
I have bad news... Now it's significantly slower (13 minutes vs 10 minutes with hot cache). |
|
@siu In my experiments, simply changing 'HashAlgorithm' at the top of @Jimilian That's really surprising, looking forward to the profile data. |
This is very implementation dependent. The openssl implementation for Linux64 has a SHA1 faster than MD5 For our purpose, we don't need a cryptographic hash function. So we could switch to something much faster. xxHash looks very interesting to me, especially because there is an implementation for Python in pip, that uses the hashlib interface. This would allow us to easily toggle between md5 and xxHash using an environment variable. But what I would like to verify first is, if the hashing function has an impact at all. As far as I know, md5 and SHA1 are usually faster than available persistent IO devices. |
|
With this PR being at 71 comments now, I think we should -- at least in this thread -- concentrate on getting the per-section locking right and then use separate PRs for other performance improvements so that this doesn't get ouf of hand. :-} This PR is already blocking some other work from progressing. |
64e50fe to
0954a47
Compare
This centralises the code for creating a system-wide lock given some path. The function also respects the lock timeout environment variable. I factor this code out into a separate method since I plan to introduce more, finer-grained locks, which will very much use the same logic.
The plan is that these locks synchronise access to an individual section of the cache.
This is a system-wide lock to be used for synchronising accesses to the statistics of a cache.
This reimplements the global 'Cache.lock' lock such that it's defined in terms of the individual section and the statistics locks. This means that acquiring and releasing Cache.lock acquires and releases locks for all sections and for the statistics. This is slower than before (because it requires acquiring and releasing up to 513 locks) but it should be only rarely needed - mostly, when cleaning the cache.
This (big) patch attempts to improve concurrency by avoiding the global cache lock. Instead, we will lock only those manifest and compiler artifact sections which we deal with. The only cases where we need to synchronize all concurrenct processes is when updating the statistics file, because there's only a single statistics file. At least for cache hits this could be avoided by tracking the number of cache hits per section, too. To avoid deadlocks, the locks have to acquired in the same order for all execution paths (the order is defined in Cache.lock, i.e. first manifests, then artifacts, then statistics). Hence, locking of the artifact section had to be pulled out of the addObjectToCache() function since that function was called with the stats locked already - a violation of the locking order. Furthermore, we can no longer perform cache.clean() in addObjectToCache() because cache.clean() acquires the global lock, so e.g. this sequence of steps was possible in non-direct mode: 1. 'A' locks cache section 'x' 2. 'A' locks global statistics lock 3. 'B' locks cache section 'y' 4. 'A' needs to lock all sections to do a cleanup At this point, 'B' cannot proceed because 'A' still holds the statistics lock, but 'A' cannot proceed because 'B' still holds the lock on section 'y'. This issue is caused by -- from B's vie -- the statistics lock being locked before a section lock. This must never happen. At the point addObjectToCache() is called, we already have the statistics locked and we know that it may be that the cache size limit was just exceeded, so it's a good moment to determine that a cleanup is needed. It's not a good moment to *perform* the cleanup though. Instead, let the function return a flag which is propagated all the way back to processCompileRequest(). The flag indicates whether cleanup is needed, and if so, processCompileRequest() will acquire Cache.lock (which acquires all sections and statistics in the correct order) to do the cleanup.
By checking for 'manifest == None' early and returning early, we can reduce the indentation depth for the larger 'manifest != None' branch.
It's really an exceptional issue, let's handle that on the caller side so that we get the 'forward to real compiler' logic for free.
It's just ManifestRepository.getIncludesContentHashForFiles() which can raise IncludeChangedException, so only that needs to be covered by try/except; doing so, moving code out of the 'try', allows reducing the indentation depth.
These functions, just like postprocessObjectEvicted(), don't need to lock the cache section: they are already called while the cache section is locked (in processDirect()).
| printTraceStatement("Cannot cache invocation as {}: called for preprocessing".format(cmdLine)) | ||
| updateCacheStatistics(cache, Statistics.registerCallForPreprocessing) | ||
| except IncludeNotFoundException: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a simple tracing message here? I just run into this and did not have a clue, why there are no hits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think I rather won't do it right now. The code seems fishy and there might be a patch needed here which is worth its own PR.
This exception can apparently be raised in two situations:
- During a cache miss, in case an include file which
cl.exeprinted (via/showIncludes) was deleted (or otherwise became unreadable) right after the compiler finished. Very unlikely, I guess. - During a cache hit, in case a manifest references a non-existent include file. In this case however, shouldn't the code rather update (i.e. rewrite) the manifest?
|
Okay I guess my 2011 CPU Phenom II X6 1055T is just too slow for this kind of measurements. I get no significant difference in performance (this vs. master) using #226. In the 13 seconds of restoring, all 6 cores are under full load, both on master and using this. In the performancetests.py, this at brings some 22 %. However, everything compiles solid as expected. So feel free to merge, if you want to. |
|
Two people other confirmed this brings a modest improvement, and does not seem to hurt. So let's merge this and see how it goes. |
This PR attempts to improve concurrency by making clcache not lock the entire cache but only affected sections of it.
At the same time, it actually increases the duration for which a section is locked in case of a cache miss: we will hold the lock even while the compiler is running, something which seems more correct but which was not viable before because it would greatly impact concurrency.
This PR is meant to supersede #208. It resolves #160 .