Skip to content
This repository was archived by the owner on Feb 4, 2020. It is now read-only.

Conversation

@siu
Copy link
Contributor

@siu siu commented Aug 16, 2016

First of all, thanks for developing clcache and maintaining it. We are checking the feasibility of using clcache for everyday use and our jenkins server and while the performance boost is great we noticed that it is still far from ccache.

We are profiling ccache (#207)(hardlink=on, directmode=on) and we found that runs that have everything in cache spend a lot of time in processDirect. More revealing is that a lot of time is spent trying to acquire the lock at the beginning of the function. Which probably means that other processes are holding the lock for too long. This change is an attempt to reduce the time spent in the critical section, releasing the lock while computing the manifest hash. The access to the cache is still protected with the lock but I am not sure if the computation of the hash is still 100% safe.

With this change the performance test goes from:

Compiling 30 source files sequentially, cold cache: 8.551584128381222 seconds
Compiling 30 source files sequentially, hot cache: 1.8987483154978264 seconds
Compiling 30 source files concurrently via /MP12, hot cache: 1.2492085815035718 seconds

to

Compiling 30 source files sequentially, cold cache: 8.512393530093654 seconds
Compiling 30 source files sequentially, hot cache: 1.8344766762600795 seconds
Compiling 30 source files concurrently via /MP12, hot cache: 0.6385067195241199 seconds

This has a big impact in our main build times making them a lot faster (14.8x faster compared to 5.6x faster with the version in master, with hot cache).

Could you help reviewing if these changes are safe?

@siu siu force-pushed the directModeLocking branch from 81838cb to 1a35d8e Compare August 17, 2016 07:53
@codecov-io
Copy link

codecov-io commented Aug 17, 2016

Current coverage is 78.40% (diff: 69.23%)

Merging #208 into master will decrease coverage by 9.47%

@@             master       #208   diff @@
==========================================
  Files             1          1          
  Lines           974        968     -6   
  Methods           0          0          
  Messages          0          0          
  Branches        162        162          
==========================================
- Hits            856        759    -97   
- Misses           86        172    +86   
- Partials         32         37     +5   

Powered by Codecov. Last update dc5c6bd...ad573a5

@webmaster128
Copy link
Contributor

I think this is safe to do, i.e. after looking at the diff I did not find an issue. Looks like a great catch to me.

Could you help us be reducing the diff? If you restore the original

                if cache.compilerArtifactsRepository.section(cachekey).hasEntry(cachekey):
                    return processCacheHit(cache, objectFile, cachekey)
                else:
                    postProcessing = lambda compilerResult: postprocessObjectEvicted(
                        cache, objectFile, cachekey, compilerResult)

the diff should get much smaller and the logic is easier to follow. The additional method assignment is irrelevant for the locking time.

@siu
Copy link
Contributor Author

siu commented Aug 18, 2016

Sure, done.

@frerich
Copy link
Owner

frerich commented Aug 21, 2016

Hi @siu , thanks a lot for your analysis and your proposed patch! I suspect this introduces a race condition though.

What happens though when two concurrent clcache invocations both compute the same manifest hash and then look for the same manifest - but it does not exist? In that case, won't they both end up trying to add the same manifest file, potentially at the same time? Does that work?

For what it's worth, an alternative solution I was toying with is to not have a cache-global lock but rather have one lock per cache (and mianfest) 'section'. That way, the chance of two concurrent invocations to be blocking each other is reduced from 100% to, uh, a lot less -- only if the manifestHash of the two invocations starts with the same two characters.

@webmaster128
Copy link
Contributor

In that case, won't they both end up trying to add the same manifest file, potentially at the same time? Does that work?

The manifest content is generated during the compiler call (no lock anyway). Storing the manifest happens in a postprocess method where the cache is locked. This would lead to one new manifest overriding the other new manifest, which isn't an issue I think.

@siu
Copy link
Contributor Author

siu commented Aug 22, 2016

@frerich , I make the same analysis as @webmaster128 , the cache is locked for any writing and it should be safe even if two processes generate the same manifest/object files.

This PR attempts to reduce contention while using the same lock, having per-section locking will further reduce the chance of using the same lock.

@siu
Copy link
Contributor Author

siu commented Aug 22, 2016

Giving this more thought I think you are right, there are some race conditions in the handling of the manifest files but it is not a problem introduced by this PR.

postprocess functions should not assume that the object or manifest cache are in the same state as before. I.e. postprocessNoManifestMiss is run because the manifest didn't exist but by the time it is run the manifest file may exist already. The same applies to postprocessHeaderChangedMiss. That means that the read-update-write of the manifest files needs to be atomic and re-tried by the postprocess functions for each change in the manifests. A writeOrUpdateManifest function could be used for this.

I have the feeling that things are simpler for the object cache as they don't aggregate data of several processes.

@frerich
Copy link
Owner

frerich commented Aug 22, 2016

Indeed, there is a general conflict between correctly coordinating concurrent accesses to the cache and avoiding that concurrent invocations block each other, dragging down performance.

I'm currently contemplating a different, more 'transactional' way of managing concurrency. The downside of using locks is that it's a very pessimistic model: it pays a price (creating, acquiring, releasing locks) to handle the exceptional case of two concurrenct processes accessing the same cache entry. Right now, it does so very naively by just managing a single lock -- that's reasonably easy to do, but greatly impacts performance (see #160).

A while ago, we started deescalating this issue by not using a single lock but rather dividing the cache into multiple (up to 256) sections, based on the hash key computed for some invocation. That way, only invocations of clcache hitting the same cache section would block. However, this work is not finished yet (we did introduce the notion of sections, but they have no dedicated locks yet), and it's also not trivial since some operations (e.g. cleaning the cache) affect all sections.

In other software projects (notably the ones written in Haskell), I got to enjoy STM (software transactional memory) for managing concurrency a lot. It is an optimistic model in that a 'transaction' (a modification of some sort) only becomes expensive if there actually is a conflict. I'm not sure this can be pulled off though.

Yet another idea might be to use dedicated readers-writers locks, such that e.g. concurrent reads to the cache are perfectly fine and can happen 100% concurrently, but write accesses are synchronised. This could maybe be combined with the idea of dividing the cache into sections.

@Jimilian
Copy link
Contributor

Anyway, I saw broken manifest and statistics file several times (that's why I introduced #199). It means that we already had some race conditions before this PR.

@frerich
Copy link
Owner

frerich commented Aug 22, 2016

By now, I'm really curious how well a good database (e.g. PostgreSQL) would work. It might take care of a couple of issues which came up in the past (or still exist):

  • It should be quick to determine the least recently used cache entries whose cumulative size brings the overall database size to 90% of the maximum size (i.e. for incremental trimming of the database, something which used to be notoriously slow in the past).
  • It hopefully takes care of handling concurrency automatically, and efficiently. In particular, I hope that cache reads don't require any locking at all.
  • It simplifies generating cache statistics because some of the things can easily be tracker per-entry (i.e. how often an entry was hit) or over the entire cache (e.g. the number of cache entries and the total cache size).
  • It makes it easy to share a cache across multiple machines.

My main concern really is performance: I wonder how long it would take to query a local database (or a database which is on a LAN, i.e. with reasonably low latency)...

I suspect the average size of object files (a couple of MB in debug builds) would permit storing the object files straight away in the database, but to keep the possibility of hardlinking (i.e. reducing the number of bytes to shovel around), it's probably better to keep a file name and some meta information (file size, date of last access, number of hits) in the database.

@akleber
Copy link
Contributor

akleber commented Aug 22, 2016

When working on the lock file based locking with the possibility in mind to use the cache across several machines (I promise I will get back to work on it soon) I also thought about a server application taking care about synchronization and statistics. I looked a bit into DBMS with regards of storing blobs but couldn't find much. Also I searched a bit about existing NoSql key-value stores for this purpose but also could not find something fitting to what I had in mind. And lastly I thought about writing a custom Python app acting as a server explicitly for clcache. I think that should not be to hard, at least to judge if it might be a reasonable solution for sharing a cache to multiple machines. But first I will get back to the lock file based solution.

@webmaster128
Copy link
Contributor

How does this relate to #78? I never worked with memchached but it might solve some of the problems.

Can those those storage and locking interfaces be generalized in a way that supports different backends? I guess we don't want every user to install huge amounts of dependencies that only pay out with octa cores, ramdisk and InfiniBand.

@frerich
Copy link
Owner

frerich commented Aug 23, 2016

@akleber One of (in fact: the) earliest clcache contributors, @vchigrin , already implemented some sort of daemon (called 'cacodaemon'). I never tried it myself, but AFAIK he uses it with great success. That was a couple of years ago though, but I see that his fork is still active, so there might be something to reuse.

@webmaster128 indeed, I'd like to keep the benefit of clcache of being reasonably easy to deploy. However, it's (as always) a matter of striking the right balance. I'd be willing to make things a 'little' (maybe even optionally) more complex to setup if we get more correctness, speed - or less code. At this point, I think it's too early to really exclude any alternatives. I guess we'll just have to try it out. In one of my local branches, I'm experimenting with something which uses PostgreSQL as the backend. So far, the setup was quite straightforward, let's see where this goes.

@frerich
Copy link
Owner

frerich commented Aug 23, 2016

@webmaster128 As for how this relates to memcached support: I have no idea, but it's a good point! I have no experience with memcached myself.

@siu
Copy link
Contributor Author

siu commented Aug 23, 2016

I don't think the current approach is too far from the correctness point of view, it should be possible to fix the remaining issues before going to more complex implementations. The plan to add locks per sections seems solid, I think it will bring the biggest improvement with fewer changes.

I still think this PR is worth merging, I will update it later on today to sync it with the latest changes in master. I will also start working on a PR to fix the issues described in #208 (comment).

@siu siu force-pushed the directModeLocking branch from 1a83a7e to ad573a5 Compare August 23, 2016 16:34
@siu
Copy link
Contributor Author

siu commented Aug 23, 2016

Patch updated.

@frerich
Copy link
Owner

frerich commented Aug 24, 2016

@siu I now had a first shot at implementing finer-grained locks in order to improve concurrency. The code is available in #217 -- it seems to work, but I only have a dual-core system at hand, where it doesn't seem to make much of a difference. Somebody with more cores at hand would need to give it a try -- can you maybe see whether it helps for you? The code is still a bit rough, but I think conceptually it's sound.

@siu
Copy link
Contributor Author

siu commented Aug 24, 2016

Testing #217. Where do you prefer me to report? I just run the performancetests.py and I got this:

Compiling 30 source files sequentially, cold cache: 8.87619953641643 seconds
Compiling 30 source files sequentially, hot cache: 1.84543584907858 seconds
Compiling 30 source files concurrently via /MP12, hot cache: 0.6135658206123011 seconds

Let me try with a bigger code base.

@frerich
Copy link
Owner

frerich commented Aug 24, 2016

@siu any place is equally good for reporting, I'm happy that volunteers are willing to try things for me. :-)

Those numbers are encouraging and disappointing at the same time:

  • One one hand, it's encouraging that the numbers are very much like what your proposed patch does - however, the patch in Per-section locking #217 is actually a lot more defensive. In fact, it keeps the lock on the section even while the compiler is running, i.e. the lock is held longer -- but we lock a smaller portion of the cache.
  • On the other hand, a 3x performance increase when running 12 compilers in parallel is somewhat disappointing. I would have hoped for a lot more. Right now, the concurrent invocations are still synchronized on one place (writing to the central statistics file), I wonder whether that's to blame. For the fun of it, I could do a patch which disables updating the cache - we could see how much of a difference it makes.

In any case (I'm biased, of course), I tend to prefer #217 over #208 since it achieves a speedup in concurrent scenarios without moving code out of the locked section (in fact, more code is executed under the lock, something which was not viable before because there was just a single lock). So if you agree, I'd like to eventually close #208 in favor of #217.

@siu siu force-pushed the directModeLocking branch from ad573a5 to 6ca23dc Compare August 30, 2016 09:14
@frerich
Copy link
Owner

frerich commented Sep 2, 2016

#217 was just merged which, to the best of my knowledge, obsoletes this PR - hence, I'll close it.

@frerich frerich closed this Sep 2, 2016
@siu siu deleted the directModeLocking branch October 11, 2017 19:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants