Skip to content
This repository was archived by the owner on Feb 4, 2020. It is now read-only.

Conversation

@siu
Copy link
Contributor

@siu siu commented Oct 26, 2016

CLCACHE_BASEDIR is not used to cleanup the command line when computing the manifest hash. This is a problem because invocations of clcache from different basedirs will create different manifests for equivalent object files.

@codecov-io
Copy link

codecov-io commented Oct 26, 2016

Current coverage is 88.96% (diff: 100%)

Merging #238 into master will increase coverage by 0.09%

@@             master       #238   diff @@
==========================================
  Files             1          1          
  Lines           997       1006     +9   
  Methods           0          0          
  Messages          0          0          
  Branches        158        164     +6   
==========================================
+ Hits            886        895     +9   
  Misses           83         83          
  Partials         28         28          

Powered by Codecov. Last update 718c651...7d45cb3

Copy link
Owner

@frerich frerich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks basically perfect to me, except for a missed opportunity to reuse the existing collapseBasedirToPlaceholder function.

This is the first time I use the new GitHub code review feature, let's see how it goes. :-)

clcache.py Outdated
commandLine = [arg for arg in commandLine if not arg.startswith("/MP")]
baseDir = os.environ.get('CLCACHE_BASEDIR')
if baseDir:
commandLine = [arg.replace(baseDir, BASEDIR_REPLACEMENT) for arg in commandLine]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could reuse the existing collapseBasedirToPlaceholder here and use

commandLine = [collapseBasedirToPlaceholder(arg) for arg in commandLine if not arg.startswith("/MP")]

in line 234.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That wouldn't work with the current implementation of collapseBasedirToPlaceholder as it assumes that the BASEDIR appears in the first position of the argument and that the argument is in normcase.

Here arg can be of the form -IC:\basedir... which doesn't satisfy any of those conditions. We shouldn't change the case either.

collapseBasedirToPlaceholder could be changed like this to make it work:

 def collapseBasedirToPlaceholder(path):
     baseDir = normalizeBaseDir(os.environ.get('CLCACHE_BASEDIR'))
-    if baseDir is None:
-        return path
-    else:
-        assert path == os.path.normcase(path)
-        assert baseDir == os.path.normcase(baseDir)
-        if path.startswith(baseDir):
-            return path.replace(baseDir, BASEDIR_REPLACEMENT, 1)
-        else:
-            return path
+    if baseDir:
+        replacement = re.compile(re.escape(baseDir), re.IGNORECASE)
+        return replacement.sub(BASEDIR_REPLACEMENT, path)
+    return path

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, true - as it is, collapseBasedirToPlaceholder is not applicable. While thinking about whether it makes sense to adjust it as you suggested, it occurred to me that maybe clcache shouldn't always blindly replace the basedir value.

For instance, should it really replace the basedir in preprocessor definitions passed via -D on the command line?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siu -- ping? :)

Copy link
Contributor Author

@siu siu Nov 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't think it is correct to apply the replacement blindly. I can instead add a list of parameters and apply the replacement only to those.

In that case, which arguments should be in the white list? I had a look here (https://msdn.microsoft.com/en-us/library/fwkeyyhe.aspx), these are the arguments that seem good candidates:

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the first three sound like good candidates. I suggest the replacement should be applied to their argument -- we can extend (or shorten) that list later as we learn about new cases.

/Yu doesn't seem important right now since clcache doesn't support invocations making use of precompiled headers anyway.

@siu siu force-pushed the handleBaseDirInGetManifestHash branch 2 times, most recently from d9b4ba3 to cc20435 Compare November 3, 2016 11:29
@siu
Copy link
Contributor Author

siu commented Nov 3, 2016

Summary of the changes in this PR:

  • Added integration tests to show how clcache should behave with /I and /D arguments that include CLCACHE_BASEDIR
  • Apply collapseBasedirToPlaceholder to arguments that begin with /I, /AI, /FU
  • Change collapseBasedirToPlaceholder to find the basedir at any position and ignore the capitalization
  • Change normalizeBaseDir to remove trailing "/" and adapt unit tests. I am not sure about this one, will this break anything? Should I increase the MANIFEST_FILE_FORMAT_VERSION again?
  • Cleanup in TestBasedir
  • Make ASSETS_DIR in integration tests an absolute path (to simplify TestBasedir)

I also noticed that the integration tests have a lot of duplicated code, this could be simplified if all of them inherited from a common base class. Anyway that is for another PR.

Note: github shows the commits in a different order than they were applied.

Copy link
Owner

@frerich frerich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's looking promising - but I must admit that with some of the changes, the Why isn't clear to me. It would be great if you could be a bit more verbose in your commit logs such that they don't just consist of one sentence saying What was changed (the What is usually evident from the diff, the commits are very small) but also Why.

Thanks a lot though for not only improving the basedir feature but also adding tests!

if not baseDir.endswith(os.path.sep):
baseDir += os.path.sep
if baseDir.endswith(os.path.sep):
baseDir = baseDir[0:-1]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you need to change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed to make the integration test TestBasedir.testBasedirIncludeArg pass. There the include path is generated by os.getcwd() which doesn't include the trailing separator. As normalizedBasedir is used in collapseBasedirToPlaceholder the version with the trailing slash would not match.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first wanted to insist on rather adjusting the test such that it appends a path separator (if needed) to the os.getcwd() return value instead of adjusting this function, but I guess if os.getcwd() does not return a trailing slash then that's a good indicator that we shouldn't do so either. So I'm okay with this. 👍

clcache.py Outdated
if baseDir:
replacement = re.compile(re.escape(baseDir), re.IGNORECASE)
return replacement.sub(BASEDIR_REPLACEMENT, path)
return path
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you dropped the assert statements? They seem to express that the function expects that both the given path as well as the baseDir are normalized, i.e. have not only the same upper/lowercase but also forward slashes converted to backslashes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I applied this change as a quick test and didn't give it too much thought. With the proposed implementation the assert path == os.path.normcase(path) does not hold. I can add back the other assert assert baseDir == os.path.normcase(baseDir)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the assert path == os.path.normcase(path) not hold anymore? Ah, I guess it's because this function is not given paths anymore, but command line arguments, e.g. /IC:\Foo\Bar. That suggests that either the argument should be renamed (path does not fit anymore), or maybe the newly introduced collapseBasedirInCmd should not pass arguments to collapseBasedirToPlaceholder but rather the actual paths.

To extract the paths from the arguments, the code in CommandLineAnalyzer.parseArgumentsAndFiles could maybe get refactored to be more reusable (it already solves the issue of getting the values for command line arguments).

I guess for now, simply renaming the path argument to e.g. argument or maybe even just s would be fine, at least it's not misleading anymore.

PYTHON_BINARY = sys.executable
CLCACHE_SCRIPT = os.path.join(os.path.dirname(os.path.realpath(__file__)), "clcache.py")
ASSETS_DIR = os.path.join("tests", "integrationtests")
ASSETS_DIR = os.path.join(os.path.dirname(os.path.realpath(__file__)), "tests", "integrationtests")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a suspicion, but I'm not quite sure - why did you change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed only to simplify the code in TestBasedir. The cwd is changed with cd() and the relative path to the assets wouldn't work. If I rearrange the code there a bit this change is not needed.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense! 👍

@siu
Copy link
Contributor Author

siu commented Nov 4, 2016

That is true, some changes look pretty random without an explanation. I will try to explain myself better. I can reply you in the PR this time, but just to be clear, in the future do you prefer to have the explanation in each of the commits?

@frerich
Copy link
Owner

frerich commented Nov 4, 2016

Thanks - yes, it's very much appreciated if the commits themselves explain the reasoning behind the change. This is very useful when debugging things later on, where you typically check the history to see why some changes were done -- and it might be in a time or place where GitHub is not even available, so all the things discussed here are inaccessible.

I very much recommend this page for some good guidelines on how to write commit logs, all future contributors (and your future self :-)) will be grateful! :-)

For what it's worth, it's very easy to update your commits with logs. With the command line client, you can run git rebase --interactive <commit_on_which_you_based_your_branch> and then change the action for each commit to perform (the word in the first column) to reword. Git will then replay all your commits, but on each commit it will pause and open an editor which allows you to edit the commit log.

@siu
Copy link
Contributor Author

siu commented Nov 4, 2016

Thanks for the clarification. I will amend the commits once the review is done.

This change makes easier to write integration tests that change the cmd.
@siu siu force-pushed the handleBaseDirInGetManifestHash branch from cc20435 to 982b8f9 Compare November 7, 2016 11:13
@siu
Copy link
Contributor Author

siu commented Nov 7, 2016

@frerich I rewrote the PR with your suggestions.

  • Thanks for pointing to CommandLineAnalyzer.parseArgumentsAndFiles, it can be used from getManifestHash as it is and makes the changes in collapseBasedirToPlaceholder unnecessary.
  • The trailing separator still must be removed from normalizeBaseDir as the include arguments may not have it and it should match.
  • I kept the absolute ASSETS_DIR as it makes writing tests easier.

siu added 3 commits November 7, 2016 12:31
Show a few scenarios where using CLCACHE_BASEDIR should produce hits
and misses.
In direct mode the command line arguments are hashed to compute the
manifest hash. CLCACHE_BASEDIR should be used to unify arguments in
order to produce hits between different invocations.  This should only
be applied to arguments that specify where to find the source code to
the compiler or we risk creating hits for compiler tasks that generate
different object files.

For example, these two invocations should reuse the object in the cache:

"cl.exe /IC:\Project1\include\lib1 main.cpp" with CLCACHE_BASEDIR=C:\Project1
"cl.exe /IC:\Project2\include\lib1 main.cpp" with CLCACHE_BASEDIR=C:\Project2

But these two should not reuse it:

"cl.exe /IC:\Project1\include\lib1 /DFOLDER=C:\Project1\ main.cpp" with CLCACHE_BASEDIR=C:\Project1
"cl.exe /IC:\Project2\include\lib1 /DFOLDER=C:\Project2\ main.cpp" with CLCACHE_BASEDIR=C:\Project2

In this example the /D argument does not specify the compiler where to
find the source code but it may specify a default path (for example).
The include paths in the command line may not include a trailing
separator. normalizedBaseDir should not include a trailing separator in
order to match.
@siu siu force-pushed the handleBaseDirInGetManifestHash branch from 982b8f9 to ea01aef Compare November 7, 2016 11:32

commandLineArgs = []
projectSpecificArgs = ("AI", "I", "FU")
for k in sorted(arguments.keys()):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use sorted here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the sorted is very important to have repeatable manifest hashes. As dictionaries in python are unordered the only way to guarantee that we generate the same hash every invocation is to iterate over the keys in the same order. It is the same case that #235 solved.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so this basically improves the manifest cache hit rate, i.e. it's an improvement but not directly related to the BASEDIR feature? Ok.

Copy link
Contributor Author

@siu siu Nov 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is going to have that side effect but at same time it is needed for correctness since the implementation changed in this PR to use a dictionary.

@frerich frerich merged commit 2e4abc9 into frerich:master Nov 8, 2016
@frerich
Copy link
Owner

frerich commented Nov 8, 2016

Thanks for being so patient and explaining all the changes to me, I don't see any reason not to merge this nice improvement - thanks a lot! :-)

@siu
Copy link
Contributor Author

siu commented Nov 8, 2016

I am happy to iterate and give explanations if things are not clear, the end result is better ;) Thanks for merging.

@siu siu deleted the handleBaseDirInGetManifestHash branch October 11, 2017 19:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants