BUG: fix source file duplication in command line #290

ghost · 2017-10-07T01:06:31Z

Previously, if the /Tp or /Tc option were passed, clcache would
duplicate the source file in the command line, making MSVC
think that it was compiling two separate files (/Tcexample.c
and example.c), when it was really compiling one.

The problem with this was that MSVC only allows some
options if it is compiling one file, which caused some
invocations of clcache to fail.

Closes #289.

Previously, if the /Tp or /Tc option were passed, clcache would duplicate the source file in the command line, making MSVC think that it was compiling two separate files (/Tcexample.c and example.c), when it was really compiling one. The problem with this was that MSVC only allows some options if it is compiling one file, which caused some invocations of clcache to fail. Closes #289.

ghost · 2017-10-07T01:16:47Z

Test failure seems unrelated.

frerich · 2017-10-07T10:06:09Z

clcache.py

+
+    sourceFile = realCmdline[-1]
+    if '/Tc' + sourceFile in realCmdline or '/Tp' + sourceFile in realCmdline:
+        printTraceStatement("Removing last argument becuase of /Tc (Issue #289)")


Small typo here (becuase) and I guess /Tc is imprecise because it seems this could also be because of /Tp.

I'm going to ignore this for now because it's not relevant to the discussion.

frerich · 2017-10-07T10:09:03Z

clcache.py

+    sourceFile = realCmdline[-1]
+    if '/Tc' + sourceFile in realCmdline or '/Tp' + sourceFile in realCmdline:
+        printTraceStatement("Removing last argument becuase of /Tc (Issue #289)")
+        realCmdline = realCmdline[:-1]


Is it really safe to assume that the last argument is always a source file argument? I wonder whether maybe this fix should be done on the caller side - but it's not clear to me on first glance where clcache would implicitly add a source file argument to the command line.

Is it really safe to assume that the last argument is always a source file argument?

No. This is an example that fixes the problem but I had hoped that you would have a suggestion about a better fix.

I wonder whether maybe this fix should be done on the caller side

Yes. Do you have any pointers for where to look?

ghost · 2017-10-07T18:04:07Z

@frerich Found it (I think):

https://github.com/xoviat/clcache/blob/f0227bb98a16a9804e2b71ef2a6d100c1c7b2719/clcache.py#L1633

Would it be okay to move the if statement there?

if '/Tc' + sourceFile in realCmdline or '/Tp' + sourceFile in realCmdline:

frerich · 2017-10-07T21:16:53Z

@xoviat Good catch, that place indeed looks like it might end up constructing incorrect command lines like the one you noticed. I think an if statement along the lines of what you propose would be much more appropriate in scheduleJobs since that's where the issue is actually introduced right now.

However, maybe it would be even nicer to not use an if but rather reuse the knowledge about which source files are specified via /Tp resp. /Tc: the program already knows this in CommandLineAnalyzer.analyze but discards that knowledge. The function only yields the list of source files but does not track which language they use. This knowledge could be used in scheduleJobs to construct a baseCmdLine value which not only omits all 'plain' source files but also those specified via /Tc and /Tp.

ghost · 2017-10-07T22:04:06Z

Test failure is due to "too many local variables." Not sure what to do about that.

frerich · 2017-10-09T09:39:51Z

The typical fix for the "too many local variables" error is, well, to not use as many local variables. :-) Instead of removing variables though, it's usually better to rather introduce a new function (i.e. the check is a kind of code complexity measure).

However, I must admit that I like the older version with the if better than the new one. My idea triggering the suggestion to reuse the knowledge from CommandLineAnalyzer.analyze was that there are actually two issues right now:

The code which constructs the baseCmdLine is defective: it attempts to filter all source files from the command line (such that lateron it's safe to add a source file to the command line) but fails to do so for source files which were specified via /Tcxxx or /Tpxxx. Furthermore, if an optional space is used (e.g. /Tc xxx or /Tp xxx), I believe the filtered command is even incorrect since e.g. cl /Tp foo.cpp will get turned into cl /Tc foo.cpp if I read correctly.
The code which builds the jobCmdLine is defective, too: it fails to consider any explicitly specified language (i.e. /Tc or /Tp) for the source file appended to the command line. So when invoking clcache like e.g. cl /Tc a.x /Tp b.c I suspect that the jobCmdLine values will compiler will be invoked like cl a.x and cl b.c, the former probably triggering an error and the latter (incorrectly) deducing C (instead of the specified C++) from the file extension .c.

My hope was that it might be viable to change CommandLineAnalyzer.analyze such that instead of a plain list of source files, it also returns an explicitly configured language (if any) for the source files. E.g. cl a.cpp /Tpx.c /Tc foo.y yields a list of tuples like [('a.cpp', ''), ('x.c', 'Tp'), ('foo.y', 'Tc')] or something like that. And this language information could then be used both when filtering the command line (to produce the baseCmdLine value) as well as when appending a single source file to the base command line (because by having the language, the code ensures to not append e.g. just foo.y but rather /Tc foo.y).

ghost · 2017-10-09T17:17:53Z

@frerich Before I fix up the unit tests, does that seem okay?

frerich

I think your latest version is a lot better already! However, the formBaseCmdLine seems tricky enough to justify a few unit tests in unittests.py.

frerich · 2017-10-09T18:25:41Z

clcache.py


+def formBaseCommandLine(cmdLine, sourceFiles):
+    # type: (List[str], List[Tuple[str, str]]) -> List[str]
+    setOfSources = set([sourceFile for sourceFile, _ in sourceFiles])


Minor cosmetic: you could shorten this by using a generator expression:

setOfSources = set(sourceFile for sourceFile, _ in sourceFiles)

frerich · 2017-10-09T18:28:02Z

clcache.py

+def formBaseCommandLine(cmdLine, sourceFiles):
+    # type: (List[str], List[Tuple[str, str]]) -> List[str]
+    setOfSources = set([sourceFile for sourceFile, _ in sourceFiles])
+    skippedArgs = ('/MP', '/Tc', '/Tp')


I think that -Tc and -Tp work, too. At least we support either form in CommandLineAnalyzer.analyze, and with Visual Studio 2013 (the only compiler I have at hand right now) you can indeed use either form.

frerich · 2017-10-09T18:30:02Z

clcache.py

+    baseCmdLine = [
+        arg for arg in cmdLine
+        if not (arg in setOfSources or arg.startswith(skippedArgs))
+    ]


This list comprehension works for cases like /Tcfoo.c but it's also possible to have an optional(!) space, i.e. /Tc foo.c. In this case both elements should get removed (since just /Tc on its own is a command line error).

Ahh, but in the second case, it would be in setOfSources.

frerich · 2017-10-09T18:32:29Z

clcache.py

        jobs = []
-        for srcFile, objFile in zip(sourceFiles, objectFiles):
-            jobCmdLine = baseCmdLine + [srcFile]
+        for srcFile, srcLanguage, objFile in zip(zip(*sourceFiles), objectFiles):


This double-zip is confusing to me. How about just

for (srcFile, srcLanguage), objFile in zip(sourceFiles, objectFiles):

frerich · 2017-10-09T18:34:14Z

clcache.py

-        for srcFile, objFile in zip(sourceFiles, objectFiles):
-            jobCmdLine = baseCmdLine + [srcFile]
+        for srcFile, srcLanguage, objFile in zip(zip(*sourceFiles), objectFiles):
+            jobCmdLine = baseCmdLine + list(filter(None, [srcLanguage, srcFile]))


Since srcLanguage is either an empty string or one of /Tp or /Tc, couldn't you drop the composition of list and filter and just use

jobCmdLine = baseCmdLine + [srcLanguage + srcFile]

frerich · 2017-10-09T18:36:17Z

clcache.py

-        if not (arg in setOfSources or arg.startswith("/MP")):
-            baseCmdLine.append(arg)
+    # type: (???, str, List[str], ???, List[Tuple[str, str]], List[str]) -> int
+    # Filter out all source files from the command line to form baseCmdLine


I think this comment is misleading: the formBaseCommandLine function does not only filter out all source files, it also filters out /MP. However, I agree that the function should filter out just the source files because then it could get the more descriptive name filterSourceFiles and the /MP could be filtered on the caller side:

baseCmdLine = [arg for arg in filterSourceFiles(cmdLine, sourceFiles) if arg != '/MP']

ghost · 2017-10-09T19:05:29Z

@frerich Third round of concerns addressed.

frerich · 2017-10-09T21:05:35Z

Thanks for your persistence! 👍 I think by now, the patch looks plausible and conceptually consistent. I believe that after fixing up the unit tests (and maybe adding four, five small tests for filterSourceFiles) this stuff is ready to get merged. It technically fixes two issues by now: it avoids constructing invalid command lines in which a source file is mentioned twice, and it ensures that when compiling in parallel, each source file gets the correct input language specified.

frerich · 2017-10-09T21:12:01Z

unittests.py

    def _testFull(self, cmdLine, expectedSourceFiles, expectedOutputFile):
        sourceFiles, outputFile = CommandLineAnalyzer.analyze(cmdLine)
-        self.assertEqual(sourceFiles, expectedSourceFiles)
+        self.assertEqual([s for s, _ in sourceFiles], expectedSourceFiles)


Wouldn't it be nicer to rather adjust the caller of _testFull such that expectedSourceFiles is actually a list of tuples, specifying the expected language (if any) specification for every source file?

ghost · 2017-10-09T21:20:42Z

Unfortunately, one of the integration tests failed. I'm not sure I understand this failure but the cache statistics may be omitting the fully specified source file arguments. I looked at numCacheHits and could not find how it was calculated.

ghost · 2017-10-10T00:26:54Z

In fact clcache appears not to reuse the cache files. I looked at this some more and unfortunately it appears to be a bit too dense for me.

siu · 2017-10-10T16:09:51Z

clcache.py

            compl = True

+        # Now collect the inputFiles into the return format
+        inputFiles = list(inputFiles.items())


Could it be that the failure in tests is introduced here because inputFiles.items() does not always return the items in the same order? either sorting them or using OrderedDict in line 1253 could help.

I will try that.

Even if it didn't fix the tests I think it still makes sense to keep the OrderedDict. The output of analyze() should return the same order across executions 1) to keep the tests passing and 2) to generate the same hash for the compilation task (if it is computed after, I am not sure about this). I don't know enough about this part of the code, to be checked with @frerich.

The test failures have nothing to do with the order, and if I understand correctly, the hash is calculated per-file ("This causes clcache to reinvoke itself recursively for each of the source files"), making the order irrelevant. There is no reason to tell Python to preserve the order when it doesn't matter.

ghost · 2017-10-10T18:20:09Z

@siu That still did not fix the problem. There are still zero cache hits on the concurrent test (ignore the stdout tests, they're not relevant).

frerich · 2017-10-11T06:38:48Z

clcache.py

-            baseCmdLine.append(arg)
+    # type: (???, str, List[str], ???, List[Tuple[str, str]], List[str]) -> int
+    # Filter out all source files from the command line to form baseCmdLine
+    baseCmdLine = [arg for arg in filterSourceFiles(cmdLine, sourceFiles) if arg != '/MP']


The arg != '/MP' test is what breaks the TestRunParallel.testHitsViaMpConcurrent test.

The test first compiles two source files individually using the command lines

/nologo /EHsc /c fibonacci01.cpp

/nologo /EHsc /c fibonacci02.cpp

The test then tries to compile both files again via

/nologo /EHsc /c /MP2 fibonacci01.cpp fibonacci02.cpp

...expecting that this triggers two cache hits. This causes clcache to reinvoke itself recursively for each of the source files. However, due to the arg != '/MP' test, the recursive invocations still include the /MP2 flag, i.e. the two nested instances use the command lines

/nologo /EHsc /c /MP2 fibonacci01.cpp

/nologo /EHsc /c /MP2 fibonacci02.cpp

Thus, the command line are different than they were when the files were originally compiled, hence the hash sum for the cache entry is different and hence there is no cache hit.

It seems this can be fixed by using not args.startswith('/MP') instead of arg != '/MP'.

The lists are modified to use the new tuple format

ghost · 2017-10-11T18:23:29Z

Okay, this should finally be ready to merge.

frerich · 2017-10-12T06:23:35Z

Looks good, thanks! Just two questions which cam to my mind during a last review:

I see you're adding 'type specifiers' like # type: (str) -> None in a couple of places. Are these processed by some sort of static analysis tool? For this PR, I'd like to keep them out - but if they are processed by a tool then a separate PR might be nice which also etends the CI system such that it runs the tool.
You modified the printTraceStatement function along the way -- did you mean to include this change in this PR? It might be useful (I personally don't really need the line numbers, but I guess they also don't hurt) but a bit out of scope for this PR.

ghost · 2017-10-12T14:05:50Z

You modified the printTraceStatement function along the way -- did you mean to include this change in this PR? It might be useful (I personally don't really need the line numbers, but I guess they also don't hurt) but a bit out of scope for this PR.

Done.

I see you're adding 'type specifiers' like # type: (str) -> None in a couple of places. Are these processed by some sort of static analysis tool? For this PR, I'd like to keep them out - but if they are processed by a tool then a separate PR might be nice which also etends the CI system such that it runs the tool.

First, these type specifiers are the pet project of the BDFL himself (see mypy). I won't provide an in-depth explanation here, but the short version is that most developers spend time reading code to understand what is happening (as I had to do here to repair the bug, and you had to do to review my changes), and the biggest problem with Python is that you have to figure out what the types are of the function arguments to understand what is happening. Static typing solves that problem: it's not for computers, it's for people.

For this PR, I'd like to keep them out

I'm going to disagree here. Of course, at the end of the day, it's your project, so if you insist, I will remove them. But what I will say is that the typing is done per-file, and it won't be useful until most of the file has it. That means that you need to build the typing over time (for example, the question marks ??? aren't even valid, but I did not know the type, so I put them there so say that) until you have something reasonable that mypy can check (you don't even need to use mypy because as I said, the type annotations are for people, but it's there as an option). Every time to work on a function, you have the typing knowledge in your mind, so you might as well spend an extra few seconds to write it down. But if you're trying to type the entire file in one go, then you don't have that knowledge, and therefore you need to learn the types by checking the caller (which is what is done without type annotations), and therefore it's not just "an extra few seconds."

frerich · 2017-10-12T15:58:14Z

First, these type specifiers are the pet project of the BDFL himself (see mypy). I won't provide an in-depth explanation here, but the short version is that most developers spend time reading code to understand what is happening (as I had to do here to repair the bug, and you had to do to review my changes), and the biggest problem with Python is that you have to figure out what the types are of the function arguments to understand what is happening. Static typing solves that problem: it's not for computers, it's for people.

As someone who greatly enjoys writing programs in Haskell in his free time and C++ for a living, I very much agree - the lack of static typing is one of my main griefs with Python. Thanks for mentioning mypy, I wasn't even aware of that effort - it sounds exciting! I'd love to learn how expressive it is.

After looking at the mypy homepage, it seems to me that it's built on top of type hints as mentioned in PEP 483, PEP 484 and variable annotations as described in PEP 526. The typing module seems to implement this as of Python 3.5. Does that sounds accurate?

If so, what do you think about omitting the comments you wrote from the PR for now and I'll instead raise the minimum supported Python version to 3.5 such that we can use 'real' type annotations as understood by mypy (and we can then also introduce mypy to the CI builds)? My impression is that the comments are a bit of a cludge due to the fact that we currently have to support Python 3.3. Or is there maybe a way to get type hints prior to Python 3.5?

I greatly appreciate you contributing this type information, it just seems to me that 1.) the type hints are a bit 'out of scope' for this particular PR which attempts to address a very specific bug (#289) and 2.) my current understanding is that we could have 'real' type hints (not in a comment) which would require dropping support for Python < 3.5. What do you think?

ghost · 2017-10-12T16:04:23Z

I greatly appreciate you contributing this type information, it just seems to me that 1.) the type hints are a bit 'out of scope' for this particular PR which attempts to address a very specific bug (#289) and 2.) my current understanding is that we could have 'real' type hints (not in a comment) which would require dropping support for Python < 3.5. What do you think?

There's no need to drop support for Python < 3.5, as Python 3 in general supports static typing. There is a need to "fix" the package format (#288) before switching to the nicer inline type annotations, but removing the comments will just mean that someone has to re-discover the types the next time around. And the comments are real type annotations for Python 2 that are fully supported by mypy (just not the question marks: ???).

frerich · 2017-10-12T16:05:57Z

Ah, I just skimmed the mypy examples and noticed that mypy seems to recognise both comments as in d = {} # type: Dict[str, int] as well as what I thought were called type hints in def overdrawn(self) -> bool. And in PEP 483, the pragmatics section says

Type declaration in comments, e.g.:

lst = [] # type: Sequence[int]

...so that would kinda answer my question: this # type: ... notation is not something you made up yourself but it's actually a convention recognised by tools? Nice!

frerich · 2017-10-12T16:08:36Z

The Fundamental building blocks section says that

Every type is consistent with Any; and it is also consistent with every type (see above).

...so instead of ??? you could maybe use Any?

ghost · 2017-10-12T16:11:45Z

Will do.

frerich · 2017-10-12T16:13:58Z

For the record, there's a dedicated discussion of type comments as one way to articulate type hints.

ghost · 2017-10-12T17:37:52Z

Updated.

ghost · 2017-10-12T18:16:20Z

Test failure appears unrelated.

frerich · 2017-10-12T18:44:17Z

Yes, the test failure should be fixed in master already after merging @sui's PR #292 . I think we're good to go - thanks a lot for your persistence (and patience)! 👍

frerich reviewed Oct 7, 2017

View reviewed changes

xoviat added 2 commits October 7, 2017 16:56

BUG: improve handling of specified sources

40cd447

BUG: revert f0227bb

cc8a52b

xoviat added 5 commits October 9, 2017 11:53

BUG: revise analyze method to detect language

ba1f7d1

MAINT: cleanup zip iterator

3df60a6

BUG: fix syntax error

662f7ed

MAINT: cleanup formBaseCommandLine

bbdd257

🎨

94218fd

frerich reviewed Oct 9, 2017

View reviewed changes

xoviat added 2 commits October 9, 2017 13:51

MAINT: cleanup source file filtering

f69b709

MAINT: improve jobCmdLine readability

df74660

frerich added the bugfix label Oct 9, 2017

TST: fix existing unittests

0aac12b

frerich reviewed Oct 9, 2017

View reviewed changes

xoviat added 3 commits October 9, 2017 19:00

appveyor: define CLCACHE_LOG

ce9de5e

TST: printTraceStatement: print caller line

ec0fae1

BUG: fix printTraceStatement

2e10786

siu reviewed Oct 10, 2017

View reviewed changes

FIX: switch to OrderedDict

019ae2f

FIX: move import

6c3a855

frerich reviewed Oct 11, 2017

View reviewed changes

xoviat added 6 commits October 11, 2017 08:15

FIX: check whether cmdLine arg startswith /MP

56be5e2

FIX: revert use of OrderedDict

5de5fcc

FIX: appveyor: revert CLCACHE_LOG

725410e

TST: revert changes

19e7126

TST: unit: correct comparisons

9079f32

The lists are modified to use the new tuple format

TST: integration: print output

7fb4480

siu mentioned this pull request Oct 11, 2017

Revert reading manifests and statistics without acquiring the lock #292

Merged

xoviat added 2 commits October 11, 2017 10:15

TST: unit: add some reasonable tests for filterSourceFiles

4a7ee82

TST: unit: fix

34fcb5d

xoviat added 2 commits October 12, 2017 08:52

FIX: revert changes to printTraceStatement

9ebb89b

FIX: remove unused import

084b23b

FIX: update ??? to type: Any

275ae15

frerich merged commit 3ef6643 into frerich:master Oct 12, 2017

ghost deleted the issue-289 branch October 12, 2017 18:44

BUG: fix source file duplication in command line #290

BUG: fix source file duplication in command line #290

Uh oh!

Conversation

ghost commented Oct 7, 2017

Uh oh!

ghost commented Oct 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost commented Oct 7, 2017

Uh oh!

frerich commented Oct 7, 2017

Uh oh!

ghost commented Oct 7, 2017

Uh oh!

frerich commented Oct 9, 2017

Uh oh!

ghost commented Oct 9, 2017

Uh oh!

frerich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost commented Oct 9, 2017

Uh oh!

frerich commented Oct 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost commented Oct 9, 2017 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Oct 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost commented Oct 10, 2017 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frerich Oct 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost commented Oct 11, 2017

Uh oh!

frerich commented Oct 12, 2017

Uh oh!

ghost commented Oct 12, 2017

ghost commented Oct 9, 2017 •

edited by ghost

Loading

ghost commented Oct 10, 2017 •

edited by ghost

Loading

frerich Oct 11, 2017 •

edited

Loading

ghost commented Oct 12, 2017 •

edited by ghost

Loading