Skip to content
This repository was archived by the owner on Feb 4, 2020. It is now read-only.

Conversation

@ghost
Copy link

@ghost ghost commented Oct 7, 2017

Previously, if the /Tp or /Tc option were passed, clcache would
duplicate the source file in the command line, making MSVC
think that it was compiling two separate files (/Tcexample.c
and example.c), when it was really compiling one.

The problem with this was that MSVC only allows some
options if it is compiling one file, which caused some
invocations of clcache to fail.

Closes #289.

Previously, if the /Tp or /Tc option were passed, clcache would
duplicate the source file in the command line, making MSVC
think that it was compiling two separate files (/Tcexample.c 
and example.c), when it was really compiling one.

The problem with this was that MSVC only allows some
options if it is compiling one file, which caused some
invocations of clcache to fail.

Closes #289.
@ghost
Copy link
Author

ghost commented Oct 7, 2017

Test failure seems unrelated.

clcache.py Outdated

sourceFile = realCmdline[-1]
if '/Tc' + sourceFile in realCmdline or '/Tp' + sourceFile in realCmdline:
printTraceStatement("Removing last argument becuase of /Tc (Issue #289)")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typo here (becuase) and I guess /Tc is imprecise because it seems this could also be because of /Tp.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to ignore this for now because it's not relevant to the discussion.

clcache.py Outdated
sourceFile = realCmdline[-1]
if '/Tc' + sourceFile in realCmdline or '/Tp' + sourceFile in realCmdline:
printTraceStatement("Removing last argument becuase of /Tc (Issue #289)")
realCmdline = realCmdline[:-1]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really safe to assume that the last argument is always a source file argument? I wonder whether maybe this fix should be done on the caller side - but it's not clear to me on first glance where clcache would implicitly add a source file argument to the command line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really safe to assume that the last argument is always a source file argument?

No. This is an example that fixes the problem but I had hoped that you would have a suggestion about a better fix.

I wonder whether maybe this fix should be done on the caller side

Yes. Do you have any pointers for where to look?

@ghost
Copy link
Author

ghost commented Oct 7, 2017

@frerich Found it (I think):

https://github.com/xoviat/clcache/blob/f0227bb98a16a9804e2b71ef2a6d100c1c7b2719/clcache.py#L1633

Would it be okay to move the if statement there?

if '/Tc' + sourceFile in realCmdline or '/Tp' + sourceFile in realCmdline:

@frerich
Copy link
Owner

frerich commented Oct 7, 2017

@xoviat Good catch, that place indeed looks like it might end up constructing incorrect command lines like the one you noticed. I think an if statement along the lines of what you propose would be much more appropriate in scheduleJobs since that's where the issue is actually introduced right now.

However, maybe it would be even nicer to not use an if but rather reuse the knowledge about which source files are specified via /Tp resp. /Tc: the program already knows this in CommandLineAnalyzer.analyze but discards that knowledge. The function only yields the list of source files but does not track which language they use. This knowledge could be used in scheduleJobs to construct a baseCmdLine value which not only omits all 'plain' source files but also those specified via /Tc and /Tp.

@ghost
Copy link
Author

ghost commented Oct 7, 2017

Test failure is due to "too many local variables." Not sure what to do about that.

@frerich
Copy link
Owner

frerich commented Oct 9, 2017

The typical fix for the "too many local variables" error is, well, to not use as many local variables. :-) Instead of removing variables though, it's usually better to rather introduce a new function (i.e. the check is a kind of code complexity measure).

However, I must admit that I like the older version with the if better than the new one. My idea triggering the suggestion to reuse the knowledge from CommandLineAnalyzer.analyze was that there are actually two issues right now:

  1. The code which constructs the baseCmdLine is defective: it attempts to filter all source files from the command line (such that lateron it's safe to add a source file to the command line) but fails to do so for source files which were specified via /Tcxxx or /Tpxxx. Furthermore, if an optional space is used (e.g. /Tc xxx or /Tp xxx), I believe the filtered command is even incorrect since e.g. cl /Tp foo.cpp will get turned into cl /Tc foo.cpp if I read correctly.

  2. The code which builds the jobCmdLine is defective, too: it fails to consider any explicitly specified language (i.e. /Tc or /Tp) for the source file appended to the command line. So when invoking clcache like e.g. cl /Tc a.x /Tp b.c I suspect that the jobCmdLine values will compiler will be invoked like cl a.x and cl b.c, the former probably triggering an error and the latter (incorrectly) deducing C (instead of the specified C++) from the file extension .c.

My hope was that it might be viable to change CommandLineAnalyzer.analyze such that instead of a plain list of source files, it also returns an explicitly configured language (if any) for the source files. E.g. cl a.cpp /Tpx.c /Tc foo.y yields a list of tuples like [('a.cpp', ''), ('x.c', 'Tp'), ('foo.y', 'Tc')] or something like that. And this language information could then be used both when filtering the command line (to produce the baseCmdLine value) as well as when appending a single source file to the base command line (because by having the language, the code ensures to not append e.g. just foo.y but rather /Tc foo.y).

@ghost
Copy link
Author

ghost commented Oct 9, 2017

@frerich Before I fix up the unit tests, does that seem okay?

Copy link
Owner

@frerich frerich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your latest version is a lot better already! However, the formBaseCmdLine seems tricky enough to justify a few unit tests in unittests.py.

clcache.py Outdated

def formBaseCommandLine(cmdLine, sourceFiles):
# type: (List[str], List[Tuple[str, str]]) -> List[str]
setOfSources = set([sourceFile for sourceFile, _ in sourceFiles])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor cosmetic: you could shorten this by using a generator expression:

setOfSources = set(sourceFile for sourceFile, _ in sourceFiles)

clcache.py Outdated
def formBaseCommandLine(cmdLine, sourceFiles):
# type: (List[str], List[Tuple[str, str]]) -> List[str]
setOfSources = set([sourceFile for sourceFile, _ in sourceFiles])
skippedArgs = ('/MP', '/Tc', '/Tp')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that -Tc and -Tp work, too. At least we support either form in CommandLineAnalyzer.analyze, and with Visual Studio 2013 (the only compiler I have at hand right now) you can indeed use either form.

clcache.py Outdated
baseCmdLine = [
arg for arg in cmdLine
if not (arg in setOfSources or arg.startswith(skippedArgs))
]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list comprehension works for cases like /Tcfoo.c but it's also possible to have an optional(!) space, i.e. /Tc foo.c. In this case both elements should get removed (since just /Tc on its own is a command line error).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, but in the second case, it would be in setOfSources.

clcache.py Outdated
jobs = []
for srcFile, objFile in zip(sourceFiles, objectFiles):
jobCmdLine = baseCmdLine + [srcFile]
for srcFile, srcLanguage, objFile in zip(zip(*sourceFiles), objectFiles):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This double-zip is confusing to me. How about just

for (srcFile, srcLanguage), objFile in zip(sourceFiles, objectFiles):

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

clcache.py Outdated
for srcFile, objFile in zip(sourceFiles, objectFiles):
jobCmdLine = baseCmdLine + [srcFile]
for srcFile, srcLanguage, objFile in zip(zip(*sourceFiles), objectFiles):
jobCmdLine = baseCmdLine + list(filter(None, [srcLanguage, srcFile]))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since srcLanguage is either an empty string or one of /Tp or /Tc, couldn't you drop the composition of list and filter and just use

jobCmdLine = baseCmdLine + [srcLanguage + srcFile]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

if not (arg in setOfSources or arg.startswith("/MP")):
baseCmdLine.append(arg)
# type: (???, str, List[str], ???, List[Tuple[str, str]], List[str]) -> int
# Filter out all source files from the command line to form baseCmdLine
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is misleading: the formBaseCommandLine function does not only filter out all source files, it also filters out /MP. However, I agree that the function should filter out just the source files because then it could get the more descriptive name filterSourceFiles and the /MP could be filtered on the caller side:

baseCmdLine = [arg for arg in filterSourceFiles(cmdLine, sourceFiles) if arg != '/MP']

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@ghost
Copy link
Author

ghost commented Oct 9, 2017

@frerich Third round of concerns addressed.

@frerich
Copy link
Owner

frerich commented Oct 9, 2017

Thanks for your persistence! 👍 I think by now, the patch looks plausible and conceptually consistent. I believe that after fixing up the unit tests (and maybe adding four, five small tests for filterSourceFiles) this stuff is ready to get merged. It technically fixes two issues by now: it avoids constructing invalid command lines in which a source file is mentioned twice, and it ensures that when compiling in parallel, each source file gets the correct input language specified.

@frerich frerich added the bugfix label Oct 9, 2017
unittests.py Outdated
def _testFull(self, cmdLine, expectedSourceFiles, expectedOutputFile):
sourceFiles, outputFile = CommandLineAnalyzer.analyze(cmdLine)
self.assertEqual(sourceFiles, expectedSourceFiles)
self.assertEqual([s for s, _ in sourceFiles], expectedSourceFiles)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be nicer to rather adjust the caller of _testFull such that expectedSourceFiles is actually a list of tuples, specifying the expected language (if any) specification for every source file?

@ghost
Copy link
Author

ghost commented Oct 9, 2017

Unfortunately, one of the integration tests failed. I'm not sure I understand this failure but the cache statistics may be omitting the fully specified source file arguments. I looked at numCacheHits and could not find how it was calculated.

@ghost
Copy link
Author

ghost commented Oct 10, 2017

In fact clcache appears not to reuse the cache files. I looked at this some more and unfortunately it appears to be a bit too dense for me.

compl = True

# Now collect the inputFiles into the return format
inputFiles = list(inputFiles.items())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be that the failure in tests is introduced here because inputFiles.items() does not always return the items in the same order? either sorting them or using OrderedDict in line 1253 could help.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it didn't fix the tests I think it still makes sense to keep the OrderedDict. The output of analyze() should return the same order across executions 1) to keep the tests passing and 2) to generate the same hash for the compilation task (if it is computed after, I am not sure about this). I don't know enough about this part of the code, to be checked with @frerich.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test failures have nothing to do with the order, and if I understand correctly, the hash is calculated per-file ("This causes clcache to reinvoke itself recursively for each of the source files"), making the order irrelevant. There is no reason to tell Python to preserve the order when it doesn't matter.

@ghost
Copy link
Author

ghost commented Oct 10, 2017

@siu That still did not fix the problem. There are still zero cache hits on the concurrent test (ignore the stdout tests, they're not relevant).

clcache.py Outdated
baseCmdLine.append(arg)
# type: (???, str, List[str], ???, List[Tuple[str, str]], List[str]) -> int
# Filter out all source files from the command line to form baseCmdLine
baseCmdLine = [arg for arg in filterSourceFiles(cmdLine, sourceFiles) if arg != '/MP']
Copy link
Owner

@frerich frerich Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arg != '/MP' test is what breaks the TestRunParallel.testHitsViaMpConcurrent test.

The test first compiles two source files individually using the command lines

  1. /nologo /EHsc /c fibonacci01.cpp
  2. /nologo /EHsc /c fibonacci02.cpp

The test then tries to compile both files again via

  • /nologo /EHsc /c /MP2 fibonacci01.cpp fibonacci02.cpp

...expecting that this triggers two cache hits. This causes clcache to reinvoke itself recursively for each of the source files. However, due to the arg != '/MP' test, the recursive invocations still include the /MP2 flag, i.e. the two nested instances use the command lines

  1. /nologo /EHsc /c /MP2 fibonacci01.cpp
  2. /nologo /EHsc /c /MP2 fibonacci02.cpp

Thus, the command line are different than they were when the files were originally compiled, hence the hash sum for the cache entry is different and hence there is no cache hit.

It seems this can be fixed by using not args.startswith('/MP') instead of arg != '/MP'.

@ghost
Copy link
Author

ghost commented Oct 11, 2017

Okay, this should finally be ready to merge.

@frerich
Copy link
Owner

frerich commented Oct 12, 2017

Looks good, thanks! Just two questions which cam to my mind during a last review:

  1. I see you're adding 'type specifiers' like # type: (str) -> None in a couple of places. Are these processed by some sort of static analysis tool? For this PR, I'd like to keep them out - but if they are processed by a tool then a separate PR might be nice which also etends the CI system such that it runs the tool.

  2. You modified the printTraceStatement function along the way -- did you mean to include this change in this PR? It might be useful (I personally don't really need the line numbers, but I guess they also don't hurt) but a bit out of scope for this PR.

@ghost
Copy link
Author

ghost commented Oct 12, 2017

You modified the printTraceStatement function along the way -- did you mean to include this change in this PR? It might be useful (I personally don't really need the line numbers, but I guess they also don't hurt) but a bit out of scope for this PR.

Done.

I see you're adding 'type specifiers' like # type: (str) -> None in a couple of places. Are these processed by some sort of static analysis tool? For this PR, I'd like to keep them out - but if they are processed by a tool then a separate PR might be nice which also etends the CI system such that it runs the tool.

First, these type specifiers are the pet project of the BDFL himself (see mypy). I won't provide an in-depth explanation here, but the short version is that most developers spend time reading code to understand what is happening (as I had to do here to repair the bug, and you had to do to review my changes), and the biggest problem with Python is that you have to figure out what the types are of the function arguments to understand what is happening. Static typing solves that problem: it's not for computers, it's for people.

For this PR, I'd like to keep them out

I'm going to disagree here. Of course, at the end of the day, it's your project, so if you insist, I will remove them. But what I will say is that the typing is done per-file, and it won't be useful until most of the file has it. That means that you need to build the typing over time (for example, the question marks ??? aren't even valid, but I did not know the type, so I put them there so say that) until you have something reasonable that mypy can check (you don't even need to use mypy because as I said, the type annotations are for people, but it's there as an option). Every time to work on a function, you have the typing knowledge in your mind, so you might as well spend an extra few seconds to write it down. But if you're trying to type the entire file in one go, then you don't have that knowledge, and therefore you need to learn the types by checking the caller (which is what is done without type annotations), and therefore it's not just "an extra few seconds."

@frerich
Copy link
Owner

frerich commented Oct 12, 2017

First, these type specifiers are the pet project of the BDFL himself (see mypy). I won't provide an in-depth explanation here, but the short version is that most developers spend time reading code to understand what is happening (as I had to do here to repair the bug, and you had to do to review my changes), and the biggest problem with Python is that you have to figure out what the types are of the function arguments to understand what is happening. Static typing solves that problem: it's not for computers, it's for people.

As someone who greatly enjoys writing programs in Haskell in his free time and C++ for a living, I very much agree - the lack of static typing is one of my main griefs with Python. Thanks for mentioning mypy, I wasn't even aware of that effort - it sounds exciting! I'd love to learn how expressive it is.

After looking at the mypy homepage, it seems to me that it's built on top of type hints as mentioned in PEP 483, PEP 484 and variable annotations as described in PEP 526. The typing module seems to implement this as of Python 3.5. Does that sounds accurate?

If so, what do you think about omitting the comments you wrote from the PR for now and I'll instead raise the minimum supported Python version to 3.5 such that we can use 'real' type annotations as understood by mypy (and we can then also introduce mypy to the CI builds)? My impression is that the comments are a bit of a cludge due to the fact that we currently have to support Python 3.3. Or is there maybe a way to get type hints prior to Python 3.5?

I greatly appreciate you contributing this type information, it just seems to me that 1.) the type hints are a bit 'out of scope' for this particular PR which attempts to address a very specific bug (#289) and 2.) my current understanding is that we could have 'real' type hints (not in a comment) which would require dropping support for Python < 3.5. What do you think?

@ghost
Copy link
Author

ghost commented Oct 12, 2017

I greatly appreciate you contributing this type information, it just seems to me that 1.) the type hints are a bit 'out of scope' for this particular PR which attempts to address a very specific bug (#289) and 2.) my current understanding is that we could have 'real' type hints (not in a comment) which would require dropping support for Python < 3.5. What do you think?

There's no need to drop support for Python < 3.5, as Python 3 in general supports static typing. There is a need to "fix" the package format (#288) before switching to the nicer inline type annotations, but removing the comments will just mean that someone has to re-discover the types the next time around. And the comments are real type annotations for Python 2 that are fully supported by mypy (just not the question marks: ???).

@frerich
Copy link
Owner

frerich commented Oct 12, 2017

Ah, I just skimmed the mypy examples and noticed that mypy seems to recognise both comments as in d = {} # type: Dict[str, int] as well as what I thought were called type hints in def overdrawn(self) -> bool. And in PEP 483, the pragmatics section says

Type declaration in comments, e.g.:

lst = [] # type: Sequence[int]

...so that would kinda answer my question: this # type: ... notation is not something you made up yourself but it's actually a convention recognised by tools? Nice!

@frerich
Copy link
Owner

frerich commented Oct 12, 2017

The Fundamental building blocks section says that

Every type is consistent with Any; and it is also consistent with every type (see above).

...so instead of ??? you could maybe use Any?

@ghost
Copy link
Author

ghost commented Oct 12, 2017

Will do.

@frerich
Copy link
Owner

frerich commented Oct 12, 2017

For the record, there's a dedicated discussion of type comments as one way to articulate type hints.

@ghost
Copy link
Author

ghost commented Oct 12, 2017

Updated.

@ghost
Copy link
Author

ghost commented Oct 12, 2017

Test failure appears unrelated.

@frerich frerich merged commit 3ef6643 into frerich:master Oct 12, 2017
@frerich
Copy link
Owner

frerich commented Oct 12, 2017

Yes, the test failure should be fixed in master already after merging @sui's PR #292 . I think we're good to go - thanks a lot for your persistence (and patience)! 👍

@ghost ghost deleted the issue-289 branch October 12, 2017 18:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants