-
Notifications
You must be signed in to change notification settings - Fork 5
Implement second-order jack-knife lower bound on number of branches #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
cfa3178 to
422bc23
Compare
7c5bdca to
3ff69e7
Compare
Liam-DeVoe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CyberiaResurrection! Could you also sign the CLA by adding your name?
I started to make some cleanup changes to this branch, before realizing I'm not actually a maintainer. @Zac-HD do you think I could get perms? 😄
| @property | ||
| def singletons(self) -> int: | ||
| # Because _every_ arc hit is counted at least once, singletons are those arcs that have that base hit, | ||
| # and then _one_ more on top of it. | ||
| singletons = [item for item in self.overall_arc_counts.values() if 2 == item] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? I believe we only store arcs that we have executed (via CustomCollectionContext), which means singletons should be those with a count of 1, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arcs outside of the package, like in python itself, are also counted, which (since they're only run once during a given run) will only ever have a count of 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should still use == 1 for singletons, and deal with "branches run only once" as I describe here by only counting branches which we can hit again by replaying the same input.
Think I've now done that in "Update CONTRIBUTING.md". Anything else you guys need? |
|
I wrote up a longer note in #11 (comment), but basically I'm not convinced that we can estimate reachable coverage accurately enough to be useful. Papers on this tend to start thousands of inputs into the run, and still see +100% / -50% variation against ground truth. I think rather than estimators, I'd rather aim for observability that makes it easy for users to see what's currently covered vs what isn't; e.g. though a highlight-and-number lines view, and then let the graph of (improved!) coverage vs log-time speak for itself. |
|
(and, uh... it really sucks to have done a bunch of work and then have me pop in and argue against including it. I'm sorry it took me so long to reply to the conversation; I really appreciate your engagement and contributions, and hope this isn't offputting 😭) |
|
That is a bit off-putting, @Zac-HD . Could some of my work in this PR be salvaged? (eg the singleton/doubleton counts) |
How about only reporting estimated coverage when (say), 10k inputs into a run? With one instance I've got locally, I'm getting 20k inputs into a run on 24.9.1 before 3 minutes are up. |
|
(sorry for vanishing; I'm freshly back from a long family camping trip) Coming back to this again, I'm going mostly off this paper - I think that showing estimates after 10k examples would make sense. I'd also be inclined to go for the jacknife estimators over Chao, but better yet would be if we can record the number of (working out how to compute a decent estimator across maybe-different restarts seems pretty annoying, really, but it's something I'd like to try eventually) |
|
sorry for dropping off this conversation as well; zac knows this area better than I do and so I'll follow his recommendation here. I do think we should focus on better tests and heuristics for coverage exclusion soon, as I saw some somewhat worrying one-off compiled lines getting classified as singletons sometimes-but-not-always when testing this locally (which is not the fault of this pull!). |
|
Yeah, memoization will create a lot of "fake singletons" if handled naively; we probably need to do some coverage-stability stuff and only consider n-tons in terms of the number of distinct inputs that trigger them (rather than raw observation count) and maybe also only for stable coverage. See #5 for brief notes on that. |
|
@Zac-HD , I'll add tripleton and quadrupleton counts (since they were used in at least one measure in at least one of Boehme's papers), then change the bound from Chao to (second-order?) jack-knife. I at the moment have no idea how to tweak the docs to account for this change. |
Boeheme notes (in "Reachable Coverage: Estimating Saturation in Fuzzing" ) that all SOTA estimators have serious biases until many thousands of runs are gathered. This commit attempts to mitigate the worst of those biases by waiting for 10,000 runs to be gathered.
e90c516 to
4a7fc35
Compare
@tybug , this is with singletons requiring two hits, as per the property definition? |
|
That's with singletons defined as one execution. Are singletons defined as two executions in this pull to avoid this issue? I'd prefer to improve the underlying coverage counts first if so (with e.g. coverage blacklists or zac's suggestions) before reporting a metric which is actually based on n+1-tons and may be quite off as a result. |
Yes, in this PR, singletons require two hits, doubletons three, etc.
Well, for a start, @tybug and @Zac-HD , what sorts of coverage should be blacklisted? |
|
Ah, I think we have a confusion of definitions here. If I'm right, I think
So... we need to track some notion of "which inputs have executed this branch", at least for branches where that number is small, so that we can count the singletons! It's probably OK to revert back to counting number of times executed once we've seen >5 distinct inputs hit the branch; that's much cheaper and at that point we can assume some diversity I think.
The downside is basically that this is a bunch more work, but I think that's what it takes to get a useful estimator (and incidentally some progress towards #5 🙏). |
|
@Zac-HD - is reporting the 2nd-order jack-knife estimator as in this PR, after 10k inputs, a worthwhile, if small, intermediate step towards better estimators? Are you wanting to track inputs triggering specific branches because (iirc) that's how AFL et al do it? |
Yes! I'm keen to get the estimator in. I also think it's important to feed it the correct observational data - GIGO, as the saying goes.
More "I want to do it the way that works, and AFL et al happen to do the same thing because it works". Imagine for example that Hypofuzz immediately replayed each input which reached new coverage (we don't currently do this, but we might randomly replay some old inputs soon). If we're using execution count to define singletons then there won't be any! So we'd better switch to tracking the number of distinct inputs; we can safely assume that the inputs we're predicting will be unique (not quite true but close enough, and we can multiply by the duplication-rate after estimating if we want a correction), but we can't assume that past inputs are all unique because Hypothesis will moderately-often replay past inputs for various reasons including those in #5. |
|
@Zac-HD , fair enough, you've made your point. Do you have any suggestions for tracking number of distinct inputs? |
@tybug - what's the best way to derive a hash from a choice sequence these days? and then instead of def add_branch_counts(counts, branches, hash_of_this_input, *, limit=5) -> None:
for branch in branches:
seen = counts.setdefault(branch, set())
if isinstance(seen, int):
counts[branch] += 1
else:
seen.add(hash_of_this_input)
if len(seen) >= limit:
counts[branch] = limit |
|
|
As mentioned in #11 (comment) , add singleton and doubleton counting to
Pool.After that, calculate the second-order jack-nife bound on number of branches after the later of 10k inputs into a run, or previously-saved examples have been replayed.