Skip to content

Conversation

@CyberiaResurrection
Copy link

@CyberiaResurrection CyberiaResurrection commented Dec 18, 2024

As mentioned in #11 (comment) , add singleton and doubleton counting to Pool.
After that, calculate the second-order jack-nife bound on number of branches after the later of 10k inputs into a run, or previously-saved examples have been replayed.

@CyberiaResurrection CyberiaResurrection marked this pull request as ready for review December 19, 2024 11:31
Copy link
Collaborator

@Liam-DeVoe Liam-DeVoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CyberiaResurrection! Could you also sign the CLA by adding your name?

I started to make some cleanup changes to this branch, before realizing I'm not actually a maintainer. @Zac-HD do you think I could get perms? 😄

Comment on lines +138 to +142
@property
def singletons(self) -> int:
# Because _every_ arc hit is counted at least once, singletons are those arcs that have that base hit,
# and then _one_ more on top of it.
singletons = [item for item in self.overall_arc_counts.values() if 2 == item]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I believe we only store arcs that we have executed (via CustomCollectionContext), which means singletons should be those with a count of 1, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arcs outside of the package, like in python itself, are also counted, which (since they're only run once during a given run) will only ever have a count of 1.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still use == 1 for singletons, and deal with "branches run only once" as I describe here by only counting branches which we can hit again by replaying the same input.

@CyberiaResurrection
Copy link
Author

Thanks @CyberiaResurrection! Could you also sign the CLA by adding your name?

Think I've now done that in "Update CONTRIBUTING.md". Anything else you guys need?

@Zac-HD
Copy link
Owner

Zac-HD commented Dec 23, 2024

I wrote up a longer note in #11 (comment), but basically I'm not convinced that we can estimate reachable coverage accurately enough to be useful. Papers on this tend to start thousands of inputs into the run, and still see +100% / -50% variation against ground truth.

I think rather than estimators, I'd rather aim for observability that makes it easy for users to see what's currently covered vs what isn't; e.g. though a highlight-and-number lines view, and then let the graph of (improved!) coverage vs log-time speak for itself.

@Zac-HD
Copy link
Owner

Zac-HD commented Dec 23, 2024

(and, uh... it really sucks to have done a bunch of work and then have me pop in and argue against including it. I'm sorry it took me so long to reply to the conversation; I really appreciate your engagement and contributions, and hope this isn't offputting 😭)

@CyberiaResurrection
Copy link
Author

That is a bit off-putting, @Zac-HD . Could some of my work in this PR be salvaged? (eg the singleton/doubleton counts)

@CyberiaResurrection
Copy link
Author

I wrote up a longer note in #11 (comment), but basically I'm not convinced that we can estimate reachable coverage accurately enough to be useful. Papers on this tend to start thousands of inputs into the run, and still see +100% / -50% variation against ground truth.

How about only reporting estimated coverage when (say), 10k inputs into a run? With one instance I've got locally, I'm getting 20k inputs into a run on 24.9.1 before 3 minutes are up.

@Zac-HD
Copy link
Owner

Zac-HD commented Jan 6, 2025

(sorry for vanishing; I'm freshly back from a long family camping trip)

Coming back to this again, I'm going mostly off this paper - I think that showing estimates after 10k examples would make sense. I'd also be inclined to go for the jacknife estimators over Chao, but better yet would be if we can record the number of n-tons (in the current run) for n=1,2,3... and thus compute whatever estimators we want after the fact.

(working out how to compute a decent estimator across maybe-different restarts seems pretty annoying, really, but it's something I'd like to try eventually)

@Liam-DeVoe
Copy link
Collaborator

Liam-DeVoe commented Jan 6, 2025

sorry for dropping off this conversation as well; zac knows this area better than I do and so I'll follow his recommendation here. I do think we should focus on better tests and heuristics for coverage exclusion soon, as I saw some somewhat worrying one-off compiled lines getting classified as singletons sometimes-but-not-always when testing this locally (which is not the fault of this pull!).

@Zac-HD
Copy link
Owner

Zac-HD commented Jan 6, 2025

Yeah, memoization will create a lot of "fake singletons" if handled naively; we probably need to do some coverage-stability stuff and only consider n-tons in terms of the number of distinct inputs that trigger them (rather than raw observation count) and maybe also only for stable coverage. See #5 for brief notes on that.

@CyberiaResurrection
Copy link
Author

@Zac-HD , I'll add tripleton and quadrupleton counts (since they were used in at least one measure in at least one of Boehme's papers), then change the bound from Chao to (second-order?) jack-knife.

I at the moment have no idea how to tweak the docs to account for this change.

Boeheme notes (in "Reachable Coverage: Estimating Saturation in
Fuzzing" ) that all SOTA estimators have serious biases until
many thousands of runs are gathered.  This commit attempts to
mitigate the worst of those biases by waiting for 10,000 runs
to be gathered.
@CyberiaResurrection CyberiaResurrection changed the title Implement Chao lower bound on number of branches Implement second-order jack-knife lower bound on number of branches Jan 7, 2025
@CyberiaResurrection
Copy link
Author

as I saw some somewhat worrying one-off compiled lines getting classified as singletons sometimes-but-not-always when testing this locally (which is not the fault of this pull!).

@tybug , this is with singletons requiring two hits, as per the property definition?

@Liam-DeVoe
Copy link
Collaborator

That's with singletons defined as one execution. Are singletons defined as two executions in this pull to avoid this issue? I'd prefer to improve the underlying coverage counts first if so (with e.g. coverage blacklists or zac's suggestions) before reporting a metric which is actually based on n+1-tons and may be quite off as a result.

@CyberiaResurrection
Copy link
Author

That's with singletons defined as one execution. Are singletons defined as two executions in this pull to avoid this issue?

Yes, in this PR, singletons require two hits, doubletons three, etc.

I'd prefer to improve the underlying coverage counts first if so (with e.g. coverage blacklists or zac's suggestions) before reporting a metric which is actually based on n+1-tons and may be quite off as a result.

Well, for a start, @tybug and @Zac-HD , what sorts of coverage should be blacklisted?

@Zac-HD
Copy link
Owner

Zac-HD commented Jan 11, 2025

Ah, I think we have a confusion of definitions here. If I'm right, I think

  • @CyberiaResurrection is defining a singleton as "a branch which has been observed in exactly two executions" (to avoid spurious cases where we're e.g. populating a cache
  • I think we need to define a singleton as "a branch which is reproducibly executed by exactly one input"

So... we need to track some notion of "which inputs have executed this branch", at least for branches where that number is small, so that we can count the singletons! It's probably OK to revert back to counting number of times executed once we've seen >5 distinct inputs hit the branch; that's much cheaper and at that point we can assume some diversity I think.

  • currently
    • on any execution where we observe a previously-unseen branch, add it to the seed pool. Assume that replaying inputs will always have the same coverage, which is false (most often due to populating and then hitting a cache)
  • alternative
    • "forkserver trick" - execute each iteration in a fresh propcess. We'd never have cache hits from previous inputs (for better and worse), but it's tricky to set up + has performance implications + doesn't really work on Windows
  • my proposal
    • when we observe a new branch, don't add that input to the pool immediately - first try to replay it. See also my comments on Report coverage stability in the dashboard #5.
      • if we get the same coverage the second time, add it to the pool.
      • otherwise, replay again - if the second to fourth are consistent and also new, add it to the pool for that rather than the first coverage.
      • if it's just inconsistent, add it to a new flaky-coverage-pool; it's still worth exploring sometimes I guess even though it's unclear how.
    • because we're working off number of distinct imputs now, we need to know what inputs led to this branch. .arc_counts must therefore track set[hash-of[choice-sequence]]. For efficiency, I'd switch to | int once the set reaches five or maybe ten elements - this is going to be annoying to deal with, so maybe we make a custom class with nice accessors?
      • A singleton is then a branch with exactly one hash in the set, doubleton has two, etc.
      • Using uint32 hash digests so that we can afford to persist this between restarts etc. That's mostly useful when we're checking whether something changed between runs.

The downside is basically that this is a bunch more work, but I think that's what it takes to get a useful estimator (and incidentally some progress towards #5 🙏).

@CyberiaResurrection
Copy link
Author

@Zac-HD - is reporting the 2nd-order jack-knife estimator as in this PR, after 10k inputs, a worthwhile, if small, intermediate step towards better estimators?

Are you wanting to track inputs triggering specific branches because (iirc) that's how AFL et al do it?

@Zac-HD
Copy link
Owner

Zac-HD commented Jan 12, 2025

is reporting the 2nd-order jack-knife estimator as in this PR, after 10k inputs, a worthwhile, if small, intermediate step towards better estimators?

Yes! I'm keen to get the estimator in. I also think it's important to feed it the correct observational data - GIGO, as the saying goes.

Are you wanting to track inputs triggering specific branches because (iirc) that's how AFL et al do it?

More "I want to do it the way that works, and AFL et al happen to do the same thing because it works". Imagine for example that Hypofuzz immediately replayed each input which reached new coverage (we don't currently do this, but we might randomly replay some old inputs soon). If we're using execution count to define singletons then there won't be any!

So we'd better switch to tracking the number of distinct inputs; we can safely assume that the inputs we're predicting will be unique (not quite true but close enough, and we can multiply by the duplication-rate after estimating if we want a correction), but we can't assume that past inputs are all unique because Hypothesis will moderately-often replay past inputs for various reasons including those in #5.

@CyberiaResurrection
Copy link
Author

@Zac-HD , fair enough, you've made your point. Do you have any suggestions for tracking number of distinct inputs?

@Zac-HD
Copy link
Owner

Zac-HD commented Jan 14, 2025

Do you have any suggestions for tracking number of distinct inputs?

@tybug - what's the best way to derive a hash from a choice sequence these days? set[bytes] worked with the buffer, but we want to treat False/0/0.0 distinctly.

and then instead of self.overall_arc_counts = Counter(), we'll want to make it a dict[Arc, int | set[the-hash-above] and

def add_branch_counts(counts, branches, hash_of_this_input, *, limit=5) -> None:
    for branch in branches:
        seen = counts.setdefault(branch, set())
        if isinstance(seen, int):
            counts[branch] += 1
        else:
            seen.add(hash_of_this_input)
            if len(seen) >= limit:
                counts[branch] = limit

@Liam-DeVoe
Copy link
Collaborator

choices_key([1, True, ""]) (added/renamed in HypothesisWorks/hypothesis#4241) is intended for this purpose - and I am now realizing that it does not distinguish 0/1 and False/True, so I'll fix that in that pull by adding an extra key to bools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants