Implement second-order jack-knife lower bound on number of branches #44

CyberiaResurrection · 2024-12-18T16:21:29Z

As mentioned in #11 (comment) , add singleton and doubleton counting to Pool.
After that, calculate the second-order jack-nife bound on number of branches after the later of 10k inputs into a run, or previously-saved examples have been replayed.

Liam-DeVoe

Thanks @CyberiaResurrection! Could you also sign the CLA by adding your name?

I started to make some cleanup changes to this branch, before realizing I'm not actually a maintainer. @Zac-HD do you think I could get perms? 😄

Liam-DeVoe · 2024-12-22T06:47:09Z

src/hypofuzz/corpus.py

+    @property
+    def singletons(self) -> int:
+        # Because _every_ arc hit is counted at least once, singletons are those arcs that have that base hit,
+        # and then _one_ more on top of it.
+        singletons = [item for item in self.overall_arc_counts.values() if 2 == item]


Is this true? I believe we only store arcs that we have executed (via CustomCollectionContext), which means singletons should be those with a count of 1, right?

Arcs outside of the package, like in python itself, are also counted, which (since they're only run once during a given run) will only ever have a count of 1.

We should still use == 1 for singletons, and deal with "branches run only once" as I describe here by only counting branches which we can hit again by replaying the same input.

CyberiaResurrection · 2024-12-22T07:57:55Z

Thanks @CyberiaResurrection! Could you also sign the CLA by adding your name?

Think I've now done that in "Update CONTRIBUTING.md". Anything else you guys need?

Zac-HD · 2024-12-23T05:00:49Z

I wrote up a longer note in #11 (comment), but basically I'm not convinced that we can estimate reachable coverage accurately enough to be useful. Papers on this tend to start thousands of inputs into the run, and still see +100% / -50% variation against ground truth.

I think rather than estimators, I'd rather aim for observability that makes it easy for users to see what's currently covered vs what isn't; e.g. though a highlight-and-number lines view, and then let the graph of (improved!) coverage vs log-time speak for itself.

Zac-HD · 2024-12-23T05:02:57Z

(and, uh... it really sucks to have done a bunch of work and then have me pop in and argue against including it. I'm sorry it took me so long to reply to the conversation; I really appreciate your engagement and contributions, and hope this isn't offputting 😭)

CyberiaResurrection · 2024-12-23T07:49:03Z

That is a bit off-putting, @Zac-HD . Could some of my work in this PR be salvaged? (eg the singleton/doubleton counts)

CyberiaResurrection · 2024-12-28T08:50:51Z

I wrote up a longer note in #11 (comment), but basically I'm not convinced that we can estimate reachable coverage accurately enough to be useful. Papers on this tend to start thousands of inputs into the run, and still see +100% / -50% variation against ground truth.

How about only reporting estimated coverage when (say), 10k inputs into a run? With one instance I've got locally, I'm getting 20k inputs into a run on 24.9.1 before 3 minutes are up.

Zac-HD · 2025-01-06T11:38:24Z

(sorry for vanishing; I'm freshly back from a long family camping trip)

Coming back to this again, I'm going mostly off this paper - I think that showing estimates after 10k examples would make sense. I'd also be inclined to go for the jacknife estimators over Chao, but better yet would be if we can record the number of n-tons (in the current run) for n=1,2,3... and thus compute whatever estimators we want after the fact.

(working out how to compute a decent estimator across maybe-different restarts seems pretty annoying, really, but it's something I'd like to try eventually)

Liam-DeVoe · 2025-01-06T21:08:39Z

sorry for dropping off this conversation as well; zac knows this area better than I do and so I'll follow his recommendation here. I do think we should focus on better tests and heuristics for coverage exclusion soon, as I saw some somewhat worrying one-off compiled lines getting classified as singletons sometimes-but-not-always when testing this locally (which is not the fault of this pull!).

Zac-HD · 2025-01-06T23:22:26Z

Yeah, memoization will create a lot of "fake singletons" if handled naively; we probably need to do some coverage-stability stuff and only consider n-tons in terms of the number of distinct inputs that trigger them (rather than raw observation count) and maybe also only for stable coverage. See #5 for brief notes on that.

CyberiaResurrection · 2025-01-07T01:41:13Z

@Zac-HD , I'll add tripleton and quadrupleton counts (since they were used in at least one measure in at least one of Boehme's papers), then change the bound from Chao to (second-order?) jack-knife.

I at the moment have no idea how to tweak the docs to account for this change.

Boeheme notes (in "Reachable Coverage: Estimating Saturation in Fuzzing" ) that all SOTA estimators have serious biases until many thousands of runs are gathered. This commit attempts to mitigate the worst of those biases by waiting for 10,000 runs to be gathered.

CyberiaResurrection · 2025-01-07T13:31:48Z

as I saw some somewhat worrying one-off compiled lines getting classified as singletons sometimes-but-not-always when testing this locally (which is not the fault of this pull!).

@tybug , this is with singletons requiring two hits, as per the property definition?

Liam-DeVoe · 2025-01-07T20:12:01Z

That's with singletons defined as one execution. Are singletons defined as two executions in this pull to avoid this issue? I'd prefer to improve the underlying coverage counts first if so (with e.g. coverage blacklists or zac's suggestions) before reporting a metric which is actually based on n+1-tons and may be quite off as a result.

CyberiaResurrection · 2025-01-08T05:42:57Z

That's with singletons defined as one execution. Are singletons defined as two executions in this pull to avoid this issue?

Yes, in this PR, singletons require two hits, doubletons three, etc.

I'd prefer to improve the underlying coverage counts first if so (with e.g. coverage blacklists or zac's suggestions) before reporting a metric which is actually based on n+1-tons and may be quite off as a result.

Well, for a start, @tybug and @Zac-HD , what sorts of coverage should be blacklisted?

Zac-HD · 2025-01-11T13:24:31Z

Ah, I think we have a confusion of definitions here. If I'm right, I think

@CyberiaResurrection is defining a singleton as "a branch which has been observed in exactly two executions" (to avoid spurious cases where we're e.g. populating a cache
I think we need to define a singleton as "a branch which is reproducibly executed by exactly one input"

So... we need to track some notion of "which inputs have executed this branch", at least for branches where that number is small, so that we can count the singletons! It's probably OK to revert back to counting number of times executed once we've seen >5 distinct inputs hit the branch; that's much cheaper and at that point we can assume some diversity I think.

currently
- on any execution where we observe a previously-unseen branch, add it to the seed pool. Assume that replaying inputs will always have the same coverage, which is false (most often due to populating and then hitting a cache)
alternative
- "forkserver trick" - execute each iteration in a fresh propcess. We'd never have cache hits from previous inputs (for better and worse), but it's tricky to set up + has performance implications + doesn't really work on Windows
my proposal
- when we observe a new branch, don't add that input to the pool immediately - first try to replay it. See also my comments on Report coverage stability in the dashboard #5.
  - if we get the same coverage the second time, add it to the pool.
  - otherwise, replay again - if the second to fourth are consistent and also new, add it to the pool for that rather than the first coverage.
  - if it's just inconsistent, add it to a new flaky-coverage-pool; it's still worth exploring sometimes I guess even though it's unclear how.
- because we're working off number of distinct imputs now, we need to know what inputs led to this branch. .arc_counts must therefore track set[hash-of[choice-sequence]]. For efficiency, I'd switch to | int once the set reaches five or maybe ten elements - this is going to be annoying to deal with, so maybe we make a custom class with nice accessors?
  - A singleton is then a branch with exactly one hash in the set, doubleton has two, etc.
  - Using uint32 hash digests so that we can afford to persist this between restarts etc. That's mostly useful when we're checking whether something changed between runs.

The downside is basically that this is a bunch more work, but I think that's what it takes to get a useful estimator (and incidentally some progress towards #5 🙏).

CyberiaResurrection · 2025-01-11T14:30:10Z

@Zac-HD - is reporting the 2nd-order jack-knife estimator as in this PR, after 10k inputs, a worthwhile, if small, intermediate step towards better estimators?

Are you wanting to track inputs triggering specific branches because (iirc) that's how AFL et al do it?

Zac-HD · 2025-01-12T07:18:50Z

is reporting the 2nd-order jack-knife estimator as in this PR, after 10k inputs, a worthwhile, if small, intermediate step towards better estimators?

Yes! I'm keen to get the estimator in. I also think it's important to feed it the correct observational data - GIGO, as the saying goes.

Are you wanting to track inputs triggering specific branches because (iirc) that's how AFL et al do it?

More "I want to do it the way that works, and AFL et al happen to do the same thing because it works". Imagine for example that Hypofuzz immediately replayed each input which reached new coverage (we don't currently do this, but we might randomly replay some old inputs soon). If we're using execution count to define singletons then there won't be any!

So we'd better switch to tracking the number of distinct inputs; we can safely assume that the inputs we're predicting will be unique (not quite true but close enough, and we can multiply by the duplication-rate after estimating if we want a correction), but we can't assume that past inputs are all unique because Hypothesis will moderately-often replay past inputs for various reasons including those in #5.

CyberiaResurrection · 2025-01-13T12:52:01Z

@Zac-HD , fair enough, you've made your point. Do you have any suggestions for tracking number of distinct inputs?

Zac-HD · 2025-01-14T07:59:48Z

Do you have any suggestions for tracking number of distinct inputs?

@tybug - what's the best way to derive a hash from a choice sequence these days? set[bytes] worked with the buffer, but we want to treat False/0/0.0 distinctly.

and then instead of self.overall_arc_counts = Counter(), we'll want to make it a dict[Arc, int | set[the-hash-above] and

def add_branch_counts(counts, branches, hash_of_this_input, *, limit=5) -> None:
    for branch in branches:
        seen = counts.setdefault(branch, set())
        if isinstance(seen, int):
            counts[branch] += 1
        else:
            seen.add(hash_of_this_input)
            if len(seen) >= limit:
                counts[branch] = limit

Liam-DeVoe · 2025-01-14T17:26:21Z

choices_key([1, True, ""]) (added/renamed in HypothesisWorks/hypothesis#4241) is intended for this purpose - and I am now realizing that it does not distinguish 0/1 and False/True, so I'll fix that in that pull by adding an extra key to bools.

CyberiaResurrection added 5 commits December 15, 2024 02:33

Add storage for singletons/doubletons/etc for Chao bound

dde7de9

Make space for estimated # of branches in dashboard

fcb617d

Add singleton/doubleton properties

7ae0160

Implement basic Chao bound

bc7543e

Convert report branches to int _before_ adding offset

422bc23

CyberiaResurrection force-pushed the ChaoBound branch from cfa3178 to 422bc23 Compare December 18, 2024 17:19

Fix line-length overruns

3ff69e7

CyberiaResurrection force-pushed the ChaoBound branch from 7c5bdca to 3ff69e7 Compare December 19, 2024 04:54

CyberiaResurrection marked this pull request as ready for review December 19, 2024 11:31

Liam-DeVoe reviewed Dec 22, 2024

View reviewed changes

Update CONTRIBUTING.md

b0c4196

CyberiaResurrection added 2 commits January 7, 2025 11:44

Add tripleton/quadrupleton properties

7f21f08

CyberiaResurrection changed the title ~~Implement Chao lower bound on number of branches~~ Implement second-order jack-knife lower bound on number of branches Jan 7, 2025

Report second-order jack-knife bound instead of Chao

4a7fc35

CyberiaResurrection force-pushed the ChaoBound branch from e90c516 to 4a7fc35 Compare January 7, 2025 01:54

Implement second-order jack-knife lower bound on number of branches #44

Are you sure you want to change the base?

Implement second-order jack-knife lower bound on number of branches #44

Uh oh!

Conversation

CyberiaResurrection commented Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Liam-DeVoe left a comment

Choose a reason for hiding this comment

Uh oh!

Liam-DeVoe Dec 22, 2024

Choose a reason for hiding this comment

Uh oh!

CyberiaResurrection Dec 22, 2024

Choose a reason for hiding this comment

Uh oh!

Zac-HD Jan 12, 2025

Choose a reason for hiding this comment

Uh oh!

CyberiaResurrection commented Dec 22, 2024

Uh oh!

Zac-HD commented Dec 23, 2024

Uh oh!

Zac-HD commented Dec 23, 2024

Uh oh!

CyberiaResurrection commented Dec 23, 2024

Uh oh!

CyberiaResurrection commented Dec 28, 2024

Uh oh!

Zac-HD commented Jan 6, 2025

Uh oh!

Liam-DeVoe commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zac-HD commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CyberiaResurrection commented Jan 7, 2025

Uh oh!

CyberiaResurrection commented Jan 7, 2025

Uh oh!

Liam-DeVoe commented Jan 7, 2025

Uh oh!

CyberiaResurrection commented Jan 8, 2025

Uh oh!

Zac-HD commented Jan 11, 2025

Uh oh!

CyberiaResurrection commented Jan 11, 2025

Uh oh!

Zac-HD commented Jan 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CyberiaResurrection commented Jan 13, 2025

Uh oh!

Zac-HD commented Jan 14, 2025

Uh oh!

Liam-DeVoe commented Jan 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CyberiaResurrection commented Dec 18, 2024 •

edited

Loading

Liam-DeVoe commented Jan 6, 2025 •

edited

Loading

Zac-HD commented Jan 6, 2025 •

edited

Loading

Zac-HD commented Jan 12, 2025 •

edited

Loading