Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
266 changes: 266 additions & 0 deletions premerge/premerge-advisor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
# Premerge Advisor

## Introduction

While the premerge testing infrastructure is generally reliable, tests that are
broken at HEAD, or tests that are flaky can significantly degrade the user
experience. Reporting failures unrelated to the PR being worked on can lead to
warning fatigue. Warning fatigue can self reinforce when people land failing
tests into main that they believe are unrelated. This causes more false positive
failures on other premerge testing invocations, leading to more warning
fatigure. To address these issues we propose to implement a "premerge advisor".
The premerge advisor is designed to run after premerge testing and signal to the
user whether the failures can be safely ignored because they are flaky or broken
at head, or whether they should be investigated further.

## Background/Motivation

The usefulness of the premerge system is significantly impacted by how much
people trust it. People trust the system less when the issues reported to them
are unrelated to the changes that they are currently working on. False positives
occur regularly whether due to flaky tests, or due to tests already being broken
at head (perhaps by a previous commit). Efforts to fix these issues at the
source and keep main always green and deflake tests are ongoing, but not a
scalable solution to the problem of false positive failures. It is also not
reasonable to expect PR authors to spend time familiarizing themselves with all
known flaky tests and dig through postcommit testing logs to see if the failures
in their premerge run also occur in main. These alternatives are further
explored in the section on [alternatives considered](#alternatives-considered).
Having tooling to automatically run through the steps that one would otherwise
need to perform manually would ensure the analysis on every failed premerge run
is thorough and likely to be correct.

## Design

The premerge advisor will consist of three main parts: jobs uploading failure
information, a web server and database to store and query failure information,
and tooling to write out a verdict about the failures to comments on Github.
When a job runs premerge or postcommit and there are build/test failures, it
will upload information to the web server containing information on the failure
like the test/build action that failed and the exact log. The web server will
then store this information in a format that makes it easy to query for later.
Every premerge run that encounters build/test failures will then query the web
server to see if there are any matching build/test failures for the version of
`main` that the PR has been merged into. If there are, the web server can
respond with the failure information and signal that the failures can be safely
ignored for that premerge run. If the failures are novel, then the web server
can signal that the failures should be investigated more thoroughly. If there
are failures, the premerge advisor can then write out a comment on the PR
explaining its findings.

### Processing and Storing Failures

A key part of the premerge advisor is infrastructure to store and process build
failure information so that it can be queried later. We plan on having jobs
extract failure logs and upload them to a web server. This web server in
addition to having an endpoint to accept uploads of failure information without
returning any analysis will have an explanation endpoint that will accept test
failure information (logs and filenames) and return whether or not they are
broken at `main`, flaky, or novel test failures due to the PR.

For the premerge jobs running through Github Actions, we plan on using the
existing `generate_test_report` scripts that are currently used for generating
summaries on job failures. When the job ends and there is a failure, there would
be a script that runs, utilizing the `generate_test_report` library to extract
failure information, and then uploads the information to the web server.
Information on how the premerge jobs will query the server and display results
about flaky/already broken tests is in [the section below](#informing-the-user).
We plan on having both the premerge and [postcommit
jobs](post-submit-testing.md) upload failure information to the web server. This
enables the web server to know about failures that are not the result of mid-air
collisions before postcommit testing has been completed. Postcommit testing can
take upwards of 30 minutes, during which the premerge advisor would not be able
to advise that these failures are due to a broken `main`. Data from both
postcommit and premerge testing will be associated with a commit SHA for the
base commit in main. Premerge testing data will additionally have a PR
associated with it to enable disambiguating between failures occurring in `main`
at a specific commit and failures only occurring in a PR. All failure
information will be stored in the database. When testing for a PR is completed
and there are failures, premerge will then be able to query the advisor to
explain any outstanding failures. The advisor will compare the logs from the
premerge run to the database and return its results, namely whether or not any
of the failures exist at `HEAD` or are known flakes.

We plan on implementing the web server in python with the `flask` library. All
contributors to the premerge infrastructure are already familiar with Python,
building web servers in Python with `flask` is relatively easy, and we do not
need high performance or advanced features. For storage, we plan on using SQLite
as it has support built in to Python, does not require any additional complexity
in terms of infrastructure setup, and is reliable.
Comment on lines +84 to +89
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we consider going without the DB?
A DB, even SQLite means we have a schema, and if we change it, have a migration path.

What we need is history on the N last commits runs on main. Could this be stored in memory in the flask process? (assuming memory shared across the N workers if we use waitress)

It means if we restart the cluster, the first commit won't have history, but this shouldn't be that often and would simplify greatly incremental work on this advisor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether we keep things in-memory or inside a DB, we still need a schema. The schema that we would need is also not complicated. We just need the file, the workflow run number/commit SHA, the failure message, and maybe a timestamp/otherwise unique ID. Keeping the data for all of eternity also isn't that necessary. We can just dump the existing DB and start fresh, just like we would if we were storing info in memory and the server restarted.

Storing the information in memory makes it much harder to keep track of flaky tests over long periods of time, and we lose all information on flaky tests everytime the server restarts, which isn't ideal. Server restarts are normal and expected with k8s, so the design needs to take them into account.


Given we have two identical clusters that need to be able to function
independently of each other, we also need to duplicate the web server for the
premerge advisor. Queries will go to the cluster-local premerge advisor web
server. Failure uploads will directly write to both premerge advisor containers.
This could lead to situations where uploading to one server succeeds while it
fails writing to the other one, resulting in inconsistent state between the
clusters. This is not a major concern because in the rare event we do end up
with inconsistent states, the only major concern would be inconsistent results
returned between the clusters. Uploading directly to both premerge containers
and explicitly allowing inconsistent state means we do not need to maintain any
state replication/consistency machinery, which would add significant complexity
and maintenance costs.

For the database, we will have two tables, one called `failures` to store
failure information, and one called `commits` that will be used to map commit
SHAs to SVN like indices. The `commits` table will simply have two columns,
mapping SHAs to indices. The failure table is designed to have rows describing
individual test failures, and the schema looks like the following:

* `source_type` - Whether the failure information comes from a postcommit run or
from a premerge run.
* `base_commit_sha` - The SHA of the base commit. For premerge runs, this is the
commit in `main` (or the user branch) that the PR is based on top of. For
postcommit runs, this is simply the SHA of the commit being tested.
* `commit_index` - The index of the base commit, assuming the base commit is in
`main`.
* `source_id` - The ID of the source. For buildbot runs, this is the buldbot run
number. For premerge runs, this is the Github Actions workflow run ID.
* `test_file` - The test file/compilation unit where the failure occurred.
* `failure_message` - Any failure message text associated with the failure.
* `platform` - The platform (including OS and architecture) where the failure
occurred.

### Classifying Test Failures

When the web server is sent a request to explain a list of failures, it needs to
be able to determine whether a failure has occurred previously. To do this, the
web server will keep a list of currently active failures in `main` and a list of
flaky tests. The list of flaky tests will be computed from historical data on
startup from a rolling window of data. The exact heuristic is left to
implementation as it will likely require some experimentation, but will look at
whether or not the test has been failing periodically over a long period of
time. The list of currently active failures will initially be taken from the
last known postcommit run. If a new postcommit run shows additional failures
that are not in the flake list, they will be added to the currently active
failure list. Failures in the currently active list not present in the latest
postcommit run will be removed from the currently active list. In the future, we
want to look at information from PRs to improve the latency of failures/passes
making their way into the system due to postcommit testing currently having a
minimum latency of 30 minutes.

Failures will be identified as being the same through a combination of the
test name and likely a fuzzy match of the test error output. The exact details
are left to implementation as they will likely require some experimentation
to get right.

These tradeoffs do leave open the possibility that we incorrectly identify tests
as being broken at head in certain extenuating circumstances. Consider a
situation where someone lands commit A breaking a test, immediately after lands
commit B fixing that test, and then opens a PR, C, that also breaks the test in
the same way around the same time. If commit B was directly pushed, or merged
without waiting for premerge to finish, then the test failure will still be in
the currently active failure list. The failure for PR C will then be marked as
failing at HEAD despite it not actually failing at HEAD. Given the low
probability of this occurring due to the same test needing to be broken, the
error message needing to be extremely similar, and the exact timing
requirements, we deem this an acceptable consequence of the current design. To
alleviate this issue, we would end up marking many more failures broken at main
as true failures due to the latency of postcommit testing.

### Informing the User

Once a workflow has completed, no actions related to the premerge advisor will
be performed if there are no failures. If there are failures, they will be
uploaded to the web server. Afterwards, the premerge workflow then makes a
request to the server asking if it can explain the failures as either existing
in `main` or as flaky.

After the response from the server has been recieved, the workflow will then
construct a comment. It will contain the failure information, and if relevant,
information/links showing the tests are flaky (and the flakiness rate) or are
broken in `main`. We plan to construct the comment in a manner similar to the
code formatting action. We will create one comment on the first workflow failure
and then update that comment everytime we get new data. This prevents creating
too much noise. This does mean that the comment might get buried in long running
reviews, but those are not the common case and people should learn to expect to
look for the comment earlier in the thread in such cases.

If all of the failures in the workflow were successfully explained by the
premerge advisor as flaky or already broken in `main`, then the premerge
workflow will be marked as successful despite the failure. This will be
achieved by having the build/test step always marked as successful. The
premerge advisor will then exit with a non-zero exit code if it comes
across non-explainable failures.

## Alternatives Considered
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are a number of off-the-shelf commercially available and open source flake detector systems. Here's a quote from Gemini:

A unit test flake detector identifies tests that pass and fail inconsistently without code changes by rerunning tests multiple times and analyzing the results for differing outcomes. These tools use historical data and CI/CD pipeline logs to flag unreliable tests, with features like auto-rerunning, historical analysis, and test quarantine for managing flagged tests. Popular examples include features within CI platforms like Azure DevOps, Mergify, and TeamCity, as well as third-party tools like Trunk.io, the flaky-tests-detection PyPI package.

I think it's worth discussing the build/buy/re-use tradeoffs here with at least a few of these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a section to cover this. This actually ended up being a pretty useful exercise.

trunk.io looks almost exactly like what we want other than that it (along with all the other alternatives) only work with flaky tests rather than failures at head. We really need both to create a cohesive solution for LLVM.


There are two main alternatives to this work. One would be to do nothing and let
users figure out these failures on their own, potentially with documentation to
better inform them of the process. The other alternative would be to work on
keeping `main` green all the time and deflake tests rather than work on a
premerge advisor. Both of these alternatives are considered below.

### Deflaking Tests and Keeping Main Green
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this section appears to be describing something we considered but did not do (and are not planning on doing?), I would prefix the section title with "Alternative: ". Just a suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a subtitle under the Alternatives Considered section, so it should be clear that it is nested within that section when actually rendered (although that does depend on the markdown renderer).


Instead of putting effort into building the premerge advisor, we could also be
putting effort into deflaking tests and making process changes to ensure `main`
is not broken. These fixes have the bonus of being useful for more than just
premerge, also improving reliability for buildbots and any downstream testing.
While we probably could achieve this at least temporarily with process changes
and a concentrated deflaking effort, we do not believe this is feasible or
scalable.

In order to ensure that main is not broken by new patches, we need to ensure
that every commit is tested directly on top of `main` before landing. This is
not feasible given the presence of "mid-air collisions", patches that pass
premerge testing, but fail on `main` when landed due to the patch in its
original state not being compatible with the new state of main. This would most
likely involve merge queues which would introduce new CI load and are also not
compatible with LLVM's existing practices.

Doing an initial effort for deflaking tests is also not scalable from an
engineering effort perspective. While we might be able to deflake existing
tests, additional flaky tests will get added in the future, and it is likely not
feasible to dedicate enough resources to deflake them. Policy improvements
around reverting patches that introduce flaky tests might make this more
scalable, but relies on quick detection of flaky tests, which might be difficult
for tests that experience flaky failures very rarely.

### Not Doing Anything

Alternatively, we could not implement this at all. This system adds quite a bit
of complexity and adds new failure modes. False positive failures also are not
that frequent. However, even a relatively small percentage of failures like we
are observing significantly impacts trust in the premerge system, which
compounds the problem. People learn not to trust the results, ignore true
failures caused by their patch, and then land it, causing many downstream
failures. The frequency of these incidents (typically multiple times per week)
means that it is pretty likely most LLVM developers will run into this class of
issue sooner or later.

The complexity is also well confined to the components specific to this new
infrastructure, and the new failure modes can be mitigated through proper error
handling at the interface between the existing premerge system and the new
premerge advisor infrastructure.

### Using an Existing System

There are several different platforms that offer post workflow test analysis
around issues like flaky tests. These platforms include [trunk.io](trunk.io),
[mergify](mergify.com), and
[DataDog Test Optimization](docs.datadoghq.com/tests/). However, only one of
these platforms, trunk.io, supports marking jobs as green if the only tests that
failed were flaky through their
[quarantining feature](docs.trunk.io/flaky-tests/quarantining). However, it
supports little customization over commenting on PRs (either is always on, even
successful runs, or always off), and our notification story is of high
importance. We want the signal to noise ratio of premerge notifications to be
high. The system should simply say nothing if everything has gone well.

In addition to being able to mark jobs as green if only flaky tests failed,
we also need to be able to keep track of failures in `main` and not block
merging on them. None of the systems considered support this functionality,
making the assumption that `main` should always be green (through mechanisms
like a merge queue), or should be fixed quickly. Due to the scale and
contribution norms in LLVM, keeping `main` always green is hard to enforce
at scale, so we need to support this feature.

The cost of using these systems is also nontrivial given how frequently
the premerge pipeline gets run and the number of tests that get run. These
systems are typically priced based on a number of "test spans" (a single
test invocation), with both mergify trunk.io and DataDog pricing ingesting
1M test spans at $3. We run ~300k tests per premerge run. Using a
conservative estimate of 400 premerge runs per weekday, no runs on weekends,
we end up with a cost of almost $100k per year, not including any potential
open source discounts.