Skip to content

Comments

Add recall@k metric to simple bench and milvus-cluster.yaml config#7

Open
idevasena wants to merge 7 commits intomainfrom
di/vdb-test
Open

Add recall@k metric to simple bench and milvus-cluster.yaml config#7
idevasena wants to merge 7 commits intomainfrom
di/vdb-test

Conversation

@idevasena
Copy link
Collaborator

  1. CSV Fields Updated: Added recall_at_10, recall_at_5, recall_at_1 to output data
  2. Recall Calculation Function: New calculate_recall() function that computes recall@k between search results and ground truth
  3. Brute Force Ground Truth: For each query batch, performs both:
    • Regular search with index (ef: 50)
    • Brute force search as ground truth (ef: 1000)
  4. Batch-Level Recall: Calculates recall for each query in batch, then averages across the batch
  5. Added recall statistics:
    • Mean, median, min, max for recall@1, recall@5, recall@10
  6. Output Display: New "RECALL STATISTICS" section in benchmark summary

Details:

  • Recall@k = (Number of relevant items retrieved in top-k) / (Total relevant items in top-k of ground truth)
  • Ground truth comes from brute force search (for exhaustive comparison)

@idevasena idevasena requested a review from wvaske September 19, 2025 12:23
@wvaske
Copy link
Owner

wvaske commented Oct 13, 2025

Functionally, this works, but the recall stats are going to mess with the performance numbers since each batch now has 2 queries associated with it -- the actual query and the higher limit query. I did a quick test and this change shows throughput dropping by 1/2 since we're doubling queries but not counting them.

Calculating the ground truth needs to happen outside the timed portion of the benchmark. The way vectordb bench does this is by pre-calculating the ground truth for the queries they will execute. The other option would be to capture each query vector and responses to an output file then after the benchmark run, calculate the ground truth from the record of the queries & responses.

And we can't use an ANN index to calculate the ground truth as the recall metric is effectively trying to measure if the ANN is accurate. Capturing more results doesn't guarantee that we've got the 'ground truth' as the algorithm could be bad all the way down.

We would need to generate the list of queries ahead of time and do ground truth from the set of queries outside Milvus or duplicate the collection but use FLAT for the index which gives brute-force results and will give ground truth accuracy (I think)

As it stands we can't merge this in as it greatly effects the measured performance.

What's the primary goal of the recall metric? Since we're using synthetic data we know our recall is going to be wonky.

limit=10,
output_fields=["id"]
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way timing is measured in your patch, has an issue that will affect your benchmark results.
Goal of this script is to report how fast the Primary Search (ef=50) is.

Place the "batch_end" time measurement before the ground_truth search. This should ideally fix the performance issue !

@idevasena
Copy link
Collaborator Author

Addresses review comments for this PR and updated recall calculation in the storage repo TF_VectorDB branch PR here: mlcommons/storage@f9ab288 . Please review. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants