Store and Access HVG Gene Names in AnnData #246

nick-youngblut · 2025-12-19T22:41:34Z

Summary

This PR enhances STATE's handling of highly variable genes (HVGs) by storing gene names directly in AnnData objects alongside the HVG expression matrix. This enables downstream tools like pdex to properly map predictions back to gene IDs without requiring additional metadata files.

Changes

Core Infrastructure

New constants module (src/state/tx/constants.py): Centralizes shared constants for TX workflows
New HVG utilities (src/state/tx/utils/hvg.py): Provides functions to retrieve and validate HVG gene names with fallback mechanisms

Preprocessing Enhancements

Enhanced preprocess_train: Now stores HVG gene names in adata.uns["X_hvg_var_names"] alongside the HVG matrix in adata.obsm["X_hvg"]
Updated inference preprocessing: Added validation and warning when HVG names are missing

CLI Improvements

Enhanced infer command: Added --verbose flag to show HVG name mapping details and status reporting
Updated predict command: Preserves HVG names in prediction outputs

Documentation & Migration

New repository guidelines (AGENTS.md): Comprehensive development guidelines
Migration guide (docs/migration/hvg_var_names.md): Backward compatibility notes and backfill script for existing data
Updated README: Added section on accessing HVG gene names

Testing

New test suites: Comprehensive tests for HVG utilities, inference pipeline, prediction outputs, and preprocessing workflows
All existing tests pass: No regressions introduced

Backward Compatibility

This change is fully backward compatible:

Existing preprocessed data: Inference commands continue working without modification
Non-blocking warnings: Users are notified when HVG names are missing but execution proceeds
Fallback mechanisms: Code can still recover gene names from adata.var.highly_variable when available
No API changes: Existing workflows continue functioning unchanged

Usage Examples

Accessing HVG Gene Names

import anndata as ad

# After preprocessing with latest STATE version
adata = ad.read_h5ad("preprocessed.h5ad")
hvg_names = adata.uns.get("X_hvg_var_names")

# Construct downstream AnnData for tools like pdex
adata_for_pdex = ad.AnnData(
    X=adata.obsm["X_hvg"],
    obs=adata.obs,
    var=pd.DataFrame(index=hvg_names),
)

Backfilling Existing Data

For pre-existing datasets, use the provided backfill script to add HVG names to existing files.

Technical Details

HVG names are stored as NumPy arrays of Python strings for h5ad compatibility
Naming convention {obsm_key}_var_names allows extension to other embedding types
Comprehensive validation ensures gene name arrays match embedding dimensions
Fallback logic prioritizes explicit uns keys over implicit var-based recovery

Testing

All existing tests pass
New test coverage includes:
- HVG name retrieval with multiple fallback scenarios
- Inference pipeline preservation of HVG metadata
- Prediction output includes HVG names
- Preprocessing correctly stores HVG names
- End-to-end workflow validation

Risks & Mitigations

Low risk: Fully backward compatible with existing workflows
Data integrity: Validation ensures HVG name arrays match embedding dimensions
Performance: Minimal overhead - only stores additional metadata
Migration: Clear documentation and backfill scripts provided

Note

Implements explicit storage and propagation of highly variable gene (HVG) names to enable downstream mapping and compatibility.

Core: Add state/tx/constants.py and state/tx/utils/hvg.py for HVG name keys, retrieval, fallbacks, and validation
Preprocessing: tx preprocess_train now writes HVG matrix to obsm['X_hvg'] and names to uns['X_hvg_var_names']; tx preprocess_infer logs presence and warns if missing
Inference/Prediction: tx infer gains --verbose, reports HVG mapping status, and writes uns['X_hvg_var_names'] to outputs; tx predict includes HVG names in both adata_pred and adata_real
Dataset: scgpt_perturbation_dataset.py defaults to reading HVG names from uns['X_hvg_var_names']
Docs: Update README.md with HVG access examples and CLI fixes; add migration guide docs/migration/hvg_var_names.md; add AGENTS.md
Tests: New tests for HVG utilities, preprocess (train/infer), inference pipeline, and prediction outputs

^{Written by Cursor Bugbot for commit 98d205d. This will update automatically on new commits. Configure here.}

…mmands, coding style, testing, and security

- Added a `--verbose` flag to the inference CLI for detailed gene name mapping output. - Implemented HVG name retrieval and validation in preprocessing and prediction scripts. - Introduced constants for HVG variable names and updated relevant functions to utilize them. - Enhanced logging for HVG name availability and warnings for missing data. - Updated dataset class to default to the new HVG names key.

- Added storage of highly variable gene (HVG) names in `adata.uns["X_hvg_var_names"]` for improved downstream mapping. - Updated documentation to reflect changes in HVG gene name access and backward compatibility. - Introduced tests to validate HVG name retrieval and ensure compatibility with existing workflows. - Enhanced inference and preprocessing scripts to preserve HVG names during data processing.

gemini-code-assist · 2025-12-19T22:41:58Z

Summary of Changes

Hello @nick-youngblut, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the STATE framework's ability to manage Highly Variable Genes (HVGs) by ensuring their names are consistently stored and accessible within AnnData objects. This explicit storage facilitates seamless integration with downstream analysis tools that require gene ID mapping, improving data integrity and usability. The changes are implemented with full backward compatibility, providing clear migration paths and robust error handling for existing datasets.

Highlights

HVG Gene Name Storage: Highly Variable Gene (HVG) names are now explicitly stored in AnnData objects under adata.uns['X_hvg_var_names'] alongside the HVG expression matrix, enabling direct mapping of predictions to gene IDs.
New Utility Modules: Introduced src/state/tx/constants.py for centralized shared constants and src/state/tx/utils/hvg.py for functions to retrieve, validate, and detect preprocessing versions of HVG gene names with fallback mechanisms.
Preprocessing Enhancements: The preprocess_train command now automatically stores HVG gene names, and preprocess_infer includes validation and warnings for missing HVG names.
CLI Improvements: The infer command gained a --verbose flag to display HVG name mapping details, and the predict command now preserves HVG names in its output.
Documentation and Migration: Added new repository guidelines (AGENTS.md), a detailed migration guide (docs/migration/hvg_var_names.md) with a backfill script for existing data, and updated the README.md to reflect the new HVG name access.
Comprehensive Testing: New test suites have been added for HVG utilities, inference pipeline, prediction outputs, and preprocessing workflows, ensuring no regressions and proper functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a well-structured enhancement to store highly variable gene (HVG) names directly within AnnData objects, which is a valuable improvement for downstream tooling. The changes include new constants and utility modules, updates to preprocessing and CLI commands, and comprehensive documentation and testing. While the overall implementation is solid, I've identified a couple of critical issues, including a hardcoded file path and a logic error in a default parameter that could lead to incorrect behavior. I've also noted some opportunities for code simplification to improve maintainability. Addressing these points will make this a very strong contribution.

gemini-code-assist · 2025-12-19T22:43:39Z

src/state/_cli/_tx/_predict.py

            gene_names = np.load(
                "/large_storage/ctc/userspace/aadduri/datasets/tahoe_19k_to_2k_names.npy", allow_pickle=True
            )


This code contains a hardcoded absolute file path. This is a critical issue as it makes the code non-portable and will cause it to fail on any system where this specific path does not exist. This path should be removed or provided through a configuration option.

@abhinadduri
I'm not sure how to update this

src/state/tx/data/dataset/scgpt_perturbation_dataset.py

src/state/_cli/_tx/_infer.py

src/state/_cli/_tx/_predict.py

cursor

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on January 16

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

src/state/_cli/_tx/_predict.py

src/state/tx/data/dataset/scgpt_perturbation_dataset.py

- Moved HVG name assignment to a single conditional block for clarity and consistency. - Removed redundant code for HVG name storage in `adata.uns` to enhance maintainability. - Cleaned up test file by removing unused numpy import.

- Added a `--verbose` flag to the inference CLI for detailed output on gene name mapping. - Implemented checks and logging for the presence of highly variable gene (HVG) names during inference. - Updated prediction script to store HVG names in `adata.uns` for improved data consistency. - Refactored code to streamline HVG name handling and ensure compatibility with existing workflows.

- Added logic to store highly variable gene (HVG) names in `adata.uns` using the defined constant `HVG_VAR_NAMES_KEY`. - Initialized `hvg_names` variable to handle cases where HVG names may not be present. - Enhanced the inference process to ensure HVG names are preserved for downstream analysis.

- Added 'tasks/' to .gitignore to prevent tracking of task-related files. - Ensured that temporary files related to tasks are excluded from version control.

- Updated conditions for storing highly variable gene (HVG) names in `adata.uns` to check that the length of `hvg_uns_names` matches the shape of prediction arrays. - This change prevents potential mismatches and ensures data integrity during the prediction process.

- Changed the argument type of `run_tx_predict` from `ap.ArgumentParser` to `ap.Namespace` for better clarity and functionality. - Added additional parameters in the `_make_args` function to enhance flexibility in test cases. - Updated test cases to include a new `toml` parameter for improved configuration handling.

- Revised command examples for model inference and embedding transformation to reflect updated argument names and paths. - Enhanced clarity in data splitting logic and configuration validation sections, including required and optional parameters. - Added new sections for preprocessing datasets and evaluating embedding models, providing users with comprehensive guidance on usage.

cursor · 2026-01-06T18:44:23Z

src/state/_cli/_tx/_infer.py


+    # Store HVG names if available
+    if hvg_names is not None:
+        adata.uns[HVG_VAR_NAMES_KEY] = np.array(hvg_names, dtype=object)


Missing dimension validation when storing HVG names in infer

Medium Severity

When there's a dimension mismatch between the input obsm["X_hvg"] and model output (lines 840-847), sim_counts is reinitialized with the model's output dimension. However, at line 936, hvg_names (retrieved earlier from the input with potentially different length) is stored unconditionally without validating dimensions. This differs from _predict.py which validates len(hvg_uns_names) == final_preds.shape[1] before storing. The result could be an output file where uns["X_hvg_var_names"] length doesn't match obsm["X_hvg"] columns, causing incorrect gene mappings for downstream tools.

Additional Locations (1)

src/state/_cli/_tx/_infer.py#L839-L847

nick-youngblut added 3 commits December 19, 2025 13:12

docs: add repository guidelines for project structure, development co…

0ff67a0

…mmands, coding style, testing, and security

nick-youngblut requested a review from a team as a code owner December 19, 2025 22:41

gemini-code-assist bot reviewed Dec 19, 2025

View reviewed changes

cursor bot reviewed Dec 19, 2025

View reviewed changes

src/state/_cli/_tx/_predict.py Show resolved Hide resolved

src/state/tx/data/dataset/scgpt_perturbation_dataset.py Show resolved Hide resolved

nick-youngblut added 8 commits December 19, 2025 14:58

Merge upstream/main - resolved conflicts with upstream versions

2a7e0bb

chore: update .gitignore to include tasks directory

0a8bd53

- Added 'tasks/' to .gitignore to prevent tracking of task-related files. - Ensured that temporary files related to tasks are excluded from version control.

cursor bot reviewed Jan 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Store and Access HVG Gene Names in AnnData #246

Store and Access HVG Gene Names in AnnData #246

Uh oh!

nick-youngblut commented Dec 19, 2025 •

edited by cursor bot

Loading

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

nick-youngblut Dec 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Store and Access HVG Gene Names in AnnData #246

Are you sure you want to change the base?

Store and Access HVG Gene Names in AnnData #246

Uh oh!

Conversation

nick-youngblut commented Dec 19, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core Infrastructure

Preprocessing Enhancements

CLI Improvements

Documentation & Migration

Testing

Backward Compatibility

Usage Examples

Accessing HVG Gene Names

Backfilling Existing Data

Technical Details

Testing

Risks & Mitigations

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

nick-youngblut Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This is the final PR Bugbot will review for you during this billing cycle

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 6, 2026

Choose a reason for hiding this comment

Missing dimension validation when storing HVG names in infer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nick-youngblut commented Dec 19, 2025 •

edited by cursor bot

Loading

nick-youngblut Dec 19, 2025 •

edited

Loading