Skip to content

Conversation

@nick-youngblut
Copy link
Contributor

@nick-youngblut nick-youngblut commented Dec 19, 2025

Summary

This PR enhances STATE's handling of highly variable genes (HVGs) by storing gene names directly in AnnData objects alongside the HVG expression matrix. This enables downstream tools like pdex to properly map predictions back to gene IDs without requiring additional metadata files.

Changes

Core Infrastructure

  • New constants module (src/state/tx/constants.py): Centralizes shared constants for TX workflows
  • New HVG utilities (src/state/tx/utils/hvg.py): Provides functions to retrieve and validate HVG gene names with fallback mechanisms

Preprocessing Enhancements

  • Enhanced preprocess_train: Now stores HVG gene names in adata.uns["X_hvg_var_names"] alongside the HVG matrix in adata.obsm["X_hvg"]
  • Updated inference preprocessing: Added validation and warning when HVG names are missing

CLI Improvements

  • Enhanced infer command: Added --verbose flag to show HVG name mapping details and status reporting
  • Updated predict command: Preserves HVG names in prediction outputs

Documentation & Migration

  • New repository guidelines (AGENTS.md): Comprehensive development guidelines
  • Migration guide (docs/migration/hvg_var_names.md): Backward compatibility notes and backfill script for existing data
  • Updated README: Added section on accessing HVG gene names

Testing

  • New test suites: Comprehensive tests for HVG utilities, inference pipeline, prediction outputs, and preprocessing workflows
  • All existing tests pass: No regressions introduced

Backward Compatibility

This change is fully backward compatible:

  • Existing preprocessed data: Inference commands continue working without modification
  • Non-blocking warnings: Users are notified when HVG names are missing but execution proceeds
  • Fallback mechanisms: Code can still recover gene names from adata.var.highly_variable when available
  • No API changes: Existing workflows continue functioning unchanged

Usage Examples

Accessing HVG Gene Names

import anndata as ad

# After preprocessing with latest STATE version
adata = ad.read_h5ad("preprocessed.h5ad")
hvg_names = adata.uns.get("X_hvg_var_names")

# Construct downstream AnnData for tools like pdex
adata_for_pdex = ad.AnnData(
    X=adata.obsm["X_hvg"],
    obs=adata.obs,
    var=pd.DataFrame(index=hvg_names),
)

Backfilling Existing Data

For pre-existing datasets, use the provided backfill script to add HVG names to existing files.

Technical Details

  • HVG names are stored as NumPy arrays of Python strings for h5ad compatibility
  • Naming convention {obsm_key}_var_names allows extension to other embedding types
  • Comprehensive validation ensures gene name arrays match embedding dimensions
  • Fallback logic prioritizes explicit uns keys over implicit var-based recovery

Testing

  • All existing tests pass
  • New test coverage includes:
    • HVG name retrieval with multiple fallback scenarios
    • Inference pipeline preservation of HVG metadata
    • Prediction output includes HVG names
    • Preprocessing correctly stores HVG names
    • End-to-end workflow validation

Risks & Mitigations

  • Low risk: Fully backward compatible with existing workflows
  • Data integrity: Validation ensures HVG name arrays match embedding dimensions
  • Performance: Minimal overhead - only stores additional metadata
  • Migration: Clear documentation and backfill scripts provided

Note

Implements explicit storage and propagation of highly variable gene (HVG) names to enable downstream mapping and compatibility.

  • Core: Add state/tx/constants.py and state/tx/utils/hvg.py for HVG name keys, retrieval, fallbacks, and validation
  • Preprocessing: tx preprocess_train now writes HVG matrix to obsm['X_hvg'] and names to uns['X_hvg_var_names']; tx preprocess_infer logs presence and warns if missing
  • Inference/Prediction: tx infer gains --verbose, reports HVG mapping status, and writes uns['X_hvg_var_names'] to outputs; tx predict includes HVG names in both adata_pred and adata_real
  • Dataset: scgpt_perturbation_dataset.py defaults to reading HVG names from uns['X_hvg_var_names']
  • Docs: Update README.md with HVG access examples and CLI fixes; add migration guide docs/migration/hvg_var_names.md; add AGENTS.md
  • Tests: New tests for HVG utilities, preprocess (train/infer), inference pipeline, and prediction outputs

Written by Cursor Bugbot for commit 98d205d. This will update automatically on new commits. Configure here.

- Added a `--verbose` flag to the inference CLI for detailed gene name mapping output.
- Implemented HVG name retrieval and validation in preprocessing and prediction scripts.
- Introduced constants for HVG variable names and updated relevant functions to utilize them.
- Enhanced logging for HVG name availability and warnings for missing data.
- Updated dataset class to default to the new HVG names key.
- Added storage of highly variable gene (HVG) names in `adata.uns["X_hvg_var_names"]` for improved downstream mapping.
- Updated documentation to reflect changes in HVG gene name access and backward compatibility.
- Introduced tests to validate HVG name retrieval and ensure compatibility with existing workflows.
- Enhanced inference and preprocessing scripts to preserve HVG names during data processing.
@nick-youngblut nick-youngblut requested a review from a team as a code owner December 19, 2025 22:41
@gemini-code-assist
Copy link

Summary of Changes

Hello @nick-youngblut, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the STATE framework's ability to manage Highly Variable Genes (HVGs) by ensuring their names are consistently stored and accessible within AnnData objects. This explicit storage facilitates seamless integration with downstream analysis tools that require gene ID mapping, improving data integrity and usability. The changes are implemented with full backward compatibility, providing clear migration paths and robust error handling for existing datasets.

Highlights

  • HVG Gene Name Storage: Highly Variable Gene (HVG) names are now explicitly stored in AnnData objects under adata.uns['X_hvg_var_names'] alongside the HVG expression matrix, enabling direct mapping of predictions to gene IDs.
  • New Utility Modules: Introduced src/state/tx/constants.py for centralized shared constants and src/state/tx/utils/hvg.py for functions to retrieve, validate, and detect preprocessing versions of HVG gene names with fallback mechanisms.
  • Preprocessing Enhancements: The preprocess_train command now automatically stores HVG gene names, and preprocess_infer includes validation and warnings for missing HVG names.
  • CLI Improvements: The infer command gained a --verbose flag to display HVG name mapping details, and the predict command now preserves HVG names in its output.
  • Documentation and Migration: Added new repository guidelines (AGENTS.md), a detailed migration guide (docs/migration/hvg_var_names.md) with a backfill script for existing data, and updated the README.md to reflect the new HVG name access.
  • Comprehensive Testing: New test suites have been added for HVG utilities, inference pipeline, prediction outputs, and preprocessing workflows, ensuring no regressions and proper functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a well-structured enhancement to store highly variable gene (HVG) names directly within AnnData objects, which is a valuable improvement for downstream tooling. The changes include new constants and utility modules, updates to preprocessing and CLI commands, and comprehensive documentation and testing. While the overall implementation is solid, I've identified a couple of critical issues, including a hardcoded file path and a logic error in a default parameter that could lead to incorrect behavior. I've also noted some opportunities for code simplification to improve maintainability. Addressing these points will make this a very strong contribution.

Comment on lines 337 to 339
gene_names = np.load(
"/large_storage/ctc/userspace/aadduri/datasets/tahoe_19k_to_2k_names.npy", allow_pickle=True
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This code contains a hardcoded absolute file path. This is a critical issue as it makes the code non-portable and will cause it to fail on any system where this specific path does not exist. This path should be removed or provided through a configuration option.

Copy link
Contributor Author

@nick-youngblut nick-youngblut Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhinadduri
I'm not sure how to update this

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on January 16

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

- Moved HVG name assignment to a single conditional block for clarity and consistency.
- Removed redundant code for HVG name storage in `adata.uns` to enhance maintainability.
- Cleaned up test file by removing unused numpy import.
- Added a `--verbose` flag to the inference CLI for detailed output on gene name mapping.
- Implemented checks and logging for the presence of highly variable gene (HVG) names during inference.
- Updated prediction script to store HVG names in `adata.uns` for improved data consistency.
- Refactored code to streamline HVG name handling and ensure compatibility with existing workflows.
- Added logic to store highly variable gene (HVG) names in `adata.uns` using the defined constant `HVG_VAR_NAMES_KEY`.
- Initialized `hvg_names` variable to handle cases where HVG names may not be present.
- Enhanced the inference process to ensure HVG names are preserved for downstream analysis.
- Added 'tasks/' to .gitignore to prevent tracking of task-related files.
- Ensured that temporary files related to tasks are excluded from version control.
- Updated conditions for storing highly variable gene (HVG) names in `adata.uns` to check that the length of `hvg_uns_names` matches the shape of prediction arrays.
- This change prevents potential mismatches and ensures data integrity during the prediction process.
- Changed the argument type of `run_tx_predict` from `ap.ArgumentParser` to `ap.Namespace` for better clarity and functionality.
- Added additional parameters in the `_make_args` function to enhance flexibility in test cases.
- Updated test cases to include a new `toml` parameter for improved configuration handling.
- Revised command examples for model inference and embedding transformation to reflect updated argument names and paths.
- Enhanced clarity in data splitting logic and configuration validation sections, including required and optional parameters.
- Added new sections for preprocessing datasets and evaluating embedding models, providing users with comprehensive guidance on usage.

# Store HVG names if available
if hvg_names is not None:
adata.uns[HVG_VAR_NAMES_KEY] = np.array(hvg_names, dtype=object)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing dimension validation when storing HVG names in infer

Medium Severity

When there's a dimension mismatch between the input obsm["X_hvg"] and model output (lines 840-847), sim_counts is reinitialized with the model's output dimension. However, at line 936, hvg_names (retrieved earlier from the input with potentially different length) is stored unconditionally without validating dimensions. This differs from _predict.py which validates len(hvg_uns_names) == final_preds.shape[1] before storing. The result could be an output file where uns["X_hvg_var_names"] length doesn't match obsm["X_hvg"] columns, causing incorrect gene mappings for downstream tools.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant