-
Notifications
You must be signed in to change notification settings - Fork 142
Store and Access HVG Gene Names in AnnData #246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…mmands, coding style, testing, and security
- Added a `--verbose` flag to the inference CLI for detailed gene name mapping output. - Implemented HVG name retrieval and validation in preprocessing and prediction scripts. - Introduced constants for HVG variable names and updated relevant functions to utilize them. - Enhanced logging for HVG name availability and warnings for missing data. - Updated dataset class to default to the new HVG names key.
- Added storage of highly variable gene (HVG) names in `adata.uns["X_hvg_var_names"]` for improved downstream mapping. - Updated documentation to reflect changes in HVG gene name access and backward compatibility. - Introduced tests to validate HVG name retrieval and ensure compatibility with existing workflows. - Enhanced inference and preprocessing scripts to preserve HVG names during data processing.
Summary of ChangesHello @nick-youngblut, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the STATE framework's ability to manage Highly Variable Genes (HVGs) by ensuring their names are consistently stored and accessible within AnnData objects. This explicit storage facilitates seamless integration with downstream analysis tools that require gene ID mapping, improving data integrity and usability. The changes are implemented with full backward compatibility, providing clear migration paths and robust error handling for existing datasets. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a well-structured enhancement to store highly variable gene (HVG) names directly within AnnData objects, which is a valuable improvement for downstream tooling. The changes include new constants and utility modules, updates to preprocessing and CLI commands, and comprehensive documentation and testing. While the overall implementation is solid, I've identified a couple of critical issues, including a hardcoded file path and a logic error in a default parameter that could lead to incorrect behavior. I've also noted some opportunities for code simplification to improve maintainability. Addressing these points will make this a very strong contribution.
src/state/_cli/_tx/_predict.py
Outdated
| gene_names = np.load( | ||
| "/large_storage/ctc/userspace/aadduri/datasets/tahoe_19k_to_2k_names.npy", allow_pickle=True | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhinadduri
I'm not sure how to update this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the final PR Bugbot will review for you during this billing cycle
Your free Bugbot reviews will reset on January 16
Details
Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
- Moved HVG name assignment to a single conditional block for clarity and consistency. - Removed redundant code for HVG name storage in `adata.uns` to enhance maintainability. - Cleaned up test file by removing unused numpy import.
- Added a `--verbose` flag to the inference CLI for detailed output on gene name mapping. - Implemented checks and logging for the presence of highly variable gene (HVG) names during inference. - Updated prediction script to store HVG names in `adata.uns` for improved data consistency. - Refactored code to streamline HVG name handling and ensure compatibility with existing workflows.
- Added logic to store highly variable gene (HVG) names in `adata.uns` using the defined constant `HVG_VAR_NAMES_KEY`. - Initialized `hvg_names` variable to handle cases where HVG names may not be present. - Enhanced the inference process to ensure HVG names are preserved for downstream analysis.
- Added 'tasks/' to .gitignore to prevent tracking of task-related files. - Ensured that temporary files related to tasks are excluded from version control.
- Updated conditions for storing highly variable gene (HVG) names in `adata.uns` to check that the length of `hvg_uns_names` matches the shape of prediction arrays. - This change prevents potential mismatches and ensures data integrity during the prediction process.
- Changed the argument type of `run_tx_predict` from `ap.ArgumentParser` to `ap.Namespace` for better clarity and functionality. - Added additional parameters in the `_make_args` function to enhance flexibility in test cases. - Updated test cases to include a new `toml` parameter for improved configuration handling.
- Revised command examples for model inference and embedding transformation to reflect updated argument names and paths. - Enhanced clarity in data splitting logic and configuration validation sections, including required and optional parameters. - Added new sections for preprocessing datasets and evaluating embedding models, providing users with comprehensive guidance on usage.
|
|
||
| # Store HVG names if available | ||
| if hvg_names is not None: | ||
| adata.uns[HVG_VAR_NAMES_KEY] = np.array(hvg_names, dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing dimension validation when storing HVG names in infer
Medium Severity
When there's a dimension mismatch between the input obsm["X_hvg"] and model output (lines 840-847), sim_counts is reinitialized with the model's output dimension. However, at line 936, hvg_names (retrieved earlier from the input with potentially different length) is stored unconditionally without validating dimensions. This differs from _predict.py which validates len(hvg_uns_names) == final_preds.shape[1] before storing. The result could be an output file where uns["X_hvg_var_names"] length doesn't match obsm["X_hvg"] columns, causing incorrect gene mappings for downstream tools.
Summary
This PR enhances STATE's handling of highly variable genes (HVGs) by storing gene names directly in AnnData objects alongside the HVG expression matrix. This enables downstream tools like
pdexto properly map predictions back to gene IDs without requiring additional metadata files.Changes
Core Infrastructure
src/state/tx/constants.py): Centralizes shared constants for TX workflowssrc/state/tx/utils/hvg.py): Provides functions to retrieve and validate HVG gene names with fallback mechanismsPreprocessing Enhancements
preprocess_train: Now stores HVG gene names inadata.uns["X_hvg_var_names"]alongside the HVG matrix inadata.obsm["X_hvg"]CLI Improvements
infercommand: Added--verboseflag to show HVG name mapping details and status reportingpredictcommand: Preserves HVG names in prediction outputsDocumentation & Migration
AGENTS.md): Comprehensive development guidelinesdocs/migration/hvg_var_names.md): Backward compatibility notes and backfill script for existing dataTesting
Backward Compatibility
This change is fully backward compatible:
adata.var.highly_variablewhen availableUsage Examples
Accessing HVG Gene Names
Backfilling Existing Data
For pre-existing datasets, use the provided backfill script to add HVG names to existing files.
Technical Details
{obsm_key}_var_namesallows extension to other embedding typesTesting
Risks & Mitigations
Note
Implements explicit storage and propagation of highly variable gene (HVG) names to enable downstream mapping and compatibility.
state/tx/constants.pyandstate/tx/utils/hvg.pyfor HVG name keys, retrieval, fallbacks, and validationtx preprocess_trainnow writes HVG matrix toobsm['X_hvg']and names touns['X_hvg_var_names'];tx preprocess_inferlogs presence and warns if missingtx infergains--verbose, reports HVG mapping status, and writesuns['X_hvg_var_names']to outputs;tx predictincludes HVG names in bothadata_predandadata_realscgpt_perturbation_dataset.pydefaults to reading HVG names fromuns['X_hvg_var_names']README.mdwith HVG access examples and CLI fixes; add migration guidedocs/migration/hvg_var_names.md; addAGENTS.mdWritten by Cursor Bugbot for commit 98d205d. This will update automatically on new commits. Configure here.