Skip to content

arpanauts/biomapper2

Repository files navigation

biomapper2

CI

This is a package for mapping biomedical entities to the KRAKEN knowledge graph, whether starting from text names or vocabulary/ontology IDs (local IDs or CURIEs).

It supports both single-entity lookups and dataset-level batch processing, and does:

  1. entity linking (text name → CURIE)
  2. ID normalization (messy local ID → CURIE)
  3. entity resolution (CURIE → canonical CURIE, by leveraging the CURIE equivalencies in the KRAKEN knowledge graph)

All CURIEs are represented in Biolink-standard format.

⚠️ Note: This package is in active development. Feedback and issues welcome!

Setup

Install uv (if not already installed)

macOS/Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

For other platforms, see uv installation docs.

Clone and install

git clone https://github.com/Phenome-Health/biomapper2.git
cd biomapper2
uv sync --dev

This will create a virtual environment and install all dependencies.

Then just create a .env file with the proper secrets:

cd biomapper2
cp .env.example .env

And edit .env so that it has the actual secrets instead of placeholders.

Then run the pytest suite to confirm all is working.

Usage

Map a single entity to knowledge graph

from biomapper2.mapper import Mapper

mapper = Mapper()

item = {
    'name': 'carnitine',
    'kegg': ['C00487'],
    'pubchem': '10917'
}

mapped_item = mapper.map_entity_to_kg(
    item=item,
    name_field='name',
    provided_id_fields=['kegg', 'pubchem'],
    entity_type='metabolite'
)

Map a dataset to knowledge graph

from biomapper2.mapper import Mapper

mapper = Mapper()

mapper.map_dataset_to_kg(
    dataset='data/examples/olink_protein_metadata.tsv',
    entity_type='protein',
    name_column='Assay',
    provided_id_columns=['UniProt'],
    array_delimiters=['_']
)

See examples/ for complete working examples.

Generate KG-performance across datasets

from biomapper.visualizer import Visualizer

viz = Visualizer()

# collect metrics from jsons named {dataset}_{entity}_MAPPED_a_summary_stats.json
stats_df = viz.aggregate_stats(
    stats_dir='data/examples/synthetic_stats/'
)

viz.render_heatmap(
    df=stats_df,
    output_path='docs/assets/comparison_viz' # defaults to producing pdf and png, configurable via Visualizer(
)

Run examples

uv run python examples/basic_entity_kg_mapping.py
uv run python examples/basic_dataset_kg_mapping.py

Run tests

uv run pytest          # Run all tests
uv run pytest -v       # Run with verbose output
uv run pytest -vs      # Run with verbose output and logging/prints displayed

Note: Tests run automatically on every commit via GitHub Actions (CI/CD).

Development

Quick Start

Run all code quality checks before committing:

./scripts/check.sh     # Run ruff, black, pyright, and pytests
./scripts/fix.sh       # Auto-fix formatting and linting issues

For detailed contribution guidelines, code style standards, and workflow practices, see docs/CONTRIBUTING.md.

Project structure

src/biomapper2/
├── mapper.py                   # Main Mapper class - entry point for entity/dataset mapping
├── config.py                   # Configuration (KG API endpoint, logging, etc.)
├── core/
│   ├── annotation_engine.py    # Orchestrates annotation of entities with ontology local IDs
│   ├── annotators/             # Individual annotator implementations (Kestrel text search, etc.)
│   │   ├── base.py             # Base annotator interface
│   │   └── kestrel_text.py     # Kestrel text search annotator
│   ├── normalizer/             # ID normalization package
│   │   ├── normalizer.py       # Main Normalizer class
│   │   ├── validators.py       # ID validation functions for different vocabularies
│   │   ├── cleaners.py         # ID cleaning/standardization functions
│   │   └── vocab_config.py     # Biolink prefix mappings and validator configurations
│   ├── linker.py               # Links curies to knowledge graph nodes
│   └── resolver.py             # Resolves one-to-many entity→KG matches
└── utils.py                    # Utility functions
└── visualizer.py               # Visualize KG performance across datasets

examples/                       # Working code examples
tests/                          # Pytest test suite
data/                           # Example and groundtruth datasets
scripts/                        # Development scripts (check.sh, fix.sh)

Configuration

Edit src/biomapper2/config.py to customize:

  • KESTREL_API_URL - Knowledge graph API endpoint (default: production server)
  • BIOLINK_VERSION_DEFAULT - Default Biolink model version
  • LOG_LEVEL - Logging verbosity (DEBUG, INFO, WARNING, ERROR, CRITICAL)

About

Map biomedical entities to KG nodes (entity linking and resolution)

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5