BioMapper is a general-purpose plugin- and strategy-based orchestration framework, with its first application in biological data harmonization. Architecturally, it blends elements of workflow engines (Nextflow, Snakemake, Kedro, Dagster) with a lightweight service-oriented API and a plugin registry backed by a unified UniversalContext. Its standout differentiator is an AI-native developer experience: CLAUDE.md, .claude/ scaffolding, custom slash commands, and the BioSherpa guide. This potentially makes it the first open-source bioinformatics orchestration platform with built-in LLM-assisted contributor workflows.
The result is a platform that is modular, extensible, and uniquely AI-augmented, well-positioned for long-term ecosystem growth. Built on a self-registering action system and YAML-based workflow definitions, it features a modern src-layout architecture with comprehensive test coverage and 2025 standardizations for production reliability.
- Self-registering action system - Actions automatically register via decorators
- Type-safe parameters - Pydantic models provide validation and IDE support
- YAML workflow definition - Declarative strategies without coding
- Real-time progress tracking - SSE events for long-running jobs
- Extensible architecture - Easy to add new actions and entity types
- AI-ready design - Built for integration with Claude Code and LLM assistance
BioMapper follows a modern microservices architecture with clear separation of concerns:
Core Design:
- YAML Strategies - Declarative configs defining pipelines of actions
- Action Registry - Self-registering via decorators; plug-and-play extensibility
- UniversalContext - Normalizes state access across heterogeneous action types
- Pydantic Models (v2) - Typed parameter models per action category
- Progressive Mapping - Iterative enrichment stages (65% β 80% coverage)
Comparison to Known Patterns:
- Similar to: Nextflow & Snakemake (declarative pipelines), Kedro (typed configs + reproducibility), Dagster (observability and orchestration)
- Different from: Heavy orchestrators (Airflow, Beam) β BioMapper is lighter, service/API-first, domain-agnostic, and tailored for interactive workflows
- Unique: Combines API service with strategy-based pipeline engine; domain-specific operations first (bio), but extensible beyond
Three-Layer Design:
- Client Layer - Python client library (
src/client/) provides programmatic access - API Layer - FastAPI service (
src/api/) handles HTTP requests, job management, and background processing - Core Layer - Business logic (
src/core/) with self-registering actions and strategy execution engine
Key Architectural Patterns:
- Registry Pattern - Actions self-register at import time using decorators, eliminating manual registration
- Strategy Pattern - YAML configurations define workflows as sequences of pluggable actions
- Pipeline Pattern - Actions process data through a shared execution context, enabling complex workflows
- Type Safety - Pydantic models provide runtime validation and compile-time type checking
Data Flow:
- User defines a workflow in YAML (strategy) or calls the API directly
- Client sends request to FastAPI server
- Server validates request and creates a background job
- MinimalStrategyService loads the strategy and executes actions sequentially
- Each action reads from and writes to a shared execution context
- Results persist to SQLite for recovery and progress tracking
- Client receives results via REST response or Server-Sent Events (SSE)
Design Principles:
- Modularity - Each action is independent and reusable
- Extensibility - New actions can be added without modifying core code
- Type Safety - Strong typing prevents runtime errors
- Reproducibility - YAML strategies ensure consistent execution
- Fault Tolerance - Job persistence enables recovery from failures
flowchart TB
Client[Client Request] --> BiomapperClient
BiomapperClient --> API[FastAPI Server]
API --> MapperService[MapperService]
MapperService --> MSS[MinimalStrategyService]
MSS --> |Loads at startup| Config[(configs/strategies/*.yaml)]
MSS --> |Executes actions| Registry["ACTION_REGISTRY<br/>Global Dict"]
Registry --> |Lookup by name| ActionClass[Action Classes]
ActionClass --> |Self-register| Decorator["@register_action"]
ActionClass --> Execute[Execute Action]
Execute --> Context["Execution Context<br/>Dict[str, Any]"]
Context --> |Shared state| NextAction[Next Action]
ActionClass -.-> TypedAction[TypedStrategyAction]
ActionClass -.-> Pydantic[Pydantic Models]
MSS --> |Job persistence| DB[(SQLite biomapper.db)]
| Component | Description | Location |
|---|---|---|
| src/actions/ | Action implementations with entity-based organization | src/actions/ |
| src/api/ | FastAPI REST service | src/api/ |
| src/client/ | Python client library | src/client/ |
| ACTION_REGISTRY | Global action registry | src/actions/registry.py |
| MinimalStrategyService | Strategy execution engine | src/core/minimal_strategy_service.py |
- Python 3.11+
- Poetry for dependency management
- Git for version control
# Clone repository
git clone https://github.com/biomapper/biomapper.git
cd biomapper
# Install dependencies with Poetry
poetry install --with dev,docs,api
# Activate virtual environment
poetry shell
# Run tests to verify installation
poetry run pytest
# Start the API server
poetry run uvicorn src.api.main:app --reload --port 8000# Basic CLI commands
poetry run biomapper --help
poetry run biomapper health
poetry run biomapper metadata list
# Check API health
curl http://localhost:8000/health
# Run a strategy via API
# (Use Python client or direct HTTP calls)from src.client.client_v2 import BiomapperClient
# Synchronous usage (recommended for scripts)
client = BiomapperClient(base_url="http://localhost:8000")
result = client.run("test_metabolite_simple", parameters={
"input_file": "/data/metabolites.csv",
"output_dir": "/results"
})
print(f"Results: {result}")
# Async usage (for integration)
import asyncio
async def run_async():
async with BiomapperClient() as client:
result = await client.execute_strategy(
"test_metabolite_simple",
parameters={"input_file": "/data/metabolites.csv"}
)
return result
asyncio.run(run_async())Create strategies in src/configs/strategies/:
name: metabolite_harmonization
description: Harmonize metabolite identifiers across platforms
parameters:
input_file: "${DATA_DIR}/metabolites.tsv"
output_dir: "${OUTPUT_DIR}"
fuzzy_threshold: 0.85
steps:
- name: load_metabolites
action:
type: LOAD_DATASET_IDENTIFIERS
params:
file_path: "${parameters.input_file}"
identifier_column: "compound_name"
output_key: "input_metabolites"
- name: enrich_with_cts
action:
type: METABOLITE_CTS_BRIDGE
params:
input_key: "input_metabolites"
output_key: "cts_enriched"
from_format: "Chemical Name"
to_format: "InChIKey"
- name: export_results
action:
type: EXPORT_DATASET_V2
params:
input_key: "cts_enriched"
output_file: "${parameters.output_dir}/harmonized.csv"
format: "csv"| Action | Description |
|---|---|
LOAD_DATASET_IDENTIFIERS |
Load biological identifiers from CSV/TSV files |
MERGE_DATASETS |
Combine multiple datasets with deduplication |
FILTER_DATASET |
Apply filtering criteria to datasets |
EXPORT_DATASET_V2 |
Export results to CSV/TSV/JSON formats |
CUSTOM_TRANSFORM_EXPRESSION |
Apply Python expressions to transform data |
| Action | Description |
|---|---|
PROTEIN_EXTRACT_UNIPROT_FROM_XREFS |
Extract UniProt IDs from compound reference fields |
PROTEIN_NORMALIZE_ACCESSIONS |
Standardize protein accession formats |
PROTEIN_MULTI_BRIDGE |
Multi-source protein identifier resolution |
MERGE_WITH_UNIPROT_RESOLUTION |
Map identifiers to UniProt accessions |
| Action | Description |
|---|---|
METABOLITE_CTS_BRIDGE |
Chemical Translation Service API integration |
METABOLITE_EXTRACT_IDENTIFIERS |
Extract metabolite IDs from text fields |
METABOLITE_NORMALIZE_HMDB |
Standardize HMDB identifier formats |
METABOLITE_MULTI_BRIDGE |
Multi-database metabolite resolution |
NIGHTINGALE_NMR_MATCH |
Nightingale NMR platform matching |
SEMANTIC_METABOLITE_MATCH |
AI-powered semantic matching |
VECTOR_ENHANCED_MATCH |
Vector embedding similarity matching |
METABOLITE_API_ENRICHMENT |
Enrich via external metabolite APIs |
COMBINE_METABOLITE_MATCHES |
Merge results from multiple strategies |
| Action | Description |
|---|---|
CHEMISTRY_EXTRACT_LOINC |
Extract LOINC codes from clinical data |
CHEMISTRY_FUZZY_TEST_MATCH |
Fuzzy matching for clinical test names |
CHEMISTRY_VENDOR_HARMONIZATION |
Harmonize vendor-specific test codes |
CHEMISTRY_TO_PHENOTYPE_BRIDGE |
Link chemistry results to phenotypes |
| Action | Description |
|---|---|
CALCULATE_SET_OVERLAP |
Calculate Jaccard similarity between datasets |
CALCULATE_THREE_WAY_OVERLAP |
Three-way dataset comparison analysis |
CALCULATE_MAPPING_QUALITY |
Assess mapping quality metrics |
GENERATE_METABOLOMICS_REPORT |
Generate comprehensive metabolomics reports |
GENERATE_ENHANCEMENT_REPORT |
Create validation and enhancement reports |
| Action | Description |
|---|---|
SYNC_TO_GOOGLE_DRIVE_V2 |
Upload results to Google Drive |
CHUNK_PROCESSOR |
Process large datasets in chunks |
Follow Test-Driven Development (TDD) approach:
# 1. Write test first (tests/unit/actions/test_my_action.py)
import pytest
from src.actions.my_action import MyAction, MyActionParams
from src.core.exceptions import ActionResult
async def test_my_action():
params = MyActionParams(input_key="test", threshold=0.8)
context = {"datasets": {"test": [{"id": "1", "name": "test", "score": 0.9}]}}
action = MyAction()
result = await action.execute_typed(params, context)
assert result.success
assert "processed" in context["datasets"]
assert len(context["datasets"]["processed"]) == 1
# 2. Implement action (src/actions/my_action.py)
from src.actions.typed_base import TypedStrategyAction
from src.actions.registry import register_action
from src.core.standards.base_models import ActionParamsBase
from src.core.exceptions import ActionResult
from pydantic import Field
from typing import Dict
class MyActionParams(ActionParamsBase):
input_key: str = Field(..., description="Input dataset key")
threshold: float = Field(0.8, ge=0.0, le=1.0)
output_key: str = Field("processed", description="Output dataset key")
@register_action("MY_ACTION")
class MyAction(TypedStrategyAction[MyActionParams, ActionResult]):
"""Process biological data with threshold filtering."""
def get_params_model(self) -> type[MyActionParams]:
return MyActionParams
async def execute_typed(self, params: MyActionParams, context: Dict) -> ActionResult:
# Get input data
input_data = context["datasets"].get(params.input_key, [])
# Process with threshold
processed = [item for item in input_data
if item.get("score", 0) >= params.threshold]
# Store results
context["datasets"][params.output_key] = processed
return ActionResult(
success=True,
message=f"Processed {len(processed)} items",
data={"filtered_count": len(input_data) - len(processed)}
)# Run all tests with coverage
poetry run pytest --cov=src --cov-report=html
# Run specific test categories
poetry run pytest tests/unit/ # Unit tests only
poetry run pytest tests/integration/ # Integration tests
poetry run pytest -k "test_my_action" # Specific test by name
# Debug failing test
poetry run pytest -xvs --pdb tests/unit/actions/test_my_action.py# Format code
poetry run ruff format .
# Check and fix linting issues
poetry run ruff check . --fix
# Type checking
poetry run mypy src/
# Run all checks (recommended before committing)
make check # Runs format, lint, typecheck, test, and docsmake test # Run tests with coverage
make format # Format code with ruff
make lint-fix # Auto-fix linting issues
make typecheck # Run mypy type checking
make check # Run all checks
make docs # Build documentation
make clean # Clean cache filesbiomapper/
βββ src/ # Source code (src-layout)
β βββ actions/ # Action implementations
β β βββ entities/ # Entity-specific actions
β β β βββ proteins/ # Protein actions
β β β βββ metabolites/ # Metabolite actions
β β β βββ chemistry/ # Chemistry actions
β β βββ algorithms/ # Reusable algorithms
β β βββ utils/ # Utilities
β β βββ io/ # Input/output actions
β β βββ registry.py # Action registry
β βββ api/ # FastAPI service
β β βββ routes/ # API endpoints
β β βββ services/ # API services
β β βββ core/ # Core API logic
β β βββ main.py # FastAPI app
β βββ client/ # Python client
β β βββ client_v2.py # Main client class
β β βββ exceptions.py # Client exceptions
β β βββ models.py # Client models
β βββ core/ # Core business logic
β β βββ algorithms/ # Core algorithms
β β βββ infrastructure/ # Infrastructure components
β β βββ standards/ # 2025 standardizations
β β βββ minimal_strategy_service.py # Strategy execution
β βββ configs/
β β βββ strategies/ # YAML strategy definitions
β βββ integrations/ # External service clients
β β βββ clients/ # API clients (UniProt, etc.)
β βββ cli/ # Command-line interface
βββ tests/ # Comprehensive test suite
β βββ unit/ # Unit tests (1,209 passing)
β βββ integration/ # Integration tests (8 passing)
β βββ performance/ # Performance tests
βββ scripts/ # Development utilities
βββ dev/ # Development standards
β βββ standards/ # 2025 standardization guides
βββ docs/ # Documentation
βββ CLAUDE.md # Claude Code instructions
βββ Makefile # Development commands
βββ pyproject.toml # Project configuration
Create a .env file in the project root:
# API Configuration
API_HOST=0.0.0.0
API_PORT=8000
# Data Directories
DATA_DIR=/path/to/data
OUTPUT_DIR=/path/to/output
# External Services (optional)
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
OPENAI_API_KEY=your-api-key
LANGFUSE_PUBLIC_KEY=your-public-key
LANGFUSE_SECRET_KEY=your-secret-keyStrategies support variable substitution:
${parameters.key}- Access strategy parameters${env.VAR_NAME}- Environment variables${DATA_DIR}- Shorthand for environment variables
BioMapper features an AI-native developer experience that sets it apart from traditional orchestration frameworks:
Three specialized frameworks provide automatic isolation and safe development:
π Surgical Framework - Fix internal action logic while preserving interfaces
- Activated by: "surgical" or describing counting/calculation issues
- Guarantees: External interfaces remain unchanged
- Example: "The statistics are counting duplicates" β Fixes logic without breaking pipeline
π Circuitous Framework - Repair pipeline orchestration and parameter flow
- Activated by: "circuitous" or describing parameter flow issues
- Guarantees: Traces and fixes YAML strategy execution
- Example: "Parameters not passing between steps" β Diagnoses and repairs flow
π Interstitial Framework - Ensure 100% backward compatibility
- Activated by: "interstitial" or describing compatibility issues
- Guarantees: All existing strategies continue working
- Example: "New parameter broke old code" β Creates compatibility layers
These frameworks activate automatically based on natural language or can be called directly by name.
- CLAUDE.md - Project "constitution" providing role-defining guidance for AI agents
- .claude/ folder - Structured agent configs and scaffolding
- BioSherpa guide - AI-powered onboarding and project navigation
- Type-safe actions - Enable better code completion and error detection
- Self-documenting - Pydantic models include descriptions
- TDD approach - Tests provide clear specifications
- Copilot/Cody: Offer IDE assistance but don't ship with per-project scaffolding
- Claude-Orchestrator/Flow frameworks: Orchestrate multiple Claude agents, but not tied to strategy orchestration
- BioMapper: First to embed LLM-native scaffolding inside an orchestration framework repo, making the AI "part of the project contract"
- Contributor on-ramp: Immediate AI guidance without context overload
- Explicit developer ergonomics: CLAUDE.md provides consistent AI-facing knowledge
- Balanced with traditional docs: Non-AI contributors aren't excluded
# Standard development requests
"Help me create a new action that extracts gene symbols from protein descriptions"
"Debug why my metabolite matching strategy is returning empty results"
"Optimize the CTS API calls to handle rate limiting better"
# Framework-specific requests (by name)
"Use surgical mode to fix the duplicate counting in statistics"
"Apply circuitous analysis to trace why parameters aren't substituting"
"Activate interstitial framework to add a new required parameter safely"
# Framework activation (by description)
"The protein count shows 3675 but should be 1172 unique" β Surgical
"The ${parameters.input_file} isn't being replaced in step 3" β Circuitous
"We renamed dataset_key to input_key and broke production" β Interstitial
- AI setup assumes Claude access β barrier for external contributors
- Single-agent model; lacks multi-agent orchestration (architect/coder/tester roles)
- AI hooks primarily developer-facing; limited in runtime/CI integration
Comprehensive Test Suite Restoration (January 2025)
- Unit Tests: 1,209 passed, 86 skipped (99.3% success rate)
- Integration Tests: 8 passed, 7 skipped (100% success rate)
- Test Coverage: 79.69% (approaching 80% target)
- Total Passing Tests: 1,217 across all categories
Production-Ready Architecture Achieved:
- β Barebones Architecture: Client β API β MinimalStrategyService β Self-Registering Actions
- β Src-Layout Structure: Modern Python package organization
- β Type Safety: Comprehensive Pydantic v2 migration
- β
Async HTTP Mocking: Migrated from
responsestorespxfor httpx compatibility - β Standards Compliance: All 10 biomapper 2025 standardizations implemented
- β Biological Data Testing: Real-world protein, metabolite, and chemistry data patterns
- β Error Handling: Comprehensive edge case coverage and validation
Architectural Strengths:
- Clean modularity (strategy vs action vs context)
- Low barrier for extension (just register a new action)
- Declarative configuration approachable to non-programmers
- Pragmatic service orientation (FastAPI, Poetry, pytest, Pydantic)
Gaps & Opportunities:
- No DAG/conditional execution in YAML
- Limited provenance/lineage tracking
- Potential performance bottlenecks at scale (10Kβ1M records)
- Observability/logging not yet first-class
Next Development Priorities:
- Environment Setup Tools - Restore setup wizards and configuration automation
- Performance Monitoring - Rebuild complexity audit and optimization detection
- CLI Enhancement - Expand command-line interface capabilities
- External Integrations - Enhanced Google Drive sync and external API connections
- Multi-agent workflows - Expand beyond single-agent AI model
- AI-driven reviews - Automated code and strategy review capabilities
- End-user AI features - Runtime AI assistance and intelligent suggestions
- CLAUDE.md - Instructions for Claude Code and development
- dev/standards/ - 2025 standardization guides
- src/actions/CLAUDE.md - Creating new actions
- API Documentation - Interactive API docs (when server running)
- Test Coverage Report - Detailed coverage analysis
- Parameter Naming Standard
- Algorithm Complexity Guide
- Context Handling Guide
- Pydantic Migration Guide
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Write tests first (TDD approach)
- Implement your feature
- Run checks (
make check) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with FastAPI and Pydantic
- Testing with pytest
- Code quality with ruff
- Dependency management with Poetry
For questions, issues, or contributions, please open an issue on GitHub or contact the maintainers.
Note: This project has achieved production-ready stability with comprehensive test coverage (1,217 passing tests) and standardized architecture. The 2025 standardizations ensure reliable biological data processing workflows.