A command-line tool to detect untagged SBIR Phase III transitions by analyzing federal spending data and identifying potential commercialization patterns.
This system processes bulk federal spending data using a combination of heuristics and configuration-driven detection to identify potential SBIR Phase III transitions. The primary goal is to create a reliable, auditable process for identifying SBIR commercialization that is not officially flagged through comprehensive data analysis and export capabilities.
- High-Confidence Detection: Identifies likely transitions using strong structural signals like sole-source contracts
- Broad Search Capabilities: Detects competed and cross-service transitions
- Command-Line Interface: Easy-to-use CLI tools for data loading, processing, and export
- Auditable Evidence: Each detection includes a comprehensive evidence bundle
- Bulk Data Processing: Efficiently processes large federal spending datasets with progress tracking
- Multiple Export Formats: Export results as JSONL, CSV, or Excel formats
- Rich Progress Indicators: Visual progress bars and detailed status updates for long-running operations
- Configuration-Driven: Customizable detection thresholds and scoring parameters via YAML config
- Python 3.11+
- Poetry (for dependency management)
-
Clone the repository:
git clone <repository-url> cd sbir-transition-classifier
-
Install dependencies:
poetry install
-
Prepare data:
- Place SBIR awards data in
data/awards.csv - Add federal contract data files to
data/directory - The system will automatically discover CSV files
- See Data Setup section below for detailed requirements
- Place SBIR awards data in
# Run complete detection pipeline
poetry run sbir-detect bulk-process --verbose
# Or step-by-step:
# Load SBIR awards
poetry run sbir-detect data load-sbir --file-path data/awards.csv --verbose
# Load contract data
poetry run sbir-detect data load-contracts --file-path data/contracts.csv --verbose
# Export results
poetry run sbir-detect export jsonl --output-path output/detections.jsonl --verbose
poetry run sbir-detect export csv --output-path output/summary.csv --verbose
# View system info
poetry run sbir-detect info- Language: Python 3.11
- CLI Framework: Click
- Progress Indicators: Rich, tqdm
- Database: SQLite
- Data Processing: Pandas
- Logging: Loguru
- Configuration: YAML-based schema with Pydantic
- Dependency Management: Poetry
src/sbir_transition_classifier/
├── cli/ # Command-line interface
│ ├── main.py # CLI entry point
│ ├── bulk.py # Bulk processing command
│ ├── run.py # Single-run command
│ ├── data.py # Data loading commands
│ ├── export.py # Export commands
│ ├── reports.py # Reporting commands
│ ├── validate.py # Configuration validation
│ ├── reset.py # Configuration reset/templates
│ ├── hygiene.py # Data quality commands
│ └── output.py # Output generation utilities
├── core/ # Shared configuration and models
│ ├── models.py # Data models (Vendor, Contract, Award)
│ └── vendor_matching.py # Vendor name normalization
├── config/ # Configuration management
│ ├── loader.py # Config file loading
│ ├── schema.py # Config schema and validation
│ └── reset.py # Config templates and reset
├── data/ # Data schemas and validation
│ ├── models.py # Pydantic schemas
│ ├── evidence.py # Evidence bundle definitions
│ └── schemas.py # Data transfer schemas
├── detection/ # Detection algorithms
│ ├── main.py # Main detection logic
│ ├── pipeline.py # Detection pipeline
│ ├── scoring.py # Scoring algorithms
│ └── heuristics.py # Heuristic rules
├── db/ # Database access layer
│ ├── database.py # SQLAlchemy setup
│ ├── config.py # DB configuration
│ └── queries.py # SQL queries
├── ingestion/ # Data loading
│ ├── sbir.py # SBIR data ingestion
│ ├── contracts.py # Contract data ingestion
│ ├── base.py # Base ingester
│ └── factory.py # Ingester factory
└── utils/ # Utility functions
├── dates.py # Date calculations
└── __init__.py
tests/
├── unit/ # Unit tests
└── integration/ # Integration tests
output/ # Generated reports and exports
data/ # Input data files (not in git)
The system uses five main database entities:
- vendors: Commercial entities receiving awards
- vendor_identifiers: Cross-walking between ID systems (UEI, CAGE, DUNS)
- sbir_awards: SBIR Phase I and II awards
- contracts: Federal contract vehicles from FPDS/USAspending
- detections: Identified potential transitions with evidence bundles
- Same agency (service branch match)
- Sole-source contract awards
- Timing within configured window (default: 1-24 months) after Phase II completion
- Service/topic continuity
- Cross-service transitions (same department, different branch)
- Competed contracts with SBIR indicators
- Department-level continuity
- Text-based description analysis
Detection thresholds and parameters are configured via YAML:
detection:
eligible_phases: ["Phase II"]
thresholds:
high_confidence: 0.85
likely_transition: 0.65
timing:
min_months_after_phase2: 1
max_months_after_phase2: 24
scoring:
sole_source_weight: 0.30
timing_weight: 0.25
agency_match_weight: 0.20
vendor_match_weight: 0.25bulk-process - Complete end-to-end detection pipeline
poetry run sbir-detect bulk-process [OPTIONS]
Options:
--data-dir PATH Directory containing input CSV files [default: ./data]
--output-dir PATH Output directory for results [default: ./output]
--chunk-size INTEGER Batch size for processing [default: 5000]
--export-format TEXT Format: jsonl|csv|both [default: both]
--verbose, -v Enable detailed progress logging
--quiet, -q Minimal outputrun - Execute detection with explicit configuration
poetry run sbir-detect run [OPTIONS]
Options:
--config PATH Path to YAML configuration file
--output PATH Output directory or file for results (required)
--data-dir PATH Directory with input data [default: ./data]
--verbose, -v Enable verbose loggingvalidate-config - Validate configuration file
poetry run sbir-detect validate-config [OPTIONS]
Options:
--config PATH Configuration file to validate (required)
--verbose, -v Show detailed validation resultsreset-config - Generate configuration from template
poetry run sbir-detect reset-config [OPTIONS]
Options:
--output PATH Output path for config file (required)
--template TEXT Template: default|high-precision|broad-discoverylist-templates - Show available configuration templates
poetry run sbir-detect list-templatesshow-template - Display template content
poetry run sbir-detect show-template --template defaultdata load-sbir - Load SBIR award data
poetry run sbir-detect data load-sbir [OPTIONS]
Options:
--file-path PATH Path to SBIR CSV file (required)
--chunk-size INTEGER Records per batch [default: 5000]
--verbose, -v Enable verbose loggingdata load-contracts - Load federal contract data
poetry run sbir-detect data load-contracts [OPTIONS]
Options:
--file-path PATH Path to contracts CSV file (required)
--chunk-size INTEGER Records per batch [default: 50000]
--verbose, -v Enable verbose loggingexport jsonl - Export detections as JSONL
poetry run sbir-detect export jsonl [OPTIONS]
Options:
--output-path PATH Output file path (required)
--verbose, -v Enable verbose loggingexport csv - Export detection summary as CSV
poetry run sbir-detect export csv [OPTIONS]
Options:
--output-path PATH Output file path (required)
--verbose, -v Enable verbose loggingexport excel - Export as Excel with multiple sheets
poetry run sbir-detect export excel [OPTIONS]
Options:
--output-path PATH Output file path (required)
--verbose, -v Enable verbose loggingreports summary - Generate summary report
poetry run sbir-detect reports summary [OPTIONS]
Options:
--results-dir PATH Results directory (required)
--output PATH Output file (prints to stdout if omitted)
--format TEXT Format: text|markdown|json [default: text]
--include-details Include detailed analysisreports stats - Show detection statistics
poetry run sbir-detect reports stats [OPTIONS]
Options:
--json, -j Output as JSONreports perspectives - Analyze transitions from multiple perspectives
poetry run sbir-detect reports perspectives [OPTIONS]
Options:
--output PATH Output file path
--format TEXT Format: text|markdown|json [default: text]info - Display system and configuration information
poetry run sbir-detect infohygiene check-dates - Validate data quality and detect anomalies
poetry run sbir-detect hygiene check-dates [OPTIONS]
Options:
--data-dir PATH Data directory to check
--output PATH Output report pathversion - Show version information
poetry run sbir-detect version--help - Show all available commands
poetry run sbir-detect --help
poetry run sbir-detect data --help
poetry run sbir-detect export --help
poetry run sbir-detect reports --help# Install dependencies
poetry install
# Enter virtual environment
poetry shell
# Run CLI commands
sbir-detect --help# Unit tests only
poetry run pytest tests/unit/ -v
# Unit tests with coverage
poetry run pytest tests/unit/ -v --tb=short --cov=sbir_transition_classifier --cov-report=term-missing
# Integration tests
poetry run pytest tests/integration/ -v
# Full test suite
poetry run pytest
# Run specific test
poetry run pytest -k test_name -v# 1. Create sample config
poetry run sbir-detect reset-config --output config/dev.yaml --template default
# 2. Validate config
poetry run sbir-detect validate-config --config config/dev.yaml --verbose
# 3. Load sample data
poetry run sbir-detect data load-sbir --file-path data/sample_awards.csv --verbose
poetry run sbir-detect data load-contracts --file-path data/sample_contracts.csv --verbose
# 4. Run detection
poetry run sbir-detect run --config config/dev.yaml --output output/dev_results/ --verbose
# 5. Export and review results
poetry run sbir-detect export jsonl --output-path output/dev_results.jsonl --verbose
poetry run sbir-detect reports summary --results-dir output/ --format markdown --include-details- Target: Backtest a full fiscal year in < 8 hours
- Processing Rate: 1000+ records/minute on modern hardware
- Scale: Processes 10-100GB yearly data files
- Memory Efficiency: Streaming processing with configurable chunk sizes
- Database: SQLite with indexed queries for fast lookups
The project uses three data directories for different purposes:
data/- Production data files (excluded from git due to size)output/- Generated reports, exports, and logstest_data/- Test fixtures and mock data for unit/integration tests
- File:
data/awards.csv - Source: SBIR.gov database export
- Columns: award_piid, phase, agency, award_date, completion_date, topic, vendor_name, etc.
- File:
data/contracts.csv - Source: https://www.usaspending.gov/download_center/award_data_archive
- Columns: piid, agency, start_date, vendor_name, naics_code, psc_code, etc.
Both CSV files must contain headers and use UTF-8 encoding. The system is tolerant of minor format variations but requires:
- SBIR Data: PIID, phase, agency, completion date
- Contract Data: PIID, agency, start date, vendor name
Contact project maintainer for access to awards.csv
- Visit https://www.usaspending.gov/download_center/award_data_archive
- Select desired fiscal year(s)
- Download contract data (CSV format)
- Place in
data/directory
# Load SBIR data
poetry run sbir-detect data load-sbir \
--file-path data/awards.csv \
--chunk-size 5000 \
--verbose
# Load contract data (may be large)
poetry run sbir-detect data load-contracts \
--file-path data/contracts.csv \
--chunk-size 50000 \
--verboseIf no configuration is specified, the system uses built-in defaults. To customize:
# Generate default config
poetry run sbir-detect reset-config --output config/custom.yaml
# Edit config/custom.yaml with your parameters
# Run with custom config
poetry run sbir-detect run --config config/custom.yaml --output output/results/- default: Balanced detection approach
- high-precision: Higher thresholds, fewer false positives
- broad-discovery: Lower thresholds, more detections
# Show template
poetry run sbir-detect show-template --template high-precision
# Create config from template
poetry run sbir-detect reset-config --output config/strict.yaml --template high-precision- Follow existing CLI patterns and command structure
- Use Rich progress indicators for long-running operations
- Include comprehensive logging and error handling
- Add tests for new functionality (see AGENTS.md for testing guidelines)
- Update documentation and help text
- Ensure Poetry builds succeed and all tests pass
- See AGENTS.md for detailed contribution guidelines
This project is licensed under the MIT License - see the LICENSE file for details.
- AGENTS.md - Complete development and contribution guidelines
- Configuration: See
config/directory for example YAML files - Testing: See
tests/directory for unit and integration tests
# Initial setup
poetry install
# Full pipeline (recommended)
poetry run sbir-detect bulk-process --verbose
# Manual steps
poetry run sbir-detect data load-sbir --file-path data/awards.csv --verbose
poetry run sbir-detect data load-contracts --file-path data/contracts.csv --verbose
poetry run sbir-detect run --output output/results/ --verbose
# Export results
poetry run sbir-detect export jsonl --output-path output/detections.jsonl --verbose
poetry run sbir-detect export csv --output-path output/summary.csv
# View results
poetry run sbir-detect reports summary --results-dir output/ --format markdowndata/- Input CSV filesoutput/- Generated reports and exportssrc/sbir_transition_classifier/cli/- Command implementationssrc/sbir_transition_classifier/detection/- Detection logictests/- Unit and integration tests
# All commands
poetry run sbir-detect --help
# Command group help
poetry run sbir-detect data --help
poetry run sbir-detect export --help
poetry run sbir-detect reports --help
# System info
poetry run sbir-detect info