Skip to content

hollomancer/sbir-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SBIR ETL Pipeline

Analyze $50B+ in SBIR/STTR funding data: Track technology transitions, patent outcomes, and economic impact of federal R&D investments.

CI Python 3.11+ License: MIT

What This Does

  • πŸ” 533K+ SBIR awards from 1983-present across all federal agencies
  • πŸš€ 40K-80K technology transitions detected using 6 independent signals
  • πŸ“Š CET classification for Critical & Emerging Technology trend analysis
  • πŸ’° Economic impact analysis with ROI and federal tax receipt estimates
  • πŸ”— Patent ownership chains tracking SBIR-funded innovation outcomes

Prerequisites

  • Python 3.11+ (required)
  • Docker (optional, for local Neo4j database)
  • AWS credentials (optional, for cloud features and S3 data)

Quick Start

Local Development

Get started in 2 minutes:

git clone https://github.com/your-org/sbir-analytics
cd sbir-analytics
make install      # Install dependencies with uv
make dev          # Start Dagster UI
# Open http://localhost:3000

Next steps:

  1. Materialize raw_sbir_awards asset in Dagster UI
  2. Explore data in Neo4j Browser (http://localhost:7474)
  3. See Getting Started Guide for detailed walkthrough

Production Deployment

For production use, see Deployment Guide for:

  • GitHub Actions (orchestrates ETL pipelines via dagster job execute)
  • AWS Lambda (serverless, for scheduled data downloads)

Key Features

Pipeline Architecture

  • Five-stage ETL: Extract β†’ Validate β†’ Enrich β†’ Transform β†’ Load
  • Asset-based orchestration: Dagster with dependency management
  • Data quality gates: Comprehensive validation at each stage
  • Cloud-first design: AWS S3 + Neo4j Aura + GitHub Actions

Specialized Analysis Systems

System Purpose Documentation
Transition Detection Identify SBIR β†’ federal contract transitions (β‰₯85% precision) docs/transition/
CET Classification ML-based technology area classification docs/ml/
PaECTER Embeddings Patent-award similarity using semantic embeddings docs/ml/paecter.md
Fiscal Returns Economic impact & ROI analysis using StateIO docs/fiscal/
Patent Analysis USPTO patent chains and tech transfer tracking docs/schemas/patent-neo4j-schema.md

Technology Stack

  • Orchestration: Dagster 1.7+ (asset-based pipeline), GitHub Actions
  • Database: Neo4j 5.x (graph database for relationships)
  • Processing: DuckDB 1.0+ (analytical queries), Pandas 2.2+
  • Configuration: Pydantic 2.8+ (type-safe YAML config)
  • Deployment: Docker, AWS Lambda, GitHub Actions

Documentation

Topic Description
Getting Started Detailed setup guides for local, cloud, and ML workflows
Architecture System design, patterns, and technical decisions
Deployment Production deployment options and guides
Testing Testing strategy, guides, and coverage
Schemas Neo4j graph schema and data models
API Reference Code documentation and API reference

See Documentation Index for complete map.

Project Structure

sbir-analytics/
β”œβ”€β”€ src/                    # Source code
β”‚   β”œβ”€β”€ assets/            # Dagster asset definitions
β”‚   β”œβ”€β”€ extractors/        # Data extraction (SBIR, USAspending, USPTO)
β”‚   β”œβ”€β”€ enrichers/         # External enrichment and fuzzy matching
β”‚   β”œβ”€β”€ transformers/      # Business logic and normalization
β”‚   β”œβ”€β”€ loaders/           # Neo4j loading and relationship creation
β”‚   └── ml/                # Machine learning (CET classification)
β”œβ”€β”€ tests/                  # Unit, integration, and E2E tests
β”œβ”€β”€ docs/                   # Documentation
β”œβ”€β”€ config/                 # YAML configuration files
β”œβ”€β”€ .kiro/                  # Kiro specifications
└── infrastructure/         # AWS CDK and deployment configs

See CONTRIBUTING.md for detailed breakdown.

Common Commands

# Development
make install              # Install dependencies
make dev                  # Start Dagster UI
make test                 # Run tests
make lint                 # Run linters

# Docker (alternative)
make docker-build         # Build Docker image
make docker-up-dev        # Start development stack
make docker-test          # Run tests in container

# Data operations
make transition-mvp-run   # Run transition detection
make cet-pipeline-dev     # Run CET classification

See Makefile for all available commands.

Configuration

Configuration uses YAML files with environment variable overrides:

# Override any config using SBIR_ETL__SECTION__KEY pattern
export SBIR_ETL__NEO4J__URI="neo4j+s://your-instance.databases.neo4j.io"
export SBIR_ETL__ENRICHMENT__BATCH_SIZE=200

See Configuration Guide for details.

Contributing

We welcome contributions! See CONTRIBUTING.md for:

  • Development setup and workflow
  • Code quality standards (black, ruff, mypy)
  • Testing requirements (β‰₯80% coverage)
  • Pull request process

Testing

make test                 # Run all tests
make test-unit            # Unit tests only
make test-integration     # Integration tests
make test-e2e             # End-to-end tests

See Testing Guide for details.

License

This project is licensed under the MIT License. Copyright (c) 2025 Conrad Hollomon.

Acknowledgments

This project makes use of and is grateful for the following open-source tools and research:

  • StateIO - State-level economic input-output modeling framework by USEPA
  • Bayesian Mixture-of-Experts - Research on calibration and uncertainty estimation by Albus Yizhuo Li
  • PaECTER - Patent similarity model by Max Planck Institute
  • @SquadronConsult - Help with SAM.gov data integration

Support

About

clean up sbir-related data, put it in a graph database, and analyze outcomes

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Contributors 5

Languages