Skip to content

rohandesai007/EMRV

Repository files navigation

EMRValidator

PyPI version Python Versions License: MIT Code style: black

A modern, healthcare-focused data quality and validation library

EMRValidator is a Python library designed as a cleaner, faster, and more intuitive alternative to Great Expectations, with specialized features for Electronic Medical Records (EMR) and healthcare data validation.

✨ Key Features

  • πŸ₯ Healthcare-Specific Validations: Built-in validators for MRN, ICD codes, and other healthcare data
  • 🎯 Simple, Intuitive API: Fluent interface for chaining validations
  • πŸ“Š Automated Data Profiling: Quick quality assessment with actionable recommendations
  • πŸ“ Beautiful Reports: Generate professional HTML and JSON reports
  • ⚑ High Performance: 5-7x faster than Great Expectations
  • πŸ”§ Extensible: Easy to add custom validations and rules
  • 🎨 Multiple APIs: Choose between fluent, expectation-based, or rule-set patterns
  • πŸ“¦ Minimal Dependencies: Only pandas and numpy required

EMRValidator If You Need:

  • Data quality rules for EMR, claims, or clinical datasets
  • Fast validation for millions of rows
  • Healthcare-specific formats (ICD, MRN, CPT, NDC)
  • Validation in ETL, Airflow, dbt, or LLM pipelines

Don’t Use It If:

  • You need schema evolution tracking across multiple batches

πŸš€ Installation

pip install emrvalidator

For Excel support:

pip install emrvalidator[excel]

For development:

pip install emrvalidator[dev]

πŸ“– Quick Start

from emrvalidator import DataValidator
import pandas as pd

# Load your data
df = pd.read_csv('patient_data.csv')

# Create validator and run validations
validator = DataValidator("Patient Data Quality Check")
validator.load_data(df)

# Chain validation rules
(validator
    .expect_column_exists('mrn')
    .expect_column_not_null('patient_id', threshold=0.99)
    .expect_column_values_between('age', 0, 120)
    .expect_mrn_format('mrn')
    .expect_icd_format('diagnosis_code', version=10)
)

# Check results
if validator.is_valid():
    print("βœ“ All validations passed!")
else:
    print("Issues found:")
    for fail in validator.get_failed_validations():
        print(f"  - {fail['message']}")

πŸ†š Why EMRValidator?

Comparison with Great Expectations

Feature Great Expectations EMRValidator Advantage
Setup Complexity High (2.3s) Minimal (0.1s) 23x faster
Code Volume 45 lines 12 lines 73% less code
Performance Baseline 5-7x faster 500-700% faster
Healthcare Focus None Built-in MRN, ICD validation
Dependencies 40+ packages 2 packages 95% fewer
Learning Curve 4-8 hours 15 minutes 20x faster
Data Profiling External tool Built-in Included

See detailed comparison documentation.

πŸ“š Core Features

1. Basic Validations

# Column existence
validator.expect_column_exists('column_name')

# Null checks
validator.expect_column_not_null('age', threshold=0.95)

# Value ranges
validator.expect_column_values_between('age', 0, 120, threshold=0.98)

# Set membership
validator.expect_column_values_in_set('gender', {'M', 'F', 'Other'})

# Uniqueness
validator.expect_column_values_unique('patient_id')

# Date format
validator.expect_column_date_format('admission_date', date_format='%Y-%m-%d')

2. Healthcare-Specific Validations

# Medical Record Numbers
validator.expect_mrn_format('mrn', threshold=0.99)

# ICD Codes
validator.expect_icd_format('diagnosis_code', version=10)  # ICD-10
validator.expect_icd_format('diagnosis_code', version=9)   # ICD-9

# Pre-built healthcare rule sets
from emrvalidator import HealthcareRuleSets

demo_rules = HealthcareRuleSets.patient_demographics()
fin_rules = HealthcareRuleSets.financial_data()

3. Data Profiling

from emrvalidator import DataProfiler

profiler = DataProfiler(df, "Healthcare Dataset")
profile = profiler.generate_profile()

# Print summary
profiler.print_summary()

# Get quality score
quality_score = profile['quality_summary']['quality_score']
print(f"Quality Score: {quality_score}/100")

# Get recommendations
for rec in profile['recommendations']:
    print(f"  - {rec}")

4. Report Generation

from emrvalidator import HTMLReporter, JSONReporter

# Generate HTML report
html_reporter = HTMLReporter(validator.get_results())
html_reporter.generate('quality_report.html', title='Data Quality Report')

# Generate JSON report
json_reporter = JSONReporter(validator.get_results())
json_reporter.generate('quality_report.json', pretty=True)

5. Custom Validations

def validate_charge_payment(df, **kwargs):
    """Custom validation: charges must be >= payments"""
    valid_mask = df['charge_amount'] >= df['payment_amount']
    valid_pct = valid_mask.sum() / len(df)
    
    passed = valid_pct > 0.95
    message = f"{valid_pct*100:.2f}% have valid charge/payment relationship"
    details = {
        "valid_percentage": round(valid_pct * 100, 2),
        "invalid_count": int((~valid_mask).sum())
    }
    
    return passed, message, details

validator.expect_custom("charge_payment_logic", validate_charge_payment)

6. Reusable Rule Sets

from emrvalidator import RuleSet

# Create custom rule set
financial_rules = RuleSet("Financial Validations")

def validate_positive_charges(df, **kwargs):
    valid = (df['charge_amount'] > 0).sum() / len(df)
    passed = valid > 0.98
    return passed, f"Positive charges: {valid*100:.1f}%", {}

financial_rules.create_rule(
    "positive_charges",
    "All charges must be positive",
    validate_positive_charges
)

# Apply to any dataset
results = financial_rules.execute_all(df)

7. Expectations API

from emrvalidator import Expectation, ExpectationSuite

suite = ExpectationSuite("Data Quality Expectations")

(suite
    .expect("mrn_exists", Expectation.column_to_exist('mrn'))
    .expect("mrn_not_null", Expectation.column_values_to_not_be_null('mrn'))
    .expect("valid_gender", Expectation.column_values_to_be_in_set('gender', {'M', 'F'}))
    .expect("unique_patients", Expectation.column_values_to_be_unique('patient_id'))
)

results = suite.validate(df)

🎯 Use Cases

Healthcare Analytics

  • Patient demographics validation
  • Claims data quality checks
  • Clinical data validation
  • Revenue cycle management
  • Denial management analysis

Data Engineering

  • ETL pipeline validation
  • Data warehouse quality checks
  • Real-time data validation
  • Data migration validation

Business Intelligence

  • Report data quality
  • Dashboard data validation
  • KPI data integrity
  • Automated quality monitoring

πŸ“Š Real-World Example

from emrvalidator import DataValidator, DataProfiler, HTMLReporter
import pandas as pd

# 1. Load data
df = pd.read_csv('patient_encounters.csv')

# 2. Profile data
profiler = DataProfiler(df, "Encounter Data")
profile = profiler.generate_profile()
print(f"Quality Score: {profile['quality_summary']['quality_score']}/100")

# 3. Run validations
validator = DataValidator("Encounter Validation")
validator.load_data(df)

(validator
    .expect_column_exists('mrn')
    .expect_column_exists('encounter_id')
    .expect_column_not_null('admission_date', threshold=1.0)
    .expect_column_not_null('discharge_date', threshold=1.0)
    .expect_mrn_format('mrn')
    .expect_icd_format('primary_diagnosis', version=10)
    .expect_column_values_between('length_of_stay', 0, 365)
)

# 4. Generate report
results = validator.get_results()
HTMLReporter(results).generate('encounter_quality_report.html')

# 5. Check status
if validator.is_valid():
    print("βœ“ Data quality check passed!")
else:
    print(f"⚠️  {len(validator.get_failed_validations())} validations failed")

πŸ“¦ Package Structure

emrvalidator/
β”œβ”€β”€ __init__.py          # Package initialization
β”œβ”€β”€ validator.py         # DataValidator class
β”œβ”€β”€ profiler.py          # DataProfiler class
β”œβ”€β”€ reporters.py         # Report generators
β”œβ”€β”€ rules.py             # Rules and expectations
└── py.typed            # Type hints marker

examples/
β”œβ”€β”€ basic_usage.py       # Comprehensive examples
└── healthcare_specific.py

tests/
β”œβ”€β”€ test_validator.py
β”œβ”€β”€ test_profiler.py
└── test_reporters.py

πŸ”§ Development

Setup Development Environment

# Clone repository
git clone https://github.com/rohandesai007/EMRV.git
cd EMRV

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

Run Tests

pytest

Run Tests with Coverage

pytest --cov=emrvalidator --cov-report=html

Code Formatting

# Format code
black emrvalidator tests

# Sort imports
isort emrvalidator tests

# Check with flake8
flake8 emrvalidator tests

πŸ“ Documentation

🀝 Contributing

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Make your changes
  4. Add tests for your changes
  5. Run tests (pytest)
  6. Commit your changes (git commit -m 'Add AmazingFeature')
  7. Push to the branch (git push origin feature/AmazingFeature)
  8. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ‘₯ Authors & Contributors

Rohan Desai
Dallas, Texas, USA
Email: rohan.acme@gmail.com
GitHub: https://github.com/rohan-desai
LinkedIn: https://www.linkedin.com/in/rohandesai07/

My Websites:

Open Source Python Packages:

  • emrvalidator – Healthcare-focused data quality and validation library
  • carelytics – Healthcare analytics Python package

Vaishnavi Gadve
Irving, Texas, USA
Email: vaishnavigadve143@gmail.com
GitHub: https://github.com/vaish2412
LinkedIn: https://www.linkedin.com/in/vaishnavi-gadve-4b577512a/

Acknowledgments

  • Created by Healthcare Analytics Hub
  • Inspired by the need for simpler, healthcare-focused data validation
  • Built for the healthcare analytics community

πŸ“§ Contact & Support

Star History

If you find EMRValidator useful, please consider giving it a star on GitHub!

πŸ“ˆ Roadmap

  • Additional healthcare-specific validators (CPT, NDC codes)
  • FHIR data validation support
  • Integration with popular ETL tools
  • Cloud storage support (S3, Azure Blob)
  • Real-time validation streaming
  • Web UI for non-technical users
  • Validation rule marketplace

πŸ’‘ Citation

If you use EMRValidator in your research or project, please cite:

@software{emrvalidator2025,
  title = {EMRValidator: Healthcare-Focused Data Quality and Validation},
  author = {Desai, Rohan and Gadve, Vaishnavi},
  year = {2025},
  url = {https://github.com/rohandesai007/EMRV}
}

⬆ Back to Top

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages