EMRValidator

A modern, healthcare-focused data quality and validation library

EMRValidator is a Python library designed as a cleaner, faster, and more intuitive alternative to Great Expectations, with specialized features for Electronic Medical Records (EMR) and healthcare data validation.

✨ Key Features

🏥 Healthcare-Specific Validations: Built-in validators for MRN, ICD codes, and other healthcare data
🎯 Simple, Intuitive API: Fluent interface for chaining validations
📊 Automated Data Profiling: Quick quality assessment with actionable recommendations
📝 Beautiful Reports: Generate professional HTML and JSON reports
⚡ High Performance: 5-7x faster than Great Expectations
🔧 Extensible: Easy to add custom validations and rules
🎨 Multiple APIs: Choose between fluent, expectation-based, or rule-set patterns
📦 Minimal Dependencies: Only pandas and numpy required

EMRValidator If You Need:

Data quality rules for EMR, claims, or clinical datasets
Fast validation for millions of rows
Healthcare-specific formats (ICD, MRN, CPT, NDC)
Validation in ETL, Airflow, dbt, or LLM pipelines

Don’t Use It If:

You need schema evolution tracking across multiple batches

🚀 Installation

pip install emrvalidator

For Excel support:

pip install emrvalidator[excel]

For development:

pip install emrvalidator[dev]

📖 Quick Start

from emrvalidator import DataValidator
import pandas as pd

# Load your data
df = pd.read_csv('patient_data.csv')

# Create validator and run validations
validator = DataValidator("Patient Data Quality Check")
validator.load_data(df)

# Chain validation rules
(validator
    .expect_column_exists('mrn')
    .expect_column_not_null('patient_id', threshold=0.99)
    .expect_column_values_between('age', 0, 120)
    .expect_mrn_format('mrn')
    .expect_icd_format('diagnosis_code', version=10)
)

# Check results
if validator.is_valid():
    print("✓ All validations passed!")
else:
    print("Issues found:")
    for fail in validator.get_failed_validations():
        print(f"  - {fail['message']}")

🆚 Why EMRValidator?

Comparison with Great Expectations

Feature	Great Expectations	EMRValidator	Advantage
Setup Complexity	High (2.3s)	Minimal (0.1s)	23x faster
Code Volume	45 lines	12 lines	73% less code
Performance	Baseline	5-7x faster	500-700% faster
Healthcare Focus	None	Built-in	MRN, ICD validation
Dependencies	40+ packages	2 packages	95% fewer
Learning Curve	4-8 hours	15 minutes	20x faster
Data Profiling	External tool	Built-in	Included

See detailed comparison documentation.

📚 Core Features

1. Basic Validations

# Column existence
validator.expect_column_exists('column_name')

# Null checks
validator.expect_column_not_null('age', threshold=0.95)

# Value ranges
validator.expect_column_values_between('age', 0, 120, threshold=0.98)

# Set membership
validator.expect_column_values_in_set('gender', {'M', 'F', 'Other'})

# Uniqueness
validator.expect_column_values_unique('patient_id')

# Date format
validator.expect_column_date_format('admission_date', date_format='%Y-%m-%d')

2. Healthcare-Specific Validations

# Medical Record Numbers
validator.expect_mrn_format('mrn', threshold=0.99)

# ICD Codes
validator.expect_icd_format('diagnosis_code', version=10)  # ICD-10
validator.expect_icd_format('diagnosis_code', version=9)   # ICD-9

# Pre-built healthcare rule sets
from emrvalidator import HealthcareRuleSets

demo_rules = HealthcareRuleSets.patient_demographics()
fin_rules = HealthcareRuleSets.financial_data()

3. Data Profiling

from emrvalidator import DataProfiler

profiler = DataProfiler(df, "Healthcare Dataset")
profile = profiler.generate_profile()

# Print summary
profiler.print_summary()

# Get quality score
quality_score = profile['quality_summary']['quality_score']
print(f"Quality Score: {quality_score}/100")

# Get recommendations
for rec in profile['recommendations']:
    print(f"  - {rec}")

4. Report Generation

from emrvalidator import HTMLReporter, JSONReporter

# Generate HTML report
html_reporter = HTMLReporter(validator.get_results())
html_reporter.generate('quality_report.html', title='Data Quality Report')

# Generate JSON report
json_reporter = JSONReporter(validator.get_results())
json_reporter.generate('quality_report.json', pretty=True)

5. Custom Validations

def validate_charge_payment(df, **kwargs):
    """Custom validation: charges must be >= payments"""
    valid_mask = df['charge_amount'] >= df['payment_amount']
    valid_pct = valid_mask.sum() / len(df)
    
    passed = valid_pct > 0.95
    message = f"{valid_pct*100:.2f}% have valid charge/payment relationship"
    details = {
        "valid_percentage": round(valid_pct * 100, 2),
        "invalid_count": int((~valid_mask).sum())
    }
    
    return passed, message, details

validator.expect_custom("charge_payment_logic", validate_charge_payment)

6. Reusable Rule Sets

from emrvalidator import RuleSet

# Create custom rule set
financial_rules = RuleSet("Financial Validations")

def validate_positive_charges(df, **kwargs):
    valid = (df['charge_amount'] > 0).sum() / len(df)
    passed = valid > 0.98
    return passed, f"Positive charges: {valid*100:.1f}%", {}

financial_rules.create_rule(
    "positive_charges",
    "All charges must be positive",
    validate_positive_charges
)

# Apply to any dataset
results = financial_rules.execute_all(df)

7. Expectations API

from emrvalidator import Expectation, ExpectationSuite

suite = ExpectationSuite("Data Quality Expectations")

(suite
    .expect("mrn_exists", Expectation.column_to_exist('mrn'))
    .expect("mrn_not_null", Expectation.column_values_to_not_be_null('mrn'))
    .expect("valid_gender", Expectation.column_values_to_be_in_set('gender', {'M', 'F'}))
    .expect("unique_patients", Expectation.column_values_to_be_unique('patient_id'))
)

results = suite.validate(df)

🎯 Use Cases

Healthcare Analytics

Patient demographics validation
Claims data quality checks
Clinical data validation
Revenue cycle management
Denial management analysis

Data Engineering

ETL pipeline validation
Data warehouse quality checks
Real-time data validation
Data migration validation

Business Intelligence

Report data quality
Dashboard data validation
KPI data integrity
Automated quality monitoring

📊 Real-World Example

from emrvalidator import DataValidator, DataProfiler, HTMLReporter
import pandas as pd

# 1. Load data
df = pd.read_csv('patient_encounters.csv')

# 2. Profile data
profiler = DataProfiler(df, "Encounter Data")
profile = profiler.generate_profile()
print(f"Quality Score: {profile['quality_summary']['quality_score']}/100")

# 3. Run validations
validator = DataValidator("Encounter Validation")
validator.load_data(df)

(validator
    .expect_column_exists('mrn')
    .expect_column_exists('encounter_id')
    .expect_column_not_null('admission_date', threshold=1.0)
    .expect_column_not_null('discharge_date', threshold=1.0)
    .expect_mrn_format('mrn')
    .expect_icd_format('primary_diagnosis', version=10)
    .expect_column_values_between('length_of_stay', 0, 365)
)

# 4. Generate report
results = validator.get_results()
HTMLReporter(results).generate('encounter_quality_report.html')

# 5. Check status
if validator.is_valid():
    print("✓ Data quality check passed!")
else:
    print(f"⚠️  {len(validator.get_failed_validations())} validations failed")

📦 Package Structure

emrvalidator/
├── __init__.py          # Package initialization
├── validator.py         # DataValidator class
├── profiler.py          # DataProfiler class
├── reporters.py         # Report generators
├── rules.py             # Rules and expectations
└── py.typed            # Type hints marker

examples/
├── basic_usage.py       # Comprehensive examples
└── healthcare_specific.py

tests/
├── test_validator.py
├── test_profiler.py
└── test_reporters.py

🔧 Development

Setup Development Environment

# Clone repository
git clone https://github.com/rohandesai007/EMRV.git
cd EMRV

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

Run Tests

pytest

Run Tests with Coverage

pytest --cov=emrvalidator --cov-report=html

Code Formatting

# Format code
black emrvalidator tests

# Sort imports
isort emrvalidator tests

# Check with flake8
flake8 emrvalidator tests

📝 Documentation

🤝 Contributing

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.

How to Contribute

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Make your changes
Add tests for your changes
Run tests (pytest)
Commit your changes (git commit -m 'Add AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

👥 Authors & Contributors

Rohan Desai
Dallas, Texas, USA
Email: rohan.acme@gmail.com
GitHub: https://github.com/rohan-desai
LinkedIn: https://www.linkedin.com/in/rohandesai07/

My Websites:

Open Source Python Packages:

emrvalidator – Healthcare-focused data quality and validation library
carelytics – Healthcare analytics Python package

Vaishnavi Gadve
Irving, Texas, USA
Email: vaishnavigadve143@gmail.com
GitHub: https://github.com/vaish2412
LinkedIn: https://www.linkedin.com/in/vaishnavi-gadve-4b577512a/

Acknowledgments

Created by Healthcare Analytics Hub
Inspired by the need for simpler, healthcare-focused data validation
Built for the healthcare analytics community

📧 Contact & Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: rohan.acme@gmail.com

Star History

If you find EMRValidator useful, please consider giving it a star on GitHub!

📈 Roadmap

Additional healthcare-specific validators (CPT, NDC codes)
FHIR data validation support
Integration with popular ETL tools
Cloud storage support (S3, Azure Blob)
Real-time validation streaming
Web UI for non-technical users
Validation rule marketplace

💡 Citation

If you use EMRValidator in your research or project, please cite:

@software{emrvalidator2025,
  title = {EMRValidator: Healthcare-Focused Data Quality and Validation},
  author = {Desai, Rohan and Gadve, Vaishnavi},
  year = {2025},
  url = {https://github.com/rohandesai007/EMRV}
}

⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github		.github
dist		dist
emrvalidator-github		emrvalidator-github
.DS_Store		.DS_Store
DOWNLOAD_AND_UPLOAD_GUIDE.md		DOWNLOAD_AND_UPLOAD_GUIDE.md
FILE_DOWNLOAD_LIST.md		FILE_DOWNLOAD_LIST.md
QUICK_DOWNLOAD_GUIDE.md		QUICK_DOWNLOAD_GUIDE.md
README.md		README.md
emrvalidator-github.tar.gz		emrvalidator-github.tar.gz
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
validator.py		validator.py

rohandesai007/EMRV

Folders and files

Latest commit

History

Repository files navigation