A modern, healthcare-focused data quality and validation library
EMRValidator is a Python library designed as a cleaner, faster, and more intuitive alternative to Great Expectations, with specialized features for Electronic Medical Records (EMR) and healthcare data validation.
- π₯ Healthcare-Specific Validations: Built-in validators for MRN, ICD codes, and other healthcare data
- π― Simple, Intuitive API: Fluent interface for chaining validations
- π Automated Data Profiling: Quick quality assessment with actionable recommendations
- π Beautiful Reports: Generate professional HTML and JSON reports
- β‘ High Performance: 5-7x faster than Great Expectations
- π§ Extensible: Easy to add custom validations and rules
- π¨ Multiple APIs: Choose between fluent, expectation-based, or rule-set patterns
- π¦ Minimal Dependencies: Only pandas and numpy required
- Data quality rules for EMR, claims, or clinical datasets
- Fast validation for millions of rows
- Healthcare-specific formats (ICD, MRN, CPT, NDC)
- Validation in ETL, Airflow, dbt, or LLM pipelines
- You need schema evolution tracking across multiple batches
pip install emrvalidatorFor Excel support:
pip install emrvalidator[excel]For development:
pip install emrvalidator[dev]from emrvalidator import DataValidator
import pandas as pd
# Load your data
df = pd.read_csv('patient_data.csv')
# Create validator and run validations
validator = DataValidator("Patient Data Quality Check")
validator.load_data(df)
# Chain validation rules
(validator
.expect_column_exists('mrn')
.expect_column_not_null('patient_id', threshold=0.99)
.expect_column_values_between('age', 0, 120)
.expect_mrn_format('mrn')
.expect_icd_format('diagnosis_code', version=10)
)
# Check results
if validator.is_valid():
print("β All validations passed!")
else:
print("Issues found:")
for fail in validator.get_failed_validations():
print(f" - {fail['message']}")| Feature | Great Expectations | EMRValidator | Advantage |
|---|---|---|---|
| Setup Complexity | High (2.3s) | Minimal (0.1s) | 23x faster |
| Code Volume | 45 lines | 12 lines | 73% less code |
| Performance | Baseline | 5-7x faster | 500-700% faster |
| Healthcare Focus | None | Built-in | MRN, ICD validation |
| Dependencies | 40+ packages | 2 packages | 95% fewer |
| Learning Curve | 4-8 hours | 15 minutes | 20x faster |
| Data Profiling | External tool | Built-in | Included |
See detailed comparison documentation.
# Column existence
validator.expect_column_exists('column_name')
# Null checks
validator.expect_column_not_null('age', threshold=0.95)
# Value ranges
validator.expect_column_values_between('age', 0, 120, threshold=0.98)
# Set membership
validator.expect_column_values_in_set('gender', {'M', 'F', 'Other'})
# Uniqueness
validator.expect_column_values_unique('patient_id')
# Date format
validator.expect_column_date_format('admission_date', date_format='%Y-%m-%d')# Medical Record Numbers
validator.expect_mrn_format('mrn', threshold=0.99)
# ICD Codes
validator.expect_icd_format('diagnosis_code', version=10) # ICD-10
validator.expect_icd_format('diagnosis_code', version=9) # ICD-9
# Pre-built healthcare rule sets
from emrvalidator import HealthcareRuleSets
demo_rules = HealthcareRuleSets.patient_demographics()
fin_rules = HealthcareRuleSets.financial_data()from emrvalidator import DataProfiler
profiler = DataProfiler(df, "Healthcare Dataset")
profile = profiler.generate_profile()
# Print summary
profiler.print_summary()
# Get quality score
quality_score = profile['quality_summary']['quality_score']
print(f"Quality Score: {quality_score}/100")
# Get recommendations
for rec in profile['recommendations']:
print(f" - {rec}")from emrvalidator import HTMLReporter, JSONReporter
# Generate HTML report
html_reporter = HTMLReporter(validator.get_results())
html_reporter.generate('quality_report.html', title='Data Quality Report')
# Generate JSON report
json_reporter = JSONReporter(validator.get_results())
json_reporter.generate('quality_report.json', pretty=True)def validate_charge_payment(df, **kwargs):
"""Custom validation: charges must be >= payments"""
valid_mask = df['charge_amount'] >= df['payment_amount']
valid_pct = valid_mask.sum() / len(df)
passed = valid_pct > 0.95
message = f"{valid_pct*100:.2f}% have valid charge/payment relationship"
details = {
"valid_percentage": round(valid_pct * 100, 2),
"invalid_count": int((~valid_mask).sum())
}
return passed, message, details
validator.expect_custom("charge_payment_logic", validate_charge_payment)from emrvalidator import RuleSet
# Create custom rule set
financial_rules = RuleSet("Financial Validations")
def validate_positive_charges(df, **kwargs):
valid = (df['charge_amount'] > 0).sum() / len(df)
passed = valid > 0.98
return passed, f"Positive charges: {valid*100:.1f}%", {}
financial_rules.create_rule(
"positive_charges",
"All charges must be positive",
validate_positive_charges
)
# Apply to any dataset
results = financial_rules.execute_all(df)from emrvalidator import Expectation, ExpectationSuite
suite = ExpectationSuite("Data Quality Expectations")
(suite
.expect("mrn_exists", Expectation.column_to_exist('mrn'))
.expect("mrn_not_null", Expectation.column_values_to_not_be_null('mrn'))
.expect("valid_gender", Expectation.column_values_to_be_in_set('gender', {'M', 'F'}))
.expect("unique_patients", Expectation.column_values_to_be_unique('patient_id'))
)
results = suite.validate(df)- Patient demographics validation
- Claims data quality checks
- Clinical data validation
- Revenue cycle management
- Denial management analysis
- ETL pipeline validation
- Data warehouse quality checks
- Real-time data validation
- Data migration validation
- Report data quality
- Dashboard data validation
- KPI data integrity
- Automated quality monitoring
from emrvalidator import DataValidator, DataProfiler, HTMLReporter
import pandas as pd
# 1. Load data
df = pd.read_csv('patient_encounters.csv')
# 2. Profile data
profiler = DataProfiler(df, "Encounter Data")
profile = profiler.generate_profile()
print(f"Quality Score: {profile['quality_summary']['quality_score']}/100")
# 3. Run validations
validator = DataValidator("Encounter Validation")
validator.load_data(df)
(validator
.expect_column_exists('mrn')
.expect_column_exists('encounter_id')
.expect_column_not_null('admission_date', threshold=1.0)
.expect_column_not_null('discharge_date', threshold=1.0)
.expect_mrn_format('mrn')
.expect_icd_format('primary_diagnosis', version=10)
.expect_column_values_between('length_of_stay', 0, 365)
)
# 4. Generate report
results = validator.get_results()
HTMLReporter(results).generate('encounter_quality_report.html')
# 5. Check status
if validator.is_valid():
print("β Data quality check passed!")
else:
print(f"β οΈ {len(validator.get_failed_validations())} validations failed")emrvalidator/
βββ __init__.py # Package initialization
βββ validator.py # DataValidator class
βββ profiler.py # DataProfiler class
βββ reporters.py # Report generators
βββ rules.py # Rules and expectations
βββ py.typed # Type hints marker
examples/
βββ basic_usage.py # Comprehensive examples
βββ healthcare_specific.py
tests/
βββ test_validator.py
βββ test_profiler.py
βββ test_reporters.py
# Clone repository
git clone https://github.com/rohandesai007/EMRV.git
cd EMRV
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"pytestpytest --cov=emrvalidator --cov-report=html# Format code
black emrvalidator tests
# Sort imports
isort emrvalidator tests
# Check with flake8
flake8 emrvalidator tests- Quick Start Guide
- API Reference
- Comparison with Great Expectations
- Healthcare Validations
- Custom Validations Guide
- Examples
Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Make your changes
- Add tests for your changes
- Run tests (
pytest) - Commit your changes (
git commit -m 'Add AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Rohan Desai
Dallas, Texas, USA
Email: rohan.acme@gmail.com
GitHub: https://github.com/rohan-desai
LinkedIn: https://www.linkedin.com/in/rohandesai07/
My Websites:
Open Source Python Packages:
- emrvalidator β Healthcare-focused data quality and validation library
- carelytics β Healthcare analytics Python package
Vaishnavi Gadve
Irving, Texas, USA
Email: vaishnavigadve143@gmail.com
GitHub: https://github.com/vaish2412
LinkedIn: https://www.linkedin.com/in/vaishnavi-gadve-4b577512a/
- Created by Healthcare Analytics Hub
- Inspired by the need for simpler, healthcare-focused data validation
- Built for the healthcare analytics community
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: rohan.acme@gmail.com
If you find EMRValidator useful, please consider giving it a star on GitHub!
- Additional healthcare-specific validators (CPT, NDC codes)
- FHIR data validation support
- Integration with popular ETL tools
- Cloud storage support (S3, Azure Blob)
- Real-time validation streaming
- Web UI for non-technical users
- Validation rule marketplace
If you use EMRValidator in your research or project, please cite:
@software{emrvalidator2025,
title = {EMRValidator: Healthcare-Focused Data Quality and Validation},
author = {Desai, Rohan and Gadve, Vaishnavi},
year = {2025},
url = {https://github.com/rohandesai007/EMRV}
}