Skip to content

Data validation engine for files, databases, and cloud warehouses. Define rules in YAML, run from CLI or Python.

License

Notifications You must be signed in to change notification settings

Squrtech/datacheck

DataCheck Logo

DataCheck — Data Validation Engine

CI Python 3.10+ License: Apache 2.0 PyPI version

DataCheck is a data quality validation engine for data engineers. Define validation rules in a YAML config and data sources in a separate sources file, then automatically validate data across files, databases, and cloud warehouses.

DataCheck provides the datacheck Command-Line Interface (CLI) and a Python API, which you can use to validate data, profile quality, and detect schema changes. These operations can be executed locally during development, embedded programmatically within your data pipelines (Airflow, Dagster, Prefect, etc.), or integrated into CI/CD workflows. View the Documentation for more details.

Highlights

  • Define validation rules in YAML config and data sources inline (files) or in a sources.yaml (databases, cloud)
  • Run checks on CSV, Parquet, Delta Lake, Avro, PostgreSQL, MySQL, Snowflake, BigQuery, Redshift, and more
  • Use 27+ built-in data quality rules for null checks, numeric ranges, patterns, timestamps, email/phone/URL validation, and cross-column checks
  • Profile data quality with automatic scoring, outlier detection, and rule suggestions
  • Detect schema evolution with compatibility levels (COMPATIBLE, WARNING, BREAKING)
  • Extend with custom rules using the @custom_rule plugin decorator

Demo

DataCheck Quickstart Demo
Install DataCheck, generate an ecommerce config with sample data, and run validation — all in one go.

Setup

Requirements

To use DataCheck, you must have installed the following on your system.

Python 3.10, 3.11, or 3.12

To check your existing version, use the CLI command: python --version or python3 --version.

Pip 21.0 or greater

To check your pip version: pip --version

Installation

DataCheck is available on public PyPI as datacheck-cli.

pip install datacheck-cli

To install with support for a specific data source, use extras:

pip install datacheck-cli[postgresql]    # PostgreSQL
pip install datacheck-cli[mysql]         # MySQL
pip install datacheck-cli[snowflake]     # Snowflake
pip install datacheck-cli[bigquery]      # BigQuery
pip install datacheck-cli[redshift]      # Redshift
pip install datacheck-cli[cloud]         # S3, GCS, Azure Blob
pip install datacheck-cli[all]           # All data sources

Quickstart

The examples below show minimal configurations. To see detailed logs, add --verbose or -v to any command.

Create a config

Use datacheck config init to generate a config from a template. Add --with-sample-data to also generate a sample CSV file so you can test validation immediately:

datacheck config init --with-sample-data
datacheck config init --template ecommerce --with-sample-data

Or create a .datacheck.yaml file manually. The config defines both the data source and the validation rules.

# .datacheck.yaml

data_source:
  type: csv
  path: ./data/orders.csv

checks:
  - name: id_check
    column: id
    rules:
      not_null: true
      unique: true

  - name: amount_check
    column: amount
    rules:
      not_null: true
      min: 0
      max: 10000

  - name: email_check
    column: email
    rules:
      email_valid: true

DataCheck auto-discovers config files in this order: .datacheck.yaml.datacheck.ymldatacheck.yamldatacheck.yml. To specify a config explicitly, use the --config flag.

Run validation

datacheck validate
datacheck validate --config checks.yaml
Parameter Required Description
-c, --config No Path to config file (auto-discovered if not specified)
--source No Named source from sources.yaml
--sources-file No Path to sources YAML file
-t, --table No Database table name (for database sources)
-w, --where No WHERE clause for filtering (for database sources)
-q, --query No Custom SQL query (alternative to --table)
-o, --output No Save results to a JSON file (terminal output is always shown)
--csv-export No Export failure details as CSV
--sample-rate No Random sample fraction (0.0–1.0)
--sample-count No Fixed sample size
--parallel No Enable multi-core execution
--verbose, -v No Enable detailed logging

Data Source Configuration

File-based data sources are defined inline under data_source in your config. For databases and cloud storage, define named sources in a separate sources.yaml file and reference them.

CSV / Parquet / Avro

data_source:
  type: csv
  path: ./data/orders.csv
  options:
    delimiter: ","
    encoding: utf-8
data_source:
  type: parquet
  path: ./data/orders.parquet

SQLite / DuckDB

data_source:
  type: sqlite
  path: ./data/analytics.db

Delta Lake

data_source:
  type: delta
  path: ./data/delta-table

Databases (PostgreSQL, Snowflake, BigQuery, etc.)

For database connections, use named sources in a sources.yaml file. The inline data_source config only supports file-based sources (csv, parquet, avro, delta, duckdb, sqlite).

# sources.yaml
sources:
  production_db:
    type: postgresql
    host: ${DB_HOST}
    port: ${DB_PORT:-5432}
    database: ${DB_NAME}
    user: ${DB_USER}
    password: ${DB_PASSWORD}

  snowflake_wh:
    type: snowflake
    account: ${SF_ACCOUNT}
    user: ${SF_USER}
    password: ${SF_PASSWORD}
    warehouse: ${SF_WAREHOUSE:-COMPUTE_WH}
    database: ${SF_DATABASE}
    schema: ${SF_SCHEMA:-PUBLIC}

  bigquery_ds:
    type: bigquery
    project_id: ${GCP_PROJECT}
    dataset_id: ${GCP_DATASET}
    credentials_path: /path/to/service-account.json
    location: US

  redshift_db:
    type: redshift
    host: ${REDSHIFT_HOST}
    port: ${REDSHIFT_PORT:-5439}
    database: ${REDSHIFT_DB}
    user: ${REDSHIFT_USER}
    password: ${REDSHIFT_PASSWORD}
    schema: public

Then reference the source in your config:

# .datacheck.yaml
sources_file: sources.yaml
source: production_db
table: orders

checks:
  - name: id_check
    column: id
    rules:
      not_null: true

Cloud Storage (S3, GCS, Azure)

Access cloud files via named sources in sources.yaml:

# sources.yaml
sources:
  s3_data:
    type: s3
    bucket: my-bucket
    path: data/orders.csv
    region: us-east-1
    access_key: ${AWS_ACCESS_KEY_ID}
    secret_key: ${AWS_SECRET_ACCESS_KEY}
datacheck validate --source s3_data --sources-file sources.yaml

Named Sources (continued)

Switch sources at runtime:

datacheck validate --source snowflake_wh --config checks.yaml

Individual checks can also override the default source:

sources_file: sources.yaml
source: production_db
table: customers

checks:
  - name: customer_email
    column: email
    rules:
      not_null: true

  - name: order_total
    column: total
    source: snowflake_wh      # Override source for this check
    table: orders
    rules:
      min: 0

Environment Variables

Config files support environment variable substitution:

# In sources.yaml
sources:
  production_db:
    type: postgresql
    host: ${DB_HOST}                    # Required variable
    port: ${DB_PORT:-5432}              # Variable with default value
    database: ${DB_NAME}
    user: ${DB_USER}
    password: ${DB_PASSWORD}

Use datacheck config env to list all variables referenced in a config and their current values.

Profile Data Quality

Generate a data quality profile with summary statistics, quality scores, and automatic rule suggestions. The data source can be provided directly, read from your config, or loaded from a named source.

# Direct file path
datacheck profile data.csv

# Auto-discover config (looks for .datacheck.yaml, datacheck.yaml, etc.)
datacheck profile

# Explicit config file
datacheck profile --config checks.yaml

# Named source from sources file
datacheck profile --source production_db --sources-file sources.yaml
Parameter Required Description
DATA_SOURCE No Data source: file path, connection string, or omit when using config/sources
-c, --config No Path to config file with data_source or sources_file defined
--source No Named source from sources.yaml
--sources-file No Path to sources YAML file
--outlier-method No Outlier detection method: zscore (default) or iqr
--format No Output format: terminal, json, or markdown
-o, --output No Write output to a file
datacheck profile                                              # Full profile
datacheck profile --format json -o profile.json                # Export as JSON
datacheck profile --source analytics_wh --sources-file sources.yaml  # Profile a named source

Detect Schema Changes

Capture a baseline schema and compare future data against it to detect column additions, removals, type changes, and nullable changes. The data source can be provided directly, read from your config, or loaded from a named source.

# Auto-discover config or use named source
datacheck schema capture                   # Save current schema as baseline
datacheck schema compare                   # Compare against baseline

# Direct file path
datacheck schema capture data.csv
datacheck schema compare data.csv

# Named source
datacheck schema capture --source production_db --sources-file sources.yaml

# Other schema commands
datacheck schema show                      # Display detected schema
datacheck schema list                      # List saved baselines
datacheck schema history                   # View capture history

Python API

Use DataCheck programmatically within your pipelines:

from datacheck import ValidationEngine

engine = ValidationEngine(config_path=".datacheck.yaml")
summary = engine.validate()

print(f"Records: {summary.total_rows:,} rows, {summary.total_columns} columns")
print(f"Passed: {summary.passed_rules}/{summary.total_rules}")

for result in summary.get_failed_results():
    print(f"  FAIL: {result.rule_name} on {result.column} ({result.failed_rows} rows)")

CI/CD Integration

DataCheck uses standard exit codes for automation:

Code Meaning
0 All rules passed (or only warning/info severity failures)
1 Some error-severity rules failed
2 Configuration error
3 Data loading error
4 Unexpected error

Rules can have severity: error (default), severity: warning, or severity: info. Only error-severity failures cause exit code 1.

# GitHub Actions
- name: Validate Data
  run: |
    pip install datacheck-cli
    datacheck validate --output results.json

Available Rules

Category Rules
Null & Uniqueness not_null, unique, unique_combination
Numeric min, max, mean_between, std_dev_less_than, percentile_range, z_score_outliers, distribution_type
String & Pattern regex, allowed_values, length, min_length, max_length, type
Temporal max_age, timestamp_range (or date_range), no_future_timestamps, date_format_valid (or date_format), business_days_only
Semantic email_valid, phone_valid, url_valid, json_valid
Cross-Column unique_combination, foreign_key_exists (Python API), sum_equals
Custom custom — user-defined functions via @custom_rule decorator

Custom Rules

Create a plugin file with custom validation functions using the @custom_rule decorator. The function receives a pd.Series and optional parameters, and returns a boolean pd.Series (True = valid).

# custom_rules.py
from datacheck.plugins.decorators import custom_rule
import pandas as pd

@custom_rule
def is_business_email(column: pd.Series, allowed_domains: list) -> pd.Series:
    domains = column.dropna().str.split("@").str[1]
    return domains.isin(allowed_domains)

Reference the plugin in your config:

plugins:
  - ./custom_rules.py

checks:
  - name: email_check
    column: email
    rules:
      custom:
        rule: is_business_email
        params:
          allowed_domains: ["company.com"]

Roadmap

DataCheck v2.0.1 includes smart config auto-generation (regex pattern inference, cross-column sum_equals detection, semantic rule suggestion, type-aware profiling), batch error reporting for config validation, connection pre-validation for database sources, and international phone number support. Here's what's next:

  • Enhanced CI/CD examples — Starter workflows for GitHub Actions, GitLab CI, and Jenkins.
  • Streaming validation — Validate large datasets without loading everything into memory.
  • Notification integrations — Slack, email, and webhook alerts on validation failures.

Development

git clone https://github.com/squrtech/datacheck.git
cd datacheck
poetry install
poetry run pytest

See CONTRIBUTING.md for guidelines.

Resources

License

Apache License 2.0 — see LICENSE for details.