DataCheck — Data Validation Engine

DataCheck is a data quality validation engine for data engineers. Define validation rules in a YAML config and data sources in a separate sources file, then automatically validate data across files, databases, and cloud warehouses.

DataCheck provides the datacheck Command-Line Interface (CLI) and a Python API, which you can use to validate data, profile quality, and detect schema changes. These operations can be executed locally during development, embedded programmatically within your data pipelines (Airflow, Dagster, Prefect, etc.), or integrated into CI/CD workflows. View the Documentation for more details.

Highlights

Define validation rules in YAML config and data sources inline (files) or in a sources.yaml (databases, cloud)
Run checks on CSV, Parquet, Delta Lake, Avro, PostgreSQL, MySQL, Snowflake, BigQuery, Redshift, and more
Use 27+ built-in data quality rules for null checks, numeric ranges, patterns, timestamps, email/phone/URL validation, and cross-column checks
Profile data quality with automatic scoring, outlier detection, and rule suggestions
Detect schema evolution with compatibility levels (COMPATIBLE, WARNING, BREAKING)
Extend with custom rules using the @custom_rule plugin decorator

Demo

_{Install DataCheck, generate an ecommerce config with sample data, and run validation — all in one go.}

Setup

Requirements

To use DataCheck, you must have installed the following on your system.

Python 3.10, 3.11, or 3.12

To check your existing version, use the CLI command: python --version or python3 --version.

Pip 21.0 or greater

To check your pip version: pip --version

Installation

DataCheck is available on public PyPI as datacheck-cli.

pip install datacheck-cli

To install with support for a specific data source, use extras:

pip install datacheck-cli[postgresql]    # PostgreSQL
pip install datacheck-cli[mysql]         # MySQL
pip install datacheck-cli[snowflake]     # Snowflake
pip install datacheck-cli[bigquery]      # BigQuery
pip install datacheck-cli[redshift]      # Redshift
pip install datacheck-cli[cloud]         # S3, GCS, Azure Blob
pip install datacheck-cli[all]           # All data sources

Quickstart

The examples below show minimal configurations. To see detailed logs, add --verbose or -v to any command.

Create a config

Use datacheck config init to generate a config from a template. Add --with-sample-data to also generate a sample CSV file so you can test validation immediately:

datacheck config init --with-sample-data
datacheck config init --template ecommerce --with-sample-data

Or create a .datacheck.yaml file manually. The config defines both the data source and the validation rules.

# .datacheck.yaml

data_source:
  type: csv
  path: ./data/orders.csv

checks:
  - name: id_check
    column: id
    rules:
      not_null: true
      unique: true

  - name: amount_check
    column: amount
    rules:
      not_null: true
      min: 0
      max: 10000

  - name: email_check
    column: email
    rules:
      email_valid: true

DataCheck auto-discovers config files in this order: .datacheck.yaml → .datacheck.yml → datacheck.yaml → datacheck.yml. To specify a config explicitly, use the --config flag.

Run validation

datacheck validate
datacheck validate --config checks.yaml

Parameter	Required	Description
`-c, --config`	No	Path to config file (auto-discovered if not specified)
`--source`	No	Named source from `sources.yaml`
`--sources-file`	No	Path to sources YAML file
`-t, --table`	No	Database table name (for database sources)
`-w, --where`	No	WHERE clause for filtering (for database sources)
`-q, --query`	No	Custom SQL query (alternative to --table)
`-o, --output`	No	Save results to a JSON file (terminal output is always shown)
`--csv-export`	No	Export failure details as CSV
`--sample-rate`	No	Random sample fraction (0.0–1.0)
`--sample-count`	No	Fixed sample size
`--parallel`	No	Enable multi-core execution
`--verbose, -v`	No	Enable detailed logging

Data Source Configuration

File-based data sources are defined inline under data_source in your config. For databases and cloud storage, define named sources in a separate sources.yaml file and reference them.

CSV / Parquet / Avro

data_source:
  type: csv
  path: ./data/orders.csv
  options:
    delimiter: ","
    encoding: utf-8

data_source:
  type: parquet
  path: ./data/orders.parquet

SQLite / DuckDB

data_source:
  type: sqlite
  path: ./data/analytics.db

Delta Lake

data_source:
  type: delta
  path: ./data/delta-table

Databases (PostgreSQL, Snowflake, BigQuery, etc.)

For database connections, use named sources in a sources.yaml file. The inline data_source config only supports file-based sources (csv, parquet, avro, delta, duckdb, sqlite).

# sources.yaml
sources:
  production_db:
    type: postgresql
    host: ${DB_HOST}
    port: ${DB_PORT:-5432}
    database: ${DB_NAME}
    user: ${DB_USER}
    password: ${DB_PASSWORD}

  snowflake_wh:
    type: snowflake
    account: ${SF_ACCOUNT}
    user: ${SF_USER}
    password: ${SF_PASSWORD}
    warehouse: ${SF_WAREHOUSE:-COMPUTE_WH}
    database: ${SF_DATABASE}
    schema: ${SF_SCHEMA:-PUBLIC}

  bigquery_ds:
    type: bigquery
    project_id: ${GCP_PROJECT}
    dataset_id: ${GCP_DATASET}
    credentials_path: /path/to/service-account.json
    location: US

  redshift_db:
    type: redshift
    host: ${REDSHIFT_HOST}
    port: ${REDSHIFT_PORT:-5439}
    database: ${REDSHIFT_DB}
    user: ${REDSHIFT_USER}
    password: ${REDSHIFT_PASSWORD}
    schema: public

Then reference the source in your config:

# .datacheck.yaml
sources_file: sources.yaml
source: production_db
table: orders

checks:
  - name: id_check
    column: id
    rules:
      not_null: true

Cloud Storage (S3, GCS, Azure)

Access cloud files via named sources in sources.yaml:

# sources.yaml
sources:
  s3_data:
    type: s3
    bucket: my-bucket
    path: data/orders.csv
    region: us-east-1
    access_key: ${AWS_ACCESS_KEY_ID}
    secret_key: ${AWS_SECRET_ACCESS_KEY}

datacheck validate --source s3_data --sources-file sources.yaml

Named Sources (continued)

Switch sources at runtime:

datacheck validate --source snowflake_wh --config checks.yaml

Individual checks can also override the default source:

sources_file: sources.yaml
source: production_db
table: customers

checks:
  - name: customer_email
    column: email
    rules:
      not_null: true

  - name: order_total
    column: total
    source: snowflake_wh      # Override source for this check
    table: orders
    rules:
      min: 0

Environment Variables

Config files support environment variable substitution:

# In sources.yaml
sources:
  production_db:
    type: postgresql
    host: ${DB_HOST}                    # Required variable
    port: ${DB_PORT:-5432}              # Variable with default value
    database: ${DB_NAME}
    user: ${DB_USER}
    password: ${DB_PASSWORD}

Use datacheck config env to list all variables referenced in a config and their current values.

Profile Data Quality

Generate a data quality profile with summary statistics, quality scores, and automatic rule suggestions. The data source can be provided directly, read from your config, or loaded from a named source.

# Direct file path
datacheck profile data.csv

# Auto-discover config (looks for .datacheck.yaml, datacheck.yaml, etc.)
datacheck profile

# Explicit config file
datacheck profile --config checks.yaml

# Named source from sources file
datacheck profile --source production_db --sources-file sources.yaml

Parameter	Required	Description
`DATA_SOURCE`	No	Data source: file path, connection string, or omit when using config/sources
`-c, --config`	No	Path to config file with data_source or sources_file defined
`--source`	No	Named source from sources.yaml
`--sources-file`	No	Path to sources YAML file
`--outlier-method`	No	Outlier detection method: `zscore` (default) or `iqr`
`--format`	No	Output format: `terminal`, `json`, or `markdown`
`-o, --output`	No	Write output to a file

datacheck profile                                              # Full profile
datacheck profile --format json -o profile.json                # Export as JSON
datacheck profile --source analytics_wh --sources-file sources.yaml  # Profile a named source

Detect Schema Changes

Capture a baseline schema and compare future data against it to detect column additions, removals, type changes, and nullable changes. The data source can be provided directly, read from your config, or loaded from a named source.

# Auto-discover config or use named source
datacheck schema capture                   # Save current schema as baseline
datacheck schema compare                   # Compare against baseline

# Direct file path
datacheck schema capture data.csv
datacheck schema compare data.csv

# Named source
datacheck schema capture --source production_db --sources-file sources.yaml

# Other schema commands
datacheck schema show                      # Display detected schema
datacheck schema list                      # List saved baselines
datacheck schema history                   # View capture history

Python API

Use DataCheck programmatically within your pipelines:

from datacheck import ValidationEngine

engine = ValidationEngine(config_path=".datacheck.yaml")
summary = engine.validate()

print(f"Records: {summary.total_rows:,} rows, {summary.total_columns} columns")
print(f"Passed: {summary.passed_rules}/{summary.total_rules}")

for result in summary.get_failed_results():
    print(f"  FAIL: {result.rule_name} on {result.column} ({result.failed_rows} rows)")

CI/CD Integration

DataCheck uses standard exit codes for automation:

Code	Meaning
`0`	All rules passed (or only warning/info severity failures)
`1`	Some error-severity rules failed
`2`	Configuration error
`3`	Data loading error
`4`	Unexpected error

Rules can have severity: error (default), severity: warning, or severity: info. Only error-severity failures cause exit code 1.

# GitHub Actions
- name: Validate Data
  run: |
    pip install datacheck-cli
    datacheck validate --output results.json

Available Rules

Category	Rules
Null & Uniqueness	`not_null`, `unique`, `unique_combination`
Numeric	`min`, `max`, `mean_between`, `std_dev_less_than`, `percentile_range`, `z_score_outliers`, `distribution_type`
String & Pattern	`regex`, `allowed_values`, `length`, `min_length`, `max_length`, `type`
Temporal	`max_age`, `timestamp_range` (or `date_range`), `no_future_timestamps`, `date_format_valid` (or `date_format`), `business_days_only`
Semantic	`email_valid`, `phone_valid`, `url_valid`, `json_valid`
Cross-Column	`unique_combination`, `foreign_key_exists` (Python API), `sum_equals`
Custom	`custom` — user-defined functions via `@custom_rule` decorator

Custom Rules

Create a plugin file with custom validation functions using the @custom_rule decorator. The function receives a pd.Series and optional parameters, and returns a boolean pd.Series (True = valid).

# custom_rules.py
from datacheck.plugins.decorators import custom_rule
import pandas as pd

@custom_rule
def is_business_email(column: pd.Series, allowed_domains: list) -> pd.Series:
    domains = column.dropna().str.split("@").str[1]
    return domains.isin(allowed_domains)

Reference the plugin in your config:

plugins:
  - ./custom_rules.py

checks:
  - name: email_check
    column: email
    rules:
      custom:
        rule: is_business_email
        params:
          allowed_domains: ["company.com"]

Roadmap

DataCheck v2.0.1 includes smart config auto-generation (regex pattern inference, cross-column sum_equals detection, semantic rule suggestion, type-aware profiling), batch error reporting for config validation, connection pre-validation for database sources, and international phone number support. Here's what's next:

Enhanced CI/CD examples — Starter workflows for GitHub Actions, GitLab CI, and Jenkins.
Streaming validation — Validate large datasets without loading everything into memory.
Notification integrations — Slack, email, and webhook alerts on validation failures.

Development

git clone https://github.com/squrtech/datacheck.git
cd datacheck
poetry install
poetry run pytest

See CONTRIBUTING.md for guidelines.

Resources

License

Apache License 2.0 — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.github		.github
assets		assets
datacheck		datacheck
docs		docs
guides		guides
security		security
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README_PYPI.md		README_PYPI.md
SECURITY.md		SECURITY.md
package-lock.json		package-lock.json
package.json		package.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataCheck — Data Validation Engine

Highlights

Demo

Setup

Requirements

Installation

Quickstart

Create a config

Run validation

Data Source Configuration

CSV / Parquet / Avro

SQLite / DuckDB

Delta Lake

Databases (PostgreSQL, Snowflake, BigQuery, etc.)

Cloud Storage (S3, GCS, Azure)

Named Sources (continued)

Environment Variables

Profile Data Quality

Detect Schema Changes

Python API

CI/CD Integration

Available Rules

Custom Rules

Roadmap

Development

Resources

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

Squrtech/datacheck

Folders and files

Latest commit

History

Repository files navigation

DataCheck — Data Validation Engine

Highlights

Demo

Setup

Requirements

Installation

Quickstart

Create a config

Run validation

Data Source Configuration

CSV / Parquet / Avro

SQLite / DuckDB

Delta Lake

Databases (PostgreSQL, Snowflake, BigQuery, etc.)

Cloud Storage (S3, GCS, Azure)

Named Sources (continued)

Environment Variables

Profile Data Quality

Detect Schema Changes

Python API

CI/CD Integration

Available Rules

Custom Rules

Roadmap

Development

Resources

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages