---
README.md | 44 +++++++++++++++++++++++++++++++++++++-------
README_PYPI.md | 12 ++++++------
2 files changed, 43 insertions(+), 13 deletions(-)
diff --git a/README.md b/README.md
index 7cb67cb..2201220 100644
--- a/README.md
+++ b/README.md
@@ -11,7 +11,7 @@
-**DataCheck enforces data quality rules at the pipeline boundary.** Define rules in YAML. Run in CI. Fail fast on bad data. No servers, no dashboards, no infrastructure.
+**DataCheck enforces deterministic validation rules at the pipeline boundary.** Define rules in YAML. Run in CI. Fail fast on bad data. No servers, no dashboards, no infrastructure.
```
Your data source → [DataCheck rules] → exit 0: pipeline continues
@@ -20,20 +20,48 @@ Your data source → [DataCheck rules] → exit 0: pipeline continues
View the [Documentation](https://squrtech.github.io/datacheck/) for full details.
+## Mental Model
+
+Code has linters.
+Infrastructure has policy enforcement.
+Data pipelines need gates.
+
+DataCheck is that gate.
+
### Why DataCheck?
-Most teams detect bad data after the fact - broken dashboards, wrong reports, angry stakeholders. DataCheck enforces data quality *before* bad data moves downstream, the same way a linter enforces code quality before bad code ships.
+Most teams detect bad data after the fact - broken dashboards, wrong reports, angry stakeholders. DataCheck enforces validation rules *before* bad data moves downstream, the same way a linter enforces code quality before bad code ships.
- **Fail fast** - structured exit codes stop pipelines at the gate, not after the damage is done
-- **Deterministic** - rules are explicit and binary: pass or fail, not "this looks unusual"
+- **Deterministic** - rules are explicit and binary. No heuristics. No anomaly scoring. No statistical guessing.
- **SQL pushdown** - database checks run as a single aggregate `SELECT`; no data leaves your warehouse
- **Zero infrastructure** - one `pip install`, one YAML file, runs anywhere
- **CI-native** - SARIF output to GitHub Security tab, GitHub Action, Apache Airflow operators
+### Validate Where Data Lives
+
+For databases, DataCheck executes validation as aggregate SQL inside your warehouse.
+
+- No data pulled into pandas
+- No row transfer
+- No separate compute layer
+- Single aggregate `SELECT` per rule set
+
+Validation happens where the data already lives.
+
### Why not observability?
DataCheck is **not** a data observability platform. It does not provide dashboards, trend analysis, anomaly detection, or SaaS backends. Those tools answer "what happened?" - DataCheck answers "does this data meet our rules right now?" Enforcement happens at the gate; investigation happens after.
+### What DataCheck Is Not
+
+- Not a monitoring dashboard
+- Not anomaly detection
+- Not a SaaS platform
+- Not a data catalog
+
+It is an enforcement layer.
+
### Demo
@@ -119,6 +147,8 @@ DataCheck auto-discovers config files in this order: `.datacheck.yaml` → `.dat
```bash
datacheck validate
+echo $? # 1 if any error-severity rule fails
+
datacheck validate --config checks.yaml
```
@@ -366,14 +396,14 @@ datacheck validate -c .datacheck.yaml
# exits 1 if any error-severity rule fails
```
-## Detect Schema Changes
+## Enforce Schema Contracts
-Capture a baseline schema and compare future data against it to detect column additions, removals, type changes, and nullable changes. The data source can be provided directly, read from your config, or loaded from a named source.
+Capture a schema baseline and compare future data against it - breaking changes fail validation. Detects column additions, removals, type changes, and nullable changes. The data source can be provided directly, read from your config, or loaded from a named source.
```bash
# Auto-discover config or use named source
datacheck schema capture # Save current schema as baseline
-datacheck schema compare # Compare against baseline
+datacheck schema compare # Compare against baseline - fails if schema changed
# Direct file path
datacheck schema capture data.csv
@@ -420,7 +450,7 @@ if not summary.all_passed:
## Roadmap
-DataCheck v2.1.0 is stable and production-ready. What's coming next:
+What's coming next:
- **Data Contracts format** - `--format datacontract` aligned with the [datacontract.com](https://datacontract.com) open spec.
- **dbt integration** - generate DataCheck rules directly from your dbt schema YAML.
diff --git a/README_PYPI.md b/README_PYPI.md
index 99ed303..4cd3ef2 100644
--- a/README_PYPI.md
+++ b/README_PYPI.md
@@ -5,17 +5,17 @@
[](https://opensource.org/licenses/Apache-2.0)
[](https://pypi.org/project/datacheck-cli/)
-**DataCheck enforces data quality rules at the pipeline boundary.** Define rules in YAML. Run in CI. Fail fast on bad data. No servers, no dashboards, no infrastructure.
+**DataCheck enforces deterministic validation rules at the pipeline boundary.** Define rules in YAML. Run in CI. Fail fast on bad data. No servers, no dashboards, no infrastructure.
```
Your data source → [DataCheck rules] → exit 0: pipeline continues
→ exit 1: pipeline stops
```
-Most teams detect bad data after the fact - broken reports, wrong numbers, angry stakeholders. DataCheck enforces quality *before* bad data moves downstream, the same way a linter enforces code quality before bad code ships.
+Most teams detect bad data after the fact - broken reports, wrong numbers, angry stakeholders. DataCheck enforces validation rules *before* bad data moves downstream, the same way a linter enforces code quality before bad code ships.
- **Fail fast** - structured exit codes stop pipelines at the gate, not after the damage is done
-- **Deterministic** - rules are explicit and binary: pass or fail, not "this looks unusual"
+- **Deterministic** - rules are explicit and binary. No heuristics. No anomaly scoring. No statistical guessing.
- **SQL pushdown** - database checks run as a single aggregate `SELECT`; no data leaves your warehouse
- **Zero infrastructure** - one `pip install`, one YAML file, runs anywhere
- **CI-native** - SARIF output to GitHub Security tab, GitHub Action, Apache Airflow operators
@@ -171,13 +171,13 @@ source: production_db
table: orders
```
-## Detect Schema Changes
+## Enforce Schema Contracts
```bash
-datacheck schema capture # Auto-discover config
+datacheck schema capture # Save current schema as baseline
datacheck schema capture data.csv # Direct file path
datacheck schema capture --source production_db --sources-file sources.yaml # Named source
-datacheck schema compare # Compare against baseline
+datacheck schema compare # Compare against baseline - fails if schema changed
```
## Python API
From d01583387db153943191f53f7db6b7ef302001e5 Mon Sep 17 00:00:00 2001
From: yash-chauhan-dev
Date: Mon, 23 Feb 2026 19:45:09 +0530
Subject: [PATCH 17/25] blogs updated
---
README.md | 3 -
...2-23-designing-fail-fast-data-pipelines.md | 458 ++++++++++++++++++
...dation-vs-statistical-anomaly-detection.md | 251 ++++++++++
...as-contracts-in-cicd-for-data-pipelines.md | 388 +++++++++++++++
...23-if-code-must-pass-ci-data-should-too.md | 305 ++++++++++++
...-schema-contracts-vs-semantic-contracts.md | 342 +++++++++++++
...3-sql-pushdown-as-a-validation-strategy.md | 289 +++++++++++
...lity-is-not-enough-for-data-enforcement.md | 244 ++++++++++
8 files changed, 2277 insertions(+), 3 deletions(-)
create mode 100644 blog/2026-02-23-designing-fail-fast-data-pipelines.md
create mode 100644 blog/2026-02-23-deterministic-validation-vs-statistical-anomaly-detection.md
create mode 100644 blog/2026-02-23-exit-codes-as-contracts-in-cicd-for-data-pipelines.md
create mode 100644 blog/2026-02-23-if-code-must-pass-ci-data-should-too.md
create mode 100644 blog/2026-02-23-schema-contracts-vs-semantic-contracts.md
create mode 100644 blog/2026-02-23-sql-pushdown-as-a-validation-strategy.md
create mode 100644 blog/2026-02-23-why-observability-is-not-enough-for-data-enforcement.md
diff --git a/README.md b/README.md
index 2201220..5bf0ac1 100644
--- a/README.md
+++ b/README.md
@@ -146,9 +146,6 @@ DataCheck auto-discovers config files in this order: `.datacheck.yaml` → `.dat
### Run validation
```bash
-datacheck validate
-echo $? # 1 if any error-severity rule fails
-
datacheck validate --config checks.yaml
```
diff --git a/blog/2026-02-23-designing-fail-fast-data-pipelines.md b/blog/2026-02-23-designing-fail-fast-data-pipelines.md
new file mode 100644
index 0000000..0870cd6
--- /dev/null
+++ b/blog/2026-02-23-designing-fail-fast-data-pipelines.md
@@ -0,0 +1,458 @@
+# Designing Fail-Fast Data Pipelines in GitHub Actions and Airflow
+
+*Published 2026-02-23 by the DataCheck engineering team*
+
+---
+
+The Airflow DAG had a validation task. It had been in production for six months. The team believed it was gating their pipeline. It was not.
+
+A developer had added a `trigger_rule='all_done'` to the transform task four months earlier to handle an optional upstream branch. One line. It meant the transform task would run regardless of whether the validation task upstream had succeeded or failed. The gate was structurally present but functionally bypassed. Every validation failure for four months had reported to the task logs, been counted in the DAG run summary, and been silently ignored.
+
+This class of problem is not rare. It appears consistently across teams that have added validation steps to their pipelines without designing the surrounding pipeline topology for fail-fast behavior. The validation tool does its job. The orchestration configuration undoes it.
+
+Fail-fast pipeline design is not about adding validation steps. It is about the structural decisions in GitHub Actions and Airflow that determine whether a gate actually holds when it fires.
+
+---
+
+## The Three Structural Requirements of a Real Gate
+
+Before examining the two environments, it is worth being precise about what "fail-fast" requires structurally:
+
+**Propagation.** A validation failure must cause the pipeline to stop. This means the failure must be visible to the orchestrator as a task failure (Airflow) or step failure (GitHub Actions), and subsequent tasks/steps must be configured to require the validation task's success.
+
+**Non-bypass.** No configuration path should allow downstream work to proceed when validation has failed. Every `trigger_rule`, `continue-on-error`, `if:` condition, and `needs:` dependency that touches validation must be reviewed for whether it can create a bypass.
+
+**Correct retry semantics.** A data quality failure (`exit 1`) should not be retried. The data is bad. Retrying the validation task against the same bad data produces the same failure. Retries are appropriate for transient infrastructure failures (`exit 3`) — not for content violations. Misconfigured retry policies can give the appearance of enforcement while actually delaying and eventually swallowing failures.
+
+With these requirements in mind, the specific failure modes in each environment become clear.
+
+---
+
+## GitHub Actions: The Structural Bypass Patterns
+
+**`continue-on-error: true`** is the most common bypass in GitHub Actions pipelines:
+
+```yaml
+# DO NOT DO THIS
+- name: Validate data
+ continue-on-error: true # ← This step can fail without failing the job
+ run: datacheck validate -c .datacheck.yaml
+```
+
+`continue-on-error: true` allows the step to exit non-zero without marking the job as failed. Subsequent steps run. The gate is gone. This setting is sometimes added for debug visibility — "I want to see the output even if it fails" — and never removed. The result is a validation step that is structurally present and functionally inert.
+
+The correct configuration: omit `continue-on-error` entirely, or set it explicitly to `false`. The default behavior in GitHub Actions is correct — a non-zero step exit fails the step, which fails the job.
+
+**Missing `needs:` on the deployment job** is the second common bypass:
+
+```yaml
+# Two jobs — but load runs regardless of validate result
+jobs:
+ validate:
+ runs-on: ubuntu-latest
+ steps:
+ - run: datacheck validate -c .datacheck.yaml
+
+ load:
+ runs-on: ubuntu-latest # ← No needs: — runs in parallel, not after validate
+ steps:
+ - run: python load_to_warehouse.py
+```
+
+Without `needs: [validate]`, the `load` job runs in parallel with `validate`. It does not wait for validation to succeed. The correct structure requires the dependency to be explicit:
+
+```yaml
+jobs:
+ validate:
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/checkout@v4
+ - name: Install DataCheck
+ run: pip install datacheck-cli[postgresql]
+ - name: Validate
+ env:
+ DB_HOST: ${{ secrets.DB_HOST }}
+ DB_PASSWORD: ${{ secrets.DB_PASSWORD }}
+ run: datacheck validate -c checks/orders.yaml
+
+ load:
+ needs: [validate] # ← load only runs if validate job succeeds
+ runs-on: ubuntu-latest
+ steps:
+ - run: python load_to_warehouse.py
+```
+
+`needs: [validate]` creates the dependency. GitHub Actions will not start the `load` job if the `validate` job failed. This is the correct gate topology.
+
+---
+
+## GitHub Actions: Multi-Stage Pipeline Design
+
+For pipelines with multiple validation phases — validate raw, transform, validate mart — the job dependency graph must encode the sequence:
+
+```yaml
+jobs:
+ validate-raw:
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/checkout@v4
+ - run: pip install -q datacheck-cli[postgresql]
+ - name: Validate raw layer
+ env:
+ DB_HOST: ${{ secrets.STAGING_DB_HOST }}
+ DB_PASSWORD: ${{ secrets.STAGING_DB_PASSWORD }}
+ run: |
+ datacheck schema compare --source staging_db --table orders_raw
+ datacheck validate -c checks/orders-raw.yaml --source staging_db
+
+ transform:
+ needs: [validate-raw] # Only runs if raw validation passed
+ runs-on: ubuntu-latest
+ steps:
+ - run: dbt run --models staging
+
+ validate-mart:
+ needs: [transform] # Only runs if transform completed
+ runs-on: ubuntu-latest
+ steps:
+ - name: Validate mart layer
+ run: datacheck validate -c checks/orders-mart.yaml --source staging_db
+
+ promote-to-production:
+ needs: [validate-mart] # Only runs if mart validation passed
+ environment: production # ← GitHub Environment protection rules apply here
+ runs-on: ubuntu-latest
+ steps:
+ - run: python promote_to_prod.py
+```
+
+The `environment: production` on the `promote-to-production` job enables GitHub Environment protection rules: required reviewers, deployment branch restrictions, and environment-specific secrets. The combination of the `needs:` dependency chain and the environment gate means production promotion requires all validation layers to succeed and may require manual approval.
+
+---
+
+## GitHub Actions: Matrix Validation for Multi-Table Pipelines
+
+When a pipeline validates multiple independent tables, a matrix strategy runs them in parallel and fails fast if any table fails:
+
+```yaml
+jobs:
+ validate-tables:
+ runs-on: ubuntu-latest
+ strategy:
+ fail-fast: true # ← Cancel remaining matrix jobs if any fails
+ matrix:
+ table:
+ - name: orders
+ config: checks/orders.yaml
+ - name: customers
+ config: checks/customers.yaml
+ - name: products
+ config: checks/products.yaml
+ - name: events
+ config: checks/events.yaml
+
+ steps:
+ - uses: actions/checkout@v4
+ - run: pip install -q datacheck-cli[postgresql]
+ - name: Validate ${{ matrix.table.name }}
+ env:
+ DB_HOST: ${{ secrets.DB_HOST }}
+ DB_PASSWORD: ${{ secrets.DB_PASSWORD }}
+ run: |
+ datacheck validate \
+ -c ${{ matrix.table.config }} \
+ --source production_db \
+ --format sarif \
+ --output ${{ matrix.table.name }}-results.sarif
+ - uses: github/codeql-action/upload-sarif@v3
+ if: always()
+ with:
+ sarif_file: ${{ matrix.table.name }}-results.sarif
+
+ load-all:
+ needs: [validate-tables] # Requires ALL matrix jobs to succeed
+ runs-on: ubuntu-latest
+ steps:
+ - run: python load_pipeline.py
+```
+
+`fail-fast: true` on the matrix strategy means: if the `orders` validation fails, cancel the `customers`, `products`, and `events` validations that are still running. The job fails. `load-all` never starts. Without `fail-fast: true`, all matrix jobs run to completion even when one has failed — useful for collecting all failure reports, but potentially misleading about whether the full pipeline gate has failed.
+
+The choice between `fail-fast: true` and `fail-fast: false` in a matrix depends on whether you want to stop all parallel work on first failure (to conserve resources) or collect all failures across all tables before surfacing the result. Both are valid; the choice should be deliberate.
+
+---
+
+## Airflow: The `trigger_rule` Failure Mode
+
+Airflow's `trigger_rule` parameter determines when a task becomes eligible to run relative to its upstream dependencies. The default is `all_success` — a task runs only when all upstream tasks have succeeded. This is the correct behavior for a gate.
+
+Any `trigger_rule` other than `all_success` on a task downstream of a validation gate is a potential bypass:
+
+```python
+# SILENT GATE BYPASS
+transform = PythonOperator(
+ task_id="transform_orders",
+ python_callable=transform_orders,
+ trigger_rule="all_done", # ← Runs regardless of validate_raw success or failure
+)
+```
+
+`all_done` means "run when all upstream tasks have finished, regardless of their outcome." This is appropriate for cleanup tasks, notification tasks, and tasks that must run even when upstream work fails. It is never appropriate for a task that should be gated on validation success.
+
+Common `trigger_rule` values and their implications for gate design:
+
+| `trigger_rule` | Behavior | Appropriate for gate? |
+|---|---|---|
+| `all_success` (default) | Run only when ALL upstream tasks succeeded | Yes — this is the gate behavior |
+| `all_done` | Run when ALL upstream tasks finished (any outcome) | No — bypasses failed validation |
+| `one_success` | Run when ANY one upstream task succeeded | No — bypasses if validation fails but another task succeeds |
+| `all_failed` | Run only when ALL upstream tasks failed | No — only for failure handling |
+| `none_failed` | Run when no upstream tasks failed (success OR skipped) | Conditional — valid if skip is intentional |
+
+The correct pattern:
+
+```python
+from airflow_provider_datacheck.operators.datacheck import DataCheckOperator
+
+validate_raw = DataCheckOperator(
+ task_id="validate_raw_orders",
+ config_path="/config/orders-raw.yaml",
+ source_name="staging_db",
+ table="orders_raw",
+ fail_on_error=True,
+ retries=0, # ← No retries — data quality failures are not transient
+)
+
+transform = PythonOperator(
+ task_id="transform_orders",
+ python_callable=transform_orders,
+ # trigger_rule is omitted — defaults to all_success
+ # transform only runs when validate_raw has succeeded
+)
+```
+
+---
+
+## Airflow: Retry Policy for Validation Tasks
+
+A validation task configured with `retries=2` and `retry_delay=timedelta(minutes=5)` will retry a data quality failure twice before marking the task as failed. Each retry runs DataCheck against the same data. Each retry produces the same failure. The pipeline is delayed 15 minutes and reaches the same dead end.
+
+Retries are appropriate for tasks that fail due to transient conditions: network timeouts, warehouse connection drops, temporary unavailability. DataCheck's exit codes distinguish these: exit `3` (data loading error) may indicate a transient infrastructure failure worth retrying. Exit `1` (rule failure) indicates a data content violation that retrying will not fix.
+
+```python
+from datetime import timedelta
+
+validate_raw = DataCheckOperator(
+ task_id="validate_raw_orders",
+ config_path="/config/orders-raw.yaml",
+ source_name="staging_db",
+ table="orders_raw",
+ fail_on_error=True,
+ retries=0, # Content failures: no retries
+ execution_timeout=timedelta(minutes=15), # Fail if validation hangs
+)
+```
+
+If the underlying infrastructure commonly has transient failures, a retry policy can be configured with a short delay and a low count — but the expectation should be that exit `1` on any retry attempt still fails the task immediately. Some teams implement this by wrapping the DataCheck call in a shell script that inspects the exit code and exits `1` immediately on content failures without waiting for retry:
+
+```bash
+#!/bin/bash
+datacheck validate -c "$CONFIG_PATH"
+EXIT=$?
+if [ $EXIT -eq 1 ]; then
+ echo "Data quality failure — not retrying" >&2
+ exit 1 # Will be caught by Airflow and treated as immediate failure
+elif [ $EXIT -ge 2 ]; then
+ echo "Infrastructure or config error — may retry" >&2
+ exit $EXIT
+fi
+```
+
+---
+
+## Airflow: `on_failure_callback` for Operational Visibility
+
+When a validation gate fires in production, the failure needs to be surfaced immediately and with enough context to route it correctly. Airflow's `on_failure_callback` runs a Python callable when the task fails, allowing the failure to trigger notifications, log structured context, or initiate remediation workflows.
+
+```python
+def validation_failure_callback(context):
+ task_id = context["task_instance"].task_id
+ dag_id = context["dag"].dag_id
+ run_id = context["run_id"]
+ log_url = context["task_instance"].log_url
+
+ message = (
+ f"Data quality gate failed\n"
+ f"DAG: {dag_id}\n"
+ f"Task: {task_id}\n"
+ f"Run: {run_id}\n"
+ f"Logs: {log_url}"
+ )
+
+ # Post to Slack, PagerDuty, or internal alerting
+ requests.post(
+ os.environ["SLACK_WEBHOOK_URL"],
+ json={"text": message, "channel": "#data-quality-alerts"},
+ )
+
+
+validate_raw = DataCheckOperator(
+ task_id="validate_raw_orders",
+ config_path="/config/orders-raw.yaml",
+ source_name="staging_db",
+ table="orders_raw",
+ fail_on_error=True,
+ retries=0,
+ on_failure_callback=validation_failure_callback,
+)
+```
+
+The `on_failure_callback` does not affect the gate behavior — the task is still failed, the downstream tasks still do not run. It adds operational visibility: the failure is loud, attributed, and routable. The team sees the failure immediately without polling the Airflow UI.
+
+---
+
+## Airflow: Branching After Validation
+
+Some pipelines need to route differently based on validation outcome rather than simply stopping. A quarantine pattern routes failing batches to an error table for investigation while allowing the pipeline to continue with clean data:
+
+```python
+from airflow.operators.python import BranchPythonOperator
+
+def route_by_validation(**context):
+ ti = context["task_instance"]
+ # DataCheckOperator pushes summary JSON to XCom
+ validation_result = ti.xcom_pull(task_ids="validate_raw", key="summary")
+ if validation_result and validation_result.get("all_passed"):
+ return "transform_orders"
+ else:
+ return "quarantine_failed_batch"
+
+validate_raw = DataCheckOperator(
+ task_id="validate_raw_orders",
+ config_path="/config/orders-raw.yaml",
+ source_name="staging_db",
+ table="orders_raw",
+ fail_on_error=False, # ← Don't fail the task; let branch routing handle it
+)
+
+route = BranchPythonOperator(
+ task_id="route_by_validation",
+ python_callable=route_by_validation,
+)
+
+transform = PythonOperator(task_id="transform_orders", ...)
+quarantine = PythonOperator(task_id="quarantine_failed_batch", ...)
+
+validate_raw >> route >> [transform, quarantine]
+```
+
+Note the `fail_on_error=False` here. When using `BranchPythonOperator` routing, the validation task should not fail the DAG — it should surface the result via XCom for the branch to read. This is the appropriate pattern when the downstream intent is quarantine-and-continue rather than halt-and-fix.
+
+---
+
+## Environment-Specific Validation Configs
+
+Development and staging environments often have different data characteristics — smaller volumes, synthetic data, incomplete referential integrity. Enforcing production-level rules in staging blocks development work on legitimate data that does not meet production constraints.
+
+```yaml
+# checks/orders.dev.yaml — permissive, unblocking
+checks:
+ - name: order_id_not_null
+ column: order_id
+ rules:
+ not_null: true
+ severity: warning # Advisory in dev
+
+ - name: amount_valid
+ column: amount
+ rules:
+ min: 0
+ severity: warning
+```
+
+```yaml
+# checks/orders.prod.yaml — strict, enforcing
+checks:
+ - name: order_id_not_null
+ column: order_id
+ rules:
+ not_null: true
+ severity: error # Blocks in production
+
+ - name: amount_valid
+ column: amount
+ rules:
+ not_null: true
+ type: float
+ min: 0.01
+ max: 1000000.00
+ severity: error
+```
+
+In GitHub Actions, the environment-specific config is selected by the workflow:
+
+```yaml
+- name: Validate
+ run: |
+ CONFIG="checks/orders.${{ vars.ENVIRONMENT }}.yaml"
+ datacheck validate -c "$CONFIG" --source production_db
+```
+
+In Airflow, the config path is parameterized via environment variable or Airflow Variable:
+
+```python
+import os
+
+validate_raw = DataCheckOperator(
+ task_id="validate_raw_orders",
+ config_path=f"/config/orders.{os.getenv('PIPELINE_ENV', 'prod')}.yaml",
+ source_name="staging_db",
+ table="orders_raw",
+ fail_on_error=(os.getenv("PIPELINE_ENV") == "prod"),
+)
+```
+
+Production is strict. Staging catches regressions without blocking on data that is legitimately absent or different in the non-production environment.
+
+---
+
+## Engineering Takeaways
+
+- **`trigger_rule='all_done'` silently bypasses validation gates.** Any Airflow task downstream of a validation gate with `trigger_rule` set to anything other than `all_success` (the default) will run regardless of validation outcome. Audit every `trigger_rule` in DAGs that contain validation tasks.
+
+- **`continue-on-error: true` in GitHub Actions removes the gate entirely.** A step with `continue-on-error: true` can exit non-zero without failing the job. This configuration should never appear on a validation step. The correct behavior is the default: omit the option.
+
+- **Missing `needs:` on deployment jobs creates parallel execution, not sequential gating.** A `load` job without `needs: [validate]` runs in parallel with validation, not after it. The dependency must be explicit. Every job that should be blocked by a validation failure must declare that dependency.
+
+- **Validation tasks should have `retries=0`.** Data quality failures (`exit 1`) are not transient. Retrying validation against the same bad data wastes time and reaches the same conclusion. Infrastructure failures (`exit 3`) may warrant retries — but this requires distinguishing exit codes at the retry policy level, not applying a blanket retry count.
+
+- **`fail-fast: true` on matrix validation jobs stops all parallel validations when any one fails.** This conserves compute and surfaces the first failure quickly. `fail-fast: false` collects all failures before reporting — useful when you want to see the full picture across all tables. Both are deliberate choices; the default matters.
+
+- **`on_failure_callback` makes the gate loud.** A failed validation task that quietly marks the DAG run as failed is not operationally visible enough in production. The callback routes the failure to the right alerting channel with task context, log URLs, and attribution — immediately, not when someone next opens the Airflow UI.
+
+- **`BranchPythonOperator` enables quarantine-and-continue as an alternative to halt-and-fix.** When the correct response to a validation failure is routing the bad batch to an error table rather than halting the pipeline, the branch pattern gives the DAG explicit routing logic based on validation outcome. This is a different policy decision from `fail_on_error=True` — both are correct for different scenarios.
+
+- **Environment-specific configs express different enforcement postures per stage.** Production rules enforce at `severity: error`. Staging rules may enforce at `severity: warning` for constraints that production data satisfies but synthetic staging data does not. The enforcement posture is a policy decision encoded in config, not in orchestration logic.
+
+---
+
+*DataCheck is an open-source deterministic validation engine. Install: `pip install datacheck-cli`. Source: [github.com/squrtech/datacheck](https://github.com/squrtech/datacheck).*
+
+
+
+LinkedIn Post
+
+An Airflow DAG had a validation task. It had been in production for six months. The team believed it was gating their pipeline.
+
+Someone had added trigger_rule='all_done' to the transform task four months earlier to handle an optional upstream branch. One line. It meant the transform task would run regardless of whether validation succeeded or failed. Every validation failure for four months had logged, been counted in the DAG summary, and been silently ignored.
+
+The validation tool did its job. The orchestration configuration undid it.
+
+"A gate that doesn't propagate isn't a gate. It's a log entry."
+
+The structural failure modes come up consistently across both GitHub Actions and Airflow:
+
+continue-on-error: true on the validation step removes the gate entirely
+Missing needs: on downstream jobs creates parallel execution, not sequential gating
+retries=2 on a validation task retries bad data against itself — three identical failures, fifteen minutes later
+Fail-fast pipeline design is not about adding validation steps. It's about auditing every trigger_rule, continue-on-error, needs:, and retry policy that touches a validation task — and confirming that none of them create a path where downstream work proceeds after the gate has fired.
\ No newline at end of file
diff --git a/blog/2026-02-23-deterministic-validation-vs-statistical-anomaly-detection.md b/blog/2026-02-23-deterministic-validation-vs-statistical-anomaly-detection.md
new file mode 100644
index 0000000..ba23941
--- /dev/null
+++ b/blog/2026-02-23-deterministic-validation-vs-statistical-anomaly-detection.md
@@ -0,0 +1,251 @@
+# Deterministic Validation vs Statistical Anomaly Detection
+
+*Published 2026-02-23 by the DataCheck engineering team*
+
+---
+
+Consider a column `payment_amount` with a business constraint: values must be between $0.01 and $50,000. A refund processing bug produces a batch containing negative values. You have two mechanisms that might catch it.
+
+The first executes:
+
+```sql
+SELECT COUNT(*) AS failed_count
+FROM payments
+WHERE payment_amount < 0.01 OR payment_amount > 50000
+```
+
+The second computes a z-score of the current batch's `payment_amount` distribution against a 30-day rolling historical baseline, and fires if the z-score exceeds a configured threshold.
+
+The first approach catches every negative value in the batch, with certainty, on every run, from day one of the pipeline's existence.
+
+The second approach may or may not catch it — depending on whether negative values have appeared before, what the variance in the historical distribution looks like, how the threshold is calibrated, and whether the current batch is large enough to shift the distribution signal.
+
+The first approach is not "simpler" in a pejorative sense. It knows something the second approach does not: that the valid range is explicitly $0.01 to $50,000. That knowledge came from an engineer who wrote it down. The second approach does not have that knowledge — it only knows what the historical data looked like.
+
+This is the core distinction between deterministic validation and statistical anomaly detection. It is not a question of sophistication. It is a question of which errors each approach is structurally capable of catching, and why.
+
+---
+
+## What Each Approach Actually Computes
+
+To understand the trade-offs, start with the algorithm.
+
+**Deterministic validation** evaluates a predicate against the data. For a `not_null` rule:
+
+```sql
+SELECT COUNT(*) AS failed_count
+FROM orders
+WHERE customer_id IS NULL
+```
+
+For an `allowed_values` rule:
+
+```sql
+SELECT COUNT(*) AS failed_count
+FROM orders
+WHERE status NOT IN ('pending', 'confirmed', 'shipped', 'cancelled')
+ AND status IS NOT NULL
+```
+
+For a `range` rule:
+
+```sql
+SELECT COUNT(*) AS failed_count
+FROM orders
+WHERE amount < 0 OR amount > 100000
+```
+
+The output of each query is an integer: the number of rows that violated the constraint. The mathematical operation is set membership or inequality evaluation — there is no statistical component. The result depends only on the current data and the rule definition.
+
+**Statistical anomaly detection** computes deviation from a learned distribution. A common implementation using z-scores in SQL looks like this:
+
+```sql
+WITH historical_stats AS (
+ SELECT
+ AVG(daily_null_rate) AS mean_rate,
+ STDDEV(daily_null_rate) AS stddev_rate
+ FROM (
+ SELECT
+ run_date,
+ SUM(CASE WHEN customer_id IS NULL THEN 1.0 ELSE 0.0 END) / COUNT(*) AS daily_null_rate
+ FROM orders_history
+ GROUP BY run_date
+ ) daily
+),
+current_batch AS (
+ SELECT
+ SUM(CASE WHEN customer_id IS NULL THEN 1.0 ELSE 0.0 END) / COUNT(*) AS current_null_rate
+ FROM orders
+)
+SELECT
+ current_null_rate,
+ mean_rate,
+ stddev_rate,
+ (current_null_rate - mean_rate) / NULLIF(stddev_rate, 0) AS z_score
+FROM current_batch
+CROSS JOIN historical_stats
+```
+
+The output is a continuous score: how many standard deviations the current batch's null rate deviates from the historical mean. Whether this constitutes a failure depends on a separately configured threshold — commonly 2.5 or 3.0 standard deviations. The result depends on the current data, the historical data used to compute the baseline, and the threshold value.
+
+Note the structural difference: deterministic validation requires the current data and a rule. Anomaly detection requires the current data, a historical dataset, a model or formula, and a threshold. Each of these additional inputs is a source of variance in the output.
+
+---
+
+## The Error Class Taxonomy
+
+These two approaches are suited to different categories of data error. Conflating the categories is where most teams go wrong.
+
+**Known constraint violations** are errors against explicit, pre-defined business rules. A value is null when the schema says it must not be. A status value is outside the allowed set. A foreign key references a record that does not exist. A timestamp is in the future when the domain requires it to be in the past. An ID does not match the expected UUID format.
+
+These errors are binary by definition. A customer ID either exists or it does not. An order status either belongs to the defined enum or it does not. There is no statistical ambiguity — only an engineer who knows the constraint.
+
+Deterministic validation is the correct tool for known constraint violations. It catches them with perfect sensitivity: if the constraint is violated, the count is non-zero. It catches them with perfect specificity: a count of zero means the constraint holds for every row, not that the violation was too subtle to detect.
+
+**Unknown pattern deviations** are anomalies against a data's expected statistical behavior — deviations you did not know to express as an explicit constraint. A volume that is 60% lower than typical. A cardinality explosion in a column that normally has stable cardinality. A correlation between two columns that has broken. A p99 latency value that is statistically inconsistent with recent history.
+
+These anomalies cannot be caught by deterministic rules because they require comparison against historical context. You do not know the "normal" volume of your pipeline without observing it over time. You cannot write a deterministic rule for "row count consistent with historical distribution" — that rule is, by definition, a statistical computation.
+
+Statistical anomaly detection is the correct tool for unknown pattern deviations. It surfaces signals you did not know to check for. It provides coverage for a class of problems that is simply not expressible as explicit predicates.
+
+The critical insight: **these domains are largely non-overlapping.** Applying anomaly detection to known constraint violations does not give you better coverage — it gives you a noisier, less reliable version of coverage you could have gotten with a three-line predicate.
+
+---
+
+## Reproducibility as a First-Class Guarantee
+
+Deterministic validation is reproducible by construction: the same input data with the same rules produces the same output, always. This property has concrete engineering implications that become visible at scale.
+
+**Debugging.** When a validation run fails, you can reproduce the failure exactly by re-running the same command against the same data. The failure is not a product of model state, threshold calibration, or historical distribution — it is a direct consequence of the data and the rule. You can trace from the failure to the offending rows without understanding any model internals.
+
+**Environment parity.** Deterministic rules work identically in development, staging, and production. The predicate `amount < 0` produces the same result in every environment for the same input data. An anomaly model trained on production traffic does not have a valid baseline in a staging environment — it will fire on staging data that looks anomalous relative to production patterns, and miss production-scale anomalies that are too subtle to appear in lower-traffic staging runs.
+
+**Auditability.** A validation result of "3,412 rows violated the range constraint on `payment_amount`" is a fully auditable finding. An auditor can understand it, reproduce it, and verify it without any knowledge of the validation tool's internals. "The anomaly model returned a score of 0.73 which exceeded the threshold of 0.60 based on a 30-day rolling IQR baseline for the `payment_amount` distribution" is not auditable in the same sense — reproducing it requires access to the same 30 days of historical data, the same model implementation, and the same threshold configuration.
+
+In regulated industries — payments, healthcare, financial reporting — auditability is not optional. Deterministic rules with explicit, versioned YAML configs produce an audit trail that is a diff in a source control system. The rule was added in commit `a3f8c2d` on a specific date. The rule changed in commit `f91b447` three months later. The reviewer approved it in a pull request. That is an auditable history.
+
+---
+
+## The Confounder Problem
+
+Statistical models are susceptible to confounders in the data — external factors that correlate with the metric being measured and produce systematic patterns that look anomalous relative to an unconditional baseline.
+
+**Seasonality.** Null rates, row counts, and value distributions often follow weekly or monthly cycles. A null rate of 0.1% on a Tuesday might be normal, while the same rate on a Saturday might be two standard deviations above the weekend mean. An unconditional anomaly model fires on the Saturday rate. An engineer investigates and finds nothing wrong. The model is tuned. The useful signal is reduced.
+
+**Promotional events.** A marketing campaign that drives 4x normal order volume creates a distribution shift across dozens of metrics. The anomaly model fires on volume, on cardinality, on value distributions. None of these are quality failures. They are expected consequences of the event. The model requires either explicit event calendars fed as features or manual suppression during known events.
+
+**Data migrations.** When an upstream source migrates to a new system with different ID formats, different precision, or different encoding, the historical baseline becomes structurally invalid. The new data distribution is entirely consistent and correct — but it is inconsistent with the old distribution that the model was trained on.
+
+Deterministic rules are immune to these confounders. The predicate `status NOT IN ('pending', 'confirmed', 'shipped', 'cancelled')` evaluates identically on a Tuesday and a Saturday, during a campaign and outside one, before and after an upstream migration. The constraint is about what values are valid, not about what values are typical.
+
+```yaml
+checks:
+ - name: order_status_valid
+ column: status
+ rules:
+ not_null: true
+ allowed_values: [pending, confirmed, shipped, cancelled, refunded]
+ severity: error
+
+ - name: order_amount_valid
+ column: amount
+ rules:
+ not_null: true
+ type: float
+ min: 0.01
+ max: 50000.00
+ severity: error
+
+ - name: customer_id_format
+ column: customer_id
+ rules:
+ not_null: true
+ regex: '^CUST-[0-9]{8}$'
+ severity: error
+
+ - name: created_at_valid
+ column: created_at
+ rules:
+ not_null: true
+ no_future_timestamps: true
+ severity: error
+```
+
+These rules do not care about the day of the week. They do not care about seasonal patterns, campaign volume, or upstream migrations. They care about whether each row satisfies the constraint.
+
+---
+
+## The Legibility Gap
+
+Deterministic validation rules are, by design, human-readable.
+
+A rule that reads `min: 0.01` is legible to every engineer on the team. It can be reviewed in a pull request. It can be explained to a new team member in seconds. Its behavior in every possible input scenario is fully predictable from the rule definition alone. Changing the rule requires a code review. Deleting the rule leaves a record in source control.
+
+An anomaly detection model's decision boundary is not legible in the same sense. "This batch failed because the z-score of the null rate was 2.83 standard deviations above the 90-day mean" requires understanding the historical baseline, the z-score calculation, and the threshold calibration to evaluate. You cannot review that decision in a pull request. You cannot reproduce it without the same historical data.
+
+This is not an indictment of anomaly detection — it is a description of an inherent trade-off. Models are powerful precisely because they can capture patterns that are too complex to express as explicit rules. That power comes at the cost of legibility. For the subset of quality checks that can be expressed as explicit rules — which is the majority of enforcement-critical checks — the legibility cost is unnecessary.
+
+```bash
+# The entire enforcement decision is visible in one command:
+datacheck validate -c checks/orders.datacheck.yaml
+
+ PASS order_status_valid (0 failures / 2,341,887 rows)
+ FAIL order_amount_valid (1,203 failures / 2,341,887 rows)
+ PASS customer_id_format (0 failures / 2,341,887 rows)
+ PASS created_at_valid (0 failures / 2,341,887 rows)
+
+ Rules: 4 total 3 passed 1 failed
+ Exit code: 1
+```
+
+The output is self-explanatory. 1,203 rows in the current batch have an `amount` value that falls outside the valid range. No model internals. No threshold explanation. No historical context required to understand the failure.
+
+---
+
+## Where Anomaly Detection Genuinely Wins
+
+Anomaly detection has a domain where it provides coverage that deterministic validation cannot: error classes you did not know to check for.
+
+A volume anomaly that drops row count by 40% is not expressible as a deterministic rule unless you know the expected row count range in advance. An unexpected correlation between `discount_rate` and `customer_segment` breaking is not expressible as a constraint rule. A p95 value for processing latency trending 3x higher than typical is not a constraint violation — it is a statistical signal.
+
+These are real data quality signals. They represent value. The correct response is to use anomaly detection for this class of problem and deterministic validation for the class of problems described above — not to use anomaly detection as a universal substitute for explicit rules.
+
+A mature data platform uses both. Deterministic rules at the enforcement gate enforce what you know. Anomaly detection in the monitoring layer surfaces what you did not know to check. The division of responsibility maps cleanly to the error taxonomy: known constraint violations belong to deterministic validation; unknown pattern deviations belong to statistical analysis.
+
+---
+
+## Engineering Takeaways
+
+- **Deterministic validation and anomaly detection compute fundamentally different things.** Validation evaluates a predicate — a binary function of the current data against an explicit rule. Anomaly detection estimates deviation from a learned distribution — a continuous function of the current data against historical state. The outputs have different properties and are suited to different purposes.
+
+- **Known constraint violations are always better caught by deterministic rules.** If you know a column should not be null, write `not_null: true`. The rule catches it with 100% sensitivity on every run. Using anomaly detection for known constraints trades guaranteed coverage for probabilistic coverage — with no upside.
+
+- **Reproducibility is a first-class guarantee of deterministic validation.** The same data with the same rules produces the same result everywhere — in development, staging, and production. Anomaly models depend on external state (training data, weights, thresholds) that varies across environments and over time.
+
+- **Deterministic rules are auditable by design; model decisions are not.** A YAML rule reviewed in a pull request produces an audit trail in source control. A model decision that depends on a 30-day rolling baseline does not produce a comparable audit record. For regulated environments, this difference is operationally significant.
+
+- **Statistical confounders systematically degrade anomaly models on data that follows expected business patterns.** Seasonality, campaigns, and migrations produce distribution shifts that are legitimate and expected. Deterministic predicates are immune to confounders — a constraint violation is a constraint violation regardless of the day or the traffic profile.
+
+- **Anomaly detection is the correct tool for unknown pattern deviations, not for known constraint enforcement.** Volume anomalies, unexpected correlation breaks, and cardinality explosions are legitimately hard to express as explicit rules. These are the use cases anomaly detection was built for. Applying it to null checks and format validation is using a tool outside its designed domain.
+
+- **The combined architecture is not a compromise — it is the precise mapping of each tool to its error class.** Deterministic rules at the gate for constraints you can express. Statistical analysis in the monitoring layer for patterns you cannot predict. The boundary between them follows the boundary between known and unknown — which is the most natural division available.
+
+---
+
+*DataCheck is an open-source deterministic validation engine. Install: `pip install datacheck-cli`. Source: [github.com/squrtech/datacheck](https://github.com/squrtech/datacheck).*
+
+
+
+LinkedIn Post
+
+Most teams think of deterministic validation and anomaly detection as competing approaches to data quality. They're not competing — they operate on different error classes and produce different guarantees.
+
+Deterministic validation evaluates a predicate: is this value in the allowed set, is this column null, does this ID match the expected format. The output is a count. Same data, same rule, same result — every time, in every environment.
+
+Anomaly detection estimates deviation from a learned distribution. It requires historical state — training data, model weights, a calibrated threshold. Change the training window, get a different answer for the same input data.
+
+"If you know a column should never be null, write not_null: true. Using a statistical model to catch a known constraint is trading a guaranteed predicate for a probabilistic approximation — with no upside."
+
+The domain split is clean: deterministic rules for constraints you can express, statistical analysis for patterns you cannot predict. Volume anomalies and unexpected correlation breaks are legitimately hard to encode as explicit rules — that's what anomaly detection was built for. Null checks and format validation are not.
+
+The mistake isn't using anomaly detection. It's using it for the wrong error class.
\ No newline at end of file
diff --git a/blog/2026-02-23-exit-codes-as-contracts-in-cicd-for-data-pipelines.md b/blog/2026-02-23-exit-codes-as-contracts-in-cicd-for-data-pipelines.md
new file mode 100644
index 0000000..5426b11
--- /dev/null
+++ b/blog/2026-02-23-exit-codes-as-contracts-in-cicd-for-data-pipelines.md
@@ -0,0 +1,388 @@
+# Exit Codes as Contracts in CI/CD for Data Pipelines
+
+*Published 2026-02-23 by the DataCheck engineering team*
+
+---
+
+A process exits. One integer is written to the process table. The parent process reads it. If it is zero, the next command runs. If it is non-zero, the next command does not.
+
+This mechanism is fifty years old. It works on every operating system that has ever run a CI pipeline. It requires no network connection, no shared database, no external service, no registration. It is the lowest-common-denominator interface for process orchestration — which is precisely why it is the most reliable one.
+
+Every CI system ever built — Jenkins, GitHub Actions, CircleCI, Buildkite, GitLab CI — treats a non-zero exit code as a job failure. Every shell ever written interprets `&&` as "run the next command only if the previous one exited zero." Every orchestrator that runs DAG tasks reads the exit code of the process it spawned to determine whether the task succeeded.
+
+Data pipeline enforcement built on exit codes inherits all of this — for free, without integration work, across every environment the pipeline runs in. The exit code is not just a return value. It is a contract between the validation tool and every system that might ever invoke it.
+
+---
+
+## What the Contract Says
+
+A contract has terms. The exit code contract for a data validation tool has specific terms that matter for how a pipeline is designed around it.
+
+DataCheck exits with one of four codes, each carrying a distinct semantic meaning:
+
+| Code | Meaning | Downstream implication |
+|------|---------|----------------------|
+| `0` | All error-severity rules passed | Pipeline may proceed |
+| `1` | One or more error-severity rules failed | Pipeline must stop; data does not meet its contract |
+| `2` | Configuration error | Pipeline must stop; the validation config itself is malformed |
+| `3` | Data loading error | Pipeline must stop; the data source was unreachable |
+| `4` | Unexpected internal error | Pipeline must stop; investigate the validator |
+
+Exit code `1` and exit code `3` carry different information. Both fail the pipeline, but they indicate different problems requiring different responses. Exit `1` means the data arrived but is bad. A retry of the same pipeline run will produce the same bad data until the upstream issue is fixed. Exit `3` means the data source was unavailable. A retry may succeed if the outage was transient.
+
+A pipeline that inspects exit codes can route these cases differently:
+
+```bash
+#!/bin/bash
+set -e
+
+datacheck validate -c checks/orders.datacheck.yaml
+VALIDATION_EXIT=$?
+
+case $VALIDATION_EXIT in
+ 0)
+ echo "Validation passed — proceeding to load"
+ python load_to_warehouse.py
+ ;;
+ 1)
+ echo "Data quality failure — data does not meet contract" >&2
+ # Do not retry — the data is bad, not the infrastructure
+ exit 1
+ ;;
+ 2)
+ echo "Configuration error — alerting platform team" >&2
+ curl -X POST "$SLACK_WEBHOOK" \
+ -d '{"text": "DataCheck config error on orders pipeline"}'
+ exit 2
+ ;;
+ 3)
+ echo "Data source unavailable — scheduling retry" >&2
+ # Signal to the scheduler that this is a transient failure
+ exit 75 # EX_TEMPFAIL — conventional retry signal on some systems
+ ;;
+esac
+```
+
+Most pipelines will not need this level of exit code inspection. The common case — `set -e` and let any non-zero exit halt the script — is correct for most enforcement gates. But the semantic distinction exists and is available when the pipeline design requires it.
+
+---
+
+## Severity as the Policy Layer
+
+The exit code contract has one configurable dimension: which rule failures count as exit `1` and which do not. This is controlled by the `severity` field on each rule.
+
+```yaml
+checks:
+ - name: order_id_not_null
+ column: order_id
+ rules:
+ not_null: true
+ severity: error # Failure → exit 1. Pipeline stops.
+
+ - name: email_format
+ column: email
+ rules:
+ regex: '^[^@]+@[^@]+\.[^@]+$'
+ severity: warning # Failure → reported, but exit remains 0.
+
+ - name: phone_populated
+ column: phone
+ rules:
+ not_null: true
+ severity: info # Failure → logged only. No output to stderr.
+```
+
+DataCheck computes the exit code after evaluating all rules:
+
+- Any `error`-severity rule failure → exit `1`
+- Only `warning` or `info` failures, no `error` failures → exit `0`
+- All rules pass → exit `0`
+
+Severity is not a quality judgment about the rule. It is a policy decision about whether a given rule failure should stop the pipeline. Both the `error` rule and the `warning` rule are deterministic — they either pass or fail, with the same certainty. The difference is what the pipeline does when they fail.
+
+This makes severity the mechanism for expressing enforcement policy in config, not in orchestration logic. You do not need a separate decision layer that reads validation results and decides whether to stop the pipeline — the exit code already encodes that decision, and the severity settings on each rule are the policy that drives it.
+
+---
+
+## Shell Composability
+
+The exit code is what makes a validation tool composable with any Unix toolchain. Because DataCheck is a subprocess that exits with a standard code, it can participate in any shell composition pattern without modification.
+
+**Sequential enforcement with `&&`:**
+
+```bash
+# Each step only runs if the previous step exited 0
+datacheck validate -c checks/raw.yaml \
+ && dbt run --models staging \
+ && datacheck validate -c checks/staging.yaml \
+ && dbt run --models marts \
+ && datacheck validate -c checks/marts.yaml \
+ && python publish.py
+```
+
+If any DataCheck validation fails, the chain stops at that point. The subsequent dbt run and all downstream steps are skipped. The pipeline halts exactly where the contract was violated.
+
+**Script-level enforcement with `set -e`:**
+
+```bash
+#!/bin/bash
+set -e # Exit the script on any non-zero exit code
+set -o pipefail # Extend to pipeline failures
+
+datacheck validate -c checks/orders.yaml
+python transform_orders.py
+datacheck validate -c checks/orders_mart.yaml
+python publish_orders_mart.py
+```
+
+`set -e` means every DataCheck invocation in the script is implicitly an enforcement gate. If any validation fails, the script exits immediately — the remaining commands do not run. This is equivalent to explicit `&&` chaining but removes the repetition.
+
+**Makefile dependency chains:**
+
+```makefile
+validate-raw:
+ datacheck validate -c checks/raw.yaml
+
+transform: validate-raw
+ dbt run --models staging
+
+validate-staging: transform
+ datacheck validate -c checks/staging.yaml
+
+publish: validate-staging
+ python publish.py
+```
+
+Make enforces the dependency graph: `publish` depends on `validate-staging`, which depends on `transform`, which depends on `validate-raw`. A non-zero exit from any target prevents dependent targets from running. This is the same sequential enforcement, expressed as a declarative graph rather than an imperative script.
+
+All three patterns work without any DataCheck-specific integration. The shell, Make, and any other Unix toolchain that respects exit codes enforce the gate automatically.
+
+---
+
+## How Orchestrators Consume Exit Codes
+
+Each orchestration environment reads exit codes through its own abstraction, but the underlying mechanism is the same.
+
+**Apache Airflow BashOperator:**
+
+```python
+from airflow.operators.bash import BashOperator
+
+validate_orders = BashOperator(
+ task_id="validate_orders",
+ bash_command="datacheck validate -c /config/orders.yaml",
+)
+```
+
+Airflow's `BashOperator` spawns a subprocess and waits for it to exit. If the subprocess exits with a non-zero code, the operator raises `AirflowException`. Airflow marks the task as failed. Downstream tasks that depend on `validate_orders` are skipped. The DAG run is marked failed. Airflow's retry policy applies if configured.
+
+The DataCheck Airflow operator wraps this pattern with validation-specific options:
+
+```python
+from airflow_provider_datacheck.operators.datacheck import DataCheckOperator
+
+validate_orders = DataCheckOperator(
+ task_id="validate_orders",
+ config_path="/config/orders.yaml",
+ source_name="production_db",
+ table="orders",
+ fail_on_error=True,
+)
+```
+
+`fail_on_error=True` maps to the exit code contract: if DataCheck exits `1`, raise `AirflowException`. `fail_on_error=False` maps to running DataCheck in warning-only mode — the task reports results but does not fail the DAG regardless of exit code.
+
+**Kubernetes Jobs:**
+
+```yaml
+apiVersion: batch/v1
+kind: Job
+metadata:
+ name: validate-orders
+spec:
+ template:
+ spec:
+ containers:
+ - name: datacheck
+ image: python:3.12-slim
+ command:
+ - sh
+ - -c
+ - |
+ pip install -q datacheck-cli[postgresql] && \
+ datacheck validate -c /config/orders.yaml
+ env:
+ - name: DB_HOST
+ valueFrom:
+ secretKeyRef:
+ name: db-credentials
+ key: host
+ restartPolicy: Never # Don't retry data quality failures
+ backoffLimit: 0 # No retries — a failed validation is not a transient error
+```
+
+Kubernetes reads the container's exit code. Exit non-zero marks the Job as failed. `restartPolicy: Never` with `backoffLimit: 0` is the correct policy for data quality failures — unlike infrastructure failures, a bad dataset does not fix itself on retry. The exit code semantics of DataCheck directly inform the Kubernetes Job policy.
+
+**Argo Workflows:**
+
+```yaml
+- name: validate-and-load
+ dag:
+ tasks:
+ - name: validate-raw
+ template: datacheck-validate
+ arguments:
+ parameters:
+ - name: config
+ value: checks/raw.yaml
+
+ - name: transform
+ dependencies: [validate-raw]
+ template: dbt-run
+
+ - name: validate-mart
+ dependencies: [transform]
+ template: datacheck-validate
+ arguments:
+ parameters:
+ - name: config
+ value: checks/mart.yaml
+
+ - name: publish
+ dependencies: [validate-mart]
+ template: publish-data
+```
+
+Argo's DAG task dependencies resolve based on task success status — which is derived from exit codes. A failed `validate-raw` task prevents `transform` from starting. This is the same DAG enforcement pattern as Airflow, but driven through Argo's declarative workflow definition.
+
+---
+
+## The Canary Approach to Rule Introduction
+
+One practical implication of the severity-to-exit-code mapping: you can introduce new validation rules without immediately blocking the pipeline, observe their behavior in production, and promote them to enforcement when confident they are correct.
+
+The pattern is deliberate severity staging:
+
+**Stage 1 — Observe.** Add the rule at `severity: warning`. The pipeline runs unchanged. The rule reports failures to the output, but exit code remains `0` when only warnings fire. You learn whether the rule produces false positives against live production data.
+
+```yaml
+# Stage 1: observe behavior without blocking
+- name: amount_precision
+ column: amount
+ rules:
+ regex: '^\d+\.\d{2}$' # Require exactly 2 decimal places
+ severity: warning
+```
+
+**Stage 2 — Evaluate.** Run the pipeline for several cycles. If the rule never fires, it is either correct or the constraint is already consistently met. If it fires frequently on legitimate data, the rule is wrong and needs adjustment.
+
+**Stage 3 — Enforce.** Change `severity: error`. The next pipeline run where this rule fails will exit `1`.
+
+```yaml
+# Stage 3: enforce after validation in production
+- name: amount_precision
+ column: amount
+ rules:
+ regex: '^\d+\.\d{2}$'
+ severity: error
+```
+
+This severity promotion is a single-line diff in the config file. It produces a clear audit trail in source control: the rule was introduced as a warning on one date, promoted to error on another, with the reason visible in the commit message and PR review.
+
+The exit code contract makes this migration path safe: the validation tool never accidentally enforces a rule that is not yet at `severity: error`. The mapping from severity to exit code is explicit and stable.
+
+---
+
+## The Dual-Channel Output Pattern
+
+When DataCheck runs with SARIF output, two independent channels carry information:
+
+```bash
+datacheck validate -c checks/orders.yaml \
+ --format sarif \
+ --output results.sarif
+# exit code: 0 or 1 (enforcement)
+# results.sarif: human-readable annotation data (reporting)
+```
+
+The exit code drives pipeline enforcement. The SARIF file drives annotation — GitHub Security tab, IDE integration, human-readable reports. These channels are independent. The enforcement decision (proceed or stop) is made by the parent process reading the exit code. The annotation experience is driven by the SARIF file being uploaded to GitHub's security endpoint.
+
+This separation matters: you can configure DataCheck to write SARIF and still let the exit code gate the pipeline. The reporting output does not influence the enforcement output. A pipeline that generates a SARIF report and still exits `1` blocks the merge and annotates the PR simultaneously.
+
+```yaml
+# .github/workflows/data-quality.yml
+- name: Validate
+ run: |
+ datacheck validate -c .datacheck.yaml \
+ --format sarif --output results.sarif
+ # Non-zero exit fails this step → fails the job → blocks merge
+
+- name: Upload annotations
+ uses: github/codeql-action/upload-sarif@v3
+ if: always() # Upload even if the previous step failed
+ with:
+ sarif_file: results.sarif
+```
+
+`if: always()` ensures the SARIF upload runs even when DataCheck exits `1`. The enforcement (blocked merge) and the annotation (inline PR comments showing which rules failed) both happen. Neither depends on the other.
+
+---
+
+## Why Not Exceptions?
+
+The DataCheck Python API raises `ValueError` when validation fails, which is the correct interface for Python callers:
+
+```python
+from datacheck import ValidationEngine
+
+engine = ValidationEngine(config_path=".datacheck.yaml")
+summary = engine.validate()
+
+if not summary.all_passed:
+ raise ValueError(f"Validation failed: {summary.failed_rules} rules failed")
+```
+
+The CLI translates validation outcomes to exit codes instead of exceptions — not because exceptions are wrong, but because exceptions are a Python-specific mechanism that does not exist in the environments where pipeline enforcement operates.
+
+A Bash script cannot catch a Python exception. GitHub Actions cannot route on a Python exception. Kubernetes cannot set job completion policy based on a Python exception. Airflow can catch exceptions from Python operators, but the `BashOperator` — and the DataCheck CLI it invokes — uses exit codes.
+
+The exit code is the interface that all of these environments share. It is the result of fifty years of convergence on a universal process communication mechanism. A data validation CLI that exits with meaningful codes is a tool that works everywhere pipeline execution happens — without any environment-specific integration.
+
+---
+
+## Engineering Takeaways
+
+- **Exit codes are a bilateral contract.** The validation tool commits to specific exit code semantics; the caller commits to enforcement based on those semantics. The contract requires no network, no shared state, and no registration — it is carried in the process table.
+
+- **Exit code `1` and exit code `3` are distinct failure modes requiring different pipeline responses.** Exit `1` means data failed its contract — a retry will not fix it, the data is bad. Exit `3` means the data source was unreachable — a retry may succeed. Pipelines that inspect exit codes can route these cases to different recovery paths.
+
+- **Severity is the policy layer that maps rule failures to exit code determination.** `severity: error` rules contribute to exit `1`. `severity: warning` rules do not. Severity is not a quality judgment — it is an explicit policy decision about what should stop the pipeline, encoded in config and reviewed in PRs.
+
+- **Shell composability (`&&`, `set -e`, Makefile dependencies) requires no integration code.** Because DataCheck exits with standard POSIX codes, it participates correctly in any Unix pipeline composition pattern without modification. The shell is the orchestrator for the simple cases.
+
+- **Different orchestrators consume exit codes through different abstractions, but the underlying mechanism is identical.** Airflow raises `AirflowException`. Kubernetes marks the Job failed. Argo marks the task failed. All of them read the process exit code. The exit code contract works across all of them without DataCheck needing to know which one is running it.
+
+- **The canary pattern — warning before error — is the safe way to introduce new rules to a production pipeline.** Add the rule at `severity: warning`, observe its behavior over multiple pipeline runs, then promote to `severity: error`. The severity promotion is a single-line diff that produces an auditable trail in source control.
+
+- **Enforcement and reporting are independent channels.** The exit code gates the pipeline. The SARIF file annotates the PR. Using `if: always()` on the upload step ensures annotations appear even when the gate fails. Neither channel depends on the other.
+
+- **Exit codes outlast any specific orchestration technology.** Airflow gets replaced. CI providers change. Kubernetes versions deprecate APIs. The POSIX exit code contract has been stable since the 1970s and is honored by every process scheduler ever written. A tool that uses exit codes as its enforcement interface will work with whatever runs it, now and in the future.
+
+---
+
+*DataCheck is an open-source deterministic validation engine. Install: `pip install datacheck-cli`. Source: [github.com/squrtech/datacheck](https://github.com/squrtech/datacheck).*
+
+
+LinkedIn Post
+
+A process exits. One integer is written to the process table. The parent process reads it. Everything else — GitHub Actions failing a job, Airflow marking a task failed, Kubernetes marking a Job failed, Bash skipping the next command — flows from that one integer.
+
+This mechanism is fifty years old. It works identically on every CI system ever built, every shell ever written, every orchestrator ever deployed. And it is exactly the right interface for data pipeline enforcement.
+
+The interesting design decision is what different exit codes mean. Exit 1 means the data failed its validation contract — a retry won't fix it, the data is bad. Exit 3 means the data source was unreachable — a retry might succeed. These are different failure modes requiring different pipeline responses, and a tool that expresses them as different codes lets orchestrators route them correctly.
+
+"An exit code is a contract. The tool commits to specific semantics; the caller commits to enforcement. No network, no shared state, no registration required."
+
+There's also a migration pattern here: introducing new validation rules at severity: warning before promoting them to severity: error. The rule fires and reports, but exit remains 0. You observe behavior over production pipeline runs, then flip one YAML field to enforce. The promotion is a single-line diff with a clear audit trail.
+
+The universality is the feature. Tools that depend on webhooks, APIs, or platform-specific plugins work with one orchestrator. Exit codes work with all of them.
\ No newline at end of file
diff --git a/blog/2026-02-23-if-code-must-pass-ci-data-should-too.md b/blog/2026-02-23-if-code-must-pass-ci-data-should-too.md
new file mode 100644
index 0000000..0af09fb
--- /dev/null
+++ b/blog/2026-02-23-if-code-must-pass-ci-data-should-too.md
@@ -0,0 +1,305 @@
+# If Code Must Pass CI, Data Should Too
+
+*Published 2026-02-23 by the DataCheck engineering team*
+
+---
+
+Every software team runs CI on code. No one debates whether a PR needs to pass tests and linting before it merges. The gate is structural — the deployment step does not run if tests fail. That constraint is enforced by the pipeline, not by convention.
+
+The data those systems produce and consume? In most engineering organizations, it ships with zero gates.
+
+This asymmetry is the source of a specific class of incident that is hard to attribute, slow to debug, and completely preventable.
+
+---
+
+## The Structural Gap
+
+When an engineer changes an API response schema, the client tests catch it. When an engineer removes a function, the type checker fails. When an engineer introduces a calculation error, unit tests flag it. The feedback loop is tight: push, CI runs, fail fast, fix before merge.
+
+The data equivalent rarely exists. A data engineer updates an ETL job and a column that was `float` is now emitted as `string` — because a library changed, because implicit coercion happened somewhere, because the upstream source system changed its format. The ETL code change ships. The downstream consumer parses the column numerically. A sum aggregation returns `NaN`. A financial report silently zeroes out. The incident surfaces two days later when a user notices the numbers don't match.
+
+The code tests passed. Nothing in CI caught it. Because data was not part of CI.
+
+The problem is not that engineers do not care about data quality. It is structural: there is no gate. Data contracts change silently alongside code changes because nothing enforces them at the point of change.
+
+---
+
+## What Code CI Actually Does
+
+Strip away the tooling and CI does one thing: it runs a set of rules against an artifact, reports whether the rules passed or failed, and returns an exit code. The exit code is the contract. `0` means proceed. Non-zero means stop.
+
+Everything else — PR annotations, build badges, Slack notifications — is UI around that exit code.
+
+Linters enforce style and correctness rules before code merges. Static analysis enforces type safety. Tests enforce behavioral contracts. Each tool runs as a subprocess, writes output to stdout, and exits with a code. The CI orchestrator does not care which tool ran — it only reads the exit code.
+
+This mechanism is simple, composable, and universal. It works with any tool that respects POSIX exit conventions. And it is exactly the mechanism that data validation needs. The tooling gap is not conceptual — it is that data validation tools have historically been built for dashboards, not for pipelines. The primary interface was a UI, not an exit code.
+
+---
+
+## Where the Gate Lives
+
+The critical question is not whether to add a data gate — it is where in the pipeline to place it.
+
+**Before ingestion.** Validation runs against source data before it is loaded into the warehouse. If the source fails, the load does not happen. This is the cleanest position for catching upstream schema drift, but it requires access to the source at pipeline time.
+
+**After load, before transformation.** The gate runs against the raw table after ingestion but before dbt, before aggregations, before any downstream consumer touches it. This is the most common production-viable position and catches both source issues and load bugs.
+
+**After transformation, before serving.** The gate runs against the final artifact — the mart table, the parquet export, the API dataset — immediately before it is published. This catches transformation bugs that gating on raw data would miss.
+
+Production pipelines need gates at both positions 2 and 3. The key property in either case: the gate runs before the next stage can proceed — not as a parallel monitor that fires alerts after the fact, but as a sequential dependency that blocks progress.
+
+---
+
+## Config as a Versioned Contract
+
+The second requirement is treating validation rules as code. Not as dashboard configurations, not as UI settings, not as metadata in a catalog — as text files that live in the repository, get reviewed in pull requests, and evolve alongside the code that produces the data.
+
+```yaml
+# checks/orders.datacheck.yaml
+
+data_source:
+ type: postgresql
+ # connection via sources.yaml
+
+checks:
+ - name: order_id_integrity
+ column: order_id
+ rules:
+ not_null: true
+ unique: true
+
+ - name: order_total_valid
+ column: order_total
+ rules:
+ not_null: true
+ type: float
+ min: 0
+ max: 1000000
+
+ - name: status_constrained
+ column: status
+ rules:
+ not_null: true
+ allowed_values: [pending, confirmed, shipped, cancelled, refunded]
+
+ - name: created_at_fresh
+ column: created_at
+ rules:
+ not_null: true
+ max_age: 7d
+```
+
+This file belongs in the same repository as the ETL code that produces the `orders` table. When an engineer changes the ETL, the diff includes any validation config change. The PR reviewer sees both. If the ETL now emits `order_total` as a string, the reviewer sees that the `type: float` rule was removed or modified — that is a conversation before merge, not an incident two days later.
+
+Without the config in the repository, there is no signal that the data contract changed. It is an invisible breaking change dressed as a code change.
+
+---
+
+## Schema Contracts as Breaking Change Detection
+
+One pattern deserves explicit treatment: schema contracts.
+
+A schema contract captures the shape of a dataset at a known-good state — column names, types, nullability — and compares every subsequent run against that baseline. Any structural deviation fails the gate.
+
+```bash
+# Run once against a known-good state. Commit the output file.
+datacheck schema capture --source production_db --sources-file sources.yaml
+
+# In CI on every subsequent run:
+datacheck schema compare --source production_db --sources-file sources.yaml
+```
+
+When the comparison fails:
+
+```
+FAIL schema_compare: Column 'order_total' type changed: float -> varchar
+FAIL schema_compare: Column 'tax_rate' removed
+PASS schema_compare: Column 'order_id' unchanged (integer, NOT NULL)
+PASS schema_compare: Column 'status' unchanged (varchar, NOT NULL)
+
+Exit code: 1
+```
+
+The pipeline stops. The engineer who triggered the upstream change gets a clear failure at the point of change — not after downstream consumers have already ingested corrupted data.
+
+This is the data equivalent of API breaking change detection. In code, you use a type system or a schema registry. For data artifacts, the equivalent is a captured baseline compared on every pipeline run. The baseline file lives in the repository alongside the validation config — both are versioned, both are reviewed, both encode what the contract is expected to be.
+
+---
+
+## The GitHub Actions Integration
+
+Wiring this into GitHub Actions is direct. The key design choice: trigger on every push that touches ETL code, dbt models, or validation config — not just on a schedule.
+
+```yaml
+# .github/workflows/data-quality.yml
+name: Data Quality Gate
+
+on:
+ push:
+ paths:
+ - 'etl/**'
+ - 'dbt/**'
+ - 'checks/**'
+ - 'schema-baselines/**'
+ pull_request:
+ paths:
+ - 'etl/**'
+ - 'checks/**'
+
+permissions:
+ contents: read
+ security-events: write
+
+jobs:
+ validate:
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/checkout@v4
+
+ - name: Install DataCheck
+ run: pip install datacheck-cli[postgresql]
+
+ - name: Run data quality gate
+ env:
+ DB_HOST: ${{ secrets.DB_HOST }}
+ DB_NAME: ${{ secrets.DB_NAME }}
+ DB_USER: ${{ secrets.DB_USER }}
+ DB_PASSWORD: ${{ secrets.DB_PASSWORD }}
+ run: |
+ datacheck validate -c checks/orders.datacheck.yaml \
+ --format sarif --output results.sarif
+
+ - name: Upload SARIF to GitHub Security tab
+ uses: github/codeql-action/upload-sarif@v3
+ if: always()
+ with:
+ sarif_file: results.sarif
+```
+
+The SARIF upload produces annotations directly on the pull request diff in the GitHub Security tab. Not in a separate dashboard, not in an email. On the PR, at the point of change. Exit code `1` from DataCheck causes the validation step to fail, which fails the job, which blocks the merge. That is the gate.
+
+---
+
+## The Airflow DAG Pattern
+
+For orchestrated pipelines, the gate is structural: the validation task must succeed before the downstream task is eligible to run. The DAG encodes the dependency explicitly.
+
+```python
+from airflow import DAG
+from airflow.operators.python import PythonOperator
+from airflow_provider_datacheck.operators.datacheck import DataCheckOperator
+
+with DAG("orders_pipeline", schedule_interval="@hourly", ...) as dag:
+
+ extract = PythonOperator(task_id="extract_orders", ...)
+
+ validate_raw = DataCheckOperator(
+ task_id="validate_raw_orders",
+ config_path="/config/orders-raw.datacheck.yaml",
+ source_name="staging_db",
+ table="orders_raw",
+ fail_on_error=True,
+ )
+
+ transform = PythonOperator(task_id="transform_orders", ...)
+
+ validate_final = DataCheckOperator(
+ task_id="validate_orders_mart",
+ config_path="/config/orders-mart.datacheck.yaml",
+ source_name="production_db",
+ table="orders_mart",
+ fail_on_error=True,
+ )
+
+ serve = PythonOperator(task_id="publish_to_consumers", ...)
+
+ extract >> validate_raw >> transform >> validate_final >> serve
+```
+
+If `validate_raw_orders` raises `AirflowException`, Airflow marks it failed and `transform_orders` never starts. If `validate_orders_mart` fails, `publish_to_consumers` never runs. The DAG graph is the enforcement mechanism — bad data cannot reach the next stage because the task that would move it there is blocked by a failed upstream dependency.
+
+This is structurally identical to how CI gates a deployment. The deployment step only runs if tests pass. The downstream task only runs if validation passes.
+
+---
+
+## What Runs Against the Database
+
+For database sources, it is worth being explicit about what executes. DataCheck does not pull rows into the validation process. For a `not_null` rule on a million-row table:
+
+```sql
+SELECT COUNT(*) AS failed_count
+FROM orders
+WHERE order_total IS NULL
+```
+
+One row comes back. The validation host never sees the actual data. For a `type: float` check:
+
+```sql
+SELECT COUNT(*) AS failed_count
+FROM orders
+WHERE order_total IS NOT NULL
+ AND order_total::text !~ '^-?[0-9]+(\.[0-9]+)?$'
+```
+
+Again, a single integer. No rows transferred. No egress cost. Validation runs at warehouse speed, inside your existing compute, without a separate validation tier.
+
+This has security and compliance implications beyond performance. In regulated environments — finance, healthcare, PII-heavy infrastructure — data leaving the warehouse boundary is an audit event. Aggregate `COUNT` queries never expose row-level data. The validation result tells you how many rows failed a rule, not which rows or what values they contained.
+
+---
+
+## What This Gate Catches
+
+Gating data in CI catches a specific set of failure modes that no other mechanism addresses before data moves downstream:
+
+**Type regressions.** An ETL update changes a column from `float` to `string` through implicit coercion. The `type: float` rule fails on the first post-deploy pipeline run, before any downstream aggregation processes the column.
+
+**Schema drift from upstream sources.** A source system removes a column, renames a field, or changes nullability. Schema comparison catches the structural deviation before downstream consumers encounter an unexpected shape.
+
+**Value set violations.** A source system adds a new enum value — `"on_hold"` — that the downstream status mapping does not handle. The `allowed_values` rule catches it before the unmapped value produces a silent `NULL` in downstream joins.
+
+**Referential breaks.** `orders.customer_id` references customer IDs that were deleted from the customers table. A `foreign_key_exists` rule catches the orphaned references before the broken join propagates into aggregations.
+
+**Temporal staleness.** A scheduled extraction job fails silently and the table stops updating. A `max_age: 24h` rule fails the next morning's pipeline run before downstream consumers serve stale data.
+
+None of these require statistical anomaly detection. None require training data or ML inference. They require explicit rules, written by engineers, enforced at the gate. The rules are deterministic — they produce the same result on every run, in every environment, at any scale.
+
+---
+
+## Engineering Takeaways
+
+- **The exit code is the contract.** `0` means data passed its rules. `1` means at least one error-severity rule failed. Any CI system, orchestrator, or shell script that reads POSIX exit codes can enforce a data gate without custom integration code.
+
+- **Validation config belongs in the repository.** Rules that live outside the codebase are invisible to reviewers. A data contract change that is not in the PR diff is an invisible breaking change. Versioning the config alongside the ETL makes contract changes reviewable before they ship.
+
+- **Gate position determines what you catch.** Gating before ingestion catches source schema drift. Gating after load catches ETL bugs. Gating after transformation catches modeling errors. A single gate is not sufficient for a production pipeline.
+
+- **Schema comparison is breaking change detection for data.** Capture a baseline against a known-good state, commit the file, compare on every run. Structural deviations fail the gate. This should be a standard practice, not an advanced configuration.
+
+- **SQL pushdown keeps validation inside the warehouse boundary.** A single aggregate `SELECT` returns counts, not rows. No data leaves the warehouse. No egress cost, no PII exposure, no additional compute tier. The validation runs where the data already lives.
+
+- **Severity is an explicit engineering decision, not a tunable threshold.** `severity: error` stops the pipeline. `severity: warning` surfaces the issue without blocking. Both are written in config and reviewed in PRs. Neither is controlled by a model that learned what is "normal" from historical distributions.
+
+- **The DAG dependency graph is the enforcement mechanism.** A validation task that must succeed before a downstream task can run is not a monitoring job — it is a sequential dependency that blocks the next stage. The enforcement is in the topology, not in alerting logic.
+
+- **Data CI and data observability are complementary, not competing.** After the gate passes, trend visibility, anomaly alerting, and historical context for investigation all still matter. Data CI answers "does this batch meet its rules right now?" Observability answers "what happened over time?" The gate enforces the contract at the point of ingestion. Observability investigates what happened after. Both have a role, and they are separate tools solving separate problems.
+
+---
+
+*DataCheck is an open-source deterministic validation engine. Install: `pip install datacheck-cli`. Source: [github.com/squrtech/datacheck](https://github.com/squrtech/datacheck).*
+
+
+
+
+LinkedIn Post
+
+Code can't ship without passing CI. Data ships constantly with no gate at all.
+
+That asymmetry explains a specific class of incident: an ETL job changes a float column to string, the code tests pass, and a financial aggregation silently returns NaN two days before anyone notices.
+
+The mechanism CI uses is not complicated — run rules against an artifact, return an exit code, block if non-zero. That's it. Every linter, static analyzer, and test runner works this way. The same mechanism applies directly to data validation.
+
+What's missing is the config living in the repository alongside the ETL code that produces the data. When validation rules are versioned with the pipeline, a data contract change shows up in the PR diff. The reviewer sees it. It's a conversation before merge, not an incident after.
+
+"A gate you can't trust is not a gate. A gate that doesn't exist isn't even that."
+
+DataCheck: deterministic data validation with POSIX exit codes, designed to run inside your existing pipeline compute. pip install datacheck-cli
\ No newline at end of file
diff --git a/blog/2026-02-23-schema-contracts-vs-semantic-contracts.md b/blog/2026-02-23-schema-contracts-vs-semantic-contracts.md
new file mode 100644
index 0000000..1d9b50c
--- /dev/null
+++ b/blog/2026-02-23-schema-contracts-vs-semantic-contracts.md
@@ -0,0 +1,342 @@
+# Schema Contracts vs Semantic Contracts in Modern Data Systems
+
+*Published 2026-02-23 by the DataCheck engineering team*
+
+---
+
+A payments pipeline ran cleanly for two weeks after a source system migration. Schema validation passed on every run. Column names matched. Types matched. Nullability matched. The monitoring showed green.
+
+The problem was that the upstream system had changed its convention for `transaction_amount`: it now emitted values in cents instead of dollars. The column was still `NUMERIC NOT NULL`. The values were still positive numbers. Every structural check passed. Every semantic contract was broken.
+
+Revenue figures were off by two orders of magnitude. The error was discovered when a finance reconciliation showed a 100x discrepancy between the warehouse aggregation and the source system's ledger. Two weeks of pipeline runs. Clean schema validation throughout.
+
+Schema contracts and semantic contracts are different things. Conflating them — or worse, treating schema compliance as the complete contract — is how this class of incident happens.
+
+---
+
+## Two Different Questions
+
+A data contract answers one of two fundamentally different questions:
+
+**"Is the shape of this data what I expect?"** This is the schema contract. It covers column names, data types, nullability, cardinality hints, and structural organization. A schema contract violation means the data's container has changed — a column was removed, renamed, or retyped. The consumer may not be able to parse the data at all.
+
+**"Are the values inside this data what I expect?"** This is the semantic contract. It covers the meaning of values: valid ranges, format conventions, allowed value sets, temporal constraints, cross-column relationships, and population-level invariants. A semantic contract violation means the data arrived in a parseable form but contains incorrect or invalid content.
+
+These questions require different mechanisms to answer. Schema comparison detects structural drift. Validation rules enforce semantic correctness. A pipeline that only runs schema comparison is answering the first question and ignoring the second. Most data quality incidents live in the second category.
+
+---
+
+## The Contract Hierarchy
+
+There are five levels of data contract, each expressing a different class of expectation:
+
+**Level 1 — Structural.** Column names, data types, nullability. The container. Schema comparison tools operate at this level. DataCheck's `schema capture/compare` operates here: it detects column additions, removals, type changes, and nullability changes against a versioned baseline.
+
+**Level 2 — Syntactic.** Format constraints on values within their containers. A `varchar` column might store email addresses, UUIDs, ISO 8601 dates, E.164 phone numbers, or free text. The type says nothing about the format. Syntactic contracts express what the value looks like: `regex`, `allowed_values`, `date_format_valid`, `type` (for string-typed numeric fields). These are detectable by inspecting individual values without any cross-row context.
+
+**Level 3 — Semantic.** Value constraints that express domain meaning. A transaction amount must be positive. A timestamp cannot be in the future. A percentage must be between 0 and 1. A record cannot be more than 48 hours old. These constraints require knowledge of the domain — they cannot be derived from the data type alone.
+
+**Level 4 — Referential.** Cross-table relationships. `orders.customer_id` must reference a valid row in `customers`. A `product_id` in a sales fact table must exist in the products dimension. These constraints require evaluating the current table against another table — the relationship must hold, not just the individual values.
+
+**Level 5 — Aggregate.** Population-level invariants. The sum of `line_item_total` across all rows for an `order_id` must equal the `order_total` on the order header. Each `(user_id, subscription_plan)` combination must be unique in the active subscriptions table. These constraints are invisible at the row level — they only exist as properties of the full dataset.
+
+Database schema DDL enforces Level 1 structurally and can partially enforce Level 4 via `FOREIGN KEY` constraints. Levels 2, 3, and 5 are entirely outside what the database schema can express concisely and reliably in production systems.
+
+---
+
+## Encoding Drift: The Failure Mode Schema Checking Cannot Catch
+
+Encoding drift is the class of data quality failure where the schema is preserved but the convention for filling it is changed. The container is intact; the meaning of its contents has shifted.
+
+**Units.** The source system changes `transaction_amount` from dollars to cents. Schema: `NUMERIC(18,2) NOT NULL`. Values are still numeric, still positive, still non-null. Downstream aggregations produce values 100x too large. Schema comparison returns clean.
+
+**Timezone convention.** The upstream ETL shifts from emitting UTC timestamps to emitting local-time timestamps without modifying the column. Schema: `TIMESTAMP NOT NULL`. Values are still timestamps, still parseable, still within plausible date ranges. Consumers that assume UTC are now off by a timezone offset — 4 to 9 hours depending on geography and DST. Every timestamp-based join and time-series aggregation is wrong.
+
+**ID format convention.** The source system migrates from integer-based customer IDs to UUID-based IDs. The warehouse column was already `VARCHAR(64) NOT NULL` to accommodate the integers as strings. The new UUIDs are also strings. Schema passes. But downstream joins that parse the old format (`CUST-001234`) fail silently for all new records, producing NULLs where matches should exist.
+
+In all three cases, a semantic rule would have caught the drift immediately:
+
+```yaml
+checks:
+ - name: transaction_amount_valid
+ column: transaction_amount
+ rules:
+ type: float
+ min: 0.01
+ max: 1000000.00 # $1M maximum — if values suddenly exceed this, investigate
+ severity: error
+
+ - name: created_at_is_recent_utc
+ column: created_at
+ rules:
+ not_null: true
+ no_future_timestamps: true
+ max_age: 48h # Fails if timestamps are hours ahead (timezone drift)
+ severity: error
+
+ - name: customer_id_format
+ column: customer_id
+ rules:
+ not_null: true
+ regex: '^[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$'
+ severity: error
+```
+
+If the source system had been emitting dollar amounts and the semantic contract said `max: 1000000.00`, a batch where amounts suddenly jumped 100x would fail the rule on the first run after the migration. If the timestamp convention changed and UTC timestamps were expected, `no_future_timestamps` and `max_age` rules would catch values that were hours ahead of current time. If the customer ID format was pinned to a regex matching UUIDs, integer-format IDs would fail the constraint the first time they appeared.
+
+Schema comparison would have caught a type change from `NUMERIC` to `BIGINT` or `VARCHAR`. It would not have caught any of these convention changes.
+
+---
+
+## Semantic Drift: When Value Sets Expand Without Notice
+
+A subtler failure mode is semantic drift — the gradual expansion of a value set that is constrained in the consumer but not enforced at the source.
+
+An `order_status` column starts with three values: `pending`, `confirmed`, `shipped`. The consumer's ETL has a `CASE` statement that maps these to internal status codes. The source system is under active development. Six months later, the source team adds `on_hold` and `partially_fulfilled` to support new fulfillment workflows. They document it in an internal wiki. The change does not trigger a schema change — the column is still `VARCHAR NOT NULL`. No schema alert fires.
+
+The new values flow downstream. The consumer's `CASE` statement has no branch for `on_hold` or `partially_fulfilled`. It returns `NULL` for all rows with these statuses. Aggregations that depend on status counts are now undercounting certain states. The error is silent.
+
+An `allowed_values` rule catches this on the first pipeline run containing the new status values:
+
+```yaml
+checks:
+ - name: order_status_constrained
+ column: order_status
+ rules:
+ not_null: true
+ allowed_values: [pending, confirmed, shipped]
+ severity: error
+```
+
+```
+FAIL order_status_constrained (847 failures / 2,341,887 rows)
+Exit code: 1
+```
+
+The pipeline stops. The 847 rows with `on_hold` status are counted and flagged. The consumer team learns that a new status value has appeared before it has been incorporated into the downstream mapping. The `allowed_values` rule becomes a communication mechanism between producer and consumer, enforced at the gate rather than discovered in a post-incident review.
+
+This is the key property of semantic contract enforcement: the producer's change to an undeclared value triggers an immediate, visible failure in the consumer's pipeline — not a silent downstream corruption days later.
+
+---
+
+## Cross-Column Constraints: The Purely Semantic Layer
+
+Some of the most important data quality constraints have no representation in database schema DDL. They are purely semantic — properties of the relationship between values across columns, or across rows within a dataset.
+
+**Temporal ordering constraints.** An order cannot ship before it is confirmed. `shipped_at` must be greater than `confirmed_at`.
+
+```sql
+SELECT COUNT(*) AS failed_count
+FROM orders
+WHERE shipped_at IS NOT NULL
+ AND confirmed_at IS NOT NULL
+ AND shipped_at < confirmed_at
+```
+
+No database schema primitive expresses this. A `CHECK` constraint could in theory, but cross-column `CHECK` constraints on timestamps are rarely used in distributed warehouse environments due to the lack of portability and enforcement overhead.
+
+**Financial integrity constraints.** The sum of line item totals must equal the order total.
+
+```yaml
+checks:
+ - name: order_total_integrity
+ column: order_total
+ rules:
+ sum_equals:
+ group_by: order_id
+ sum_column: line_item_total
+ tolerance: 0.01 # Allow 1-cent floating-point rounding
+ severity: error
+```
+
+The generated SQL:
+
+```sql
+SELECT COUNT(*) AS failed_count
+FROM (
+ SELECT order_id
+ FROM orders
+ GROUP BY order_id
+ HAVING ABS(SUM(line_item_total) - MAX(order_total)) > 0.01
+) AS violations
+```
+
+**Uniqueness across combinations.** A customer should not have two active subscriptions for the same plan simultaneously.
+
+```yaml
+checks:
+ - name: no_duplicate_active_subscriptions
+ column: customer_id
+ rules:
+ unique_combination:
+ columns: [customer_id, subscription_plan]
+ where: "status = 'active'"
+ severity: error
+```
+
+These constraints represent business invariants — properties that must hold true for the data to be correct, independent of what any schema specification says. They are not derivable from column types, not expressible in DDL, and not catchable by schema comparison. They require explicit semantic rule specification.
+
+---
+
+## The Inadequacy of Database Schema as a Contract
+
+Database DDL is a contract with the storage engine, not a contract with consumers. It enforces:
+
+- That columns exist with the declared type
+- That NOT NULL columns contain values
+- That UNIQUE columns contain distinct values (within the table)
+- That FOREIGN KEY columns reference existing rows in another table (within the database)
+
+It does not enforce:
+
+- That values within the declared type are in a valid range
+- That string values follow a particular format convention
+- That the encoding convention for numeric values has not changed
+- That enum-like string columns contain only the values the consumer expects
+- That cross-column temporal ordering holds
+- That financial aggregations are internally consistent
+
+`CHECK` constraints can express some of these, but in practice they are rarely used in production distributed data systems. They have limited cross-column support, no cross-row support, inconsistent behavior across warehouse engines, and create migration complexity. The industry standard for data teams working with Snowflake, BigQuery, and Redshift is: use DDL for structural definition, and enforce value contracts separately.
+
+DataCheck externalizes this enforcement into a YAML file that lives in the repository alongside the pipeline code. The contract is readable, reviewable, versionable, and portable — it works the same way against PostgreSQL, Snowflake, BigQuery, a CSV file, or a Parquet extract.
+
+---
+
+## The Layered Validation Pattern
+
+In practice, both contract levels should be enforced, in sequence:
+
+```bash
+# Layer 1: structural contract check
+# Detects: added/removed columns, type changes, nullability changes
+datacheck schema compare \
+ --source production_db \
+ --sources-file sources.yaml
+
+# Layer 2: semantic contract enforcement (only runs if Layer 1 passes)
+# Enforces: format constraints, value ranges, cross-column invariants, aggregate rules
+datacheck validate \
+ -c checks/orders.datacheck.yaml \
+ --source production_db
+```
+
+The sequence matters. If schema comparison fails — a column was removed, a type changed — the semantic validation results may be misleading. A rule expecting `amount` as a `float` will behave differently if the column is now `varchar`. Running semantic validation only on structurally sound data avoids validating against a corrupt schema baseline.
+
+```bash
+#!/bin/bash
+set -e
+
+echo "Layer 1: structural contract check"
+datacheck schema compare --source production_db --sources-file sources.yaml
+
+echo "Layer 2: semantic contract enforcement"
+datacheck validate -c checks/orders.datacheck.yaml --source production_db
+
+echo "Both contracts satisfied — proceeding to load"
+python transform_and_load.py
+```
+
+The `set -e` means a failure at either layer halts the script. Layer 1 failures indicate structural breaks that need platform team attention. Layer 2 failures indicate value contract violations that need data investigation.
+
+A complete semantic contract for the same orders table:
+
+```yaml
+# checks/orders.datacheck.yaml
+
+checks:
+ # Level 2: Syntactic
+ - name: order_id_format
+ column: order_id
+ rules:
+ not_null: true
+ regex: '^ORD-[0-9]{10}$'
+
+ # Level 3: Semantic
+ - name: order_amount_valid
+ column: order_total
+ rules:
+ not_null: true
+ type: float
+ min: 0.01
+ max: 1000000.00
+
+ - name: status_constrained
+ column: order_status
+ rules:
+ not_null: true
+ allowed_values: [pending, confirmed, shipped, cancelled, refunded]
+
+ - name: timestamps_valid
+ column: created_at
+ rules:
+ not_null: true
+ no_future_timestamps: true
+ max_age: 7d
+
+ # Level 4: Referential
+ - name: customer_exists
+ column: customer_id
+ rules:
+ not_null: true
+ foreign_key_exists:
+ table: customers
+ column: id
+
+ # Level 5: Aggregate
+ - name: line_items_sum_to_total
+ column: order_total
+ rules:
+ sum_equals:
+ group_by: order_id
+ sum_column: line_item_total
+ tolerance: 0.01
+
+ - name: no_duplicate_orders
+ column: order_id
+ rules:
+ unique_combination:
+ columns: [customer_id, order_id]
+```
+
+Schema comparison covers Level 1. This config covers Levels 2 through 5. Together they answer both questions: is the shape correct, and are the values correct.
+
+---
+
+## Engineering Takeaways
+
+- **Schema compliance is a necessary condition for data correctness, not a sufficient one.** A table that passes schema validation can contain completely incorrect values — wrong units, wrong format conventions, out-of-range numbers, broken referential relationships. Schema compliance means the container is intact. It says nothing about the contents.
+
+- **Encoding drift is the failure mode that schema checking structurally cannot catch.** A units change, a timezone convention change, or an ID format convention change preserves the column type while breaking the semantics. The only mechanism that catches encoding drift is an explicit semantic rule that knows what valid values look like — a range, a regex, or a temporal constraint.
+
+- **Semantic drift is the silent failure mode of missing `allowed_values` rules.** Upstream enum sets expand. New status values appear. Consumer mappings produce NULLs for unmapped values. Schema comparison returns clean because no structural change occurred. An `allowed_values` rule catches the new value on its first appearance.
+
+- **Cross-column constraints are purely semantic and have no schema representation.** Temporal ordering between columns (`shipped_at > confirmed_at`), financial integrity (`sum of line items = order total`), and combination uniqueness cannot be expressed in database DDL in a portable, production-viable way. They require explicit semantic rule specification.
+
+- **The five contract levels require two enforcement mechanisms.** `datacheck schema compare` answers "has the structural shape changed?" `datacheck validate` answers "do the values meet their semantic contract?" These are different commands answering different questions. Both need to run.
+
+- **Run schema comparison before semantic validation.** If the structural contract is broken — a column was removed or retyped — semantic validation against the changed structure may produce misleading results. Layer 1 passes before Layer 2 runs.
+
+- **Database DDL is a contract with the storage engine, not with consumers.** It enforces type and nullability. It does not enforce value ranges, format conventions, or cross-column invariants. Externalizing the semantic contract into versioned YAML that lives in the repository makes it reviewable, portable, and enforceable across environments — including environments where you do not control the DDL.
+
+- **Semantic contracts capture producer-consumer agreements that live outside the schema.** When a source team adds a new enum value, it should surface as a validation failure in the consumer's pipeline — not as a silent NULL in a downstream join. The `allowed_values` rule is the communication interface between producer and consumer, enforced at the gate.
+
+---
+
+*DataCheck is an open-source deterministic validation engine. Install: `pip install datacheck-cli`. Source: [github.com/squrtech/datacheck](https://github.com/squrtech/datacheck).*
+
+
+LinkedIn Post
+
+A payments pipeline ran clean for two weeks after a source system migration. Schema validation passed every run. Column names matched, types matched, nullability matched. Green across the board.
+
+The upstream system had changed its transaction_amount convention from dollars to cents. The column was still NUMERIC NOT NULL. Values were still positive numbers. Every schema check passed. Revenue figures were off by 100x.
+
+Schema contracts and semantic contracts are different questions. "Is the shape of this data what I expect?" is a schema question. "Are the values inside correct?" is a semantic question. Most pipelines only answer the first one.
+
+"Schema compliance means the container is intact. It says nothing about the contents."
+
+The failure modes that live in the gap: encoding drift (units, timezone conventions, ID format conventions that change without a type change), semantic drift (upstream enum sets expanding without notifying consumers), and cross-column invariants that have no representation in database schema DDL at all.
+
+A CASE statement that maps order_status to internal codes silently returns NULL for on_hold status values that the source team added last month. The column is still VARCHAR NOT NULL. Nothing in schema validation catches it. An allowed_values rule catches it on the first run.
+
+Schema comparison catches structural breaks. Semantic rules catch value contract violations. Both questions need to be answered, by different mechanisms, in sequence.
\ No newline at end of file
diff --git a/blog/2026-02-23-sql-pushdown-as-a-validation-strategy.md b/blog/2026-02-23-sql-pushdown-as-a-validation-strategy.md
new file mode 100644
index 0000000..b5324bc
--- /dev/null
+++ b/blog/2026-02-23-sql-pushdown-as-a-validation-strategy.md
@@ -0,0 +1,289 @@
+# SQL Pushdown as a Validation Strategy: Architecture and Trade-offs
+
+*Published 2026-02-23 by the DataCheck engineering team*
+
+---
+
+A data engineering team ran their quality checks against a Snowflake table containing 800 million rows. The check was whether the `transaction_id` column was null. Their tool fetched all 800 million rows to an EC2 validation host, loaded them into a pandas DataFrame, ran `.isnull().sum()`, and returned a number.
+
+The fetch took 11 minutes. The instance required 64 GB of RAM. The egress cost was approximately $6 per run. They ran it twice a day.
+
+The query that answers the same question:
+
+```sql
+SELECT COUNT(*) AS null_count
+FROM transactions
+WHERE transaction_id IS NULL
+```
+
+Execution time in Snowflake: under one second. Cost: a fraction of one credit. The answer is identical. The architectural decision is not.
+
+---
+
+## The Principle: Move Computation to Data
+
+SQL pushdown is a specific instance of a general principle in distributed systems: move computation to the data, not data to the computation.
+
+When data lives in a warehouse — Snowflake, BigQuery, Redshift, PostgreSQL — it is already co-located with a query engine that is specifically designed and optimized for aggregate computation on that data. Column-oriented warehouses are built around the assumption that you will be running aggregate queries on large datasets. Their storage formats, compression strategies, and MPP execution engines are optimized for exactly this pattern.
+
+Pulling data out of the warehouse to validate it in an external Python process is an anti-pattern: it moves gigabytes across a network to use a slower compute environment for an operation the warehouse could have executed in seconds.
+
+SQL pushdown for validation means: express the validation logic as a SQL predicate, execute the predicate inside the warehouse, and receive only the aggregate result. The validator never sees the rows. It receives an integer — the count of rows that violated the constraint.
+
+---
+
+## Three Execution Architectures
+
+There are three common approaches to running data validation against a warehouse table. They differ in what moves across the network and what the validator actually executes.
+
+**Architecture 1: Pull-to-Python**
+
+```python
+# Validation tool fetches rows to an external process
+df = pd.read_sql("SELECT * FROM orders", engine)
+null_count = df["customer_id"].isnull().sum()
+amount_violations = ((df["amount"] < 0) | (df["amount"] > 100000)).sum()
+```
+
+The entire table transfers across the network. Memory requirement on the validation host scales linearly with table size. Data leaves the warehouse boundary on every run. Validation performance is bottlenecked by the network and the external host's compute, not the warehouse's query engine.
+
+**Architecture 2: Push-Aggregate (SQL Pushdown)**
+
+```sql
+SELECT COUNT(*) AS null_count
+FROM orders
+WHERE customer_id IS NULL
+```
+
+Zero row transfer. One integer returned. Runs at warehouse speed. Data never leaves the warehouse boundary. Validation performance is bottlenecked by query execution — which is what the warehouse is built for.
+
+**Architecture 3: Sampling**
+
+```python
+df = pd.read_sql(
+ "SELECT * FROM orders TABLESAMPLE BERNOULLI (1)", engine
+)
+null_count = df["customer_id"].isnull().sum()
+```
+
+Transfers a fraction of the table. Reduces the cost of pull-to-Python. But introduces sampling error: a constraint violation affecting 0.1% of rows has a significant probability of not appearing in a 1% sample. For an enforcement gate where the contract is binary — pass or fail — sampling is the wrong execution model. The gate may report pass on data that is failing.
+
+For enforcement, Architecture 2 is the only correct choice. It returns exact counts, not estimates, and costs orders of magnitude less.
+
+---
+
+## The Single Aggregate SELECT Optimization
+
+The most important optimization in SQL pushdown validation is batching multiple rule checks into one query — one table scan, multiple results.
+
+A naive implementation runs one query per rule:
+
+```sql
+-- Query 1
+SELECT COUNT(*) FROM orders WHERE customer_id IS NULL;
+
+-- Query 2
+SELECT COUNT(*) FROM orders WHERE amount < 0 OR amount > 100000;
+
+-- Query 3
+SELECT COUNT(*) FROM orders
+WHERE status NOT IN ('pending', 'confirmed', 'shipped', 'cancelled');
+
+-- Query 4
+SELECT COUNT(*) FROM orders WHERE created_at > NOW();
+```
+
+This scans the table four times and pays query startup overhead four times. The same result is available from a single pass:
+
+```sql
+SELECT
+ SUM(CASE WHEN customer_id IS NULL
+ THEN 1 ELSE 0 END) AS customer_id_nulls,
+ SUM(CASE WHEN amount < 0 OR amount > 100000
+ THEN 1 ELSE 0 END) AS amount_violations,
+ SUM(CASE WHEN status NOT IN (
+ 'pending', 'confirmed', 'shipped', 'cancelled')
+ THEN 1 ELSE 0 END) AS status_violations,
+ SUM(CASE WHEN created_at > NOW()
+ THEN 1 ELSE 0 END) AS future_timestamps
+FROM orders
+```
+
+One table scan. Four rule results. On a column-oriented warehouse, the efficiency gains come from two properties:
+
+**Column projection.** Columnar storage reads only the columns referenced in the query. On a 200-column `orders` table where the validation config checks 4 columns, the warehouse reads approximately 4/200 of the stored data. The remaining 196 columns are never touched. This applies equally to the single-query and multi-query approaches — but the single query pays the per-query overhead once.
+
+**Query startup cost.** Each query on a warehouse cluster requires parsing, planning, and worker allocation. For an MPP system managing compute concurrency, a validation run with 20 rules that executes as 1 query consumes one query slot. The same run as 20 queries can hit concurrency limits, queue behind other workloads, and pay the startup cost twenty times. At scale, this difference shows up in pipeline latency.
+
+On BigQuery, billing is based on bytes scanned. Multiple `CASE WHEN` expressions on the same columns do not increase bytes scanned — the columns are read once regardless of how many expressions reference them. A 20-rule validation config over 4 columns costs the same as a 1-rule validation, provided they reference the same columns.
+
+---
+
+## WHERE Clauses and Partition Pruning
+
+Validation runs against large historical tables benefit substantially from partition targeting. A daily-partitioned table with three years of history contains over 1,000 partitions. Without a partition filter, a COUNT query scans all of them. With one, it scans one.
+
+DataCheck supports this via the `--where` flag:
+
+```bash
+datacheck validate \
+ -c checks/orders.datacheck.yaml \
+ --source production_db \
+ --table orders \
+ --where "DATE(created_at) = CURRENT_DATE"
+```
+
+The generated queries become:
+
+```sql
+SELECT
+ SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS customer_id_nulls,
+ SUM(CASE WHEN amount < 0 OR amount > 100000 THEN 1 ELSE 0 END) AS amount_violations
+FROM orders
+WHERE DATE(created_at) = CURRENT_DATE
+```
+
+The warehouse query planner prunes all partitions except today's. The validation run scans one day's data instead of three years. On Snowflake, this is the difference between consuming one credit and consuming a hundred. On BigQuery, it is the difference between scanning 10 MB and scanning 20 GB.
+
+This optimization is only available because validation runs as SQL inside the warehouse. A pull-to-Python approach that adds a WHERE clause to its fetch query reduces data transfer — but it still transfers matching rows rather than returning a single aggregate. The optimization is partial. SQL pushdown makes it complete.
+
+---
+
+## Egress Cost and Security Arithmetic
+
+The cost difference between pull-to-Python and SQL pushdown is not academic. For a mid-sized production table:
+
+- 100 million rows, 20 columns, ~200 bytes per row = **20 GB**
+- AWS inter-region data transfer: $0.09/GB
+- Pull-to-Python cost per run: **$1.80**
+- At 4 validation runs per day: **$2,628 per year**, for one table
+
+SQL pushdown returns a single row of integers — approximately 100 bytes. The cost is negligible. For a data platform validating a dozen production tables multiple times per day, the egress arithmetic at scale favors pushdown by several orders of magnitude.
+
+The security implication has a different character. In a pull-to-Python approach, every validation run transmits the raw contents of the table to the validation host — including PII, financial fields, and any regulated data. That host is now in scope for your data security policy, your SOC 2 audit, and any HIPAA or PCI-DSS compliance review. Its memory, logs, and temporary storage become potential exposure surfaces for the data it processed.
+
+In a SQL pushdown approach, the validation host receives integers. It never sees column values. The warehouse boundary is the data security boundary. The validator knows that 3,412 rows violated a constraint — it does not know which rows or what their `customer_id` values were. This is not a limitation; it is a deliberate architectural property.
+
+---
+
+## What SQL Pushdown Cannot Do
+
+Being precise about the limitations:
+
+**Cross-database foreign key checks.** A referential integrity constraint where the parent table lives in a different database — or a different warehouse system entirely — cannot be evaluated as a single pushed query. DataCheck handles single-warehouse foreign key validation via SQL pushdown. Cross-system referential integrity requires fetching one side, which is a partial pull.
+
+**Row-level failure details.** Pushdown returns counts, not rows. When a validation run reports 3,412 violations, you know how many — not which ones. Retrieving the offending rows is a separate warehouse query:
+
+```sql
+SELECT order_id, customer_id, amount, status
+FROM orders
+WHERE amount < 0 OR amount > 100000
+ORDER BY created_at DESC
+LIMIT 100
+```
+
+This is the correct separation: the enforcement gate uses counts to make a pass/fail decision; investigation of failures uses direct warehouse queries. The validation tool's job is the gate, not the forensics.
+
+**Regex portability.** PostgreSQL supports `~` for regex matching. MySQL uses `REGEXP`. Snowflake uses `REGEXP_LIKE`. SQL Server has limited native regex support. DataCheck normalizes regex rules to the correct dialect per backend, but complex patterns — lookaheads, named groups, possessive quantifiers — may not be portable across all supported warehouse types. For maximum portability, keep regex patterns simple or use the Python API for regex-heavy validation on problematic backends.
+
+**Warehouse credit consumption.** Validation queries run on warehouse compute. On Snowflake and BigQuery, high-frequency validation pipelines will register in your credit and billing usage. The cost per run is typically small — a single aggregate SELECT against a partitioned table costs fractions of a credit — but it is not zero and should appear in capacity planning. This is a real cost, not a reason to avoid pushdown, but it should be quantified rather than ignored.
+
+---
+
+## Custom SQL for Constraints YAML Cannot Express
+
+Some constraints require SQL expressiveness beyond what predicate-per-column rules can capture. DataCheck accepts a custom SQL query via `--query`. The query should return the rows that constitute a failure. DataCheck wraps it in a COUNT:
+
+```bash
+# No customer should have more than 5 active subscriptions
+datacheck validate \
+ --source production_db \
+ --check-name max_active_subscriptions \
+ --query "
+ SELECT customer_id
+ FROM subscriptions
+ WHERE status = 'active'
+ GROUP BY customer_id
+ HAVING COUNT(*) > 5
+ "
+```
+
+DataCheck executes:
+
+```sql
+SELECT COUNT(*) AS failed_count
+FROM (
+ SELECT customer_id
+ FROM subscriptions
+ WHERE status = 'active'
+ GROUP BY customer_id
+ HAVING COUNT(*) > 5
+) AS violations
+```
+
+The result is still a single integer. The execution is still inside the warehouse. The exit code is still `1` if the count is non-zero. The pushdown property holds regardless of the complexity of the user-supplied query.
+
+This pattern handles aggregation-based constraints, cross-column join conditions, and any validation logic that requires SQL expressiveness beyond the YAML rule vocabulary.
+
+---
+
+## File-Based Validation: The Equivalent Pattern
+
+For CSV and Parquet files, there is no warehouse to push to. DataCheck loads the file into an in-process engine and applies predicate logic. The architectural principle remains: push computation into the optimized engine, not a Python loop.
+
+The difference between vectorized evaluation and row iteration matters at scale:
+
+```python
+# Row iteration: O(n) Python overhead per row, per rule
+failed = 0
+for _, row in df.iterrows():
+ if row["amount"] < 0 or row["amount"] > 100000:
+ failed += 1
+
+# Vectorized: C-level SIMD execution, near-constant Python overhead
+failed = ((df["amount"] < 0) | (df["amount"] > 100000)).sum()
+```
+
+For a 10 million row CSV, row iteration takes seconds per rule. Vectorized evaluation takes milliseconds. For a 20-rule validation config, the difference is a 60-second run versus a 3-second run.
+
+The underlying mechanism — expressing the validation as a predicate over a column rather than a condition on each row — is the same principle as SQL pushdown. The optimization target is the same: avoid Python-level loop overhead by delegating computation to the engine that is built to do it.
+
+---
+
+## Engineering Takeaways
+
+- **Move computation to data, not data to computation.** SQL pushdown is a direct instantiation of this distributed systems principle. A COUNT predicate runs at warehouse speed. Pulling rows to validate externally fights against the warehouse's architecture rather than working with it.
+
+- **Batch multiple rules into a single aggregate SELECT.** `CASE WHEN` expressions inside a single query execute in one table scan. Twenty rules as twenty separate queries pays query startup overhead twenty times. On partitioned tables with high rule counts, this difference is measurable in both latency and warehouse credit consumption.
+
+- **Use WHERE clauses to enable partition pruning.** `--where "DATE(created_at) = CURRENT_DATE"` limits the warehouse scan to today's partition on a daily-partitioned table. The cost of validating a three-year historical table drops to the cost of validating one day's data. This only works because validation runs inside the warehouse's query planner.
+
+- **The egress cost is real and compounds with table count and frequency.** At $0.09/GB, a 20 GB table validated four times per day costs over $2,600 per year in transfer alone. SQL pushdown returns approximately 100 bytes per run. The cost difference is not marginal — it is structural.
+
+- **The warehouse boundary is the data security boundary.** SQL pushdown means the validation host receives integers, not rows. PII, financial data, and regulated fields never leave the warehouse. The validator cannot leak what it never received. This is an architectural property, not a configuration option.
+
+- **Pushdown returns counts, not rows.** The gate knows 3,412 rows failed — not which rows or what their values were. Investigation of failures is a separate warehouse query. This separation is correct: enforcement and forensics are different jobs, and conflating them by pulling rows into the validation host compromises the security property without improving the enforcement.
+
+- **Custom SQL extends pushdown to constraints YAML cannot express.** Aggregation-based constraints, cross-column join conditions, and HAVING clauses are outside the predicate-per-column rule model. Wrapping a user-supplied query in a COUNT preserves the pushdown property — the complex logic runs inside the warehouse, and the validator still receives only an integer.
+
+- **Sampling is not a valid substitute for pushdown in enforcement contexts.** A 1% sample misses constraint violations affecting less than ~5% of rows with meaningful probability. For a binary enforcement gate, the only acceptable false negative rate is zero. Pushdown provides exact counts at the cost of zero additional data transfer.
+
+---
+
+*DataCheck is an open-source deterministic validation engine. Install: `pip install datacheck-cli`. Source: [github.com/squrtech/datacheck](https://github.com/squrtech/datacheck).*
+
+
+
+LinkedIn Post
+
+A team was running data quality checks against an 800M-row Snowflake table. Their tool fetched all 800 million rows to an EC2 host to check whether a column was null. 11 minutes. 64 GB RAM. $6 in egress per run, twice a day.
+
+The query that answers the same question takes under a second and returns one integer.
+
+This is not a subtle optimization. It is the difference between working with a warehouse's architecture and fighting against it. MPP systems like Snowflake and BigQuery are built to execute aggregate queries at scale. Pulling rows out to validate them externally moves computation away from the engine that's optimized for it.
+
+The less obvious optimization is batching: twenty validation rules don't need twenty queries. A single SELECT with CASE WHEN expressions executes in one table scan, one round trip, one query slot. On columnar warehouses, it scans only the columns referenced — regardless of how many CASE WHEN branches reference them.
+
+"The validation host should receive integers, not rows. What it never received, it cannot leak."
+
+There's also a security property here that often goes unnoticed. When validation runs inside the warehouse, PII and regulated data never leave the warehouse boundary. The validator knows 3,412 rows violated a constraint — not which rows or what their values were. That's an architectural property, not a configuration option.
\ No newline at end of file
diff --git a/blog/2026-02-23-why-observability-is-not-enough-for-data-enforcement.md b/blog/2026-02-23-why-observability-is-not-enough-for-data-enforcement.md
new file mode 100644
index 0000000..195a4a3
--- /dev/null
+++ b/blog/2026-02-23-why-observability-is-not-enough-for-data-enforcement.md
@@ -0,0 +1,244 @@
+# Why Observability Is Not Enough for Data Enforcement
+
+*Published 2026-02-23 by the DataCheck engineering team*
+
+---
+
+The alert fired at 11:47 PM. The pipeline had run at 8:30 PM. For three hours, the downstream mart had been serving aggregations built on a column that had silently coerced from `numeric` to `varchar` in an upstream ETL change. The anomaly detection model flagged the null rate deviation. The alert went to PagerDuty. The on-call engineer woke up, confirmed the issue, and started the rollback at 12:15 AM.
+
+The observability platform worked exactly as designed. The problem is that "working as designed" still meant three hours of bad data in production, a manual rollback, and a downstream reprocessing job that ran until 4 AM.
+
+Observability is not enforcement. This distinction is architectural, not philosophical. Understanding why requires looking at how observability systems are built, what they optimize for, and what they fundamentally cannot do.
+
+---
+
+## The Telemetry Chain Is Asynchronous by Design
+
+Observability systems are built around an asynchronous data collection architecture. Your pipeline runs, emits metrics or events to a collection endpoint, those events are buffered and ingested by the observability platform, processing and aggregation happen in the background, alert conditions are evaluated on the stored data, and notifications are dispatched if thresholds are breached.
+
+This chain is deliberately async. Async collection means your pipeline does not wait for the observability backend to process each event before proceeding. Async processing means the platform can absorb bursts. Async alerting means alert evaluation can run on aggregated windows rather than per-event. These are correct architectural choices for an observability system.
+
+But they mean the system is structurally incapable of sitting in the critical path of your pipeline. By the time the alert fires, your pipeline has already finished running. The data is already wherever your pipeline put it.
+
+An enforcement gate has the opposite requirement. It must be synchronous. It must be in the critical path. Its entire purpose is to prevent the next stage from running if the current stage produced bad data. You cannot retrofit a synchronous enforcement point onto an asynchronous observation system — they are solving different problems at different points in time.
+
+---
+
+## Probabilistic Outputs Cannot Make Binary Routing Decisions
+
+The second architectural incompatibility is in the nature of the output.
+
+Anomaly detection produces a signal that is inherently probabilistic. "This value is 3.2 standard deviations from the 30-day rolling mean." "The null rate is elevated at 94th percentile of historical distribution." "Row count is anomalous with 87% confidence." These are useful signals for investigation — they tell you something changed, and they give you a severity indication.
+
+A pipeline gate needs a different kind of answer entirely. It needs a binary predicate: pass or fail, proceed or stop. You cannot route a pipeline on a probability score without converting it to a threshold, and every threshold is a judgment call that must be made in advance, maintained over time, and tuned when it produces noise.
+
+Consider what happens to that threshold in practice. The anomaly model fires on Monday mornings because weekend data has a different volume profile. The on-call engineer adjusts the threshold. The model fires when a marketing campaign runs and order volume spikes 4x. The threshold gets widened. Six months in, the threshold that was set to catch a 10% null rate increase now lets through a 35% increase before alerting because the band was widened incrementally to suppress noise.
+
+This is not a failure of the observability platform. It is the expected behavior of a probabilistic system operated by engineers who rationally respond to false positives by tuning them away. The problem is that each tuning decision slightly weakens the gate, and the degradation is invisible until a bad batch slips through.
+
+A deterministic rule has no threshold to tune. The rule `not_null: true` either passes or fails. The rule `allowed_values: [pending, confirmed, shipped]` either passes or fails. There is no sensitivity dial, no window size to adjust, no training data to go stale. The predicate is the same on every run.
+
+---
+
+## The Baseline Cold-Start Problem
+
+Statistical anomaly detection requires a baseline. It needs to know what "normal" looks like before it can identify what is "abnormal." This requirement creates a specific class of blind spots that occur exactly when bad data is most likely to enter.
+
+**New pipelines.** A new pipeline has no history. You cannot train an anomaly model on data that does not exist yet. From day one through the first weeks or months of operation, the model is either not running or operating on an insufficient baseline. These early runs are often the highest-risk period — the pipeline is new, the data sources are not fully understood, the transformation logic has not been battle-tested.
+
+**After schema changes.** When a column is added, removed, renamed, or changes type, the historical baseline for that column is no longer valid. A model trained on a `float` distribution does not have meaningful anomaly thresholds for `varchar`. After the change, you need to wait for the model to establish a new baseline — during which time the column has no effective anomaly coverage.
+
+**After data migrations.** When a source system migrates and the data characteristics change structurally — different ID formats, different value ranges, different cardinality — the old baseline misrepresents the new normal. Every value that is now legitimately different from the old distribution looks anomalous. The model fires constantly. Engineers tune down the sensitivity. The gate degrades.
+
+Deterministic rules do not have a cold-start period. A validation config written on day one of a pipeline enforces exactly the same rules as one written eighteen months in. A `regex` rule that validates UUID format enforces UUID format whether the column contains 100 rows or 100 million rows, whether it is one day old or three years old.
+
+```yaml
+checks:
+ - name: event_id_format
+ column: event_id
+ rules:
+ not_null: true
+ regex: '^[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$'
+
+ - name: event_type_constrained
+ column: event_type
+ rules:
+ not_null: true
+ allowed_values: [click, view, purchase, refund, signup]
+
+ - name: amount_bounds
+ column: amount_usd
+ rules:
+ type: float
+ min: 0.01
+ max: 99999.99
+```
+
+These rules work on the first run. They work the same way on every subsequent run. There is no warm-up period, no baseline to rebuild after a schema change, no sensitivity to tune after a traffic spike.
+
+---
+
+## Alert Fatigue as Systemic Gate Degradation
+
+There is a well-documented operational failure mode in alert-driven systems: when alerts fire frequently enough, engineers adapt by treating them as lower-priority signals.
+
+This adaptation is individually rational. If a data quality alert fires 40 times per week and 38 of those firings are benign — expected seasonality, acceptable distribution shifts, pipeline restarts — a skilled engineer will quickly learn to evaluate context before acting. The alert has trained them to distinguish signal from noise. The consequence is that when a real quality issue fires, it enters a queue of other alerts that are also being evaluated for context. Response time increases. The alert is acknowledged rather than acted on immediately.
+
+Pipeline enforcement does not have this failure mode. When a validation task in an Airflow DAG fails, the downstream task does not run. There is no alert to acknowledge, no threshold to tune, no on-call queue to compete with. The pipeline is stopped. The engineer sees a failed task. The only path forward is to fix the underlying issue.
+
+```python
+# The task dependency IS the enforcement mechanism.
+# No alert routing, no on-call, no threshold tuning.
+extract >> validate_raw >> transform >> validate_final >> serve
+```
+
+If `validate_raw` fails, `transform` does not run. The pipeline is in a known, visible failed state. There is no ambiguity about whether this is signal or noise — the pipeline is down. That visibility is a feature, not a limitation.
+
+Compare this to the observability-as-enforcement pattern some teams use: an alert fires, which triggers a webhook, which pauses the pipeline, which notifies the on-call engineer, who then reviews the alert, confirms it is actionable, and manually resumes or kills the pipeline. That is four systems and three human decisions in the critical path of what should be a binary gate. Each link adds latency and failure surface.
+
+---
+
+## Out-of-Band Observers and In-Path Gates
+
+The architectural distinction generalizes: observability is out-of-band; enforcement is in-path.
+
+An out-of-band observer receives copies of data and events. It runs alongside the system being observed. It can fail without affecting the pipeline. If your observability backend goes down, your pipelines continue running — which is the correct behavior for a monitoring system.
+
+An in-path gate sits in the critical path of execution. It must run. Its success or failure determines whether the next step proceeds. If the gate fails to execute, that itself is a failure — the pipeline should not proceed when its quality check is unavailable.
+
+This is not a subtle distinction. A system that can be bypassed without affecting pipeline execution is not a gate. It is a monitor. Calling it a gate is a category error that produces a false sense of security: teams believe their pipeline is protected by the quality check, but the check runs in a lane that does not intersect the pipeline's execution path.
+
+DataCheck is in-path by construction. It runs as a step in your pipeline. Its exit code is what the orchestrator reads to decide whether to proceed.
+
+```bash
+# In a shell pipeline: the next command only runs if datacheck exits 0
+datacheck validate -c checks/events.datacheck.yaml && load_to_warehouse.sh
+```
+
+```yaml
+# In a Makefile: explicit dependency chain
+validate:
+ datacheck validate -c checks/events.datacheck.yaml
+
+load: validate
+ python load_to_warehouse.py
+```
+
+If `datacheck validate` exits `1`, `load_to_warehouse.sh` does not execute. The shell `&&` operator is the enforcement mechanism. No additional tooling required.
+
+Exit code semantics:
+
+| Code | Meaning |
+|------|---------|
+| `0` | All rules passed — pipeline may proceed |
+| `1` | One or more error-severity rules failed — pipeline must stop |
+| `2` | Configuration error — cannot proceed |
+| `3` | Data loading error — cannot proceed |
+
+---
+
+## What Observability Is Actually For
+
+None of this is an argument against observability. Observability tools solve problems that enforcement tools cannot.
+
+**Trend analysis.** A null rate that is 0.2% today and was 0.1% last week is not a failure — it might not even be notable. But a null rate that has climbed from 0.05% to 0.8% over six months is a signal worth investigating. Trend data requires time-series storage and historical context that a per-run validation tool does not provide.
+
+**Unknown unknowns.** Deterministic rules enforce what you know to check. An anomaly detection system can surface patterns you did not know to look for — unusual distributions, unexpected correlations, cardinality explosions. These are valuable discovery signals.
+
+**Post-incident investigation.** After a quality issue, you want to know when it started, how many runs were affected, and how the metrics evolved. Observability platforms are built for this kind of historical query. A validation tool records pass/fail per run — not the history needed for detailed incident analysis.
+
+**SLA monitoring.** "This pipeline has run successfully within the last 4 hours" is a monitoring question, not a validation question. DataCheck can check `max_age` on a timestamp column to catch stale data within a run — but SLA-level uptime monitoring belongs in your observability layer.
+
+The cleaner architecture uses both: deterministic validation rules as the synchronous enforcement gate, observability for trend analysis and post-hoc investigation. They operate at different points in the pipeline lifecycle and answer different questions.
+
+---
+
+## The Enforcement Layer
+
+What the enforcement layer needs to provide is conceptually simple: explicit rules, evaluated deterministically, with a binary output that can be consumed by any pipeline orchestrator.
+
+In practice, for a database-backed pipeline, a `not_null` check executes as:
+
+```sql
+SELECT COUNT(*) AS failed_count
+FROM events
+WHERE event_id IS NULL
+```
+
+An `allowed_values` check:
+
+```sql
+SELECT COUNT(*) AS failed_count
+FROM events
+WHERE event_type NOT IN ('click', 'view', 'purchase', 'refund', 'signup')
+ AND event_type IS NOT NULL
+```
+
+A `max_age` check, confirming the table has been updated within the last 4 hours:
+
+```sql
+SELECT COUNT(*) AS failed_count
+FROM events
+WHERE created_at < NOW() - INTERVAL '4 hours'
+```
+
+Each query returns a single integer — the number of rows that violated the rule. Zero means pass. Non-zero means fail. The entire validation run produces a single exit code. No metrics to store, no baselines to maintain, no training data to refresh.
+
+Running validation:
+
+```bash
+datacheck validate -c checks/events.datacheck.yaml
+
+ PASS event_id_format (0 failures / 1,847,293 rows)
+ PASS event_type_constrained (0 failures / 1,847,293 rows)
+ FAIL amount_bounds (3,412 failures / 1,847,293 rows)
+ PASS created_at_fresh (0 failures / 1,847,293 rows)
+
+Rules: 4 total 3 passed 1 failed
+Exit code: 1
+```
+
+The pipeline stops. The 3,412 rows that violated the `amount_bounds` rule are identified by count. The next stage does not run. No three-hour detection window. No on-call page at midnight. No rollback at 4 AM.
+
+---
+
+## Engineering Takeaways
+
+- **Observability is asynchronous by design; enforcement requires synchronous in-path execution.** These are architectural incompatibilities. An async telemetry chain cannot be retrofitted into a synchronous pipeline gate — the data has already moved by the time the alert evaluates.
+
+- **Probabilistic anomaly scores cannot make binary routing decisions reliably.** Any threshold applied to a probability output will be tuned over time in response to false positives. That tuning incrementally weakens the gate. Deterministic predicates have no threshold to erode.
+
+- **Anomaly detection is least reliable precisely when you need enforcement most.** New pipelines, post-schema-change periods, and post-migration states have no valid historical baseline. Deterministic rules work from run one with no warm-up period.
+
+- **Alert fatigue degrades the gate over time; pipeline failure does not.** Engineers rationally learn to evaluate alert context before acting, which increases response time for real issues. A failed pipeline task produces a clear, unambiguous blocked state with no equivalent path toward being ignored.
+
+- **An out-of-band observer that can be bypassed is a monitor, not a gate.** If the quality check runs in a lane that does not intersect pipeline execution, it provides no enforcement guarantee. Enforcement requires being in the critical path.
+
+- **SQL pushdown means enforcement is a COUNT query, not a data scan.** A single aggregate SELECT returns one row — the number of violations. No data leaves the warehouse. No rows are transferred. The validation cost is proportional to the query, not the table size.
+
+- **Observability and enforcement answer different questions at different points in time.** "Does this batch meet its rules right now?" is an enforcement question. "How has this metric trended over the last 90 days?" is an observability question. Treating the second system as a substitute for the first leaves the pipeline unprotected during the window between ingestion and alert evaluation.
+
+- **The correct architecture uses both.** Deterministic validation at the gate answers the binary question before data moves. Observability downstream answers the trend and anomaly questions after data has passed the gate. Neither is a substitute for the other.
+
+---
+
+*DataCheck is an open-source deterministic validation engine. Install: `pip install datacheck-cli`. Source: [github.com/squrtech/datacheck](https://github.com/squrtech/datacheck).*
+
+
+
+
+LinkedIn Post
+
+Data observability platforms are built to be asynchronous by design. Your pipeline runs, telemetry is emitted, it's collected in the background, processed, aggregated, and then an alert fires.
+
+By the time that alert fires, your data is already wherever the pipeline sent it.
+
+That's not a flaw in observability — it's the correct architecture for trend analysis, anomaly discovery, and historical investigation. But it means observability cannot be your quality gate. An async system cannot sit synchronously in the critical path of pipeline execution.
+
+The second problem is probabilistic output. Anomaly detection gives you scores and distributions. A pipeline gate needs a binary predicate. Every threshold you apply to that score will be tuned over time in response to false positives — and each tuning decision slightly weakens the gate. After six months, the threshold that was meant to catch a 10% null rate increase is letting through 30%.
+
+"An out-of-band observer that can be bypassed is a monitor, not a gate."
+
+The enforcement layer is a different architectural slot: deterministic rules, evaluated in-path, with a POSIX exit code that blocks the next stage if any rule fails. No alert to acknowledge. No threshold to tune. The pipeline either proceeds or it does not.
+
+Both layers belong in a mature data platform. They just answer different questions at different points in time.
\ No newline at end of file
From f8f208ec365833ae39b051dc4dce0e904ea0d5f0 Mon Sep 17 00:00:00 2001
From: yash-chauhan-dev
Date: Tue, 24 Feb 2026 11:16:41 +0530
Subject: [PATCH 18/25] Fix positioning discrepancies across all surfaces
- CLI help text: "Lightweight data quality validation tool" -> "A linter for data pipelines"
- CLI schema command: "Schema evolution detection" -> "Enforce schema contracts"
- datacheck/__init__.py: update module docstring
- pyproject.toml: "data quality rules" -> "deterministic validation rules"; remove data-quality/data-observability keywords, add data-linter/schema-contracts
- airflow/operators.py + __init__.py: "data quality checks" -> "validation rules"; "Detect schema" -> "Enforce schema contracts"
- airflow-provider: pyproject.toml description/keywords cleaned; provider __init__ docstring; example DAG docstring
- airflow-provider/README.md: "Detects schema changes" -> "Enforces schema contracts"
- github-action/README.md: "Validate data quality" -> "Enforce deterministic validation rules"
- guides/cli-guide.md: schema section heading + command table + code comment
- guides/python-api.md: schema operator description + Airflow example comment
- guides/guide-who-uses-datacheck.md: "schema evolution detection" -> enforcement framing; pipeline diagram comments
- docs/index.md: "detect schema changes" -> "enforce schema contracts" (3 occurrences)
Co-Authored-By: Claude Sonnet 4.6
---
airflow-provider/README.md | 2 +-
.../airflow_provider_datacheck/__init__.py | 2 +-
.../example_dags/example_validate_dag.py | 2 +-
airflow-provider/pyproject.toml | 6 +++---
datacheck/__init__.py | 2 +-
datacheck/airflow/__init__.py | 8 ++++----
datacheck/airflow/operators.py | 6 +++---
datacheck/cli/__init__.py | 12 ++++++------
datacheck/cli/schema.py | 4 ++--
docs/index.md | 6 +++---
github-action/README.md | 8 ++++----
guides/cli-guide.md | 8 ++++----
guides/guide-who-uses-datacheck.md | 10 +++++-----
guides/python-api.md | 4 ++--
pyproject.toml | 6 +++---
15 files changed, 43 insertions(+), 43 deletions(-)
diff --git a/airflow-provider/README.md b/airflow-provider/README.md
index 947a8ef..2617b69 100644
--- a/airflow-provider/README.md
+++ b/airflow-provider/README.md
@@ -76,7 +76,7 @@ validate = DataCheckOperator(
### `DataCheckSchemaOperator`
-Detects schema changes against a saved baseline. On first run, captures the baseline automatically.
+Enforces schema contracts against a saved baseline - fails if breaking changes are detected. On first run, captures the baseline automatically.
```python
from airflow_provider_datacheck.operators.datacheck import DataCheckSchemaOperator
diff --git a/airflow-provider/airflow_provider_datacheck/__init__.py b/airflow-provider/airflow_provider_datacheck/__init__.py
index 2dfc7a7..e6b7e62 100644
--- a/airflow-provider/airflow_provider_datacheck/__init__.py
+++ b/airflow-provider/airflow_provider_datacheck/__init__.py
@@ -1,4 +1,4 @@
-"""Apache Airflow provider for DataCheck data quality validation."""
+"""Apache Airflow provider for DataCheck - enforce validation rules in Airflow DAGs."""
def get_provider_info() -> dict:
diff --git a/airflow-provider/example_dags/example_validate_dag.py b/airflow-provider/example_dags/example_validate_dag.py
index ab05817..562dceb 100644
--- a/airflow-provider/example_dags/example_validate_dag.py
+++ b/airflow-provider/example_dags/example_validate_dag.py
@@ -1,4 +1,4 @@
-"""Example DAG: daily data quality validation with DataCheckOperator.
+"""Example DAG: daily validation gate with DataCheckOperator.
Demonstrates:
- Validating a date-partitioned Parquet file using Jinja templating
diff --git a/airflow-provider/pyproject.toml b/airflow-provider/pyproject.toml
index 865770c..a642060 100644
--- a/airflow-provider/pyproject.toml
+++ b/airflow-provider/pyproject.toml
@@ -1,15 +1,15 @@
[tool.poetry]
name = "apache-airflow-provider-datacheck"
version = "1.0.0"
-description = "Data quality validation operators for Apache Airflow. Validate files, databases, Snowflake, BigQuery, and more."
+description = "Enforce DataCheck validation rules in Apache Airflow. Gate pipelines on data quality for files, databases, Snowflake, BigQuery, and more."
authors = ["Squrtech "]
readme = "README.md"
license = "Apache-2.0"
homepage = "https://github.com/squrtech/datacheck"
repository = "https://github.com/squrtech/datacheck"
keywords = [
- "airflow", "data-quality", "data-validation", "data-engineering",
- "pipeline", "etl", "snowflake", "bigquery", "postgresql", "data-observability",
+ "airflow", "data-linter", "data-validation", "data-engineering",
+ "pipeline", "etl", "snowflake", "bigquery", "postgresql", "schema-contracts",
]
classifiers = [
"Development Status :: 5 - Production/Stable",
diff --git a/datacheck/__init__.py b/datacheck/__init__.py
index b21d33e..4bf1840 100644
--- a/datacheck/__init__.py
+++ b/datacheck/__init__.py
@@ -1,4 +1,4 @@
-"""DataCheck - Lightweight data quality validation CLI tool."""
+"""DataCheck - A linter for data pipelines."""
from datacheck.engine import ValidationEngine
from datacheck.exceptions import (
diff --git a/datacheck/airflow/__init__.py b/datacheck/airflow/__init__.py
index 4dce2b7..0fd7f34 100644
--- a/datacheck/airflow/__init__.py
+++ b/datacheck/airflow/__init__.py
@@ -1,10 +1,10 @@
"""Airflow integration for DataCheck.
-Provides two operators for integrating DataCheck data quality
-validation into Airflow pipelines:
+Provides two operators for enforcing DataCheck validation rules
+in Airflow pipelines:
-- DataCheckOperator: Validate data against configured rules
-- DataCheckSchemaOperator: Detect schema changes against baselines
+- DataCheckOperator: Enforce validation rules against configured data sources
+- DataCheckSchemaOperator: Enforce schema contracts against saved baselines
For complex workflows, you can also use the CLI via BashOperator.
"""
diff --git a/datacheck/airflow/operators.py b/datacheck/airflow/operators.py
index 711f5a6..0c91b79 100644
--- a/datacheck/airflow/operators.py
+++ b/datacheck/airflow/operators.py
@@ -1,9 +1,9 @@
"""Airflow operators for DataCheck validation.
-Provides two operators for running data quality checks in Airflow DAGs:
+Provides two operators for enforcing validation rules in Airflow DAGs:
-- DataCheckOperator: Validate data against configured rules
-- DataCheckSchemaOperator: Detect schema changes against baselines
+- DataCheckOperator: Enforce validation rules against configured data sources
+- DataCheckSchemaOperator: Enforce schema contracts against saved baselines
"""
from __future__ import annotations
diff --git a/datacheck/cli/__init__.py b/datacheck/cli/__init__.py
index 169b6bc..b658810 100644
--- a/datacheck/cli/__init__.py
+++ b/datacheck/cli/__init__.py
@@ -7,7 +7,7 @@
app = typer.Typer(
name="datacheck",
- help="Lightweight data quality validation CLI tool",
+ help="A linter for data pipelines. Enforce validation rules in CI, Airflow, and beyond.",
add_completion=False,
)
@@ -22,21 +22,21 @@ def version() -> None:
@app.callback(invoke_without_command=True)
def main(ctx: typer.Context) -> None:
- """DataCheck - Lightweight data quality validation CLI tool.
+ """DataCheck - A linter for data pipelines.
- Run 'datacheck validate ' to validate a data file.
+ Run 'datacheck validate' to enforce validation rules against a data source.
Run 'datacheck --help' for more information.
"""
if ctx.invoked_subcommand is None:
- console.print("[bold]DataCheck[/bold] - Data Quality Validation")
+ console.print("[bold]DataCheck[/bold] - A Linter for Data Pipelines")
console.print(f"Version: {__version__}")
console.print()
console.print("Usage: datacheck [COMMAND] [OPTIONS]")
console.print()
console.print("Commands:")
- console.print(" validate Validate data file against configured rules")
+ console.print(" validate Enforce validation rules against a data source")
console.print(" config Configuration management commands")
- console.print(" schema Schema evolution detection commands")
+ console.print(" schema Enforce schema contracts against a baseline")
console.print(" version Display version information")
console.print()
console.print("Run 'datacheck [COMMAND] --help' for more information on a command.")
diff --git a/datacheck/cli/schema.py b/datacheck/cli/schema.py
index 7404f08..d20c582 100644
--- a/datacheck/cli/schema.py
+++ b/datacheck/cli/schema.py
@@ -20,10 +20,10 @@ def _safe_encoding() -> bool:
_TICK = "✓" if _safe_encoding() else "v"
from datacheck.exceptions import DataLoadError
-# Schema sub-app for schema evolution commands
+# Schema sub-app for schema contract enforcement commands
schema_app = typer.Typer(
name="schema",
- help="Schema evolution detection commands",
+ help="Enforce schema contracts - capture baselines and fail on breaking changes",
)
diff --git a/docs/index.md b/docs/index.md
index fcb9735..c47bf8f 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -11,7 +11,7 @@ Your data source -> [DataCheck rules] -> exit 0: pipeline continues
-> exit 1: pipeline stops
```
-DataCheck provides the `datacheck` CLI and a Python API to enforce validation rules and detect schema changes. Embed it in pipelines (Airflow, Dagster, Prefect), run it in CI/CD, or use it locally before pushing data. Rules are binary, deterministic, and config-driven - no statistical anomaly detection, no dashboards, no infrastructure required.
+DataCheck provides the `datacheck` CLI and a Python API to enforce validation rules and schema contracts. Embed it in pipelines (Airflow, Dagster, Prefect), run it in CI/CD, or use it locally before pushing data. Rules are binary, deterministic, and config-driven - no statistical anomaly detection, no dashboards, no infrastructure required.
---
@@ -679,7 +679,7 @@ Configuration management commands.
### `datacheck schema`
-Schema evolution detection and management.
+Schema contract enforcement - capture baselines and fail on breaking changes.
| Subcommand | Description |
|------------|-------------|
@@ -891,7 +891,7 @@ validate_orders = DataCheckOperator(
### DataCheckSchemaOperator
-Detect schema changes inside Airflow DAGs:
+Enforce schema contracts inside Airflow DAGs:
```python
from datacheck.airflow.operators import DataCheckSchemaOperator
diff --git a/github-action/README.md b/github-action/README.md
index 0ce9a69..8c39dc6 100644
--- a/github-action/README.md
+++ b/github-action/README.md
@@ -10,8 +10,8 @@
-Validate data quality in CI/CD with [DataCheck](https://github.com/squrtech/datacheck).
-Define rules in YAML, catch bad data before it breaks pipelines.
+Enforce deterministic validation rules in CI/CD with [DataCheck](https://github.com/squrtech/datacheck).
+Define rules in YAML. Fail fast on bad data. Stop pipelines at the gate.
Results appear in the **GitHub Security tab** via SARIF upload.
---
@@ -174,9 +174,9 @@ checks:
if: always()
run: |
if [ "${{ steps.datacheck.outputs.passed }}" == "true" ]; then
- echo "All data quality checks passed!"
+ echo "All validation rules passed!"
else
- echo "Data quality checks failed — see the Security tab for details."
+ echo "Validation failed - see the Security tab for details."
fi
```
diff --git a/guides/cli-guide.md b/guides/cli-guide.md
index 98bf86c..7862773 100644
--- a/guides/cli-guide.md
+++ b/guides/cli-guide.md
@@ -74,7 +74,7 @@ pip install datacheck-cli[all] # All data sources
```
datacheck validate Validate data against configured rules
-datacheck schema Schema evolution detection commands
+datacheck schema Enforce schema contracts against a baseline
datacheck config Configuration management commands
datacheck version Display version information
```
@@ -258,9 +258,9 @@ datacheck validate --csv-export failures.csv
---
-## Schema
+## Schema Contract Enforcement
-Track schema changes over time. Capture a baseline, then compare future data to detect column additions, removals, type changes, renames, and nullable changes. The data source can be provided directly, read from your config, or loaded from a named source.
+Capture a schema baseline, then enforce it - breaking changes (column additions, removals, type changes, nullable changes) fail the pipeline. The data source can be provided directly, read from your config, or loaded from a named source.
### Capture a Baseline
@@ -995,7 +995,7 @@ validate = DataCheckOperator(
file_path="/data/orders_{{ ds }}.parquet",
)
-# Detect schema changes
+# Enforce schema contracts
schema_check = DataCheckSchemaOperator(
task_id="schema_check",
file_path="/data/orders_{{ ds }}.parquet",
diff --git a/guides/guide-who-uses-datacheck.md b/guides/guide-who-uses-datacheck.md
index 9000429..12f2261 100644
--- a/guides/guide-who-uses-datacheck.md
+++ b/guides/guide-who-uses-datacheck.md
@@ -226,7 +226,7 @@ You maintain a `dim_customers` table that joins data from three sources. A schem
### How DataCheck Helps
-DataCheck combines rule-based validation with schema evolution detection. You define rules for data quality and track schema changes separately, so you catch both content issues and structural changes.
+DataCheck enforces both validation rules and schema contracts. You define rules for content correctness and baseline schema separately - both are gates that fail the pipeline when violated.
### Setup
@@ -268,7 +268,7 @@ Compatibility: BREAKING (2 breaking changes)
With `--fail-on-breaking`, exit code 1 stops your pipeline.
-**Step 3: Validate data quality rules**
+**Step 3: Enforce validation rules**
```yaml
# dim_customers_checks.yaml
@@ -446,8 +446,8 @@ Developer pushes PR
[ CI Pipeline ]
|
+---> datacheck config validate (config syntax OK?)
- +---> datacheck validate (data quality OK?)
- +---> datacheck schema compare (schema unchanged?)
+ +---> datacheck validate (validation rules passed?)
+ +---> datacheck schema compare (schema contract enforced?)
|
all pass?
/ \
@@ -585,7 +585,7 @@ with DAG(
fail_on_breaking=True,
)
- # 2. Validate data quality
+ # 2. Enforce validation rules
validate = DataCheckOperator(
task_id="validate_orders",
config_path="/opt/airflow/config/order_checks.yaml",
diff --git a/guides/python-api.md b/guides/python-api.md
index c551741..20b0c24 100644
--- a/guides/python-api.md
+++ b/guides/python-api.md
@@ -935,7 +935,7 @@ validate = DataCheckOperator(
### DataCheckSchemaOperator
-Detects schema changes by comparing current data against a saved baseline. If no baseline exists, it captures one automatically. Uses DataCheck's `SchemaDetector`, `SchemaComparator`, and `BaselineManager`.
+Enforces schema contracts by comparing current data against a saved baseline - fails if breaking changes are detected. If no baseline exists, it captures one automatically. Uses DataCheck's `SchemaDetector`, `SchemaComparator`, and `BaselineManager`.
```python
DataCheckSchemaOperator(
@@ -1093,7 +1093,7 @@ with DAG(
fail_on_breaking=True,
)
- # Step 2: Validate data quality
+ # Step 2: Enforce validation rules
validate = DataCheckOperator(
task_id="validate_orders",
config_path="/opt/airflow/config/checks.yaml",
diff --git a/pyproject.toml b/pyproject.toml
index 8a69fd8..c0286c1 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,19 +1,19 @@
[tool.poetry]
name = "datacheck-cli"
version = "2.1.0"
-description = "A linter for data pipelines. Enforce data quality rules in CI/CD, Airflow, and beyond."
+description = "A linter for data pipelines. Enforce deterministic validation rules in CI/CD, Airflow, and beyond."
authors = ["Squrtech "]
readme = "README_PYPI.md"
license = "Apache-2.0"
homepage = "https://github.com/squrtech/datacheck"
repository = "https://github.com/squrtech/datacheck"
keywords = [
- "data-validation", "data-quality", "cli", "data-engineering",
+ "data-validation", "data-linter", "cli", "data-engineering",
"pipeline", "ci-cd", "yaml", "testing", "csv", "parquet",
"postgres", "data-testing", "great-expectations-alternative",
"soda-alternative", "dbt-testing", "data-contracts",
"airflow", "dagster", "prefect", "snowflake", "bigquery",
- "redshift", "data-observability", "schema-validation",
+ "redshift", "schema-contracts", "schema-validation",
"data-pipeline", "etl-testing",
]
classifiers = [
From 7ab7a23b802c82843148949922bb99dbd1c9230f Mon Sep 17 00:00:00 2001
From: yash-chauhan-dev
Date: Tue, 24 Feb 2026 11:29:18 +0530
Subject: [PATCH 19/25] Fix CSVLoader double-kwarg bug when columns passed via
LoaderFactory
LoaderFactory.create_loader extracted 'columns' explicitly but also
left it in file_kwargs, causing CSVLoader to receive it twice.
Added 'columns' to the exclusion list in file_kwargs.
Co-Authored-By: Claude Sonnet 4.6
---
datacheck/loader.py | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/datacheck/loader.py b/datacheck/loader.py
index 9251298..842acfd 100644
--- a/datacheck/loader.py
+++ b/datacheck/loader.py
@@ -301,9 +301,9 @@ def create_loader(source: str | Path, **kwargs: Any) -> DataLoader:
ext = source_path.suffix.lower()
- # Filter out database-specific kwargs for file loaders
+ # Filter out non-file-loader kwargs
file_kwargs = {k: v for k, v in kwargs.items()
- if k not in ["table", "where", "query"]}
+ if k not in ["table", "where", "query", "columns"]}
if ext == ".csv":
csv_columns = kwargs.get("columns")
From 50d5b96079d2e408dc515f4310f91d01b15b52c9 Mon Sep 17 00:00:00 2001
From: yash-chauhan-dev
Date: Tue, 24 Feb 2026 11:32:53 +0530
Subject: [PATCH 20/25] Remove references to unsupported sources and
integrations
- docs/index.md: remove Dagster and Prefect (no integrations exist)
- pyproject.toml: remove dagster/prefect keywords
- github-action/README.md: remove gcs/azure from extras list; CSV/Parquet only for data-source input
- SECURITY.md: remove GCS and Azure from optional dependencies
Co-Authored-By: Claude Sonnet 4.6
---
SECURITY.md | 2 --
docs/index.md | 2 +-
github-action/README.md | 4 ++--
pyproject.toml | 2 +-
4 files changed, 4 insertions(+), 6 deletions(-)
diff --git a/SECURITY.md b/SECURITY.md
index b40a786..54879c5 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -235,8 +235,6 @@ Core dependencies:
- pyarrow (Parquet support)
- pyyaml (configuration)
- boto3 (AWS S3) - optional
-- google-cloud-storage (GCS) - optional
-- azure-storage-blob (Azure) - optional
## Updates and Patches
diff --git a/docs/index.md b/docs/index.md
index c47bf8f..b2479a2 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -11,7 +11,7 @@ Your data source -> [DataCheck rules] -> exit 0: pipeline continues
-> exit 1: pipeline stops
```
-DataCheck provides the `datacheck` CLI and a Python API to enforce validation rules and schema contracts. Embed it in pipelines (Airflow, Dagster, Prefect), run it in CI/CD, or use it locally before pushing data. Rules are binary, deterministic, and config-driven - no statistical anomaly detection, no dashboards, no infrastructure required.
+DataCheck provides the `datacheck` CLI and a Python API to enforce validation rules and schema contracts. Embed it in CI/CD pipelines, run it in Airflow DAGs, or use it locally before pushing data. Rules are binary, deterministic, and config-driven - no statistical anomaly detection, no dashboards, no infrastructure required.
---
diff --git a/github-action/README.md b/github-action/README.md
index 8c39dc6..adc74de 100644
--- a/github-action/README.md
+++ b/github-action/README.md
@@ -50,9 +50,9 @@ results to the GitHub Security tab. The job fails (exit 1) if any `error`-severi
| Input | Required | Default | Description |
|-------|----------|---------|-------------|
| `config` | No | `.datacheck.yaml` | Path to your validation config |
-| `data-source` | No | _(empty)_ | Path to a data file (CSV, Parquet, JSON) to validate. Overrides the source defined in the config — useful for validating a freshly generated file. |
+| `data-source` | No | _(empty)_ | Path to a data file (CSV or Parquet) to validate. Overrides the source defined in the config — useful for validating a freshly generated file. |
| `sources-file` | No | _(empty)_ | Path to `sources.yaml` — only needed for database/cloud sources |
-| `extras` | No | _(empty)_ | Connector extras to install: `postgresql`, `mysql`, `snowflake`, `bigquery`, `redshift`, `s3`, `gcs`, `azure`, `cloud`, `databases`, `warehouses`, `all`. Comma-separated for multiple. |
+| `extras` | No | _(empty)_ | Connector extras to install: `postgresql`, `mysql`, `mssql`, `snowflake`, `bigquery`, `redshift`, `s3`, `databases`, `warehouses`, `all`. Comma-separated for multiple. |
| `output-format` | No | `sarif` | Output format: `sarif`, `json`, `markdown`, `csv` |
| `output-file` | No | `datacheck-results.sarif` | Path to save the results file |
| `upload-sarif` | No | `true` | Auto-upload SARIF to GitHub Security tab |
diff --git a/pyproject.toml b/pyproject.toml
index c0286c1..63896f6 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -12,7 +12,7 @@ keywords = [
"pipeline", "ci-cd", "yaml", "testing", "csv", "parquet",
"postgres", "data-testing", "great-expectations-alternative",
"soda-alternative", "dbt-testing", "data-contracts",
- "airflow", "dagster", "prefect", "snowflake", "bigquery",
+ "airflow", "snowflake", "bigquery",
"redshift", "schema-contracts", "schema-validation",
"data-pipeline", "etl-testing",
]
From bd4a5920d9c21c697450e4eb2927983edec3c3e6 Mon Sep 17 00:00:00 2001
From: yash-chauhan-dev
Date: Tue, 24 Feb 2026 11:57:42 +0530
Subject: [PATCH 21/25] Complete validate and schema command option tables in
README
validate:
- Add all 20+ options in grouped tables (data source / output / execution / logging)
- Add positional [DATA_SOURCE] argument and direct file example
- Add echo $? to reinforce gating behavior
schema compare:
- Fix incorrect comment: compare does NOT fail by default - only with --fail-on-breaking
- Add --fail-on-breaking to examples
- Add full schema compare options table
README_PYPI.md: add direct file and echo $? examples to validate quickstart
Co-Authored-By: Claude Sonnet 4.6
---
README.md | 91 +++++++++++++++++++++++++++++++++++++++-----------
README_PYPI.md | 6 ++--
2 files changed, 75 insertions(+), 22 deletions(-)
diff --git a/README.md b/README.md
index 5bf0ac1..b628774 100644
--- a/README.md
+++ b/README.md
@@ -146,22 +146,57 @@ DataCheck auto-discovers config files in this order: `.datacheck.yaml` → `.dat
### Run validation
```bash
+datacheck validate # auto-discover config
+datacheck validate data.csv # direct file
datacheck validate --config checks.yaml
+echo $? # 1 if any error-severity rule fails
```
-| Parameter | Required | Description |
-|-----------|----------|-------------|
-| `-c, --config` | No | Path to config file (auto-discovered if not specified) |
-| `--source` | No | Named source from `sources.yaml` |
-| `--sources-file` | No | Path to sources YAML file |
-| `-t, --table` | No | Database table name (for database sources) |
-| `-w, --where` | No | WHERE clause for filtering (for database sources) |
-| `-q, --query` | No | Custom SQL query (alternative to --table) |
-| `-o, --output` | No | Save results to a file (format controlled by `--format`) |
-| `--format` | No | Output format: `json` (default), `sarif`, `markdown`, `csv` |
-| `--csv-export` | No | Export failure details as CSV |
-| `--parallel` | No | Enable multi-core execution |
-| `--verbose, -v` | No | Enable detailed logging |
+**Data source**
+
+| Option | Short | Description |
+|--------|-------|-------------|
+| `[DATA_SOURCE]` | | Positional: file path or connection string |
+| `--config` | `-c` | Path to config file (auto-discovered if not set) |
+| `--source` | | Named source from `sources.yaml` |
+| `--sources-file` | | Path to sources YAML file |
+| `--table` | `-t` | Database table name |
+| `--where` | `-w` | WHERE clause for filtering |
+| `--query` | `-q` | Custom SQL query (alternative to `--table`) |
+| `--schema` | `-s` | Schema/dataset name (databases and warehouses) |
+| `--warehouse` | | Snowflake warehouse name |
+| `--credentials` | | Path to credentials file (e.g., BigQuery service account JSON) |
+| `--region` | | Cloud region (Redshift IAM auth) |
+| `--cluster` | | Cluster identifier (Redshift IAM auth) |
+| `--iam-auth` | | Use IAM authentication (Redshift) |
+
+**Output**
+
+| Option | Short | Description |
+|--------|-------|-------------|
+| `--output` | `-o` | Save results to file |
+| `--format` | `-f` | Output format: `json` (default), `sarif`, `markdown`, `csv` |
+| `--csv-export` | | Export failure details as CSV |
+| `--suggestions` / `--no-suggestions` | | Show actionable fix suggestions (default: on) |
+
+**Execution**
+
+| Option | Short | Description |
+|--------|-------|-------------|
+| `--parallel` | | Enable multi-core execution |
+| `--workers` | | Number of worker processes (default: CPU count) |
+| `--chunk-size` | | Rows per chunk for parallel processing (default: 100000) |
+| `--progress` / `--no-progress` | | Show progress bar (default: on) |
+| `--slack-webhook` | | Slack webhook URL for result notifications |
+
+**Logging**
+
+| Option | Short | Description |
+|--------|-------|-------------|
+| `--verbose` | `-v` | Set log level to DEBUG |
+| `--log-level` | | `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL` |
+| `--log-format` | | `console` (default) or `json` |
+| `--log-file` | | Path to log file (enables rotation) |
## Data Source Configuration
@@ -395,26 +430,42 @@ datacheck validate -c .datacheck.yaml
## Enforce Schema Contracts
-Capture a schema baseline and compare future data against it - breaking changes fail validation. Detects column additions, removals, type changes, and nullable changes. The data source can be provided directly, read from your config, or loaded from a named source.
+Capture a schema baseline and compare future data against it. Detects column additions, removals, type changes, and nullable changes. Use `--fail-on-breaking` to exit 1 on breaking changes. The data source can be provided directly, read from your config, or loaded from a named source.
```bash
# Auto-discover config or use named source
-datacheck schema capture # Save current schema as baseline
-datacheck schema compare # Compare against baseline - fails if schema changed
+datacheck schema capture # Save current schema as baseline
+datacheck schema compare # Compare - reports changes, exit 0
+datacheck schema compare --fail-on-breaking # Compare - exit 1 on breaking changes
# Direct file path
datacheck schema capture data.csv
-datacheck schema compare data.csv
+datacheck schema compare data.csv --fail-on-breaking
# Named source
datacheck schema capture --source production_db --sources-file sources.yaml
# Other schema commands
-datacheck schema show # Display detected schema
-datacheck schema list # List saved baselines
-datacheck schema history # View capture history
+datacheck schema show # Display saved baseline
+datacheck schema list # List saved baselines
+datacheck schema history # View capture history
```
+`schema compare` options:
+
+| Option | Short | Description |
+|--------|-------|-------------|
+| `[DATA_SOURCE]` | | Positional: file path or connection string |
+| `--config` | `-c` | Path to config file |
+| `--source` | | Named source from `sources.yaml` |
+| `--sources-file` | | Path to sources YAML file |
+| `--table` | `-t` | Database table name |
+| `--baseline` | `-b` | Name of baseline to compare against (default: `baseline`) |
+| `--baseline-dir` | | Directory containing baselines (default: `.datacheck/schemas`) |
+| `--rename-threshold` | | Similarity threshold for rename detection (default: 0.8) |
+| `--fail-on-breaking` | | Exit 1 if breaking changes are detected |
+| `--format` | `-f` | Output format: `terminal` (default) or `json` |
+
## Python API
Use DataCheck programmatically within your pipelines:
diff --git a/README_PYPI.md b/README_PYPI.md
index 4cd3ef2..9485db0 100644
--- a/README_PYPI.md
+++ b/README_PYPI.md
@@ -73,8 +73,10 @@ checks:
Run validation:
```bash
-datacheck validate
-# exits 1 if any error-severity rule fails
+datacheck validate # auto-discover config
+datacheck validate data.csv # direct file
+datacheck validate --config checks.yaml
+echo $? # 1 if any error-severity rule fails
```
## CI/CD Integration
From 45e77afb5fecf056e490b5c7423f94293f5438d8 Mon Sep 17 00:00:00 2001
From: yash-chauhan-dev
Date: Tue, 24 Feb 2026 12:10:10 +0530
Subject: [PATCH 22/25] Fix CI action versions, linting errors, and validate
command description
- Fix all GitHub Actions using non-existent @v6 versions across ci.yml,
security.yml, auto-release.yml, release.yml, pr-version-check.yml
(checkout@v4, setup-python@v5, upload-artifact@v4)
- Remove data-quality.yml from this repo's CI - it is a user template,
not a workflow for the DataCheck repo itself (no .datacheck.yaml here)
- Fix validate command one-line description to enforcement language
- Fix 30 ruff linting errors: unused imports, dead variable, loop variable,
Optional[X] -> X | None modernisation, quoted type annotations
Co-Authored-By: Claude Sonnet 4.6
---
.github/workflows/auto-release.yml | 22 ++++-----
.github/workflows/ci.yml | 10 ++--
.github/workflows/data-quality.yml | 65 --------------------------
.github/workflows/pr-version-check.yml | 4 +-
.github/workflows/release.yml | 18 +++----
.github/workflows/security.yml | 18 +++----
datacheck/cli/validate.py | 3 +-
datacheck/config/sample_data.py | 5 +-
datacheck/reporting/csv_exporter.py | 2 -
datacheck/rules/base.py | 2 +-
datacheck/rules/numeric_rules.py | 3 +-
datacheck/sql_pushdown/builder.py | 6 +--
datacheck/sql_pushdown/dialects.py | 37 +++++++--------
datacheck/validation/rules.py | 1 -
14 files changed, 62 insertions(+), 134 deletions(-)
delete mode 100644 .github/workflows/data-quality.yml
diff --git a/.github/workflows/auto-release.yml b/.github/workflows/auto-release.yml
index 8c34a51..924a0d3 100644
--- a/.github/workflows/auto-release.yml
+++ b/.github/workflows/auto-release.yml
@@ -20,12 +20,12 @@ jobs:
should-release: ${{ steps.check.outputs.should_release }}
steps:
- - uses: actions/checkout@v6
+ - uses: actions/checkout@v4
with:
fetch-depth: 0 # Need full history to compare versions
- name: Set up Python
- uses: actions/setup-python@v6
+ uses: actions/setup-python@v5
with:
python-version: "3.12"
@@ -85,10 +85,10 @@ jobs:
runs-on: ubuntu-latest
steps:
- - uses: actions/checkout@v6
+ - uses: actions/checkout@v4
- name: Set up Python
- uses: actions/setup-python@v6
+ uses: actions/setup-python@v5
with:
python-version: "3.12"
@@ -112,10 +112,10 @@ jobs:
runs-on: ubuntu-latest
steps:
- - uses: actions/checkout@v6
+ - uses: actions/checkout@v4
- name: Set up Python
- uses: actions/setup-python@v6
+ uses: actions/setup-python@v5
with:
python-version: "3.12"
@@ -128,7 +128,7 @@ jobs:
run: poetry build
- name: Upload build artifacts
- uses: actions/upload-artifact@v6
+ uses: actions/upload-artifact@v4
with:
name: dist
path: dist/
@@ -139,7 +139,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- - uses: actions/checkout@v6
+ - uses: actions/checkout@v4
- name: Create and push tag
run: |
@@ -159,10 +159,10 @@ jobs:
url: https://pypi.org/project/datacheck-cli/
steps:
- - uses: actions/checkout@v6
+ - uses: actions/checkout@v4
- name: Set up Python
- uses: actions/setup-python@v6
+ uses: actions/setup-python@v5
with:
python-version: "3.12"
@@ -185,7 +185,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- - uses: actions/checkout@v6
+ - uses: actions/checkout@v4
with:
fetch-depth: 0
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 3456cea..7049878 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -11,10 +11,10 @@ jobs:
runs-on: ubuntu-latest
steps:
- - uses: actions/checkout@v6
+ - uses: actions/checkout@v4
- name: Set up Python
- uses: actions/setup-python@v6
+ uses: actions/setup-python@v5
with:
python-version: "3.12"
@@ -37,10 +37,10 @@ jobs:
needs: [lint]
steps:
- - uses: actions/checkout@v6
+ - uses: actions/checkout@v4
- name: Set up Python
- uses: actions/setup-python@v6
+ uses: actions/setup-python@v5
with:
python-version: "3.10"
@@ -53,7 +53,7 @@ jobs:
run: poetry build
- name: Upload artifacts
- uses: actions/upload-artifact@v6
+ uses: actions/upload-artifact@v4
with:
name: dist
path: dist/
diff --git a/.github/workflows/data-quality.yml b/.github/workflows/data-quality.yml
deleted file mode 100644
index c813f7f..0000000
--- a/.github/workflows/data-quality.yml
+++ /dev/null
@@ -1,65 +0,0 @@
-# DataCheck - Data Quality Gate
-#
-# Runs DataCheck on every push and pull request.
-# Failed error-severity rules cause exit code 1 and fail the workflow.
-# Results are uploaded to the GitHub Security tab as SARIF annotations.
-#
-# Requirements:
-# - A .datacheck.yaml config file in the repo root (or set 'config' below)
-# - For databases/cloud: a sources.yaml with credentials via secrets
-#
-# Minimal setup:
-# 1. Add this file to .github/workflows/
-# 2. Add a .datacheck.yaml to your repo
-# 3. Push - results appear in the Security tab on PRs
-
-name: Data Quality Gate
-
-on:
- push:
- branches: [main, master]
- pull_request:
- branches: [main, master]
-
-permissions:
- contents: read
- security-events: write # Required for SARIF upload to Security tab
-
-jobs:
- validate:
- name: Validate data quality
- runs-on: ubuntu-latest
-
- steps:
- - name: Checkout
- uses: actions/checkout@v4
-
- - name: Set up Python
- uses: actions/setup-python@v5
- with:
- python-version: "3.12"
-
- - name: Install DataCheck
- run: pip install datacheck-cli
- # For database sources, install the relevant extra:
- # pip install datacheck-cli[postgresql]
- # pip install datacheck-cli[snowflake]
- # pip install datacheck-cli[bigquery]
-
- - name: Run data quality gate
- run: |
- datacheck validate \
- --config .datacheck.yaml \
- --format sarif \
- --output results.sarif
- # For database sources, pass credentials via env vars:
- # env:
- # DB_HOST: ${{ secrets.DB_HOST }}
- # DB_PASSWORD: ${{ secrets.DB_PASSWORD }}
-
- - name: Upload SARIF to GitHub Security tab
- uses: github/codeql-action/upload-sarif@v3
- if: always() # Upload even on failure so violations appear in the PR
- with:
- sarif_file: results.sarif
- category: data-quality
diff --git a/.github/workflows/pr-version-check.yml b/.github/workflows/pr-version-check.yml
index 761d2fc..d4c697e 100644
--- a/.github/workflows/pr-version-check.yml
+++ b/.github/workflows/pr-version-check.yml
@@ -11,12 +11,12 @@ jobs:
runs-on: ubuntu-latest
steps:
- - uses: actions/checkout@v6
+ - uses: actions/checkout@v4
with:
fetch-depth: 0 # Need full history to compare with base branch
- name: Set up Python
- uses: actions/setup-python@v6
+ uses: actions/setup-python@v5
with:
python-version: "3.12"
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
index 3ea7251..a7c2559 100644
--- a/.github/workflows/release.yml
+++ b/.github/workflows/release.yml
@@ -14,10 +14,10 @@ jobs:
runs-on: ubuntu-latest
steps:
- - uses: actions/checkout@v6
+ - uses: actions/checkout@v4
- name: Set up Python
- uses: actions/setup-python@v6
+ uses: actions/setup-python@v5
with:
python-version: "3.12"
@@ -40,10 +40,10 @@ jobs:
runs-on: ubuntu-latest
steps:
- - uses: actions/checkout@v6
+ - uses: actions/checkout@v4
- name: Set up Python
- uses: actions/setup-python@v6
+ uses: actions/setup-python@v5
with:
python-version: "3.12"
@@ -56,7 +56,7 @@ jobs:
run: poetry build
- name: Upload build artifacts
- uses: actions/upload-artifact@v6
+ uses: actions/upload-artifact@v4
with:
name: dist
path: dist/
@@ -69,10 +69,10 @@ jobs:
url: https://pypi.org/project/datacheck/
steps:
- - uses: actions/checkout@v6
+ - uses: actions/checkout@v4
- name: Set up Python
- uses: actions/setup-python@v6
+ uses: actions/setup-python@v5
with:
python-version: "3.12"
@@ -94,7 +94,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- - uses: actions/checkout@v6
+ - uses: actions/checkout@v4
with:
fetch-depth: 0 # Fetch all history for changelog
@@ -134,7 +134,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- - uses: actions/checkout@v6
+ - uses: actions/checkout@v4
- name: Extract version from tag
id: version
diff --git a/.github/workflows/security.yml b/.github/workflows/security.yml
index fa98c53..03dde64 100644
--- a/.github/workflows/security.yml
+++ b/.github/workflows/security.yml
@@ -20,10 +20,10 @@ jobs:
steps:
- name: Checkout code
- uses: actions/checkout@v6
+ uses: actions/checkout@v4
- name: Set up Python
- uses: actions/setup-python@v6
+ uses: actions/setup-python@v5
with:
python-version: "3.11"
@@ -44,7 +44,7 @@ jobs:
- name: Upload Bandit report
if: always()
- uses: actions/upload-artifact@v6
+ uses: actions/upload-artifact@v4
with:
name: bandit-report
path: bandit-report.json
@@ -56,10 +56,10 @@ jobs:
steps:
- name: Checkout code
- uses: actions/checkout@v6
+ uses: actions/checkout@v4
- name: Set up Python
- uses: actions/setup-python@v6
+ uses: actions/setup-python@v5
with:
python-version: "3.11"
@@ -86,7 +86,7 @@ jobs:
- name: Upload Safety report
if: always()
- uses: actions/upload-artifact@v6
+ uses: actions/upload-artifact@v4
with:
name: safety-report
path: safety-report.json
@@ -94,7 +94,7 @@ jobs:
- name: Upload pip-audit report
if: always()
- uses: actions/upload-artifact@v6
+ uses: actions/upload-artifact@v4
with:
name: pip-audit-report
path: pip-audit-report.json
@@ -106,7 +106,7 @@ jobs:
steps:
- name: Checkout code
- uses: actions/checkout@v6
+ uses: actions/checkout@v4
with:
fetch-depth: 0
@@ -129,7 +129,7 @@ jobs:
steps:
- name: Checkout repository
- uses: actions/checkout@v6
+ uses: actions/checkout@v4
- name: Initialize CodeQL
uses: github/codeql-action/init@v3
diff --git a/datacheck/cli/validate.py b/datacheck/cli/validate.py
index 8f0f2a7..e41e640 100644
--- a/datacheck/cli/validate.py
+++ b/datacheck/cli/validate.py
@@ -13,7 +13,6 @@
from datacheck.engine import ValidationEngine
from datacheck.exceptions import ConfigurationError, DataCheckError, DataLoadError, ValidationError
from datacheck.logging import configure_logging, get_logger, set_trace_id, generate_trace_id
-from datacheck.output import JSONExporter
def _load_from_warehouse(
@@ -368,7 +367,7 @@ def validate(
help="Enable verbose logging (sets log level to DEBUG)",
),
) -> None:
- """Validate data using specified rules.
+ """Enforce validation rules against a configured data source.
Supports both file-based and database sources.
diff --git a/datacheck/config/sample_data.py b/datacheck/config/sample_data.py
index 21f8770..beb87c3 100644
--- a/datacheck/config/sample_data.py
+++ b/datacheck/config/sample_data.py
@@ -11,10 +11,9 @@
"""
import csv
-import math
import random
import string
-from datetime import date, datetime, timedelta
+from datetime import date, timedelta
from pathlib import Path
from typing import Any
@@ -278,7 +277,7 @@ def generate_saas_data(num_rows: int = 1000) -> list[dict[str, Any]]:
last_login_lo = today - timedelta(days=364) # within past year
data = []
- for i in range(1, num_rows + 1):
+ for _i in range(1, num_rows + 1):
plan = random.choice(plans)
# mrr: 0 for free, else gauss(300, 200) clamped to [1, 5000]
mrr = 0 if plan == "free" else round(_gauss(300, 200, 1, 5000), 2)
diff --git a/datacheck/reporting/csv_exporter.py b/datacheck/reporting/csv_exporter.py
index e00be61..05e0858 100644
--- a/datacheck/reporting/csv_exporter.py
+++ b/datacheck/reporting/csv_exporter.py
@@ -255,8 +255,6 @@ def _get_suggestion_for_value(value: Any, rule_type: str) -> str:
if value is None:
return "Replace NULL with default value"
- value_str = str(value)
-
if rule_type == "not_null":
return "Replace with default value"
diff --git a/datacheck/rules/base.py b/datacheck/rules/base.py
index cb33694..6e7bdc1 100644
--- a/datacheck/rules/base.py
+++ b/datacheck/rules/base.py
@@ -5,7 +5,7 @@
import pandas as pd
-from datacheck.exceptions import ColumnNotFoundError, RuleDefinitionError
+from datacheck.exceptions import ColumnNotFoundError
from datacheck.results import FailureDetail, RuleResult
diff --git a/datacheck/rules/numeric_rules.py b/datacheck/rules/numeric_rules.py
index 7f5cda6..09e1c89 100644
--- a/datacheck/rules/numeric_rules.py
+++ b/datacheck/rules/numeric_rules.py
@@ -1,10 +1,9 @@
"""Numeric validation rules."""
-import numpy as np
import pandas as pd
from datacheck.exceptions import ColumnNotFoundError, RuleDefinitionError
-from datacheck.results import FailureDetail, RuleResult
+from datacheck.results import RuleResult
from datacheck.rules.base import Rule
diff --git a/datacheck/sql_pushdown/builder.py b/datacheck/sql_pushdown/builder.py
index 850c063..23cdd88 100644
--- a/datacheck/sql_pushdown/builder.py
+++ b/datacheck/sql_pushdown/builder.py
@@ -68,7 +68,7 @@ def __init__(self) -> None:
# ── Public API ──────────────────────────────────────────────────────────
def partition_checks(
- self, checks: list[Any], dialect: "Dialect"
+ self, checks: list[Any], dialect: Dialect
) -> tuple[list[Any], list[Any]]:
"""Split checks into (pushable, non_pushable) for the given *dialect*.
@@ -91,7 +91,7 @@ def build_query(
table: str,
where: str | None,
pushable_checks: list[Any],
- dialect: "Dialect",
+ dialect: Dialect,
) -> str:
"""Build a single aggregate SELECT for all pushable checks.
@@ -142,7 +142,7 @@ def _rule_to_sql(
rule_type: str,
params: Any,
alias_prefix: str,
- dialect: "Dialect",
+ dialect: Dialect,
) -> list[tuple[str, str]]:
"""Return (alias, SQL_expression) pairs for one rule."""
diff --git a/datacheck/sql_pushdown/dialects.py b/datacheck/sql_pushdown/dialects.py
index 5306c6b..f58d0e4 100644
--- a/datacheck/sql_pushdown/dialects.py
+++ b/datacheck/sql_pushdown/dialects.py
@@ -15,7 +15,6 @@
from __future__ import annotations
-from typing import Optional
# ── Base pushable-rule set (supported by every dialect) ───────────────────────
# Rules that rely on dialect-specific functions (regex, percentile, max_age)
@@ -71,7 +70,7 @@ def current_timestamp(self) -> str:
"""SQL expression for the current wall-clock timestamp."""
return "CURRENT_TIMESTAMP"
- def age_violation_expr(self, col: str, duration: str) -> Optional[str]:
+ def age_violation_expr(self, col: str, duration: str) -> str | None:
"""Inner CASE condition that is TRUE when *col* is older than *duration*.
Returns *None* if the dialect cannot express this in SQL (the rule then
@@ -85,7 +84,7 @@ def age_violation_expr(self, col: str, duration: str) -> Optional[str]:
ts = self.current_timestamp()
return f"{col} < {ts} - INTERVAL '{interval}'"
- def _duration_to_interval_str(self, duration: str) -> Optional[str]:
+ def _duration_to_interval_str(self, duration: str) -> str | None:
"""Convert a duration token (e.g. ``'24h'``) to a standard interval string."""
s = str(duration).strip().lower()
unit_map = {"m": "minutes", "h": "hours", "d": "days", "w": "weeks"}
@@ -95,7 +94,7 @@ def _duration_to_interval_str(self, duration: str) -> Optional[str]:
# ── Regex ──────────────────────────────────────────────────────────────────
- def regex_violation_expr(self, col: str, pattern: str) -> Optional[str]:
+ def regex_violation_expr(self, col: str, pattern: str) -> str | None:
"""Inner CASE condition that is TRUE when *col* does NOT match *pattern*.
Returns *None* if the dialect has no native regex operator.
@@ -112,11 +111,11 @@ def sep1(self) -> str:
# ── LIMIT / TOP ────────────────────────────────────────────────────────────
- def top_clause(self, n: Optional[int]) -> str:
+ def top_clause(self, n: int | None) -> str:
"""Token inserted after SELECT (SQL Server ``TOP n``). Empty for most DBs."""
return ""
- def limit_clause(self, n: Optional[int]) -> str:
+ def limit_clause(self, n: int | None) -> str:
"""Trailing ``LIMIT n`` clause. Empty for SQL Server (uses TOP instead)."""
return f" LIMIT {n}" if n is not None else ""
@@ -148,13 +147,13 @@ def str_length(self, col: str) -> str:
def current_timestamp(self) -> str:
return "NOW()"
- def age_violation_expr(self, col: str, duration: str) -> Optional[str]:
+ def age_violation_expr(self, col: str, duration: str) -> str | None:
interval = self._duration_to_interval_str(duration)
if interval is None:
return None
return f"{col} < NOW() - INTERVAL '{interval}'"
- def regex_violation_expr(self, col: str, pattern: str) -> Optional[str]:
+ def regex_violation_expr(self, col: str, pattern: str) -> str | None:
# !~ is the case-sensitive "does not match regex" operator in PostgreSQL.
# Cast to text so non-text columns (enums, UUIDs) are handled correctly.
p = pattern.replace("'", "''")
@@ -191,7 +190,7 @@ def str_length(self, col: str) -> str:
def current_timestamp(self) -> str:
return "NOW()"
- def age_violation_expr(self, col: str, duration: str) -> Optional[str]:
+ def age_violation_expr(self, col: str, duration: str) -> str | None:
# MySQL INTERVAL syntax: NOW() - INTERVAL 24 HOUR (no quotes, unit unquoted)
s = str(duration).strip().lower()
unit_map = {"m": "MINUTE", "h": "HOUR", "d": "DAY", "w": "WEEK"}
@@ -199,7 +198,7 @@ def age_violation_expr(self, col: str, duration: str) -> Optional[str]:
return f"{col} < NOW() - INTERVAL {s[:-1]} {unit_map[s[-1]]}"
return None
- def regex_violation_expr(self, col: str, pattern: str) -> Optional[str]:
+ def regex_violation_expr(self, col: str, pattern: str) -> str | None:
# MySQL REGEXP operator performs case-insensitive matching by default.
p = pattern.replace("'", "''")
return f"{col} NOT REGEXP '{p}'"
@@ -233,7 +232,7 @@ def sep1(self) -> str:
def current_timestamp(self) -> str:
return "GETDATE()"
- def age_violation_expr(self, col: str, duration: str) -> Optional[str]:
+ def age_violation_expr(self, col: str, duration: str) -> str | None:
# T-SQL: DATEADD(unit, -n, GETDATE())
s = str(duration).strip().lower()
unit_map = {"m": "minute", "h": "hour", "d": "day", "w": "week"}
@@ -241,15 +240,15 @@ def age_violation_expr(self, col: str, duration: str) -> Optional[str]:
return f"{col} < DATEADD({unit_map[s[-1]]}, -{s[:-1]}, GETDATE())"
return None
- def regex_violation_expr(self, col: str, pattern: str) -> Optional[str]:
+ def regex_violation_expr(self, col: str, pattern: str) -> str | None:
# SQL Server has no native regex operator.
return None
- def top_clause(self, n: Optional[int]) -> str:
+ def top_clause(self, n: int | None) -> str:
# SQL Server uses SELECT TOP N instead of LIMIT.
return f"TOP {n} " if n is not None else ""
- def limit_clause(self, n: Optional[int]) -> str:
+ def limit_clause(self, n: int | None) -> str:
# No LIMIT in T-SQL — rows are bounded by TOP in the SELECT clause.
return ""
@@ -277,14 +276,14 @@ def str_length(self, col: str) -> str:
def current_timestamp(self) -> str:
return "CURRENT_TIMESTAMP()"
- def age_violation_expr(self, col: str, duration: str) -> Optional[str]:
+ def age_violation_expr(self, col: str, duration: str) -> str | None:
# Snowflake supports standard INTERVAL '…' syntax.
interval = self._duration_to_interval_str(duration)
if interval is None:
return None
return f"{col} < CURRENT_TIMESTAMP() - INTERVAL '{interval}'"
- def regex_violation_expr(self, col: str, pattern: str) -> Optional[str]:
+ def regex_violation_expr(self, col: str, pattern: str) -> str | None:
# Snowflake REGEXP_LIKE(subject, pattern) — negate for violations.
p = pattern.replace("'", "''")
return f"NOT REGEXP_LIKE({col}, '{p}')"
@@ -312,7 +311,7 @@ def str_length(self, col: str) -> str:
def current_timestamp(self) -> str:
return "CURRENT_TIMESTAMP()"
- def age_violation_expr(self, col: str, duration: str) -> Optional[str]:
+ def age_violation_expr(self, col: str, duration: str) -> str | None:
# BigQuery: TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL n UNIT)
s = str(duration).strip().lower()
unit_map = {"m": "MINUTE", "h": "HOUR", "d": "DAY", "w": "WEEK"}
@@ -323,7 +322,7 @@ def age_violation_expr(self, col: str, duration: str) -> Optional[str]:
)
return None
- def regex_violation_expr(self, col: str, pattern: str) -> Optional[str]:
+ def regex_violation_expr(self, col: str, pattern: str) -> str | None:
# BigQuery REGEXP_CONTAINS(value, regexp) — negate for violations.
# The r'' prefix is cosmetic in the generated SQL string.
p = pattern.replace("'", "''")
@@ -349,7 +348,7 @@ def pushable_rules(self) -> frozenset[str]:
PUSHDOWN_CAPABLE_TYPES: frozenset[str] = frozenset(_DIALECT_MAP)
-def get_dialect(source_type: str) -> Optional[Dialect]:
+def get_dialect(source_type: str) -> Dialect | None:
"""Return the SQL dialect for *source_type*, or ``None`` if pushdown is not supported."""
return _DIALECT_MAP.get(source_type)
diff --git a/datacheck/validation/rules.py b/datacheck/validation/rules.py
index 0f6b9ec..2a50b9b 100644
--- a/datacheck/validation/rules.py
+++ b/datacheck/validation/rules.py
@@ -10,7 +10,6 @@
"""
import re
from abc import ABC, abstractmethod
-from collections.abc import Callable
from dataclasses import dataclass, field
from enum import Enum
from typing import Any
From f735ca487355fcc118eaeda64aa1b77a8bf9d5c2 Mon Sep 17 00:00:00 2001
From: yash-chauhan-dev
Date: Tue, 24 Feb 2026 12:15:39 +0530
Subject: [PATCH 23/25] Regenerate poetry.lock to match pyproject.toml
Co-Authored-By: Claude Sonnet 4.6
---
poetry.lock | 576 ++++------------------------------------------------
1 file changed, 39 insertions(+), 537 deletions(-)
diff --git a/poetry.lock b/poetry.lock
index f95ddb4..c8eaf3f 100644
--- a/poetry.lock
+++ b/poetry.lock
@@ -1,84 +1,5 @@
# This file is automatically @generated by Poetry 2.3.2 and should not be changed by hand.
-[[package]]
-name = "arro3-core"
-version = "0.6.5"
-description = ""
-optional = false
-python-versions = ">=3.9"
-groups = ["main", "dev"]
-files = [
- {file = "arro3_core-0.6.5-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:da193dc2fb8c2005d0b3887b09d1a90d42cec1f59f17a8a1a5791f0de90946ae"},
- {file = "arro3_core-0.6.5-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:ed1a760ec39fe19c65e98f45515582408002d0212df5db227a5959ffeb07ad4a"},
- {file = "arro3_core-0.6.5-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:6584a3d28007740afcef1e301332876e2b785bd8edd59a458a6bc9b051bce052"},
- {file = "arro3_core-0.6.5-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:8e0af4789618f02bead4a0cd4d0a54abd9c8aa4fcedf9872b4891d2e3e984161"},
- {file = "arro3_core-0.6.5-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:c73f212e549e9b6d11cfe3f14bbf3fba9d0891426afb5916688d16d0df724085"},
- {file = "arro3_core-0.6.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:89f88f62e4e276a9e84f250722d2e5ffc078af9a3f67ac691f572a0e05dd6095"},
- {file = "arro3_core-0.6.5-cp310-cp310-manylinux_2_24_aarch64.whl", hash = "sha256:b2635e4c227f25ff8784dc8efb38cb7c1674646cfdc68ded53f2426289885f0e"},
- {file = "arro3_core-0.6.5-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:a5f3e936686bcd8542fafc94c68fdb23ec42d1d51a4777967ae815c90aff7296"},
- {file = "arro3_core-0.6.5-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:705c32fec03dadc08f807d69ce557882005d43eb20ec62699f7036340f0d580f"},
- {file = "arro3_core-0.6.5-cp310-cp310-musllinux_1_2_armv7l.whl", hash = "sha256:56d8166235a4c54e4f7ba082ec76890c820fa8c1b6c995ec59cead62a9698e59"},
- {file = "arro3_core-0.6.5-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:1ba43ba9081c00767083195222b6be74913de668296f55599658c4b0bb7cd327"},
- {file = "arro3_core-0.6.5-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:4f5df13c6742e3f0b494cfe9025dccdc8426a74cc9e3e5a1239311e07a4b24e0"},
- {file = "arro3_core-0.6.5-cp310-cp310-win_amd64.whl", hash = "sha256:34676b728178236df63c9ea10b21432392d4b5bb51e2030e77c68eed4dede2ad"},
- {file = "arro3_core-0.6.5-cp311-abi3-macosx_10_12_x86_64.whl", hash = "sha256:9d5999506daec1ab31096b3deb1e3573041d6ecadb4ca99c96f7ab26720c592c"},
- {file = "arro3_core-0.6.5-cp311-abi3-macosx_11_0_arm64.whl", hash = "sha256:bd3e251184c2dd6ade81c5613256b6d85ab3ddbd5af838b1de657e0ddec017f8"},
- {file = "arro3_core-0.6.5-cp311-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:7cadb29349960d3821b0515d9df80f2725cea155ad966c699f6084de32e313cb"},
- {file = "arro3_core-0.6.5-cp311-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:a922e560ed2ccee3293d51b39e013b51cc233895d25ddafcacfb83c540a19e6f"},
- {file = "arro3_core-0.6.5-cp311-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:68fe6672bf51f039b12046a209cba0a9405e10ae44e5a0d557f091b356a62051"},
- {file = "arro3_core-0.6.5-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5c3ee95603e375401a58ff763ce2c8aa858e0c4f757c1fb719f48fb070f540b2"},
- {file = "arro3_core-0.6.5-cp311-abi3-manylinux_2_24_aarch64.whl", hash = "sha256:fbaf6b65213630007b798b565e0701c2092a330deeba16bd3d896d401f7e9f28"},
- {file = "arro3_core-0.6.5-cp311-abi3-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:20679f874558bb2113e96325522625ec64a72687000b7a9578031a4d082c6ef5"},
- {file = "arro3_core-0.6.5-cp311-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:d82d6ec32d5c7c73057fb9c528390289fd5bc94b8d8f28fca9c56fc8e41c412c"},
- {file = "arro3_core-0.6.5-cp311-abi3-musllinux_1_2_armv7l.whl", hash = "sha256:4cba4db0a4203a3ccf131c3fb7804d77f0740d6165ec9efa3aa3acbca87c43a3"},
- {file = "arro3_core-0.6.5-cp311-abi3-musllinux_1_2_i686.whl", hash = "sha256:e358affc4a0fe5c1b5dccf4f92c43a836aaa4c4eab0906c83b00b60275de3b6d"},
- {file = "arro3_core-0.6.5-cp311-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:324e43f07b7681846d00a8995b78bdc4b4a719047aa0d34426b462b8f208ee98"},
- {file = "arro3_core-0.6.5-cp311-abi3-win_amd64.whl", hash = "sha256:285f802c8a42fe29ecb84584d1700bc4c4f974552b75f805e1f4362d28b97080"},
- {file = "arro3_core-0.6.5-cp313-cp313t-macosx_10_12_x86_64.whl", hash = "sha256:8c20e69c3b3411fd6ed56091f388e699072651e880e682be5bd14f3a392ed3e8"},
- {file = "arro3_core-0.6.5-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:92211f1d03221ff74d0b535a576b39601083d8e98e9d47228314573f9d4f9ae2"},
- {file = "arro3_core-0.6.5-cp313-cp313t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:280d933b75f2649779d76e32a07f91d2352a952f2c97ddf7b320e267f440cd42"},
- {file = "arro3_core-0.6.5-cp313-cp313t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:bfc3f6b93b924f43fb7985b06202343c30b43da6bd5055ba8b84eda431e494d4"},
- {file = "arro3_core-0.6.5-cp313-cp313t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:a5963635eb698ebc7da689e641f68b3998864bab894cf0ca84bd058b8c60d97f"},
- {file = "arro3_core-0.6.5-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ac291b3e74b57e56e03373d57530540cbbbfd92e4219fe2778ea531006673fe9"},
- {file = "arro3_core-0.6.5-cp313-cp313t-manylinux_2_24_aarch64.whl", hash = "sha256:5d3f4cc58a654037d61f61ba230419da2c8f88a0ac82b9d41fe307f7cf9fda97"},
- {file = "arro3_core-0.6.5-cp313-cp313t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:93cddac90238d64451f5e66c630ded89d0b5fd6d2c099bf3a5151dde2c1ddf1d"},
- {file = "arro3_core-0.6.5-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:1fa7ac10db5846c33f4e8b66a6eaa705d84998e38575a835acac9a6a6649933d"},
- {file = "arro3_core-0.6.5-cp313-cp313t-musllinux_1_2_armv7l.whl", hash = "sha256:ca69f698a065cdbf845d59d412bc204e8f8af12f93737d82e6a18f3cff812349"},
- {file = "arro3_core-0.6.5-cp313-cp313t-musllinux_1_2_i686.whl", hash = "sha256:de74a2512e2e2366d4b064c498c38672bf6ddea38acec8b1999b4e66182dd001"},
- {file = "arro3_core-0.6.5-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:806ca8e20507675b2de68b3d009f76e898cc3c3e441c834ea5220866f68aac50"},
- {file = "arro3_core-0.6.5-cp313-cp313t-win_amd64.whl", hash = "sha256:8f6f0cc78877ade7ad6e678a4671b191406547e7b407bc9637436869c017ed47"},
- {file = "arro3_core-0.6.5-cp39-cp39-macosx_10_12_x86_64.whl", hash = "sha256:dfac7fac3c6a302399d94644d48682a19488a5b67bd1ccbdf6c560a7ffabde6d"},
- {file = "arro3_core-0.6.5-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:9fc70042e558d1cd5fbe917b58e8ef52701441e38ff30b1912858050f796a62c"},
- {file = "arro3_core-0.6.5-cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:1583b29b2ba83927a33e5435e5d9d134114c45a6360a8bb4db4beda13dab4fd8"},
- {file = "arro3_core-0.6.5-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:6a170fe53f18dda4a4647fd3b8b4a9373fc11ac42c41a4b65f55d79ad531a33e"},
- {file = "arro3_core-0.6.5-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:83047b4e6e18835c91c8d12c5494e6ababc7c185c5a772d3429e8f9b0c185894"},
- {file = "arro3_core-0.6.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d3d4393d281d1ef18927915a11187da27287d279f99d5325bc9afb417f76084f"},
- {file = "arro3_core-0.6.5-cp39-cp39-manylinux_2_24_aarch64.whl", hash = "sha256:f0c88d8babcf51affdd69390882e2f0ecb1890a1b8a5abfc087d003e7181eb6e"},
- {file = "arro3_core-0.6.5-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:36424e1d62212466a5cacdc27d414e99bf0fdab1544cc2b7e5b81e41437e5970"},
- {file = "arro3_core-0.6.5-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:4eb4d96f7db618f100758a8b7ec1b221c8737d543073701b7ffee74bc5019d46"},
- {file = "arro3_core-0.6.5-cp39-cp39-musllinux_1_2_armv7l.whl", hash = "sha256:2cfe9b4b1dd663d256754f1aa7aae783a1cddd3eb5698892b9caf381431f0af7"},
- {file = "arro3_core-0.6.5-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:a3b2621505f97eb5ce80f1c6fa8c77d18d757ab48d1f11d33a805e9ccbcd6fb6"},
- {file = "arro3_core-0.6.5-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:6c1becbb96ceba0b20f3d4318dd35f3417ee9a49065813d99f52b0fa285fc569"},
- {file = "arro3_core-0.6.5-cp39-cp39-win_amd64.whl", hash = "sha256:5459e7bd39bb9dd8c57aa06856d2bebc5c1ca782cbccab0e186c6c89530e4ca9"},
- {file = "arro3_core-0.6.5-pp310-pypy310_pp73-macosx_10_12_x86_64.whl", hash = "sha256:26d5b50139f1a96727fa1760b4d70393acf5ee0fba45346ad2d4f69824d3bdc2"},
- {file = "arro3_core-0.6.5-pp310-pypy310_pp73-macosx_11_0_arm64.whl", hash = "sha256:b65b3d8d7f65f2f3c36002dc467380d7a31ea771132986dddc6341c5a9dc726f"},
- {file = "arro3_core-0.6.5-pp310-pypy310_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:6c3442a79a757ed3fbd7793de180019ae3201f04237537c2e2e3f1e3dd99b31c"},
- {file = "arro3_core-0.6.5-pp310-pypy310_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:def7b0065a684d6f903a658d2567da47e2fcecde716e0b34eff4d899c6468c8d"},
- {file = "arro3_core-0.6.5-pp310-pypy310_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:cbfe2f2d4d0d393833cd6a4bd9c15266a02307a3028f159155a1c536469c3ae7"},
- {file = "arro3_core-0.6.5-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a191a3e4f72c34f7ace7724a94f2d90b06c804a6cbece4ae0f18d36325479cf3"},
- {file = "arro3_core-0.6.5-pp310-pypy310_pp73-manylinux_2_24_aarch64.whl", hash = "sha256:e3f6ab4c6ea96c451eff72aa6c5b9835a0ea8a9847cfe3995c88cce0c7701fb5"},
- {file = "arro3_core-0.6.5-pp310-pypy310_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:27df5239835330299636a02977f2cb34d5c460cc03b2ae1d6ab6a03d28051b08"},
- {file = "arro3_core-0.6.5-pp310-pypy310_pp73-musllinux_1_2_aarch64.whl", hash = "sha256:71dce89c0e91be4cfb42591f03809235bbc374c396e08acdf93c4d85b09e40f5"},
- {file = "arro3_core-0.6.5-pp310-pypy310_pp73-musllinux_1_2_armv7l.whl", hash = "sha256:d380c28f85568ed99c1686fb9d64b5a811d76d569f367cbec8ef7e58f6e2fdf9"},
- {file = "arro3_core-0.6.5-pp310-pypy310_pp73-musllinux_1_2_i686.whl", hash = "sha256:8e359c0c4fe9992f5a863a4a31502ea58eb2f92988fc2e501850540b3eff0328"},
- {file = "arro3_core-0.6.5-pp310-pypy310_pp73-musllinux_1_2_x86_64.whl", hash = "sha256:9a58acbc61480b533aa84d735db04b1e68fc7f6807ab694d606c03b5e694d83d"},
- {file = "arro3_core-0.6.5.tar.gz", hash = "sha256:768078887cd7ac82de4736f94bbd91f6d660f10779848bd5b019f511badd9d75"},
-]
-markers = {main = "extra == \"deltalake\" or extra == \"formats\" or extra == \"all\""}
-
-[package.dependencies]
-typing-extensions = {version = "*", markers = "python_full_version < \"3.12.0\""}
-
[[package]]
name = "asn1crypto"
version = "1.5.1"
@@ -90,7 +11,7 @@ files = [
{file = "asn1crypto-1.5.1-py2.py3-none-any.whl", hash = "sha256:db4e40728b728508912cbb3d44f19ce188f218e9eba635821bb4b68564f8fd67"},
{file = "asn1crypto-1.5.1.tar.gz", hash = "sha256:13ae38502be632115abf8a24cbe5f4da52e3b5231990aff31123c805306ccb9c"},
]
-markers = {main = "python_version <= \"3.13\" and (extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\")", dev = "python_version <= \"3.13\""}
+markers = {main = "(extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\") and python_version <= \"3.13\"", dev = "python_version <= \"3.13\""}
[[package]]
name = "attrs"
@@ -105,49 +26,6 @@ files = [
]
markers = {main = "extra == \"validation\" or extra == \"all\""}
-[[package]]
-name = "azure-core"
-version = "1.38.0"
-description = "Microsoft Azure Core Library for Python"
-optional = false
-python-versions = ">=3.9"
-groups = ["main", "dev"]
-files = [
- {file = "azure_core-1.38.0-py3-none-any.whl", hash = "sha256:ab0c9b2cd71fecb1842d52c965c95285d3cfb38902f6766e4a471f1cd8905335"},
- {file = "azure_core-1.38.0.tar.gz", hash = "sha256:8194d2682245a3e4e3151a667c686464c3786fed7918b394d035bdcd61bb5993"},
-]
-markers = {main = "extra == \"azure\" or extra == \"cloud\" or extra == \"all\""}
-
-[package.dependencies]
-requests = ">=2.21.0"
-typing-extensions = ">=4.6.0"
-
-[package.extras]
-aio = ["aiohttp (>=3.0)"]
-tracing = ["opentelemetry-api (>=1.26,<2.0)"]
-
-[[package]]
-name = "azure-storage-blob"
-version = "12.28.0"
-description = "Microsoft Azure Blob Storage Client Library for Python"
-optional = false
-python-versions = ">=3.9"
-groups = ["main", "dev"]
-files = [
- {file = "azure_storage_blob-12.28.0-py3-none-any.whl", hash = "sha256:00fb1db28bf6a7b7ecaa48e3b1d5c83bfadacc5a678b77826081304bd87d6461"},
- {file = "azure_storage_blob-12.28.0.tar.gz", hash = "sha256:e7d98ea108258d29aa0efbfd591b2e2075fa1722a2fae8699f0b3c9de11eff41"},
-]
-markers = {main = "extra == \"azure\" or extra == \"cloud\" or extra == \"all\""}
-
-[package.dependencies]
-azure-core = ">=1.30.0"
-cryptography = ">=2.1.4"
-isodate = ">=0.6.1"
-typing-extensions = ">=4.6.0"
-
-[package.extras]
-aio = ["azure-core[aio] (>=1.30.0)"]
-
[[package]]
name = "boto3"
version = "1.42.30"
@@ -159,7 +37,7 @@ files = [
{file = "boto3-1.42.30-py3-none-any.whl", hash = "sha256:d7e548bea65e0ae2c465c77de937bc686b591aee6a352d5a19a16bc751e591c1"},
{file = "boto3-1.42.30.tar.gz", hash = "sha256:ba9cd2f7819637d15bfbeb63af4c567fcc8a7dcd7b93dd12734ec58601169538"},
]
-markers = {main = "(python_version <= \"3.13\" or extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\") and (extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\" or extra == \"snowflake\")"}
+markers = {main = "(extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\" or extra == \"snowflake\") and (python_version <= \"3.13\" or extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\")"}
[package.dependencies]
botocore = ">=1.42.30,<1.43.0"
@@ -180,7 +58,7 @@ files = [
{file = "botocore-1.42.30-py3-none-any.whl", hash = "sha256:97070a438cac92430bb7b65f8ebd7075224f4a289719da4ee293d22d1e98db02"},
{file = "botocore-1.42.30.tar.gz", hash = "sha256:9bf1662b8273d5cc3828a49f71ca85abf4e021011c1f0a71f41a2ea5769a5116"},
]
-markers = {main = "(python_version <= \"3.13\" or extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\") and (extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\" or extra == \"snowflake\")"}
+markers = {main = "(extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\" or extra == \"snowflake\") and (python_version <= \"3.13\" or extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\")"}
[package.dependencies]
jmespath = ">=0.7.1,<2.0.0"
@@ -201,7 +79,7 @@ files = [
{file = "certifi-2026.1.4-py3-none-any.whl", hash = "sha256:9943707519e4add1115f44c2bc244f782c0249876bf51b6599fee1ffbedd685c"},
{file = "certifi-2026.1.4.tar.gz", hash = "sha256:ac726dd470482006e014ad384921ed6438c457018f4b3d204aea4281258b2120"},
]
-markers = {main = "(python_version <= \"3.13\" or extra == \"gcs\" or extra == \"cloud\" or extra == \"all\" or extra == \"bigquery\" or extra == \"warehouses\" or extra == \"azure\") and (extra == \"gcs\" or extra == \"cloud\" or extra == \"all\" or extra == \"bigquery\" or extra == \"warehouses\" or extra == \"azure\" or extra == \"snowflake\")"}
+markers = {main = "(extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\" or extra == \"bigquery\") and (python_version <= \"3.13\" or extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\")"}
[[package]]
name = "cffi"
@@ -279,7 +157,7 @@ files = [
{file = "cffi-1.17.1-cp39-cp39-win_amd64.whl", hash = "sha256:d016c76bdd850f3c626af19b0542c9677ba156e4ee4fccfdd7848803533ef662"},
{file = "cffi-1.17.1.tar.gz", hash = "sha256:1c39c6016c32bc48dd54561950ebd6836e1670f2ae46128f67cf49e789c52824"},
]
-markers = {main = "python_version <= \"3.13\" and (platform_python_implementation != \"PyPy\" or extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\") and (extra == \"azure\" or extra == \"cloud\" or extra == \"all\" or extra == \"snowflake\" or extra == \"warehouses\")", dev = "python_version <= \"3.13\""}
+markers = {main = "(extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\") and python_version <= \"3.13\"", dev = "python_version <= \"3.13\""}
[package.dependencies]
pycparser = "*"
@@ -290,7 +168,8 @@ version = "2.0.0"
description = "Foreign Function Interface for Python calling C code."
optional = false
python-versions = ">=3.9"
-groups = ["main", "dev"]
+groups = ["dev"]
+markers = "python_version >= \"3.14\" and platform_python_implementation != \"PyPy\""
files = [
{file = "cffi-2.0.0-cp310-cp310-macosx_10_13_x86_64.whl", hash = "sha256:0cf2d91ecc3fcc0625c2c530fe004f82c110405f101548512cce44322fa8ac44"},
{file = "cffi-2.0.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:f73b96c41e3b2adedc34a7356e64c8eb96e03a3782b535e043a986276ce12a49"},
@@ -377,7 +256,6 @@ files = [
{file = "cffi-2.0.0-cp39-cp39-win_amd64.whl", hash = "sha256:b882b3df248017dba09d6b16defe9b5c407fe32fc7c65a9c69798e6175601be9"},
{file = "cffi-2.0.0.tar.gz", hash = "sha256:44d1b5909021139fe36001ae048dbdde8214afa20200eda0f64c068cac5d5529"},
]
-markers = {main = "python_version >= \"3.14\" and platform_python_implementation != \"PyPy\" and (extra == \"azure\" or extra == \"cloud\" or extra == \"all\")", dev = "python_version >= \"3.14\" and platform_python_implementation != \"PyPy\""}
[package.dependencies]
pycparser = {version = "*", markers = "implementation_name != \"PyPy\""}
@@ -516,7 +394,7 @@ files = [
{file = "charset_normalizer-3.4.4-py3-none-any.whl", hash = "sha256:7a32c560861a02ff789ad905a2fe94e3f840803362c84fecf1851cb4cf3dc37f"},
{file = "charset_normalizer-3.4.4.tar.gz", hash = "sha256:94537985111c35f28720e43603b8e7b43a6ecfb2ce1d3058bbe955b73404e21a"},
]
-markers = {main = "(python_version <= \"3.13\" or extra == \"gcs\" or extra == \"cloud\" or extra == \"all\" or extra == \"bigquery\" or extra == \"warehouses\" or extra == \"azure\") and (extra == \"gcs\" or extra == \"cloud\" or extra == \"all\" or extra == \"bigquery\" or extra == \"warehouses\" or extra == \"azure\" or extra == \"snowflake\")"}
+markers = {main = "(extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\" or extra == \"bigquery\") and (python_version <= \"3.13\" or extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\")"}
[[package]]
name = "click"
@@ -778,7 +656,7 @@ files = [
{file = "cryptography-46.0.0-pp311-pypy311_pp73-win_amd64.whl", hash = "sha256:bf1961037309ee0bdf874ccba9820b1c2f720c2016895c44d8eb2316226c1ad5"},
{file = "cryptography-46.0.0.tar.gz", hash = "sha256:99f64a6d15f19f3afd78720ad2978f6d8d4c68cd4eb600fab82ab1a7c2071dca"},
]
-markers = {main = "(python_version <= \"3.13\" or extra == \"azure\" or extra == \"cloud\" or extra == \"all\") and (extra == \"azure\" or extra == \"cloud\" or extra == \"all\" or extra == \"snowflake\" or extra == \"warehouses\")"}
+markers = {main = "(extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\") and python_version <= \"3.13\""}
[package.dependencies]
cffi = [
@@ -797,51 +675,6 @@ ssh = ["bcrypt (>=3.1.5)"]
test = ["certifi (>=2024)", "cryptography-vectors (==46.0.0)", "pretend (>=0.7)", "pytest (>=7.4.0)", "pytest-benchmark (>=4.0)", "pytest-cov (>=2.10.1)", "pytest-xdist (>=3.5.0)"]
test-randomorder = ["pytest-randomly"]
-[[package]]
-name = "deltalake"
-version = "1.4.1"
-description = "Native Delta Lake Python binding based on delta-rs with Pandas integration"
-optional = false
-python-versions = ">=3.10"
-groups = ["main", "dev"]
-files = [
- {file = "deltalake-1.4.1-cp310-abi3-macosx_10_12_x86_64.whl", hash = "sha256:dc7b6b99bd8a8d4949645b8f6073d0ced9dd06109fa9669b7802ddf8207291e1"},
- {file = "deltalake-1.4.1-cp310-abi3-macosx_11_0_arm64.whl", hash = "sha256:0d4c460a4fab802cf051ca66a49583d93a3490842eb849bd1aae7176b12b8030"},
- {file = "deltalake-1.4.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bcd312abe5928f0da3217901431f6f537da4d51162d23cd81fc3849559c5f5cc"},
- {file = "deltalake-1.4.1-cp310-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:cc98a04918d0acd0a425ecaf33b6fbbbe458d395da31139554b97b7a62a045f9"},
- {file = "deltalake-1.4.1-cp310-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:9a3af28f14535122395b89ff146c5acca4c00db63191c59c39a0ae30356030b6"},
- {file = "deltalake-1.4.1-cp310-abi3-win_amd64.whl", hash = "sha256:4dd4648be88375b5dadd119cd2a45f481e3df6007da6a519d49646f202b036da"},
- {file = "deltalake-1.4.1.tar.gz", hash = "sha256:0a7e7f2f0f60edab087087f0144e539428c8d5a0e6f80f86fe49db82499a50ab"},
-]
-markers = {main = "extra == \"deltalake\" or extra == \"formats\" or extra == \"all\""}
-
-[package.dependencies]
-arro3-core = ">=0.5.0"
-deprecated = ">=1.2.18"
-
-[package.extras]
-pandas = ["pandas"]
-pyarrow = ["pyarrow (>=16)"]
-
-[[package]]
-name = "deprecated"
-version = "1.3.1"
-description = "Python @deprecated decorator to deprecate old python classes, functions or methods."
-optional = false
-python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,>=2.7"
-groups = ["main", "dev"]
-files = [
- {file = "deprecated-1.3.1-py2.py3-none-any.whl", hash = "sha256:597bfef186b6f60181535a29fbe44865ce137a5079f295b479886c82729d5f3f"},
- {file = "deprecated-1.3.1.tar.gz", hash = "sha256:b1b50e0ff0c1fddaa5708a2c6b0a6588bb09b892825ab2b214ac9ea9d92a5223"},
-]
-markers = {main = "extra == \"deltalake\" or extra == \"formats\" or extra == \"all\""}
-
-[package.dependencies]
-wrapt = ">=1.10,<3"
-
-[package.extras]
-dev = ["PyTest", "PyTest-Cov", "bump2version (<1)", "setuptools ; python_version >= \"3.12\"", "tox"]
-
[[package]]
name = "distlib"
version = "0.4.0"
@@ -854,98 +687,6 @@ files = [
{file = "distlib-0.4.0.tar.gz", hash = "sha256:feec40075be03a04501a973d81f633735b4b69f98b05450592310c0f401a4e0d"},
]
-[[package]]
-name = "dnspython"
-version = "2.8.0"
-description = "DNS toolkit"
-optional = false
-python-versions = ">=3.10"
-groups = ["main"]
-files = [
- {file = "dnspython-2.8.0-py3-none-any.whl", hash = "sha256:01d9bbc4a2d76bf0db7c1f729812ded6d912bd318d3b1cf81d30c0f845dbf3af"},
- {file = "dnspython-2.8.0.tar.gz", hash = "sha256:181d3c6996452cb1189c4046c61599b84a5a86e099562ffde77d26984ff26d0f"},
-]
-
-[package.extras]
-dev = ["black (>=25.1.0)", "coverage (>=7.0)", "flake8 (>=7)", "hypercorn (>=0.17.0)", "mypy (>=1.17)", "pylint (>=3)", "pytest (>=8.4)", "pytest-cov (>=6.2.0)", "quart-trio (>=0.12.0)", "sphinx (>=8.2.0)", "sphinx-rtd-theme (>=3.0.0)", "twine (>=6.1.0)", "wheel (>=0.45.0)"]
-dnssec = ["cryptography (>=45)"]
-doh = ["h2 (>=4.2.0)", "httpcore (>=1.0.0)", "httpx (>=0.28.0)"]
-doq = ["aioquic (>=1.2.0)"]
-idna = ["idna (>=3.10)"]
-trio = ["trio (>=0.30)"]
-wmi = ["wmi (>=1.5.1) ; platform_system == \"Windows\""]
-
-[[package]]
-name = "duckdb"
-version = "1.4.3"
-description = "DuckDB in-process database"
-optional = false
-python-versions = ">=3.9.0"
-groups = ["main", "dev"]
-files = [
- {file = "duckdb-1.4.3-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:efa7f1191c59e34b688fcd4e588c1b903a4e4e1f4804945902cf0b20e08a9001"},
- {file = "duckdb-1.4.3-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:4fef6a053a1c485292000bf0c338bba60f89d334f6a06fc76ba4085a5a322b76"},
- {file = "duckdb-1.4.3-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:702dabbc22b27dc5b73e7599c60deef3d8c59968527c36b391773efddd8f4cf1"},
- {file = "duckdb-1.4.3-cp310-cp310-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:854b79375fa618f6ffa8d84fb45cbc9db887f6c4834076ea10d20bc106f1fd90"},
- {file = "duckdb-1.4.3-cp310-cp310-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:1bb8bd5a3dd205983726185b280a211eacc9f5bc0c4d4505bec8c87ac33a8ccb"},
- {file = "duckdb-1.4.3-cp310-cp310-win_amd64.whl", hash = "sha256:d0ff08388ef8b1d1a4c95c321d6c5fa11201b241036b1ee740f9d841df3d6ba2"},
- {file = "duckdb-1.4.3-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:366bf607088053dce845c9d24c202c04d78022436cc5d8e4c9f0492de04afbe7"},
- {file = "duckdb-1.4.3-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:8d080e8d1bf2d226423ec781f539c8f6b6ef3fd42a9a58a7160de0a00877a21f"},
- {file = "duckdb-1.4.3-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:9dc049ba7e906cb49ca2b6d4fbf7b6615ec3883193e8abb93f0bef2652e42dda"},
- {file = "duckdb-1.4.3-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:2b30245375ea94ab528c87c61fc3ab3e36331180b16af92ee3a37b810a745d24"},
- {file = "duckdb-1.4.3-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:a7c864df027da1ee95f0c32def67e15d02cd4a906c9c1cbae82c09c5112f526b"},
- {file = "duckdb-1.4.3-cp311-cp311-win_amd64.whl", hash = "sha256:813f189039b46877b5517f1909c7b94a8fe01b4bde2640ab217537ea0fe9b59b"},
- {file = "duckdb-1.4.3-cp311-cp311-win_arm64.whl", hash = "sha256:fbc63ffdd03835f660155b37a1b6db2005bcd46e5ad398b8cac141eb305d2a3d"},
- {file = "duckdb-1.4.3-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:6302452e57aef29aae3977063810ed7b2927967b97912947b9cca45c1c21955f"},
- {file = "duckdb-1.4.3-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:deab351ac43b6282a3270e3d40e3d57b3b50f472d9fd8c30975d88a31be41231"},
- {file = "duckdb-1.4.3-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:5634e40e1e2d972e4f75bced1fbdd9e9e90faa26445c1052b27de97ee546944a"},
- {file = "duckdb-1.4.3-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:274d4a31aba63115f23e7e7b401e3e3a937f3626dc9dea820a9c7d3073f450d2"},
- {file = "duckdb-1.4.3-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:4f868a7e6d9b37274a1aa34849ea92aa964e9bd59a5237d6c17e8540533a1e4f"},
- {file = "duckdb-1.4.3-cp312-cp312-win_amd64.whl", hash = "sha256:ef7ef15347ce97201b1b5182a5697682679b04c3374d5a01ac10ba31cf791b95"},
- {file = "duckdb-1.4.3-cp312-cp312-win_arm64.whl", hash = "sha256:1b9b445970fd18274d5ac07a0b24c032e228f967332fb5ebab3d7db27738c0e4"},
- {file = "duckdb-1.4.3-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:16952ac05bd7e7b39946695452bf450db1ebbe387e1e7178e10f593f2ea7b9a8"},
- {file = "duckdb-1.4.3-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:de984cd24a6cbefdd6d4a349f7b9a46e583ca3e58ce10d8def0b20a6e5fcbe78"},
- {file = "duckdb-1.4.3-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:1e5457dda91b67258aae30fb1a0df84183a9f6cd27abac1d5536c0d876c6dfa1"},
- {file = "duckdb-1.4.3-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:006aca6a6d6736c441b02ff5c7600b099bb8b7f4de094b8b062137efddce42df"},
- {file = "duckdb-1.4.3-cp313-cp313-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:a2813f4635f4d6681cc3304020374c46aca82758c6740d7edbc237fe3aae2744"},
- {file = "duckdb-1.4.3-cp313-cp313-win_amd64.whl", hash = "sha256:6db124f53a3edcb32b0a896ad3519e37477f7e67bf4811cb41ab60c1ef74e4c8"},
- {file = "duckdb-1.4.3-cp313-cp313-win_arm64.whl", hash = "sha256:a8b0a8764e1b5dd043d168c8f749314f7a1252b5a260fa415adaa26fa3b958fd"},
- {file = "duckdb-1.4.3-cp314-cp314-macosx_10_15_universal2.whl", hash = "sha256:316711a9e852bcfe1ed6241a5f654983f67e909e290495f3562cccdf43be8180"},
- {file = "duckdb-1.4.3-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:9e625b2b4d52bafa1fd0ebdb0990c3961dac8bb00e30d327185de95b68202131"},
- {file = "duckdb-1.4.3-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:130c6760f6c573f9c9fe9aba56adba0fab48811a4871b7b8fd667318b4a3e8da"},
- {file = "duckdb-1.4.3-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:20c88effaa557a11267706b01419c542fe42f893dee66e5a6daa5974ea2d4a46"},
- {file = "duckdb-1.4.3-cp314-cp314-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:1b35491db98ccd11d151165497c084a9d29d3dc42fc80abea2715a6c861ca43d"},
- {file = "duckdb-1.4.3-cp314-cp314-win_amd64.whl", hash = "sha256:23b12854032c1a58d0452e2b212afa908d4ce64171862f3792ba9a596ba7c765"},
- {file = "duckdb-1.4.3-cp314-cp314-win_arm64.whl", hash = "sha256:90f241f25cffe7241bf9f376754a5845c74775e00e1c5731119dc88cd71e0cb2"},
- {file = "duckdb-1.4.3-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:aa26a7406205bc1426cee28bdfdf084f669a5686977dafa4c3ec65873989593c"},
- {file = "duckdb-1.4.3-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:caa2164c91f7e91befb1ffb081b3cd97a137117533aef7abe1538b03ad72e3a9"},
- {file = "duckdb-1.4.3-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:8d53b217698a76c4957e2c807dd9295d409146f9d3d7932f372883201ba9d25a"},
- {file = "duckdb-1.4.3-cp39-cp39-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:8afba22c370f06b7314aa46bfed052509269e482bcfb3f7b1ea0fa17ae49ce42"},
- {file = "duckdb-1.4.3-cp39-cp39-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:2b195270ff1a661f22cbd547a215baff265b7d4469a76a215c8992b5994107c3"},
- {file = "duckdb-1.4.3-cp39-cp39-win_amd64.whl", hash = "sha256:23a3a077821bed1768a84ac9cbf6b6487ead33e28e62cb118bda5fb8f9e53dea"},
- {file = "duckdb-1.4.3.tar.gz", hash = "sha256:fea43e03604c713e25a25211ada87d30cd2a044d8f27afab5deba26ac49e5268"},
-]
-markers = {main = "(extra == \"duckdb\" or extra == \"databases\" or extra == \"formats\" or extra == \"all\") and platform_system != \"Windows\"", dev = "platform_system != \"Windows\""}
-
-[package.extras]
-all = ["adbc-driver-manager", "fsspec", "ipython", "numpy", "pandas", "pyarrow"]
-
-[[package]]
-name = "email-validator"
-version = "2.3.0"
-description = "A robust email address syntax and deliverability validation library."
-optional = false
-python-versions = ">=3.8"
-groups = ["main"]
-files = [
- {file = "email_validator-2.3.0-py3-none-any.whl", hash = "sha256:80f13f623413e6b197ae73bb10bf4eb0908faf509ad8362c5edeb0be7fd450b4"},
- {file = "email_validator-2.3.0.tar.gz", hash = "sha256:9fc05c37f2f6cf439ff414f8fc46d917929974a82244c20eb10231ba60c54426"},
-]
-
-[package.dependencies]
-dnspython = ">=2.0.0"
-idna = ">=2.0.0"
-
[[package]]
name = "exceptiongroup"
version = "1.3.1"
@@ -965,69 +706,6 @@ typing-extensions = {version = ">=4.6.0", markers = "python_version < \"3.13\""}
[package.extras]
test = ["pytest (>=6)"]
-[[package]]
-name = "fastavro"
-version = "1.12.1"
-description = "Fast read/write of AVRO files"
-optional = false
-python-versions = ">=3.9"
-groups = ["main", "dev"]
-files = [
- {file = "fastavro-1.12.1-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:00650ca533907361edda22e6ffe8cf87ab2091c5d8aee5c8000b0f2dcdda7ed3"},
- {file = "fastavro-1.12.1-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:ac76d6d95f909c72ee70d314b460b7e711d928845771531d823eb96a10952d26"},
- {file = "fastavro-1.12.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:1f55eef18c41d4476bd32a82ed5dd86aabc3f614e1b66bdb09ffa291612e1670"},
- {file = "fastavro-1.12.1-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:81563e1f93570e6565487cdb01ba241a36a00e58cff9c5a0614af819d1155d8f"},
- {file = "fastavro-1.12.1-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:bec207360f76f0b3de540758a297193c5390e8e081c43c3317f610b1414d8c8f"},
- {file = "fastavro-1.12.1-cp310-cp310-win_amd64.whl", hash = "sha256:c0390bfe4a9f8056a75ac6785fbbff8f5e317f5356481d2e29ec980877d2314b"},
- {file = "fastavro-1.12.1-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:6b632b713bc5d03928a87d811fa4a11d5f25cd43e79c161e291c7d3f7aa740fd"},
- {file = "fastavro-1.12.1-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:eaa7ab3769beadcebb60f0539054c7755f63bd9cf7666e2c15e615ab605f89a8"},
- {file = "fastavro-1.12.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:123fb221df3164abd93f2d042c82f538a1d5a43ce41375f12c91ce1355a9141e"},
- {file = "fastavro-1.12.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:632a4e3ff223f834ddb746baae0cc7cee1068eb12c32e4d982c2fee8a5b483d0"},
- {file = "fastavro-1.12.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:83e6caf4e7a8717d932a3b1ff31595ad169289bbe1128a216be070d3a8391671"},
- {file = "fastavro-1.12.1-cp311-cp311-win_amd64.whl", hash = "sha256:b91a0fe5a173679a6c02d53ca22dcaad0a2c726b74507e0c1c2e71a7c3f79ef9"},
- {file = "fastavro-1.12.1-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:509818cb24b98a804fc80be9c5fed90f660310ae3d59382fc811bfa187122167"},
- {file = "fastavro-1.12.1-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:089e155c0c76e0d418d7e79144ce000524dd345eab3bc1e9c5ae69d500f71b14"},
- {file = "fastavro-1.12.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:44cbff7518901c91a82aab476fcab13d102e4999499df219d481b9e15f61af34"},
- {file = "fastavro-1.12.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:a275e48df0b1701bb764b18a8a21900b24cf882263cb03d35ecdba636bbc830b"},
- {file = "fastavro-1.12.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:2de72d786eb38be6b16d556b27232b1bf1b2797ea09599507938cdb7a9fe3e7c"},
- {file = "fastavro-1.12.1-cp312-cp312-win_amd64.whl", hash = "sha256:9090f0dee63fe022ee9cc5147483366cc4171c821644c22da020d6b48f576b4f"},
- {file = "fastavro-1.12.1-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:78df838351e4dff9edd10a1c41d1324131ffecbadefb9c297d612ef5363c049a"},
- {file = "fastavro-1.12.1-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:780476c23175d2ae457c52f45b9ffa9d504593499a36cd3c1929662bf5b7b14b"},
- {file = "fastavro-1.12.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:0714b285160fcd515eb0455540f40dd6dac93bdeacdb03f24e8eac3d8aa51f8d"},
- {file = "fastavro-1.12.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:a8bc2dcec5843d499f2489bfe0747999108f78c5b29295d877379f1972a3d41a"},
- {file = "fastavro-1.12.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:3b1921ac35f3d89090a5816b626cf46e67dbecf3f054131f84d56b4e70496f45"},
- {file = "fastavro-1.12.1-cp313-cp313-win_amd64.whl", hash = "sha256:5aa777b8ee595b50aa084104cd70670bf25a7bbb9fd8bb5d07524b0785ee1699"},
- {file = "fastavro-1.12.1-cp313-cp313t-macosx_10_13_universal2.whl", hash = "sha256:c3d67c47f177e486640404a56f2f50b165fe892cc343ac3a34673b80cc7f1dd6"},
- {file = "fastavro-1.12.1-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:5217f773492bac43dae15ff2931432bce2d7a80be7039685a78d3fab7df910bd"},
- {file = "fastavro-1.12.1-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:469fecb25cba07f2e1bfa4c8d008477cd6b5b34a59d48715e1b1a73f6160097d"},
- {file = "fastavro-1.12.1-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:d71c8aa841ef65cfab709a22bb887955f42934bced3ddb571e98fdbdade4c609"},
- {file = "fastavro-1.12.1-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:b81fc04e85dfccf7c028e0580c606e33aa8472370b767ef058aae2c674a90746"},
- {file = "fastavro-1.12.1-cp314-cp314-macosx_10_15_universal2.whl", hash = "sha256:9445da127751ba65975d8e4bdabf36bfcfdad70fc35b2d988e3950cce0ec0e7c"},
- {file = "fastavro-1.12.1-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:ed924233272719b5d5a6a0b4d80ef3345fc7e84fc7a382b6232192a9112d38a6"},
- {file = "fastavro-1.12.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:3616e2f0e1c9265e92954fa099db79c6e7817356d3ff34f4bcc92699ae99697c"},
- {file = "fastavro-1.12.1-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:cb0337b42fd3c047fcf0e9b7597bd6ad25868de719f29da81eabb6343f08d399"},
- {file = "fastavro-1.12.1-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:64961ab15b74b7c168717bbece5660e0f3d457837c3cc9d9145181d011199fa7"},
- {file = "fastavro-1.12.1-cp314-cp314-win_amd64.whl", hash = "sha256:792356d320f6e757e89f7ac9c22f481e546c886454a6709247f43c0dd7058004"},
- {file = "fastavro-1.12.1-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:120aaf82ac19d60a1016afe410935fe94728752d9c2d684e267e5b7f0e70f6d9"},
- {file = "fastavro-1.12.1-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:b6a3462934b20a74f9ece1daa49c2e4e749bd9a35fa2657b53bf62898fba80f5"},
- {file = "fastavro-1.12.1-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:1f81011d54dd47b12437b51dd93a70a9aa17b61307abf26542fc3c13efbc6c51"},
- {file = "fastavro-1.12.1-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:43ded16b3f4a9f1a42f5970c2aa618acb23ea59c4fcaa06680bdf470b255e5a8"},
- {file = "fastavro-1.12.1-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:02281432dcb11c78b3280da996eff61ee0eff39c5de06c6e0fbf19275093e6d4"},
- {file = "fastavro-1.12.1-cp39-cp39-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:4128978b930aaf930332db4b3acc290783183f3be06a241ae4a482f3ed8ce892"},
- {file = "fastavro-1.12.1-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:546ffffda6610fca672f0ed41149808e106d8272bb246aa7539fa8bb6f117f17"},
- {file = "fastavro-1.12.1-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:a7d840ccd9aacada3ddc80fbcc4ea079b658107fe62e9d289a0de9d54e95d366"},
- {file = "fastavro-1.12.1-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:3100ad643e7fa658469a2a2db229981c1a000ff16b8037c0b58ce3ec4d2107e8"},
- {file = "fastavro-1.12.1-cp39-cp39-win_amd64.whl", hash = "sha256:a38607444281619eda3a9c1be9f5397634012d1b237142eee1540e810b30ac8b"},
- {file = "fastavro-1.12.1.tar.gz", hash = "sha256:2f285be49e45bc047ab2f6bed040bb349da85db3f3c87880e4b92595ea093b2b"},
-]
-markers = {main = "extra == \"avro\" or extra == \"formats\" or extra == \"all\""}
-
-[package.extras]
-codecs = ["cramjam", "lz4", "zstandard"]
-lz4 = ["lz4"]
-snappy = ["cramjam"]
-zstandard = ["zstandard"]
-
[[package]]
name = "filelock"
version = "3.20.3"
@@ -1039,7 +717,7 @@ files = [
{file = "filelock-3.20.3-py3-none-any.whl", hash = "sha256:4b0dda527ee31078689fc205ec4f1c1bf7d56cf88b6dc9426c4f230e46c2dce1"},
{file = "filelock-3.20.3.tar.gz", hash = "sha256:18c57ee915c7ec61cff0ecf7f0f869936c7c30191bb0cf406f1341778d0834e1"},
]
-markers = {main = "python_version <= \"3.13\" and (extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\")"}
+markers = {main = "(extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\") and python_version <= \"3.13\""}
[[package]]
name = "google-api-core"
@@ -1052,7 +730,7 @@ files = [
{file = "google_api_core-2.29.0-py3-none-any.whl", hash = "sha256:d30bc60980daa36e314b5d5a3e5958b0200cb44ca8fa1be2b614e932b75a3ea9"},
{file = "google_api_core-2.29.0.tar.gz", hash = "sha256:84181be0f8e6b04006df75ddfe728f24489f0af57c96a529ff7cf45bc28797f7"},
]
-markers = {main = "extra == \"gcs\" or extra == \"cloud\" or extra == \"all\" or extra == \"bigquery\" or extra == \"warehouses\""}
+markers = {main = "extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\""}
[package.dependencies]
google-auth = ">=2.14.1,<3.0.0"
@@ -1091,7 +769,7 @@ files = [
{file = "google_auth-2.47.0-py3-none-any.whl", hash = "sha256:c516d68336bfde7cf0da26aab674a36fedcf04b37ac4edd59c597178760c3498"},
{file = "google_auth-2.47.0.tar.gz", hash = "sha256:833229070a9dfee1a353ae9877dcd2dec069a8281a4e72e72f77d4a70ff945da"},
]
-markers = {main = "extra == \"gcs\" or extra == \"cloud\" or extra == \"all\" or extra == \"bigquery\" or extra == \"warehouses\""}
+markers = {main = "extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\""}
[package.dependencies]
pyasn1-modules = ">=0.2.1"
@@ -1153,7 +831,7 @@ files = [
{file = "google_cloud_core-2.5.0-py3-none-any.whl", hash = "sha256:67d977b41ae6c7211ee830c7912e41003ea8194bff15ae7d72fd6f51e57acabc"},
{file = "google_cloud_core-2.5.0.tar.gz", hash = "sha256:7c1b7ef5c92311717bd05301aa1a91ffbc565673d3b0b4163a52d8413a186963"},
]
-markers = {main = "extra == \"gcs\" or extra == \"cloud\" or extra == \"all\" or extra == \"bigquery\" or extra == \"warehouses\""}
+markers = {main = "extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\""}
[package.dependencies]
google-api-core = ">=1.31.6,<2.0.dev0 || >2.3.0,<3.0.0"
@@ -1162,31 +840,6 @@ google-auth = ">=1.25.0,<3.0.0"
[package.extras]
grpc = ["grpcio (>=1.38.0,<2.0.0) ; python_version < \"3.14\"", "grpcio (>=1.75.1,<2.0.0) ; python_version >= \"3.14\"", "grpcio-status (>=1.38.0,<2.0.0)"]
-[[package]]
-name = "google-cloud-storage"
-version = "2.19.0"
-description = "Google Cloud Storage API client library"
-optional = false
-python-versions = ">=3.7"
-groups = ["main", "dev"]
-files = [
- {file = "google_cloud_storage-2.19.0-py2.py3-none-any.whl", hash = "sha256:aeb971b5c29cf8ab98445082cbfe7b161a1f48ed275822f59ed3f1524ea54fba"},
- {file = "google_cloud_storage-2.19.0.tar.gz", hash = "sha256:cd05e9e7191ba6cb68934d8eb76054d9be4562aa89dbc4236feee4d7d51342b2"},
-]
-markers = {main = "extra == \"gcs\" or extra == \"cloud\" or extra == \"all\""}
-
-[package.dependencies]
-google-api-core = ">=2.15.0,<3.0.0.dev0"
-google-auth = ">=2.26.1,<3.0.dev0"
-google-cloud-core = ">=2.3.0,<3.0.dev0"
-google-crc32c = ">=1.0,<2.0.dev0"
-google-resumable-media = ">=2.7.2"
-requests = ">=2.18.0,<3.0.0.dev0"
-
-[package.extras]
-protobuf = ["protobuf (<6.0.0.dev0)"]
-tracing = ["opentelemetry-api (>=1.1.0)"]
-
[[package]]
name = "google-crc32c"
version = "1.8.0"
@@ -1229,7 +882,7 @@ files = [
{file = "google_crc32c-1.8.0-pp311-pypy311_pp73-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:f639065ea2042d5c034bf258a9f085eaa7af0cd250667c0635a3118e8f92c69c"},
{file = "google_crc32c-1.8.0.tar.gz", hash = "sha256:a428e25fb7691024de47fecfbff7ff957214da51eddded0da0ae0e0f03a2cf79"},
]
-markers = {main = "extra == \"gcs\" or extra == \"cloud\" or extra == \"all\" or extra == \"bigquery\" or extra == \"warehouses\""}
+markers = {main = "extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\""}
[[package]]
name = "google-resumable-media"
@@ -1242,7 +895,7 @@ files = [
{file = "google_resumable_media-2.8.0-py3-none-any.whl", hash = "sha256:dd14a116af303845a8d932ddae161a26e86cc229645bc98b39f026f9b1717582"},
{file = "google_resumable_media-2.8.0.tar.gz", hash = "sha256:f1157ed8b46994d60a1bc432544db62352043113684d4e030ee02e77ebe9a1ae"},
]
-markers = {main = "extra == \"gcs\" or extra == \"cloud\" or extra == \"all\" or extra == \"bigquery\" or extra == \"warehouses\""}
+markers = {main = "extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\""}
[package.dependencies]
google-crc32c = ">=1.0.0,<2.0.0"
@@ -1262,7 +915,7 @@ files = [
{file = "googleapis_common_protos-1.72.0-py3-none-any.whl", hash = "sha256:4299c5a82d5ae1a9702ada957347726b167f9f8d1fc352477702a1e851ff4038"},
{file = "googleapis_common_protos-1.72.0.tar.gz", hash = "sha256:e55a601c1b32b52d7a3e65f43563e2aa61bcd737998ee672ac9b951cd49319f5"},
]
-markers = {main = "extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\" or extra == \"gcs\" or extra == \"cloud\""}
+markers = {main = "extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\""}
[package.dependencies]
protobuf = ">=3.20.2,<4.21.1 || >4.21.1,<4.21.2 || >4.21.2,<4.21.3 || >4.21.3,<4.21.4 || >4.21.4,<4.21.5 || >4.21.5,<7.0.0"
@@ -1455,6 +1108,7 @@ files = [
{file = "idna-3.11-py3-none-any.whl", hash = "sha256:771a87f49d9defaf64091e6e6fe9c18d4833f140bd19464795bc32d966ca37ea"},
{file = "idna-3.11.tar.gz", hash = "sha256:795dafcc9c04ed0c1fb032c2aa73654d8e8c5023a7df64a53f39190ada629902"},
]
+markers = {main = "(extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\" or extra == \"bigquery\") and (python_version <= \"3.13\" or extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\")"}
[package.extras]
all = ["flake8 (>=7.1.1)", "mypy (>=1.11.2)", "pytest (>=8.3.2)", "ruff (>=0.6.2)"]
@@ -1471,19 +1125,6 @@ files = [
{file = "iniconfig-2.3.0.tar.gz", hash = "sha256:c76315c77db068650d49c5b56314774a7804df16fee4402c1f19d6d15d8c4730"},
]
-[[package]]
-name = "isodate"
-version = "0.7.2"
-description = "An ISO 8601 date/time/duration parser and formatter"
-optional = false
-python-versions = ">=3.7"
-groups = ["main", "dev"]
-files = [
- {file = "isodate-0.7.2-py3-none-any.whl", hash = "sha256:28009937d8031054830160fce6d409ed342816b543597cece116d966c6d99e15"},
- {file = "isodate-0.7.2.tar.gz", hash = "sha256:4cd1aa0f43ca76f4a6c6c0292a85f40b35ec2e43e315b59f06e6d32171a953e6"},
-]
-markers = {main = "extra == \"azure\" or extra == \"cloud\" or extra == \"all\""}
-
[[package]]
name = "jinja2"
version = "3.1.6"
@@ -1513,7 +1154,7 @@ files = [
{file = "jmespath-1.0.1-py3-none-any.whl", hash = "sha256:02e2e4cc71b5bcab88332eebf907519190dd9e6e82107fa7f83b1003a6252980"},
{file = "jmespath-1.0.1.tar.gz", hash = "sha256:90261b206d6defd58fdd5e85f478bf633a2901798906be2ad389150c5c60edbe"},
]
-markers = {main = "(python_version <= \"3.13\" or extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\") and (extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\" or extra == \"snowflake\")"}
+markers = {main = "(extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\" or extra == \"snowflake\") and (python_version <= \"3.13\" or extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\")"}
[[package]]
name = "jsonschema"
@@ -2114,7 +1755,7 @@ files = [
{file = "packaging-25.0-py3-none-any.whl", hash = "sha256:29572ef2b1f17581046b3a2227d5c611fb25ec70ca1ba8554b24b0e69331a484"},
{file = "packaging-25.0.tar.gz", hash = "sha256:d443872c98d677bf60f6a1f2f8c1cb748e8fe762d2bf9d3148b5599295b0fc4f"},
]
-markers = {main = "(python_version <= \"3.13\" or extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\") and (extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\" or extra == \"bigquery\")"}
+markers = {main = "(extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\" or extra == \"bigquery\") and (python_version <= \"3.13\" or extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\")"}
[[package]]
name = "pandas"
@@ -2250,18 +1891,6 @@ optional = ["typing-extensions (>=4)"]
re2 = ["google-re2 (>=1.1)"]
tests = ["pytest (>=9)", "typing-extensions (>=4.15)"]
-[[package]]
-name = "phonenumbers"
-version = "9.0.22"
-description = "Python version of Google's common library for parsing, formatting, storing and validating international phone numbers."
-optional = false
-python-versions = ">=2.5"
-groups = ["main"]
-files = [
- {file = "phonenumbers-9.0.22-py2.py3-none-any.whl", hash = "sha256:645e66cd9a136b3b257b5f941fa97d324124114d31ad3c9f2488682f47ad7ee1"},
- {file = "phonenumbers-9.0.22.tar.gz", hash = "sha256:eff985c65575749d1d54e07c56c3517d5243e03e08e4a6191761df9aab2278f2"},
-]
-
[[package]]
name = "platformdirs"
version = "4.5.1"
@@ -2273,7 +1902,7 @@ files = [
{file = "platformdirs-4.5.1-py3-none-any.whl", hash = "sha256:d03afa3963c806a9bed9d5125c8f4cb2fdaf74a55ab60e5d59b3fde758104d31"},
{file = "platformdirs-4.5.1.tar.gz", hash = "sha256:61d5cdcc6065745cdd94f0f878977f8de9437be93de97c1c12f853c9c0cdcbda"},
]
-markers = {main = "python_version <= \"3.13\" and (extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\")"}
+markers = {main = "(extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\") and python_version <= \"3.13\""}
[package.extras]
docs = ["furo (>=2025.9.25)", "proselint (>=0.14)", "sphinx (>=8.2.3)", "sphinx-autodoc-typehints (>=3.2)"]
@@ -2326,7 +1955,7 @@ files = [
{file = "proto_plus-1.27.0-py3-none-any.whl", hash = "sha256:1baa7f81cf0f8acb8bc1f6d085008ba4171eaf669629d1b6d1673b21ed1c0a82"},
{file = "proto_plus-1.27.0.tar.gz", hash = "sha256:873af56dd0d7e91836aee871e5799e1c6f1bda86ac9a983e0bb9f0c266a568c4"},
]
-markers = {main = "extra == \"gcs\" or extra == \"cloud\" or extra == \"all\" or extra == \"bigquery\" or extra == \"warehouses\""}
+markers = {main = "extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\""}
[package.dependencies]
protobuf = ">=3.19.0,<7.0.0"
@@ -2353,7 +1982,7 @@ files = [
{file = "protobuf-6.33.2-py3-none-any.whl", hash = "sha256:7636aad9bb01768870266de5dc009de2d1b936771b38a793f73cbbf279c91c5c"},
{file = "protobuf-6.33.2.tar.gz", hash = "sha256:56dc370c91fbb8ac85bc13582c9e373569668a290aa2e66a590c2a0d35ddb9e4"},
]
-markers = {main = "extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\" or extra == \"gcs\" or extra == \"cloud\""}
+markers = {main = "extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\""}
[[package]]
name = "psycopg2-binary"
@@ -2504,7 +2133,7 @@ files = [
{file = "pyasn1-0.6.2-py3-none-any.whl", hash = "sha256:1eb26d860996a18e9b6ed05e7aae0e9fc21619fcee6af91cca9bad4fbea224bf"},
{file = "pyasn1-0.6.2.tar.gz", hash = "sha256:9b59a2b25ba7e4f8197db7686c09fb33e658b98339fadb826e9512629017833b"},
]
-markers = {main = "extra == \"gcs\" or extra == \"cloud\" or extra == \"all\" or extra == \"bigquery\" or extra == \"warehouses\""}
+markers = {main = "extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\""}
[[package]]
name = "pyasn1-modules"
@@ -2517,7 +2146,7 @@ files = [
{file = "pyasn1_modules-0.4.2-py3-none-any.whl", hash = "sha256:29253a9207ce32b64c3ac6600edc75368f98473906e8fd1043bd6b5b1de2c14a"},
{file = "pyasn1_modules-0.4.2.tar.gz", hash = "sha256:677091de870a80aae844b1ca6134f54652fa2c8c5a52aa396440ac3106e941e6"},
]
-markers = {main = "extra == \"gcs\" or extra == \"cloud\" or extra == \"all\" or extra == \"bigquery\" or extra == \"warehouses\""}
+markers = {main = "extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\""}
[package.dependencies]
pyasn1 = ">=0.6.1,<0.7.0"
@@ -2533,7 +2162,7 @@ files = [
{file = "pycparser-2.23-py3-none-any.whl", hash = "sha256:e5c6e8d3fbad53479cab09ac03729e0a9faf2bee3db8208a550daf5af81a5934"},
{file = "pycparser-2.23.tar.gz", hash = "sha256:78816d4f24add8f10a06d6f05b4d424ad9e96cfebf68a4ddc99c65c0720d00c2"},
]
-markers = {main = "(python_version <= \"3.13\" or platform_python_implementation != \"PyPy\") and (python_version <= \"3.13\" or extra == \"azure\" or extra == \"cloud\" or extra == \"all\") and (python_version <= \"3.13\" or implementation_name != \"PyPy\") and (platform_python_implementation != \"PyPy\" or extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\") and (extra == \"azure\" or extra == \"cloud\" or extra == \"all\" or extra == \"snowflake\" or extra == \"warehouses\")", dev = "(python_version <= \"3.13\" or platform_python_implementation != \"PyPy\") and (python_version <= \"3.13\" or implementation_name != \"PyPy\")"}
+markers = {main = "(extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\") and python_version <= \"3.13\"", dev = "python_version <= \"3.13\" or platform_python_implementation != \"PyPy\" and implementation_name != \"PyPy\""}
[[package]]
name = "pygments"
@@ -2561,7 +2190,7 @@ files = [
{file = "PyJWT-2.10.1-py3-none-any.whl", hash = "sha256:dcdd193e30abefd5debf142f9adfcdd2b58004e644f25406ffaebd50bd98dacb"},
{file = "pyjwt-2.10.1.tar.gz", hash = "sha256:3cc5772eb20009233caf06e9d8a0577824723b44e6648ee0a2aedb6cf9381953"},
]
-markers = {main = "python_version <= \"3.13\" and (extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\")", dev = "python_version <= \"3.13\""}
+markers = {main = "(extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\") and python_version <= \"3.13\"", dev = "python_version <= \"3.13\""}
[package.extras]
crypto = ["cryptography (>=3.4.0)"]
@@ -2655,7 +2284,7 @@ files = [
{file = "pyopenssl-25.3.0-py3-none-any.whl", hash = "sha256:1fda6fc034d5e3d179d39e59c1895c9faeaf40a79de5fc4cbbfbe0d36f4a77b6"},
{file = "pyopenssl-25.3.0.tar.gz", hash = "sha256:c981cb0a3fd84e8602d7afc209522773b94c1c2446a3c710a75b06fe1beae329"},
]
-markers = {main = "python_version <= \"3.13\" and (extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\")", dev = "python_version <= \"3.13\""}
+markers = {main = "(extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\") and python_version <= \"3.13\"", dev = "python_version <= \"3.13\""}
[package.dependencies]
cryptography = ">=45.0.7,<47"
@@ -2882,7 +2511,7 @@ files = [
{file = "requests-2.32.5-py3-none-any.whl", hash = "sha256:2462f94637a34fd532264295e186976db0f5d453d1cdd31473c85a6a161affb6"},
{file = "requests-2.32.5.tar.gz", hash = "sha256:dbba0bac56e100853db0ea71b82b4dfd5fe2bf6d3754a8893c3af500cec7d7cf"},
]
-markers = {main = "(python_version <= \"3.13\" or extra == \"gcs\" or extra == \"cloud\" or extra == \"all\" or extra == \"bigquery\" or extra == \"warehouses\" or extra == \"azure\") and (extra == \"gcs\" or extra == \"cloud\" or extra == \"all\" or extra == \"bigquery\" or extra == \"warehouses\" or extra == \"azure\" or extra == \"snowflake\")"}
+markers = {main = "(extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\" or extra == \"bigquery\") and (python_version <= \"3.13\" or extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\")"}
[package.dependencies]
certifi = ">=2017.4.17"
@@ -3070,7 +2699,7 @@ files = [
{file = "rsa-4.9.1-py3-none-any.whl", hash = "sha256:68635866661c6836b8d39430f97a996acbd61bfa49406748ea243539fe239762"},
{file = "rsa-4.9.1.tar.gz", hash = "sha256:e7bdbfdb5497da4c07dfd35530e1a902659db6ff241e39d9953cad06ebd0ae75"},
]
-markers = {main = "extra == \"gcs\" or extra == \"cloud\" or extra == \"all\" or extra == \"bigquery\" or extra == \"warehouses\""}
+markers = {main = "extra == \"bigquery\" or extra == \"warehouses\" or extra == \"all\""}
[package.dependencies]
pyasn1 = ">=0.1.3"
@@ -3115,7 +2744,7 @@ files = [
{file = "s3transfer-0.16.0-py3-none-any.whl", hash = "sha256:18e25d66fed509e3868dc1572b3f427ff947dd2c56f844a5bf09481ad3f3b2fe"},
{file = "s3transfer-0.16.0.tar.gz", hash = "sha256:8e990f13268025792229cd52fa10cb7163744bf56e719e0b9cb925ab79abf920"},
]
-markers = {main = "(python_version <= \"3.13\" or extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\") and (extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\" or extra == \"snowflake\")"}
+markers = {main = "(extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\" or extra == \"snowflake\") and (python_version <= \"3.13\" or extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\")"}
[package.dependencies]
botocore = ">=1.37.4,<2.0a0"
@@ -3130,7 +2759,7 @@ description = "Fundamental algorithms for scientific computing in Python"
optional = true
python-versions = ">=3.11"
groups = ["main"]
-markers = "(extra == \"statistical\" or extra == \"all\") and python_version >= \"3.11\""
+markers = "python_version >= \"3.11\" and (extra == \"statistical\" or extra == \"all\")"
files = [
{file = "scipy-1.17.0-cp311-cp311-macosx_10_14_x86_64.whl", hash = "sha256:2abd71643797bd8a106dff97894ff7869eeeb0af0f7a5ce02e4227c6a2e9d6fd"},
{file = "scipy-1.17.0-cp311-cp311-macosx_12_0_arm64.whl", hash = "sha256:ef28d815f4d2686503e5f4f00edc387ae58dfd7a2f42e348bb53359538f01558"},
@@ -3262,7 +2891,7 @@ files = [
{file = "snowflake_connector_python-3.18.0-cp39-cp39-win_amd64.whl", hash = "sha256:a8c570edff5a4888840dbe1e9e65c5e4d77d55c5c800cd359fe0903a769201e0"},
{file = "snowflake_connector_python-3.18.0.tar.gz", hash = "sha256:41a46eb9824574c5f8068e3ed5c02a2dc0a733ed08ee81fa1fb3dd0ebe921728"},
]
-markers = {main = "python_version <= \"3.13\" and (extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\")", dev = "python_version <= \"3.13\""}
+markers = {main = "(extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\") and python_version <= \"3.13\"", dev = "python_version <= \"3.13\""}
[package.dependencies]
asn1crypto = ">0.24.0,<2.0.0"
@@ -3300,7 +2929,7 @@ files = [
{file = "sortedcontainers-2.4.0-py2.py3-none-any.whl", hash = "sha256:a163dcaede0f1c021485e957a39245190e74249897e2ae4b2aa38595db237ee0"},
{file = "sortedcontainers-2.4.0.tar.gz", hash = "sha256:25caa5a06cc30b6b83d11423433f65d1f9d76c4c6a0c90e3379eaa43b9bfdb88"},
]
-markers = {main = "python_version <= \"3.13\" and (extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\")", dev = "python_version <= \"3.13\""}
+markers = {main = "(extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\") and python_version <= \"3.13\"", dev = "python_version <= \"3.13\""}
[[package]]
name = "sqlalchemy"
@@ -3458,7 +3087,7 @@ files = [
{file = "tomlkit-0.14.0-py3-none-any.whl", hash = "sha256:592064ed85b40fa213469f81ac584f67a4f2992509a7c3ea2d632208623a3680"},
{file = "tomlkit-0.14.0.tar.gz", hash = "sha256:cf00efca415dbd57575befb1f6634c4f42d2d87dbba376128adb42c121b87064"},
]
-markers = {main = "python_version <= \"3.13\" and (extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\")", dev = "python_version <= \"3.13\""}
+markers = {main = "(extra == \"snowflake\" or extra == \"warehouses\" or extra == \"all\") and python_version <= \"3.13\"", dev = "python_version <= \"3.13\""}
[[package]]
name = "typer"
@@ -3537,7 +3166,7 @@ files = [
{file = "urllib3-2.6.3-py3-none-any.whl", hash = "sha256:bf272323e553dfb2e87d9bfd225ca7b0f467b919d7bbd355436d3fd37cb0acd4"},
{file = "urllib3-2.6.3.tar.gz", hash = "sha256:1b62b6884944a57dbe321509ab94fd4d3b307075e0c2eae991ac71ee15ad38ed"},
]
-markers = {main = "(python_version <= \"3.13\" or extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\" or extra == \"gcs\" or extra == \"bigquery\" or extra == \"azure\") and (extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\" or extra == \"snowflake\" or extra == \"gcs\" or extra == \"bigquery\" or extra == \"azure\")"}
+markers = {main = "(extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\" or extra == \"snowflake\" or extra == \"bigquery\") and (python_version <= \"3.13\" or extra == \"s3\" or extra == \"cloud\" or extra == \"redshift\" or extra == \"warehouses\" or extra == \"all\" or extra == \"bigquery\")"}
[package.extras]
brotli = ["brotli (>=1.2.0) ; platform_python_implementation == \"CPython\"", "brotlicffi (>=1.2.0.0) ; platform_python_implementation != \"CPython\""]
@@ -3585,127 +3214,6 @@ markupsafe = ">=2.1.1"
[package.extras]
watchdog = ["watchdog (>=2.3)"]
-[[package]]
-name = "wrapt"
-version = "2.0.1"
-description = "Module for decorators, wrappers and monkey patching."
-optional = false
-python-versions = ">=3.8"
-groups = ["main", "dev"]
-files = [
- {file = "wrapt-2.0.1-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:64b103acdaa53b7caf409e8d45d39a8442fe6dcfec6ba3f3d141e0cc2b5b4dbd"},
- {file = "wrapt-2.0.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:91bcc576260a274b169c3098e9a3519fb01f2989f6d3d386ef9cbf8653de1374"},
- {file = "wrapt-2.0.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:ab594f346517010050126fcd822697b25a7031d815bb4fbc238ccbe568216489"},
- {file = "wrapt-2.0.1-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:36982b26f190f4d737f04a492a68accbfc6fa042c3f42326fdfbb6c5b7a20a31"},
- {file = "wrapt-2.0.1-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:23097ed8bc4c93b7bf36fa2113c6c733c976316ce0ee2c816f64ca06102034ef"},
- {file = "wrapt-2.0.1-cp310-cp310-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:8bacfe6e001749a3b64db47bcf0341da757c95959f592823a93931a422395013"},
- {file = "wrapt-2.0.1-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:8ec3303e8a81932171f455f792f8df500fc1a09f20069e5c16bd7049ab4e8e38"},
- {file = "wrapt-2.0.1-cp310-cp310-musllinux_1_2_riscv64.whl", hash = "sha256:3f373a4ab5dbc528a94334f9fe444395b23c2f5332adab9ff4ea82f5a9e33bc1"},
- {file = "wrapt-2.0.1-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:f49027b0b9503bf6c8cdc297ca55006b80c2f5dd36cecc72c6835ab6e10e8a25"},
- {file = "wrapt-2.0.1-cp310-cp310-win32.whl", hash = "sha256:8330b42d769965e96e01fa14034b28a2a7600fbf7e8f0cc90ebb36d492c993e4"},
- {file = "wrapt-2.0.1-cp310-cp310-win_amd64.whl", hash = "sha256:1218573502a8235bb8a7ecaed12736213b22dcde9feab115fa2989d42b5ded45"},
- {file = "wrapt-2.0.1-cp310-cp310-win_arm64.whl", hash = "sha256:eda8e4ecd662d48c28bb86be9e837c13e45c58b8300e43ba3c9b4fa9900302f7"},
- {file = "wrapt-2.0.1-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:0e17283f533a0d24d6e5429a7d11f250a58d28b4ae5186f8f47853e3e70d2590"},
- {file = "wrapt-2.0.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:85df8d92158cb8f3965aecc27cf821461bb5f40b450b03facc5d9f0d4d6ddec6"},
- {file = "wrapt-2.0.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:c1be685ac7700c966b8610ccc63c3187a72e33cab53526a27b2a285a662cd4f7"},
- {file = "wrapt-2.0.1-cp311-cp311-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:df0b6d3b95932809c5b3fecc18fda0f1e07452d05e2662a0b35548985f256e28"},
- {file = "wrapt-2.0.1-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:4da7384b0e5d4cae05c97cd6f94faaf78cc8b0f791fc63af43436d98c4ab37bb"},
- {file = "wrapt-2.0.1-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:ec65a78fbd9d6f083a15d7613b2800d5663dbb6bb96003899c834beaa68b242c"},
- {file = "wrapt-2.0.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:7de3cc939be0e1174969f943f3b44e0d79b6f9a82198133a5b7fc6cc92882f16"},
- {file = "wrapt-2.0.1-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:fb1a5b72cbd751813adc02ef01ada0b0d05d3dcbc32976ce189a1279d80ad4a2"},
- {file = "wrapt-2.0.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:3fa272ca34332581e00bf7773e993d4f632594eb2d1b0b162a9038df0fd971dd"},
- {file = "wrapt-2.0.1-cp311-cp311-win32.whl", hash = "sha256:fc007fdf480c77301ab1afdbb6ab22a5deee8885f3b1ed7afcb7e5e84a0e27be"},
- {file = "wrapt-2.0.1-cp311-cp311-win_amd64.whl", hash = "sha256:47434236c396d04875180171ee1f3815ca1eada05e24a1ee99546320d54d1d1b"},
- {file = "wrapt-2.0.1-cp311-cp311-win_arm64.whl", hash = "sha256:837e31620e06b16030b1d126ed78e9383815cbac914693f54926d816d35d8edf"},
- {file = "wrapt-2.0.1-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:1fdbb34da15450f2b1d735a0e969c24bdb8d8924892380126e2a293d9902078c"},
- {file = "wrapt-2.0.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:3d32794fe940b7000f0519904e247f902f0149edbe6316c710a8562fb6738841"},
- {file = "wrapt-2.0.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:386fb54d9cd903ee0012c09291336469eb7b244f7183d40dc3e86a16a4bace62"},
- {file = "wrapt-2.0.1-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:7b219cb2182f230676308cdcacd428fa837987b89e4b7c5c9025088b8a6c9faf"},
- {file = "wrapt-2.0.1-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:641e94e789b5f6b4822bb8d8ebbdfc10f4e4eae7756d648b717d980f657a9eb9"},
- {file = "wrapt-2.0.1-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:fe21b118b9f58859b5ebaa4b130dee18669df4bd111daad082b7beb8799ad16b"},
- {file = "wrapt-2.0.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:17fb85fa4abc26a5184d93b3efd2dcc14deb4b09edcdb3535a536ad34f0b4dba"},
- {file = "wrapt-2.0.1-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:b89ef9223d665ab255ae42cc282d27d69704d94be0deffc8b9d919179a609684"},
- {file = "wrapt-2.0.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:a453257f19c31b31ba593c30d997d6e5be39e3b5ad9148c2af5a7314061c63eb"},
- {file = "wrapt-2.0.1-cp312-cp312-win32.whl", hash = "sha256:3e271346f01e9c8b1130a6a3b0e11908049fe5be2d365a5f402778049147e7e9"},
- {file = "wrapt-2.0.1-cp312-cp312-win_amd64.whl", hash = "sha256:2da620b31a90cdefa9cd0c2b661882329e2e19d1d7b9b920189956b76c564d75"},
- {file = "wrapt-2.0.1-cp312-cp312-win_arm64.whl", hash = "sha256:aea9c7224c302bc8bfc892b908537f56c430802560e827b75ecbde81b604598b"},
- {file = "wrapt-2.0.1-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:47b0f8bafe90f7736151f61482c583c86b0693d80f075a58701dd1549b0010a9"},
- {file = "wrapt-2.0.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:cbeb0971e13b4bd81d34169ed57a6dda017328d1a22b62fda45e1d21dd06148f"},
- {file = "wrapt-2.0.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:eb7cffe572ad0a141a7886a1d2efa5bef0bf7fe021deeea76b3ab334d2c38218"},
- {file = "wrapt-2.0.1-cp313-cp313-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:c8d60527d1ecfc131426b10d93ab5d53e08a09c5fa0175f6b21b3252080c70a9"},
- {file = "wrapt-2.0.1-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:c654eafb01afac55246053d67a4b9a984a3567c3808bb7df2f8de1c1caba2e1c"},
- {file = "wrapt-2.0.1-cp313-cp313-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:98d873ed6c8b4ee2418f7afce666751854d6d03e3c0ec2a399bb039cd2ae89db"},
- {file = "wrapt-2.0.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:c9e850f5b7fc67af856ff054c71690d54fa940c3ef74209ad9f935b4f66a0233"},
- {file = "wrapt-2.0.1-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:e505629359cb5f751e16e30cf3f91a1d3ddb4552480c205947da415d597f7ac2"},
- {file = "wrapt-2.0.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:2879af909312d0baf35f08edeea918ee3af7ab57c37fe47cb6a373c9f2749c7b"},
- {file = "wrapt-2.0.1-cp313-cp313-win32.whl", hash = "sha256:d67956c676be5a24102c7407a71f4126d30de2a569a1c7871c9f3cabc94225d7"},
- {file = "wrapt-2.0.1-cp313-cp313-win_amd64.whl", hash = "sha256:9ca66b38dd642bf90c59b6738af8070747b610115a39af2498535f62b5cdc1c3"},
- {file = "wrapt-2.0.1-cp313-cp313-win_arm64.whl", hash = "sha256:5a4939eae35db6b6cec8e7aa0e833dcca0acad8231672c26c2a9ab7a0f8ac9c8"},
- {file = "wrapt-2.0.1-cp313-cp313t-macosx_10_13_universal2.whl", hash = "sha256:a52f93d95c8d38fed0669da2ebdb0b0376e895d84596a976c15a9eb45e3eccb3"},
- {file = "wrapt-2.0.1-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:4e54bbf554ee29fcceee24fa41c4d091398b911da6e7f5d7bffda963c9aed2e1"},
- {file = "wrapt-2.0.1-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:908f8c6c71557f4deaa280f55d0728c3bca0960e8c3dd5ceeeafb3c19942719d"},
- {file = "wrapt-2.0.1-cp313-cp313t-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:e2f84e9af2060e3904a32cea9bb6db23ce3f91cfd90c6b426757cf7cc01c45c7"},
- {file = "wrapt-2.0.1-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:e3612dc06b436968dfb9142c62e5dfa9eb5924f91120b3c8ff501ad878f90eb3"},
- {file = "wrapt-2.0.1-cp313-cp313t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:6d2d947d266d99a1477cd005b23cbd09465276e302515e122df56bb9511aca1b"},
- {file = "wrapt-2.0.1-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:7d539241e87b650cbc4c3ac9f32c8d1ac8a54e510f6dca3f6ab60dcfd48c9b10"},
- {file = "wrapt-2.0.1-cp313-cp313t-musllinux_1_2_riscv64.whl", hash = "sha256:4811e15d88ee62dbf5c77f2c3ff3932b1e3ac92323ba3912f51fc4016ce81ecf"},
- {file = "wrapt-2.0.1-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:c1c91405fcf1d501fa5d55df21e58ea49e6b879ae829f1039faaf7e5e509b41e"},
- {file = "wrapt-2.0.1-cp313-cp313t-win32.whl", hash = "sha256:e76e3f91f864e89db8b8d2a8311d57df93f01ad6bb1e9b9976d1f2e83e18315c"},
- {file = "wrapt-2.0.1-cp313-cp313t-win_amd64.whl", hash = "sha256:83ce30937f0ba0d28818807b303a412440c4b63e39d3d8fc036a94764b728c92"},
- {file = "wrapt-2.0.1-cp313-cp313t-win_arm64.whl", hash = "sha256:4b55cacc57e1dc2d0991dbe74c6419ffd415fb66474a02335cb10efd1aa3f84f"},
- {file = "wrapt-2.0.1-cp314-cp314-macosx_10_13_universal2.whl", hash = "sha256:5e53b428f65ece6d9dad23cb87e64506392b720a0b45076c05354d27a13351a1"},
- {file = "wrapt-2.0.1-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:ad3ee9d0f254851c71780966eb417ef8e72117155cff04821ab9b60549694a55"},
- {file = "wrapt-2.0.1-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:d7b822c61ed04ee6ad64bc90d13368ad6eb094db54883b5dde2182f67a7f22c0"},
- {file = "wrapt-2.0.1-cp314-cp314-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:7164a55f5e83a9a0b031d3ffab4d4e36bbec42e7025db560f225489fa929e509"},
- {file = "wrapt-2.0.1-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:e60690ba71a57424c8d9ff28f8d006b7ad7772c22a4af432188572cd7fa004a1"},
- {file = "wrapt-2.0.1-cp314-cp314-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:3cd1a4bd9a7a619922a8557e1318232e7269b5fb69d4ba97b04d20450a6bf970"},
- {file = "wrapt-2.0.1-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:b4c2e3d777e38e913b8ce3a6257af72fb608f86a1df471cb1d4339755d0a807c"},
- {file = "wrapt-2.0.1-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:3d366aa598d69416b5afedf1faa539fac40c1d80a42f6b236c88c73a3c8f2d41"},
- {file = "wrapt-2.0.1-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:c235095d6d090aa903f1db61f892fffb779c1eaeb2a50e566b52001f7a0f66ed"},
- {file = "wrapt-2.0.1-cp314-cp314-win32.whl", hash = "sha256:bfb5539005259f8127ea9c885bdc231978c06b7a980e63a8a61c8c4c979719d0"},
- {file = "wrapt-2.0.1-cp314-cp314-win_amd64.whl", hash = "sha256:4ae879acc449caa9ed43fc36ba08392b9412ee67941748d31d94e3cedb36628c"},
- {file = "wrapt-2.0.1-cp314-cp314-win_arm64.whl", hash = "sha256:8639b843c9efd84675f1e100ed9e99538ebea7297b62c4b45a7042edb84db03e"},
- {file = "wrapt-2.0.1-cp314-cp314t-macosx_10_13_universal2.whl", hash = "sha256:9219a1d946a9b32bb23ccae66bdb61e35c62773ce7ca6509ceea70f344656b7b"},
- {file = "wrapt-2.0.1-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:fa4184e74197af3adad3c889a1af95b53bb0466bced92ea99a0c014e48323eec"},
- {file = "wrapt-2.0.1-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:c5ef2f2b8a53b7caee2f797ef166a390fef73979b15778a4a153e4b5fedce8fa"},
- {file = "wrapt-2.0.1-cp314-cp314t-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:e042d653a4745be832d5aa190ff80ee4f02c34b21f4b785745eceacd0907b815"},
- {file = "wrapt-2.0.1-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:2afa23318136709c4b23d87d543b425c399887b4057936cd20386d5b1422b6fa"},
- {file = "wrapt-2.0.1-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:6c72328f668cf4c503ffcf9434c2b71fdd624345ced7941bc6693e61bbe36bef"},
- {file = "wrapt-2.0.1-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:3793ac154afb0e5b45d1233cb94d354ef7a983708cc3bb12563853b1d8d53747"},
- {file = "wrapt-2.0.1-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:fec0d993ecba3991645b4857837277469c8cc4c554a7e24d064d1ca291cfb81f"},
- {file = "wrapt-2.0.1-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:949520bccc1fa227274da7d03bf238be15389cd94e32e4297b92337df9b7a349"},
- {file = "wrapt-2.0.1-cp314-cp314t-win32.whl", hash = "sha256:be9e84e91d6497ba62594158d3d31ec0486c60055c49179edc51ee43d095f79c"},
- {file = "wrapt-2.0.1-cp314-cp314t-win_amd64.whl", hash = "sha256:61c4956171c7434634401db448371277d07032a81cc21c599c22953374781395"},
- {file = "wrapt-2.0.1-cp314-cp314t-win_arm64.whl", hash = "sha256:35cdbd478607036fee40273be8ed54a451f5f23121bd9d4be515158f9498f7ad"},
- {file = "wrapt-2.0.1-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:90897ea1cf0679763b62e79657958cd54eae5659f6360fc7d2ccc6f906342183"},
- {file = "wrapt-2.0.1-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:50844efc8cdf63b2d90cd3d62d4947a28311e6266ce5235a219d21b195b4ec2c"},
- {file = "wrapt-2.0.1-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:49989061a9977a8cbd6d20f2efa813f24bf657c6990a42967019ce779a878dbf"},
- {file = "wrapt-2.0.1-cp38-cp38-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:09c7476ab884b74dce081ad9bfd07fe5822d8600abade571cb1f66d5fc915af6"},
- {file = "wrapt-2.0.1-cp38-cp38-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:d1a8a09a004ef100e614beec82862d11fc17d601092c3599afd22b1f36e4137e"},
- {file = "wrapt-2.0.1-cp38-cp38-musllinux_1_2_aarch64.whl", hash = "sha256:89a82053b193837bf93c0f8a57ded6e4b6d88033a499dadff5067e912c2a41e9"},
- {file = "wrapt-2.0.1-cp38-cp38-musllinux_1_2_x86_64.whl", hash = "sha256:f26f8e2ca19564e2e1fdbb6a0e47f36e0efbab1acc31e15471fad88f828c75f6"},
- {file = "wrapt-2.0.1-cp38-cp38-win32.whl", hash = "sha256:115cae4beed3542e37866469a8a1f2b9ec549b4463572b000611e9946b86e6f6"},
- {file = "wrapt-2.0.1-cp38-cp38-win_amd64.whl", hash = "sha256:c4012a2bd37059d04f8209916aa771dfb564cccb86079072bdcd48a308b6a5c5"},
- {file = "wrapt-2.0.1-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:68424221a2dc00d634b54f92441914929c5ffb1c30b3b837343978343a3512a3"},
- {file = "wrapt-2.0.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:6bd1a18f5a797fe740cb3d7a0e853a8ce6461cc62023b630caec80171a6b8097"},
- {file = "wrapt-2.0.1-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:fb3a86e703868561c5cad155a15c36c716e1ab513b7065bd2ac8ed353c503333"},
- {file = "wrapt-2.0.1-cp39-cp39-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:5dc1b852337c6792aa111ca8becff5bacf576bf4a0255b0f05eb749da6a1643e"},
- {file = "wrapt-2.0.1-cp39-cp39-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:c046781d422f0830de6329fa4b16796096f28a92c8aef3850674442cdcb87b7f"},
- {file = "wrapt-2.0.1-cp39-cp39-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:f73f9f7a0ebd0db139253d27e5fc8d2866ceaeef19c30ab5d69dcbe35e1a6981"},
- {file = "wrapt-2.0.1-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:b667189cf8efe008f55bbda321890bef628a67ab4147ebf90d182f2dadc78790"},
- {file = "wrapt-2.0.1-cp39-cp39-musllinux_1_2_riscv64.whl", hash = "sha256:a9a83618c4f0757557c077ef71d708ddd9847ed66b7cc63416632af70d3e2308"},
- {file = "wrapt-2.0.1-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:1e9b121e9aeb15df416c2c960b8255a49d44b4038016ee17af03975992d03931"},
- {file = "wrapt-2.0.1-cp39-cp39-win32.whl", hash = "sha256:1f186e26ea0a55f809f232e92cc8556a0977e00183c3ebda039a807a42be1494"},
- {file = "wrapt-2.0.1-cp39-cp39-win_amd64.whl", hash = "sha256:bf4cb76f36be5de950ce13e22e7fdf462b35b04665a12b64f3ac5c1bbbcf3728"},
- {file = "wrapt-2.0.1-cp39-cp39-win_arm64.whl", hash = "sha256:d6cc985b9c8b235bd933990cdbf0f891f8e010b65a3911f7a55179cd7b0fc57b"},
- {file = "wrapt-2.0.1-py3-none-any.whl", hash = "sha256:4d2ce1bf1a48c5277d7969259232b57645aae5686dba1eaeade39442277afbca"},
- {file = "wrapt-2.0.1.tar.gz", hash = "sha256:9c9c635e78497cacb81e84f8b11b23e0aacac7a136e73b8e5b2109a1d9fc468f"},
-]
-markers = {main = "extra == \"deltalake\" or extra == \"formats\" or extra == \"all\""}
-
-[package.extras]
-dev = ["pytest", "setuptools"]
-
[[package]]
name = "xmltodict"
version = "1.0.2"
@@ -3722,16 +3230,10 @@ files = [
test = ["pytest", "pytest-cov"]
[extras]
-all = ["azure-storage-blob", "boto3", "deltalake", "duckdb", "fastavro", "google-auth", "google-cloud-bigquery", "google-cloud-storage", "jsonschema", "mysql-connector-python", "psycopg2-binary", "pyodbc", "scipy", "snowflake-connector-python", "sqlalchemy"]
-avro = ["fastavro"]
-azure = ["azure-storage-blob"]
+all = ["boto3", "google-auth", "google-cloud-bigquery", "jsonschema", "mysql-connector-python", "psycopg2-binary", "pyodbc", "scipy", "snowflake-connector-python", "sqlalchemy"]
bigquery = ["google-auth", "google-cloud-bigquery"]
-cloud = ["azure-storage-blob", "boto3", "google-auth", "google-cloud-storage"]
-databases = ["duckdb", "mysql-connector-python", "psycopg2-binary", "pyodbc", "sqlalchemy"]
-deltalake = ["deltalake"]
-duckdb = ["duckdb"]
-formats = ["deltalake", "duckdb", "fastavro"]
-gcs = ["google-auth", "google-cloud-storage"]
+cloud = ["boto3"]
+databases = ["mysql-connector-python", "psycopg2-binary", "pyodbc", "sqlalchemy"]
mssql = ["pyodbc", "sqlalchemy"]
mysql = ["mysql-connector-python", "sqlalchemy"]
postgres = ["psycopg2-binary", "sqlalchemy"]
@@ -3746,4 +3248,4 @@ warehouses = ["boto3", "google-auth", "google-cloud-bigquery", "psycopg2-binary"
[metadata]
lock-version = "2.1"
python-versions = ">=3.10,<4.0"
-content-hash = "57f5ef0f9def33af10db85c6ef1e1473eaf563d6ebe5aec769b7e96cf91d1782"
+content-hash = "3ec4ffd949a462e23313f979e5cfdbf09997f47dccaf83d0b30709bde33c1cf8"
From 9f78e85156ff3a9317be5c04ffee42a1615fcc3c Mon Sep 17 00:00:00 2001
From: yash-chauhan-dev
Date: Tue, 24 Feb 2026 12:27:32 +0530
Subject: [PATCH 24/25] Fix all mypy type errors
- builder.py: use separate variable name for int(params) in min/max_length
to avoid type conflict with str-typed v used elsewhere in the function
- sample_data.py: add type annotation to nested seg() helper and data list
- engine.py: cast to_dict() result to dict[str, Any] for parse_results()
- loader.py: add type: ignore[call-overload] on pd.read_csv calls where
**kwargs spread prevents pandas-stubs overload resolution
- poetry.lock: regenerated after types-PyYAML and pandas-stubs were installed
Co-Authored-By: Claude Sonnet 4.6
---
datacheck/config/sample_data.py | 4 ++--
datacheck/engine.py | 2 +-
datacheck/loader.py | 4 ++--
datacheck/sql_pushdown/builder.py | 8 ++++----
4 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/datacheck/config/sample_data.py b/datacheck/config/sample_data.py
index beb87c3..0464f94 100644
--- a/datacheck/config/sample_data.py
+++ b/datacheck/config/sample_data.py
@@ -56,7 +56,7 @@ def _random_datetime(start: date, end: date) -> str:
def _random_uuid() -> str:
h = "0123456789abcdef"
- def seg(n):
+ def seg(n: int) -> str:
return "".join(random.choice(h) for _ in range(n))
return f"{seg(8)}-{seg(4)}-4{seg(3)}-{random.choice('89ab')}{seg(3)}-{seg(12)}"
@@ -327,7 +327,7 @@ def generate_iot_data(num_rows: int = 1000) -> list[dict[str, Any]]:
today = date.today()
start = date(2024, 1, 1)
used_combos: set[tuple[str, str]] = set()
- data = []
+ data: list[dict[str, Any]] = []
i = 0
while len(data) < num_rows:
diff --git a/datacheck/engine.py b/datacheck/engine.py
index 1df26ab..c7b24a7 100644
--- a/datacheck/engine.py
+++ b/datacheck/engine.py
@@ -392,7 +392,7 @@ def validate_sources(
tbl, where, pushable, _dialect
)
_pd_result = _connector.execute_query(_sql)
- _pd_row = _pd_result.iloc[0].to_dict()
+ _pd_row: dict[str, Any] = {str(k): v for k, v in _pd_result.iloc[0].to_dict().items()}
_sql_results = _builder.parse_results(_pd_row, pushable)
all_results.extend(_sql_results)
if not non_pushable:
diff --git a/datacheck/loader.py b/datacheck/loader.py
index 842acfd..1a4b954 100644
--- a/datacheck/loader.py
+++ b/datacheck/loader.py
@@ -124,7 +124,7 @@ def load(self) -> pd.DataFrame:
usecols_kwarg = {"usecols": self.columns} if self.columns is not None else {}
try:
# Use PyArrow engine for faster CSV parsing + Arrow-backed dtypes
- df: pd.DataFrame = pd.read_csv(
+ df: pd.DataFrame = pd.read_csv( # type: ignore[call-overload]
self.file_path,
encoding=encoding,
delimiter=self.delimiter,
@@ -135,7 +135,7 @@ def load(self) -> pd.DataFrame:
)
except Exception:
# Fallback to default engine for exotic encodings or edge cases
- df = pd.read_csv(
+ df = pd.read_csv( # type: ignore[call-overload]
self.file_path,
encoding=encoding,
delimiter=self.delimiter,
diff --git a/datacheck/sql_pushdown/builder.py b/datacheck/sql_pushdown/builder.py
index 23cdd88..33c6f53 100644
--- a/datacheck/sql_pushdown/builder.py
+++ b/datacheck/sql_pushdown/builder.py
@@ -216,22 +216,22 @@ def _rule_to_sql(
]
if rule_type == "min_length":
- v = int(params)
+ n = int(params)
length_expr = dialect.str_length(dialect.cast_to_text(col))
return [
(alias_prefix,
f"SUM(CASE WHEN {col} IS NOT NULL"
- f" AND {length_expr} < {v}"
+ f" AND {length_expr} < {n}"
f" THEN 1 ELSE 0 END)")
]
if rule_type == "max_length":
- v = int(params)
+ n = int(params)
length_expr = dialect.str_length(dialect.cast_to_text(col))
return [
(alias_prefix,
f"SUM(CASE WHEN {col} IS NOT NULL"
- f" AND {length_expr} > {v}"
+ f" AND {length_expr} > {n}"
f" THEN 1 ELSE 0 END)")
]
From 3a6c54cc83ce34f53444109afb6289a774a6a5fb Mon Sep 17 00:00:00 2001
From: yash-chauhan-dev
Date: Tue, 24 Feb 2026 22:42:22 +0530
Subject: [PATCH 25/25] docs updated
---
docs/index.md | 1675 ++++++++++++++++++++++++++-----------------------
1 file changed, 897 insertions(+), 778 deletions(-)
diff --git a/docs/index.md b/docs/index.md
index b2479a2..99c277b 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,51 +1,67 @@
-# DataCheck - A Linter for Data Pipelines
+# DataCheck
-**Enforce data quality rules at the pipeline boundary.** Define rules in YAML. Run in CI. Fail fast on bad data. No servers, no dashboards, no infrastructure.
-
-```bash
-pip install datacheck-cli
-```
+DataCheck is a deterministic validation engine for data pipelines. Define rules in YAML, run validation in CI, and enforce data quality contracts through POSIX exit codes. No servers, no dashboards, no anomaly detection.
```
-Your data source -> [DataCheck rules] -> exit 0: pipeline continues
- -> exit 1: pipeline stops
+data source → [validation rules] → exit 0: pipeline continues
+ → exit 1: pipeline fails
```
-DataCheck provides the `datacheck` CLI and a Python API to enforce validation rules and schema contracts. Embed it in CI/CD pipelines, run it in Airflow DAGs, or use it locally before pushing data. Rules are binary, deterministic, and config-driven - no statistical anomaly detection, no dashboards, no infrastructure required.
+DataCheck executes rules in one pass over the data and produces a binary pass/fail signal. Rules are config-driven, deterministic, and reproducible — the same input always produces the same result. For database sources, eligible rules compile into a single aggregate SQL query and execute entirely inside the database engine.
+
+---
+
+## 1. Overview
+
+DataCheck runs as a CLI command or Python library. It reads a YAML config that defines validation checks, loads data from a file or database connection, evaluates every active rule, and exits with a POSIX code that encodes the outcome.
+
+**Where it runs**: Inside existing pipeline compute. It has no server component, no background process, and no external dependency at runtime. It runs wherever Python 3.10+ and the relevant database drivers are installed.
+
+**How it integrates**: CI/CD systems, Airflow operators, pre-commit hooks, and deployment gates all consume POSIX exit codes. DataCheck uses exit 0 for pass and exit 1 for error-severity failures. Any non-zero exit code aborts the pipeline step in standard CI environments.
+
+**What it does not do**: DataCheck does not perform statistical anomaly detection, machine learning inference, or schema drift suggestion. It enforces rules that were explicitly defined. A rule either passes or fails.
---
-## Installation
+## 2. Installation
### Requirements
-- **Python 3.10, 3.11, or 3.12**
-- **pip 21.0 or greater**
+- Python `>=3.10, <4.0`
+- pip 21.0 or greater
-### Install
+### Base install
```bash
pip install datacheck-cli
```
-### Install with extras
+The base install supports CSV and Parquet validation. No database connectivity is included.
+
+### Extras
Install only the connectors you need:
```bash
-# Databases
-pip install datacheck-cli[postgresql]
-pip install datacheck-cli[mysql]
-pip install datacheck-cli[mssql]
+# Individual database connectors
+pip install datacheck-cli[postgresql] # psycopg2-binary + SQLAlchemy
+pip install datacheck-cli[mysql] # mysql-connector-python + SQLAlchemy
+pip install datacheck-cli[mssql] # pyodbc + SQLAlchemy
+
+# All three databases
+pip install datacheck-cli[databases]
# Cloud warehouses
-pip install datacheck-cli[snowflake]
-pip install datacheck-cli[bigquery]
-pip install datacheck-cli[redshift]
-pip install datacheck-cli[warehouses] # All three warehouses
+pip install datacheck-cli[snowflake] # snowflake-connector-python
+pip install datacheck-cli[bigquery] # google-cloud-bigquery + google-auth
+pip install datacheck-cli[redshift] # boto3 + psycopg2-binary + SQLAlchemy
+
+# All three warehouses
+pip install datacheck-cli[warehouses]
# Cloud storage
-pip install datacheck-cli[cloud] # S3
+pip install datacheck-cli[s3] # boto3
+pip install datacheck-cli[cloud] # alias for s3
# Everything
pip install datacheck-cli[all]
@@ -59,197 +75,341 @@ datacheck version
---
-## Quickstart
+## 3. Core Concepts
-### 1. Generate a config with sample data
+### Rules
-```bash
-datacheck config init --with-sample-data
+A rule is a single constraint applied to a column. Rules are identified by type (`not_null`, `min`, `regex`, etc.) and parameterized inline. Each rule produces one `RuleResult`: pass, fail, or execution error.
+
+```yaml
+rules:
+ not_null: true
+ min: 0
+ regex: '^[A-Z]{2}[0-9]{4}$'
```
-This creates a `datacheck.yaml` config file and a sample CSV file. Use `--template` to pick an industry template:
+### Checks
-```bash
-datacheck config init --template ecommerce --with-sample-data
+A check groups one or more rules targeting a single column. Every check has a `name`, `column`, and `rules` map. Optional fields: `severity`, `enabled`, `description`, `source`, `table`.
+
+```yaml
+checks:
+ - name: order_id
+ column: id
+ rules:
+ not_null: true
+ unique: true
+ severity: error
```
-### 2. Run validation
+When a check defines multiple rules, each rule is evaluated independently and produces its own result entry.
-```bash
-datacheck validate
-```
+### Severity levels
-DataCheck auto-discovers config files in this order: `.datacheck.yaml` → `.datacheck.yml` → `datacheck.yaml` → `datacheck.yml`. To specify a config explicitly:
+| Level | Default | Effect on exit code |
+|-------|---------|---------------------|
+| `error` | Yes | Failure causes exit 1 |
+| `warning` | No | Failure reported; exit code unaffected |
+| `info` | No | Failure reported; exit code unaffected |
-```bash
-datacheck validate --config checks.yaml
+Only `error`-severity rule failures cause a non-zero exit code. Warning and info violations appear in output but do not block the pipeline.
+
+### Exit codes
+
+| Code | Condition |
+|------|-----------|
+| `0` | All `error`-severity rules passed (warnings and info violations are allowed) |
+| `1` | At least one `error`-severity rule failed |
+| `2` | Configuration error (invalid YAML, unknown rule type, missing required field) |
+| `3` | Data load error (file not found, connection refused, bad credentials) |
+| `4` | Execution error (rule threw an exception, unexpected runtime failure) |
+
+Exit codes are stable contracts. They do not change between patch releases.
+
+### SQL pushdown model
+
+For database sources that support it, DataCheck compiles all eligible rules into a single aggregate `SELECT` statement. This query executes inside the database engine and returns only violation counts — no rows are transferred to the client.
+
+```sql
+-- Generated by DataCheck for a PostgreSQL source
+SELECT
+ COUNT(*) AS _total_rows,
+ SUM(CASE WHEN "amount" IS NULL THEN 1 ELSE 0 END) AS _c0_not_null,
+ SUM(CASE WHEN "amount" IS NOT NULL AND "amount" < 0 THEN 1 ELSE 0 END) AS _c0_min,
+ SUM(CASE WHEN "status" IS NOT NULL AND "status"::text NOT IN ('active','inactive') THEN 1 ELSE 0 END) AS _c1_allowed_values
+FROM "orders"
+WHERE created_at > '2026-01-01'
```
-### 3. Minimal config example
+Rules that cannot be expressed as SQL aggregates (e.g., `date_format_valid`, `foreign_key_exists`) fall back to the in-process Python path, which requires loading the relevant rows.
-```yaml
-# .datacheck.yaml
+### Fail-fast behavior
-data_source:
- type: csv
- path: ./data/orders.csv
+DataCheck does not fail fast on individual rule failures. All rules run to completion and results are aggregated. The exit code reflects the combined outcome. This behavior is intentional: a single validation run reports all failures simultaneously.
-checks:
- - name: id_check
- column: id
- rules:
- not_null: true
- unique: true
+Data load errors and configuration errors abort immediately before any rules execute.
- - name: amount_check
- column: amount
- rules:
- not_null: true
- min: 0
- max: 10000
+### Enforcement boundary model
-```
+DataCheck enforces rules at the point where it is invoked. It does not monitor, poll, or persist state between runs. Each invocation is stateless and self-contained. Schema baselines (see Section 9) are the only persistent artifact; they are plain JSON files under version control.
---
-## Configuration
+## 4. Configuration
-### Config file structure
+### Config file discovery
+
+When no `--config` flag is provided, DataCheck searches the current working directory for config files in this order:
+
+1. `.datacheck.yaml`
+2. `.datacheck.yml`
+3. `datacheck.yaml`
+4. `datacheck.yml`
+
+The first match wins. If none is found and no `--config` flag was supplied, DataCheck exits with code 2.
-A `.datacheck.yaml` file can contain:
+### Config file structure
```yaml
-# Data source (inline, for file-based sources)
+# Optional: schema version
+version: "1.0"
+
+# Optional: metadata (informational only)
+metadata:
+ author: "data-eng-team"
+ description: "Order pipeline validation"
+ tags: ["orders", "production"]
+
+# Optional: inherit checks from a base config
+extends: base.yaml
+
+# Inline file-based data source
data_source:
- type: csv
+ type: csv # csv | parquet
path: ./data/orders.csv
- options:
+ options: # passed to pandas loader
delimiter: ","
encoding: utf-8
-# Or reference named sources
+# OR reference named sources from a separate file
sources_file: sources.yaml
-source: production_db
-table: orders
+source: production_db # default source name
+table: orders # default table
-# Validation checks
+# Required: validation checks
checks:
- - name: id_check
+ - name: order_id
column: id
rules:
not_null: true
unique: true
- severity: error # error (default), warning, info
- enabled: true # default: true
+ severity: error
+ enabled: true
+ description: "Primary key must be present and unique"
-# Config inheritance
-extends: base.yaml
+ - name: order_amount
+ column: amount
+ rules:
+ not_null: true
+ min: 0
+ max: 1000000
+ severity: error
+
+ - name: status_values
+ column: status
+ rules:
+ allowed_values:
+ - pending
+ - confirmed
+ - shipped
+ - cancelled
+ severity: warning
-# Reporting
+# Optional: output and reporting
reporting:
- output_path: ./reports
- export_failures: true
- failures_file: failures.csv
+ output_path: ./reports # directory for output files
+ export_failures: true # auto-export failures to CSV
+ failures_file: failures.csv # explicit failures CSV path
-# Notifications
+# Optional: Slack notifications
notifications:
slack_webhook: "${SLACK_WEBHOOK}"
- mention_on_failure: true
-
+ mention_on_failure: false
```
-### Checks definition
+### data_source definition
-Each check targets a column and applies one or more rules:
+The `data_source` block is used for file-based sources embedded in the config. For database sources, use `sources_file` + `source` instead.
-```yaml
-checks:
- - name: order_amount # Rule identifier
- column: amount # Target column
- rules:
- not_null: true # Rule type → parameters
- min: 0
- max: 100000
- severity: error # error (default), warning, info
- enabled: true # Toggle check on/off
-
- - name: warehouse_orders
- column: total
- source: snowflake_wh # Override source for this check
- table: orders # Override table for this check
- rules:
- min: 0
-```
+| Field | Required | Description |
+|-------|----------|-------------|
+| `type` | Yes | `csv` or `parquet` |
+| `path` | Yes | Path to file (relative to config dir) |
+| `options` | No | Loader keyword args passed to pandas |
-### Severity levels
+### checks definition
+
+Each item in `checks` requires `name`, `column`, and `rules`. All other fields are optional.
+
+| Field | Required | Default | Description |
+|-------|----------|---------|-------------|
+| `name` | Yes | — | Unique identifier for this check |
+| `column` | Yes | — | Column name in the dataset |
+| `rules` | Yes | — | Map of rule type → parameters |
+| `severity` | No | `error` | `error`, `warning`, or `info` |
+| `enabled` | No | `true` | Set to `false` to skip this check |
+| `description` | No | — | Human-readable description |
+| `source` | No | — | Override default named source for this check |
+| `table` | No | — | Override default table for this check |
-| Severity | Effect |
-|----------|--------|
-| `error` (default) | Causes exit code 1 on failure |
-| `warning` | Reported but does not fail the run |
-| `info` | Informational only |
+### Rules syntax
-Only `error`-severity failures cause a non-zero exit code.
+Rules are expressed as a map under each check's `rules` key. The key is the rule type; the value is the parameter (or `true` for boolean rules):
-### Environment variables
+```yaml
+rules:
+ not_null: true # boolean flag
+ unique: true
+ min: 0 # scalar
+ max: 10000
+ range: {min: 0, max: 10000} # dict
+ regex: '^[A-Z0-9]{8,16}$' # string
+ allowed_values: [active, inactive, pending] # list
+ min_length: 2
+ max_length: 64
+ max_age: '24h'
+ timestamp_range: {min: "2020-01-01", max: "2030-12-31"}
+ type: string
+```
-Config files support environment variable substitution:
+### Environment variable substitution
+
+Config files support shell-style variable substitution. Applies to all string values in both `datacheck.yaml` and `sources.yaml`.
```yaml
sources:
- production_db:
+ production:
type: postgresql
- host: ${DB_HOST} # Required — fails if not set
- port: ${DB_PORT:-5432} # Optional — uses default 5432
+ host: ${DB_HOST} # required — empty string if unset
+ port: ${DB_PORT:-5432} # optional — falls back to 5432
database: ${DB_NAME}
user: ${DB_USER}
password: ${DB_PASSWORD}
```
-Use `datacheck config env` to list all variables referenced in a config and their current values:
-
-```bash
-datacheck config env datacheck.yaml
-```
+| Syntax | Behavior |
+|--------|----------|
+| `${VAR}` | Substituted with the env var value; empty string if unset |
+| `${VAR:-default}` | Uses `default` if `VAR` is unset or empty |
### Config inheritance
-Use `extends` to inherit rules from a base config and override or add checks per environment:
+Use `extends` to inherit checks from a base config and add or override checks per environment:
```yaml
-# base.yaml — shared rules
+# base.yaml
data_source:
- type: csv
- path: ./data/orders.csv
+ type: parquet
+ path: ./data/transactions.parquet
checks:
- - name: id_check
- column: id
+ - name: tx_id
+ column: transaction_id
rules:
not_null: true
unique: true
```
```yaml
-# production.yaml — inherits base, adds stricter rules
+# production.yaml
extends: base.yaml
checks:
- - name: amount_check
+ - name: tx_amount
column: amount
rules:
- min: 0
- max: 50000
+ positive: true
+ max: 500000
severity: error
```
-For a complete walkthrough of every config field, all data source types, the full rules reference, per-environment patterns, and troubleshooting, see the **[Config File Guide](./config-guide)**.
+Checks in the child config are merged with the parent. The child takes precedence on name conflicts.
+
+---
+
+## 5. Supported Rules
+
+### Null and uniqueness
+
+| Rule | Parameter | Description |
+|------|-----------|-------------|
+| `not_null` | `true` | Fails if any value is null or missing |
+| `unique` | `true` | Fails if any duplicate values exist; nulls are excluded from uniqueness check |
+| `unique_combination` | `[col1, col2, ...]` | Fails if any combination of the listed columns is duplicated; null rows are excluded |
+
+### Numeric
+
+| Rule | Parameter | Description |
+|------|-----------|-------------|
+| `min` | number | Fails if any non-null value is less than the threshold |
+| `max` | number | Fails if any non-null value is greater than the threshold |
+| `range` | `{min: N, max: N}` | Fails if any non-null value falls outside the inclusive range |
+| `positive` | `true` | Fails if any non-null value is `<= 0` |
+| `non_negative` | `true` | Fails if any non-null value is `< 0` |
+
+Null values are always skipped by numeric rules. To enforce non-null numeric bounds, combine with `not_null: true`.
+
+### String and pattern
+
+| Rule | Parameter | Description |
+|------|-----------|-------------|
+| `regex` | regex string | Fails if any non-null value does not match the pattern |
+| `allowed_values` | list | Fails if any non-null value is not in the list |
+| `min_length` | integer | Fails if any non-null string has fewer characters than the threshold |
+| `max_length` | integer | Fails if any non-null string has more characters than the threshold |
+| `type` | type string | Fails if the column's detected type does not match; accepts `int` (or `integer`), `float` (or `numeric`), `string`, `bool`, `date`, `datetime` |
+| `boolean` | `true` | Fails if any non-null value is not a boolean representation (`True`/`False`, `1`/`0`, `true`/`false`) |
+
+`regex` is applied per-value. Null values are skipped. Pattern matching is case-sensitive unless the pattern includes inline flags.
+
+### Temporal
+
+| Rule | Parameter | Description |
+|------|-----------|-------------|
+| `max_age` | duration string | Fails if any non-null timestamp is older than the specified duration from now |
+| `timestamp_range` | `{min: "ISO8601", max: "ISO8601"}` | Fails if any non-null timestamp falls outside the inclusive range |
+| `date_range` | `{min: "ISO8601", max: "ISO8601"}` | Equivalent to `timestamp_range`; use for date-only columns |
+| `no_future_timestamps` | `true` | Fails if any non-null timestamp is greater than the current time at execution |
+| `date_format_valid` | strftime string | Fails if any non-null value cannot be parsed with the given format |
+| `date_format` | `{format: strftime string}` | Alternate dict form of `date_format_valid` |
+
+**Duration syntax for `max_age`**: A numeric value followed by a unit suffix.
+
+| Suffix | Unit |
+|--------|------|
+| `m` | minutes |
+| `h` | hours |
+| `d` | days |
+| `w` | weeks |
+
+Examples: `'15m'`, `'24h'`, `'7d'`, `'4w'`
+
+### Relationship
+
+| Rule | Parameter | Description |
+|------|-----------|-------------|
+| `foreign_key_exists` | Python API only | Fails if any value in the column does not exist in a reference dataset |
+| `sum_equals` | `{column_a: col, column_b: col, tolerance: float}` | Fails if any row where `column_a + column_b ≠ column` (within tolerance, default `0.01`) |
+
+`foreign_key_exists` requires a reference dataset passed via the Python API. It is not usable from CLI config alone.
+
+`sum_equals` is applied row-by-row: for each row, the check is whether `column_a + column_b` equals the value in the target column.
---
-## Data Sources
+## 6. Data Sources
-### File sources (inline in config)
+### File-based sources
**CSV**
@@ -270,892 +430,851 @@ data_source:
path: ./data/orders.parquet
```
-### Database sources (named sources)
+DataCheck uses column pruning for both CSV and Parquet: only columns referenced by checks are loaded. For large files, this reduces memory usage proportionally.
-For databases, define named sources in a `sources.yaml` file:
+### Database sources via sources.yaml
+
+Database connections are defined in a separate `sources.yaml` file and referenced by name. This separates credentials from validation logic.
```yaml
# sources.yaml
sources:
production_db:
type: postgresql
- host: ${DB_HOST}
- port: ${DB_PORT:-5432}
- database: ${DB_NAME}
- user: ${DB_USER}
- password: ${DB_PASSWORD}
+ host: ${PG_HOST}
+ port: ${PG_PORT:-5432}
+ database: ${PG_DATABASE}
+ user: ${PG_USER}
+ password: ${PG_PASSWORD}
schema: public
- mysql_db:
- type: mysql
- host: ${MYSQL_HOST}
- port: ${MYSQL_PORT:-3306}
- database: ${MYSQL_DB}
- user: ${MYSQL_USER}
- password: ${MYSQL_PASSWORD}
-
- mssql_db:
- type: mssql
- host: ${MSSQL_HOST}
- port: ${MSSQL_PORT:-1433}
- database: ${MSSQL_DB}
- user: ${MSSQL_USER}
- password: ${MSSQL_PASSWORD}
-```
-
-### Cloud warehouse sources
-
-```yaml
-# sources.yaml
-sources:
- snowflake_wh:
+ warehouse:
type: snowflake
account: ${SF_ACCOUNT}
user: ${SF_USER}
password: ${SF_PASSWORD}
- warehouse: ${SF_WAREHOUSE:-COMPUTE_WH}
database: ${SF_DATABASE}
schema: ${SF_SCHEMA:-PUBLIC}
+ warehouse: ${SF_WAREHOUSE}
role: ${SF_ROLE}
- # SSO: authenticator: externalbrowser
- # Key pair: private_key_path: /path/to/key.p8
- bigquery_ds:
+ analytics:
type: bigquery
project_id: ${GCP_PROJECT}
dataset_id: ${GCP_DATASET}
- credentials_path: /path/to/service-account.json
+ credentials_path: /secrets/bq-service-account.json
location: US
+ mysql_db:
+ type: mysql
+ host: ${MYSQL_HOST}
+ port: ${MYSQL_PORT:-3306}
+ database: ${MYSQL_DATABASE}
+ user: ${MYSQL_USER}
+ password: ${MYSQL_PASSWORD}
+
+ sqlserver_db:
+ type: mssql
+ host: ${MSSQL_HOST}
+ port: ${MSSQL_PORT:-1433}
+ database: ${MSSQL_DATABASE}
+ user: ${MSSQL_USER}
+ password: ${MSSQL_PASSWORD}
+
redshift_db:
type: redshift
- host: ${REDSHIFT_HOST}
- port: ${REDSHIFT_PORT:-5439}
- database: ${REDSHIFT_DB}
- user: ${REDSHIFT_USER}
- password: ${REDSHIFT_PASSWORD}
+ host: ${RS_HOST}
+ port: ${RS_PORT:-5439}
+ database: ${RS_DATABASE}
+ user: ${RS_USER}
+ password: ${RS_PASSWORD}
schema: public
- # IAM auth: cluster_identifier, region, iam_auth: true
+ # IAM auth (optional):
+ # cluster_identifier: ${RS_CLUSTER}
+ # region: ${AWS_REGION}
+ # iam_auth: true
```
-Snowflake, BigQuery, and Redshift support **server-side filtering** — WHERE clauses and LIMIT execute on the warehouse to minimize data transfer before validation runs locally.
+Reference in config:
+
+```yaml
+# datacheck.yaml
+sources_file: sources.yaml
+source: production_db
+table: orders
+
+checks:
+ - name: order_id
+ column: id
+ rules:
+ not_null: true
+ unique: true
+```
-### Cloud storage sources
+### Cloud storage
```yaml
-# sources.yaml
sources:
- s3_data:
+ s3_source:
type: s3
- bucket: my-bucket
- path: data/orders.csv
- region: us-east-1
+ bucket: ${S3_BUCKET}
+ path: data/orders.csv # key within the bucket
+ region: ${AWS_REGION:-us-east-1}
access_key: ${AWS_ACCESS_KEY_ID}
secret_key: ${AWS_SECRET_ACCESS_KEY}
-
-```
-
-### Connection strings
-
-You can also pass connection strings directly to the CLI:
-
-```bash
-datacheck validate postgresql://user:pass@host:5432/db --table orders
-datacheck validate mysql://user:pass@host:3306/db --table orders
-datacheck validate mssql://user:pass@host:1433/database --table orders
-datacheck validate snowflake://account/database/schema --table orders
-datacheck validate bigquery://project/dataset --table orders
-datacheck validate redshift://user:pass@host:5439/database/schema --table orders
```
-### Named sources and per-check overrides
+### Named sources
-Reference a named source in your config:
+Reference a named source in config to use it as the default:
```yaml
-# .datacheck.yaml
sources_file: sources.yaml
source: production_db
table: orders
-
-checks:
- - name: customer_email
- column: email
- rules:
- not_null: true
-
- - name: order_total
- column: total
- source: snowflake_wh # Override source for this check
- table: orders
- rules:
- min: 0
```
-Switch sources at runtime:
+Override at runtime without modifying the config:
```bash
-datacheck validate --source snowflake_wh --config checks.yaml
-datacheck validate --source s3_data --sources-file sources.yaml
+datacheck validate --source staging_db --table orders --config checks.yaml
+datacheck validate --source analytics --table transactions --config checks.yaml
```
-### Connection pre-validation
+Per-check source overrides allow validating columns from different sources in a single run:
-When validating against database sources, DataCheck tests connectivity for **all** referenced sources before running any validation rules. If multiple sources are unreachable, all connection errors are reported together:
+```yaml
+checks:
+ - name: local_id
+ column: id
+ source: production_db
+ table: orders
+ rules:
+ not_null: true
+ - name: warehouse_total
+ column: revenue
+ source: warehouse
+ table: daily_revenue
+ rules:
+ positive: true
```
-Source connectivity check failed:
- - Source 'production_db' (postgresql): Connection failed — could not connect to server
- - Source 'analytics_wh' (snowflake): Connection failed — invalid credentials
-```
-For file-based sources, DataCheck verifies the file exists before validation begins.
+### Switching sources at runtime
+
+```bash
+# Override source
+datacheck validate --source production_db --table orders
-### SQL filtering
+# Override table
+datacheck validate --source production_db --table refunds
-Use `--table`, `--where`, and `--query` for server-side filtering:
+# Override with WHERE clause
+datacheck validate --source production_db --table orders --where "status = 'pending'"
-```bash
-datacheck validate --source production_db --table orders --where "status = 'active'"
+# Override with custom query
datacheck validate --source production_db --query "SELECT * FROM orders WHERE created_at > '2026-01-01'"
```
----
-
-## Validation Rules
+When `--query` is specified, SQL pushdown is disabled for that run regardless of source type.
-### Null and uniqueness
+---
-| Rule | YAML Syntax | Description |
-|------|------------|-------------|
-| `not_null` | `not_null: true` | No null or missing values |
-| `unique` | `unique: true` | No duplicate values (nulls ignored) |
-| `unique_combination` | `unique_combination: [col1, col2]` | Composite uniqueness across columns |
+## 7. CLI Reference
-### Numeric
+### `datacheck validate`
-| Rule | YAML Syntax | Description |
-|------|------------|-------------|
-| `min` | `min: 0` | Column >= value |
-| `max` | `max: 10000` | Column <= value |
-| `range` | `range: {min: 0, max: 100}` | Column value within inclusive range |
-| `positive` | `positive: true` | Column value > 0 |
-| `non_negative` | `non_negative: true` | Column value >= 0 |
-| `boolean` | `boolean: true` | Column contains only boolean values (`True`/`False`, `1`/`0`) |
+Run validation against a data source. The primary command.
-### String and pattern
+```
+datacheck validate [DATA_SOURCE] [OPTIONS]
+```
-| Rule | YAML Syntax | Description |
-|------|------------|-------------|
-| `regex` | `regex: '^[A-Z]{2}[0-9]{4}$'` | Match regex pattern |
-| `allowed_values` | `allowed_values: [active, inactive, pending]` | Value in allowed set |
-| `type` | `type: 'string'` | Data type check (`int`, `numeric`, `string`, `bool`, `date`, `datetime`) |
-| `length` | `length: {min: 1, max: 100}` | String length constraints |
-| `min_length` | `min_length: 1` | Minimum string length |
-| `max_length` | `max_length: 255` | Maximum string length |
+`DATA_SOURCE` is an optional positional argument: a file path or connection string. If omitted, the source is resolved from the config's `data_source`, `source`, or `sources_file` fields.
-### Temporal
+**Data source options**
-| Rule | YAML Syntax | Description |
-|------|------------|-------------|
-| `max_age` | `max_age: '24h'` | Data freshness — supports `h` (hours), `d` (days), `w` (weeks), `m` (minutes) |
-| `timestamp_range` | `timestamp_range: {min: "2025-01-01", max: "2026-12-31"}` | Timestamps within range (ISO format) |
-| `date_range` | `date_range: {min: "2025-01-01", max: "2026-12-31"}` | Alias for `timestamp_range` |
-| `no_future_timestamps` | `no_future_timestamps: true` | No timestamps beyond current time |
-| `date_format_valid` | `date_format_valid: '%Y-%m-%d'` | Validates date format (Python strftime) |
-| `date_format` | `date_format: {format: '%Y-%m-%d'}` | Alias for `date_format_valid` (dict form) |
-### Cross-column and relationships
+| Flag | Short | Description |
+|------|-------|-------------|
+| `--config` | `-c` | Path to validation config YAML. Auto-discovered if omitted. |
+| `--source` | | Named source from sources.yaml |
+| `--sources-file` | | Path to sources YAML (overrides config `sources_file`) |
+| `--table` | `-t` | Database table name |
+| `--where` | `-w` | SQL WHERE clause |
+| `--query` | `-q` | Custom SQL query; disables SQL pushdown |
+| `--schema` | `-s` | Schema or dataset name |
-| Rule | YAML Syntax | Description |
-|------|------------|-------------|
-| `unique_combination` | `unique_combination: [col1, col2]` | Composite uniqueness across multiple columns |
-| `foreign_key_exists` | Python API | Foreign key validation against a reference DataFrame (use Python API to pass live data) |
-| `sum_equals` | `sum_equals: {column_a: col1, column_b: col2}` | Verify column equals sum of two other columns (with optional `tolerance`) |
+**Warehouse-specific options**
-### Example: complete config with rules
+| Flag | Description |
+|------|-------------|
+| `--warehouse` | Snowflake warehouse name |
+| `--credentials` | Path to credentials file (BigQuery service account JSON) |
+| `--region` | AWS region for Redshift IAM auth |
+| `--cluster` | Cluster identifier for Redshift IAM auth |
+| `--iam-auth` | Use IAM authentication for Redshift |
-```yaml
-data_source:
- type: csv
- path: ./data/orders.csv
+**Execution options**
-checks:
- - name: id_not_null
- column: id
- rules:
- not_null: true
- unique: true
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--parallel` | off | Enable multi-threaded parallel execution |
+| `--workers` | CPU count | Worker thread count (used with `--parallel`) |
+| `--chunk-size` | 100000 | Rows per chunk for parallel execution |
+| `--progress / --no-progress` | on | Show/hide terminal progress indicator |
- - name: amount_range
- column: amount
- rules:
- not_null: true
- min: 0
- max: 100000
- severity: error
+**Output options**
- - name: order_date
- column: created_at
- rules:
- no_future_timestamps: true
- max_age: '30d'
- date_format_valid: '%Y-%m-%d %H:%M:%S'
+| Flag | Short | Default | Description |
+|------|-------|---------|-------------|
+| `--output` | `-o` | — | Save results to file |
+| `--format` | `-f` | `json` | Output format for `--output`: `json`, `sarif`, `markdown`, `csv` |
+| `--csv-export` | | — | Export failure details to a CSV file |
+| `--suggestions / --no-suggestions` | | on | Show actionable suggestions for failures |
+| `--slack-webhook` | | — | Slack webhook URL for result notifications |
- - name: status_values
- column: status
- rules:
- allowed_values:
- - pending
- - confirmed
- - shipped
- - delivered
- - cancelled
-```
+**Logging options**
----
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--log-level` | `WARNING` | `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL` |
+| `--log-format` | `console` | `console` (human-readable) or `json` (structured) |
+| `--log-file` | — | Write logs to file with automatic rotation |
+| `--verbose` / `-v` | off | Shorthand for `--log-level DEBUG` |
-## Schema Detection and Evolution
+### `datacheck schema`
-### Commands
+Schema contract commands. See Section 9.
```bash
-datacheck schema capture # Save current schema as baseline
-datacheck schema compare # Compare current data against baseline
-datacheck schema show # Display detected schema
-datacheck schema list # List all saved baselines
-datacheck schema history # View capture history
+datacheck schema capture [DATA_SOURCE] [OPTIONS]
+datacheck schema compare [DATA_SOURCE] [OPTIONS]
+datacheck schema show [OPTIONS]
+datacheck schema list [OPTIONS]
+datacheck schema history [OPTIONS]
```
-### Schema capture
+### `datacheck config`
+
+Configuration management commands.
```bash
-datacheck schema capture data.csv
-datacheck schema capture --source production_db --sources-file sources.yaml
-datacheck schema capture --name v2-baseline
-datacheck schema capture --baseline-dir ./schemas
-datacheck schema capture --no-history
+datacheck config init [OPTIONS] # Generate config from template
+datacheck config validate [CONFIG_PATH] # Validate config syntax
+datacheck config show [CONFIG_PATH] # Show resolved config
+datacheck config env [CONFIG_PATH] # Show environment variables referenced
+datacheck config merge [FILES...] -o out # Merge multiple configs
+datacheck config templates # List available templates
```
-| Flag | Description |
-|------|-------------|
-| `--name / -n` | Baseline name (default: `baseline`) |
-| `--baseline-dir` | Storage directory (default: `.datacheck/schemas/`) |
-| `--save-history / --no-history` | Save to history (default: enabled) |
+**`config init` options**
-### Schema compare
+| Flag | Short | Default | Description |
+|------|-------|---------|-------------|
+| `--output` | `-o` | `datacheck.yaml` | Output config file path |
+| `--template` | `-t` | `basic` | Template: `basic`, `ecommerce`, `healthcare`, `finance`, `saas`, `iot` |
+| `--with-sample-data` | `-s` | off | Generate matching sample CSV file |
+| `--sample-rows` | | 100 | Rows in generated sample CSV |
+| `--force` | `-f` | off | Overwrite existing config file |
+
+### `datacheck version`
```bash
-datacheck schema compare data.csv
-datacheck schema compare --baseline v2-baseline
-datacheck schema compare --fail-on-breaking
-datacheck schema compare --rename-threshold 0.9
-datacheck schema compare --format json
+datacheck version
```
-| Flag | Description |
-|------|-------------|
-| `--baseline / -b` | Baseline name to compare against (default: `baseline`) |
-| `--rename-threshold` | Similarity threshold for rename detection (0.0-1.0, default: 0.8) |
-| `--fail-on-breaking` | Exit with code 1 on breaking changes |
-| `--format / -f` | Output format: `terminal` (default) or `json` |
+### Output formats
-### Schema compare exit codes
+| Format | Used with | Description |
+|--------|-----------|-------------|
+| Terminal | Always | Rich-formatted table output; cannot be suppressed |
+| `json` | `--output file.json` | Full results with all rule outcomes, failure details, and execution stats |
+| `sarif` | `--output file.sarif` | SARIF 2.1.0 for GitHub Security tab and SARIF-aware tools |
+| `markdown` | `--output report.md` | Human-readable report with results table and failure details |
+| `csv` | `--output failures.csv --format csv` or `--csv-export failures.csv` | Failure rows only: check name, column, severity, failed count, sample values |
-| Code | Meaning |
-|------|---------|
-| 0 | Compatible — no breaking changes |
-| 1 | Breaking changes detected (with `--fail-on-breaking`) |
-| 2 | Baseline not found |
-| 3 | Data load error |
-| 4 | Unexpected error |
+Terminal output is always shown regardless of `--output`. The `--format` flag controls only the file output format.
-### What schema tracks
+### Exit codes table
-For each column: name, data type, nullable status, position, unique value count, null percentage. For the dataset: row count, source identifier, capture timestamp.
+| Code | Meaning | Common causes |
+|------|---------|---------------|
+| 0 | Pass | All `error`-severity rules passed |
+| 1 | Validation failure | One or more `error`-severity rules failed |
+| 2 | Configuration error | Invalid YAML, unknown rule type, no data source defined |
+| 3 | Data load error | File not found, connection refused, authentication failure |
+| 4 | Execution error | Exception during rule evaluation; unexpected runtime failure |
-### Change types detected
+---
-| Change | Compatibility Level |
-|--------|-------------------|
-| Column added | COMPATIBLE |
-| Column removed | BREAKING |
-| Column renamed | WARNING |
-| Nullable changed | WARNING |
-| Order changed | COMPATIBLE |
+## 8. CI/CD Integration
-### Type change compatibility
+### Generic CI usage
-**Compatible changes** (widening): `int→float`, `int→string`, `float→string`, `bool→string`, `date→datetime`, `date→string`, `datetime→string`
+```bash
+pip install datacheck-cli
+datacheck validate --config checks.yaml --output results.json
+```
-**Breaking changes** (narrowing): `float→int`, `string→int`, `string→float`, `string→bool`, `datetime→date`, `string→datetime`, `string→date`
+DataCheck exits non-zero on any failure. Most CI systems treat non-zero exit codes as build failures automatically.
-### Baseline storage
+### GitHub Actions — basic
-- Baselines are stored as JSON files in `.datacheck/schemas/`
-- History entries are stored in `.datacheck/schemas/history/` with timestamps (e.g. `schema_20260212_143000.json`)
-- Use `datacheck schema list` to see all baselines
-- Use `datacheck schema history --limit 20` to see recent history
+```yaml
+name: data-quality
+on: [push, pull_request]
----
+jobs:
+ validate:
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/checkout@v4
+ - uses: actions/setup-python@v5
+ with:
+ python-version: "3.12"
+ - run: pip install datacheck-cli
+ - run: datacheck validate --config .datacheck.yaml
+```
-## CLI Command Reference
+### GitHub Actions — SARIF upload
-### `datacheck validate`
+SARIF output integrates with the GitHub Security tab. Findings appear as code scanning alerts on pull requests.
-Run validation against data files or databases.
+```yaml
+name: data-quality
+on: [push, pull_request]
-**Data source flags:**
+permissions:
+ contents: read
+ security-events: write
-| Flag | Description |
-|------|-------------|
-| `data_source` (positional) | File path or connection string |
-| `--config / -c` | Path to validation config YAML |
-| `--source` | Named source from sources.yaml |
-| `--sources-file` | Path to sources YAML file |
-| `--table / -t` | Database table name |
-| `--where / -w` | SQL WHERE clause for filtering |
-| `--query / -q` | Custom SQL query |
-| `--schema / -s` | Schema/dataset name |
+jobs:
+ validate:
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/checkout@v4
+ - uses: actions/setup-python@v5
+ with:
+ python-version: "3.12"
+ - run: pip install datacheck-cli
+ - name: Run validation
+ run: |
+ datacheck validate \
+ --config .datacheck.yaml \
+ --format sarif \
+ --output results.sarif
+ continue-on-error: true # let the upload step run even on failure
+ - uses: github/codeql-action/upload-sarif@v3
+ with:
+ sarif_file: results.sarif
+```
-**Warehouse-specific flags:**
+**SARIF output**: DataCheck generates SARIF 2.1.0. Each failed rule produces a result entry with `ruleId` (rule type), `level` (mapped from severity), and `message` (failure description). The tool identifier is `datacheck`.
-| Flag | Description |
-|------|-------------|
-| `--warehouse` | Snowflake warehouse name |
-| `--credentials` | Path to credentials file (BigQuery service account) |
-| `--region` | AWS region (Redshift IAM auth) |
-| `--cluster` | Cluster identifier (Redshift IAM auth) |
-| `--iam-auth` | Use IAM authentication (Redshift) |
+### GitLab CI
-**Execution flags:**
+```yaml
+validate_data:
+ image: python:3.12
+ script:
+ - pip install datacheck-cli
+ - datacheck validate --config .datacheck.yaml --output results.json
+ artifacts:
+ paths:
+ - results.json
+ when: always
+```
-| Flag | Description |
-|------|-------------|
-| `--parallel` | Enable multi-core parallel execution |
-| `--workers` | Number of worker processes (default: CPU count) |
-| `--chunk-size` | Rows per chunk for parallel processing (default: 10,000) |
-| `--progress / --no-progress` | Show/hide progress bar |
+### Database sources in CI
-**Output flags:**
+Store credentials as CI secrets and substitute via environment variables:
-| Flag | Description |
-|------|-------------|
-| `--format / -f` | Output format: `sarif`, `json`, `markdown`, `csv` |
-| `--output / -o` | Save results to file (path) |
-| `--csv-export` | Export failure details as CSV |
-| `--suggestions / --no-suggestions` | Show improvement suggestions (default: enabled) |
-| `--slack-webhook` | Slack webhook URL for notifications |
+```yaml
+# .datacheck.yaml
+sources_file: sources.yaml
+source: production_db
+table: orders
-**Logging flags:**
+checks:
+ - name: order_amount
+ column: amount
+ rules:
+ not_null: true
+ positive: true
+```
-| Flag | Description |
-|------|-------------|
-| `--log-level` | Log level: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL` |
-| `--log-format` | Log format: `console` (human-readable) or `json` (machine-parseable) |
-| `--log-file` | Path to log file (with automatic rotation) |
-| `--verbose / -v` | Shortcut for `--log-level DEBUG` |
+```yaml
+# sources.yaml
+sources:
+ production_db:
+ type: postgresql
+ host: ${DB_HOST}
+ port: ${DB_PORT:-5432}
+ database: ${DB_NAME}
+ user: ${DB_USER}
+ password: ${DB_PASSWORD}
+```
-### `datacheck config`
+```yaml
+# GitHub Actions job env block
+env:
+ DB_HOST: ${{ secrets.DB_HOST }}
+ DB_NAME: ${{ secrets.DB_NAME }}
+ DB_USER: ${{ secrets.DB_USER }}
+ DB_PASSWORD: ${{ secrets.DB_PASSWORD }}
+```
-Configuration management commands.
+### Airflow integration
-| Subcommand | Description |
-|------------|-------------|
-| `config init` | Generate config from template |
-| `config init --template ` | Use specific template (`basic`, `ecommerce`, `healthcare`, `finance`, `saas`, `iot`, `rules-reference`, `sources`) |
-| `config init --with-sample-data` | Also generate a sample CSV file |
-| `config init --sample-rows N` | Number of sample rows to generate (default: 100) |
-| `config init --force` | Overwrite existing config file |
-| `config validate ` | Validate config file syntax and rule definitions |
-| `config validate --strict` | Fail on warnings too |
-| `config show ` | Show fully resolved config (env vars + inheritance applied) |
-| `config show --format yaml/json` | Output format |
-| `config show --no-resolve-env` | Skip environment variable resolution |
-| `config show --no-resolve-extends` | Skip config inheritance resolution |
-| `config merge ` | Merge multiple configs (later files override earlier) |
-| `config merge -o output.yaml` | Write merged result to file |
-| `config templates` | List available templates with descriptions |
-| `config env ` | Show environment variables referenced in config |
+See Section 8 of this document (Airflow operators) and Section 10 (Python API) for programmatic use.
-### `datacheck schema`
+**BashOperator pattern** (simplest):
-Schema contract enforcement - capture baselines and fail on breaking changes.
+```python
+from airflow.operators.bash import BashOperator
-| Subcommand | Description |
-|------------|-------------|
-| `schema capture` | Save current schema as baseline |
-| `schema compare` | Compare current data against baseline |
-| `schema show` | Display detected schema (columns, types, nullable, stats) |
-| `schema list` | List all saved baseline schemas |
-| `schema history` | View capture history (newest first) |
+validate = BashOperator(
+ task_id="validate_orders",
+ bash_command=(
+ "datacheck validate "
+ "--config /opt/airflow/dags/checks/orders.yaml "
+ "--output /tmp/results_{{ ds }}.json"
+ ),
+ env={
+ "DB_HOST": "{{ var.value.db_host }}",
+ "DB_PASSWORD": "{{ var.value.db_password }}",
+ },
+)
+```
-### `datacheck version`
+**DataCheckOperator**:
-Display version information.
+```python
+from datacheck.airflow.operators import DataCheckOperator
-### Exit codes
+validate_orders = DataCheckOperator(
+ task_id="validate_orders",
+ config_path="/opt/airflow/dags/checks/orders.yaml",
+ source_name="production_db",
+ sources_file="/opt/airflow/dags/sources.yaml",
+ table="orders",
+ where="created_at >= '{{ ds }}'",
+ fail_on_error=True,
+ push_results=True,
+ min_pass_rate=None, # None = disabled; set to e.g. 95.0 to require 95% pass rate
+)
+```
-| Code | Meaning |
-|------|---------|
-| 0 | All rules passed (or only warning/info severity failures) |
-| 1 | Some error-severity rules failed |
-| 2 | Configuration error |
-| 3 | Data loading error |
-| 4 | Unexpected error |
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `config_path` | str | required | Path to validation config YAML |
+| `file_path` | str | None | Path to data file (CSV or Parquet) |
+| `source_name` | str | None | Named source from sources.yaml |
+| `sources_file` | str | None | Path to sources YAML |
+| `table` | str | None | Database table |
+| `where` | str | None | SQL WHERE clause |
+| `query` | str | None | Custom SQL query |
+| `parallel` | bool | False | Enable parallel execution |
+| `workers` | int | None | Worker thread count |
+| `min_pass_rate` | float | None | Minimum rule pass rate 0–100; None disables |
+| `fail_on_error` | bool | True | Fail Airflow task on validation failure |
+| `push_results` | bool | True | Push results dict to XCom |
----
+Template fields (Jinja-rendered): `config_path`, `file_path`, `source_name`, `sources_file`, `table`, `where`, `query`.
-## Output and Reporting
+`DataCheckOperator` raises `AirflowException` when `fail_on_error=True` and the validation exits non-zero.
-### Terminal output
+**DataCheckSchemaOperator**:
-DataCheck uses Rich-formatted terminal output with color-coded results:
+```python
+from datacheck.airflow.operators import DataCheckSchemaOperator
-- **Green**: Passed rules
-- **Red**: Failed rules
-- **Yellow**: Errors during rule execution
+check_schema = DataCheckSchemaOperator(
+ task_id="check_schema",
+ file_path="/data/orders_{{ ds }}.parquet",
+ baseline_name="orders-baseline",
+ baseline_dir="/opt/airflow/schemas",
+ fail_on_breaking=True,
+ push_results=True,
+)
+```
-Output includes a statistics table (records, columns, rules, pass/fail counts), detailed failure tables (check name, column, failure count, sample values), and actionable improvement suggestions.
+Auto-captures a new baseline if none exists at `baseline_name`.
-### JSON export
+---
-```bash
-datacheck validate --output results.json
-```
+## 9. Schema Contracts
+
+Schema contracts capture the structural definition of a dataset (column names, types, nullability) as a baseline and fail if the current data deviates in a breaking way.
-Exports full validation results in machine-readable JSON format, including all rule results, failure details, and summary statistics. Use this for automation and CI/CD integration.
+### `schema capture`
-### CSV export
+Captures the current schema and saves it as a named baseline.
```bash
-datacheck validate --csv-export failures.csv
+datacheck schema capture data.parquet
+datacheck schema capture --source production_db --sources-file sources.yaml --table orders
+datacheck schema capture data.csv --name v2-baseline --baseline-dir ./schemas
+datacheck schema capture data.csv --no-history # skip history entry
```
-Exports failure details as CSV with columns: check_name, column, severity, failed_rows, reason, suggestion.
-
-### Slack notifications
-
-Configure the webhook in your config file so you don't need to pass it every time:
+| Flag | Short | Default | Description |
+|------|-------|---------|-------------|
+| `--name` | `-n` | `baseline` | Baseline identifier |
+| `--baseline-dir` | | `.datacheck/schemas` | Storage directory |
+| `--save-history / --no-history` | | on | Append to schema history |
+| `--config` | `-c` | auto | Config file |
+| `--source` | | — | Named source |
+| `--sources-file` | | — | Sources YAML path |
+| `--table` | `-t` | — | Table name |
+| `--query` | `-q` | — | Custom SQL query |
-```yaml
-notifications:
- slack_webhook: "${SLACK_WEBHOOK}"
- mention_on_failure: true # @channel on failures (default: false)
-```
+### `schema compare`
-Or pass it via the CLI (overrides the config value):
+Compares the current data schema against a saved baseline.
```bash
-datacheck validate --slack-webhook https://hooks.slack.com/services/...
+datacheck schema compare data.parquet
+datacheck schema compare data.parquet --baseline v2-baseline
+datacheck schema compare data.csv --fail-on-breaking
+datacheck schema compare data.parquet --rename-threshold 0.9 --format json
```
-Sends validation results to Slack with:
-- Color-coded messages (green for pass, red for fail)
-- Summary statistics and failed rules
-- Optional `@channel` mention on failures (via `mention_on_failure`)
-- Up to 5 failed rule details with row counts
+| Flag | Short | Default | Description |
+|------|-------|---------|-------------|
+| `--baseline` | `-b` | `baseline` | Baseline name to compare against |
+| `--fail-on-breaking` | | off | Exit 1 on breaking changes |
+| `--rename-threshold` | | 0.8 | Similarity threshold for rename detection (0.0–1.0) |
+| `--format` | `-f` | `terminal` | Output format: `terminal` or `json` |
----
+### Baseline storage model
-## Parallel Execution and Performance
+Baselines are stored as JSON files:
-### Enabling parallel mode
+- Active baselines: `.datacheck/schemas/.json`
+- History entries: `.datacheck/schemas/history/schema_.json`
-```bash
-datacheck validate --parallel
-datacheck validate --parallel --workers 4
-datacheck validate --parallel --chunk-size 50000
-datacheck validate --parallel --progress
-```
+The `.datacheck/schemas/` directory should be committed to version control to track schema evolution alongside code changes.
-| Flag | Description |
-|------|-------------|
-| `--parallel` | Enable multi-core parallel execution |
-| `--workers` | Number of worker processes (default: CPU count) |
-| `--chunk-size` | Rows per chunk (default: 100,000) |
-| `--progress / --no-progress` | Show/hide progress bar |
+### Failure semantics
-### How parallel execution works
+`schema compare` reports each change with a compatibility level:
-1. Splits the DataFrame into chunks based on `--chunk-size`
-2. Processes chunks in parallel using `multiprocessing.Pool`
-3. Aggregates results across chunks (combines pass/fail counts, merges failure details)
-4. Automatically falls back to sequential execution for small datasets
-5. Shows a Rich progress bar with spinner, elapsed time, and remaining time
+| Change type | Compatibility |
+|-------------|--------------|
+| Column added | COMPATIBLE |
+| Column order changed | COMPATIBLE |
+| Column removed | BREAKING |
+| Column renamed (inferred) | WARNING |
+| Type narrowed (e.g., `string → int`) | BREAKING |
+| Type widened (e.g., `int → string`) | COMPATIBLE |
+| Nullable changed (non-null → nullable) | WARNING |
+| Nullable changed (nullable → non-null) | BREAKING |
-### Performance features
+Without `--fail-on-breaking`, `schema compare` exits 0 regardless of changes detected. With `--fail-on-breaking`, any BREAKING-level change causes exit 1.
-- **PyArrow backend**: Vectorized operations for faster validation (e.g. fast null count via Arrow)
-- **Lazy loading**: Cloud connectors are loaded only when needed — no unnecessary dependencies
-- **Memory optimization**: Memory-aware chunk sizing, worker auto-scaling, and large file handling
-- **Caching**: Regex compilation caching (`@lru_cache`) and compute-once patterns for expensive operations
-- **Vectorized rules**: NumPy/Pandas vectorized operations — no Python loops in hot paths
+Exit codes for `schema compare`:
+
+| Code | Condition |
+|------|-----------|
+| 0 | Compatible (no breaking changes), or breaking changes without `--fail-on-breaking` |
+| 1 | Breaking changes detected and `--fail-on-breaking` set |
+| 2 | Baseline not found |
+| 3 | Data load error |
+| 4 | Unexpected error |
---
-## Logging
+## 10. Python API
-### Log configuration
+### ValidationEngine
-```bash
-datacheck validate --verbose # DEBUG level
-datacheck validate --log-level WARNING # Specific level
-datacheck validate --log-format json # Machine-parseable JSON logs
-datacheck validate --log-file validation.log # Log to file (with rotation)
-datacheck validate --log-level DEBUG --log-format json --log-file debug.log
+```python
+from datacheck import ValidationEngine
```
-| Flag | Description |
-|------|-------------|
-| `--log-level` | `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL` |
-| `--log-format` | `console` (human-readable, default) or `json` (machine-parseable) |
-| `--log-file` | Path to log file (automatic rotation) |
-| `--verbose / -v` | Shortcut for `--log-level DEBUG` |
+**Constructor**:
-### Logging features
+```python
+engine = ValidationEngine(
+ config=None, # ValidationConfig object (mutually exclusive with config_path)
+ config_path=None, # str | Path to YAML file; auto-discovered if both are None
+ parallel=False, # bool
+ workers=None, # int | None — defaults to CPU count when parallel=True
+ chunk_size=None, # int | None — defaults to 100000
+ show_progress=True, # bool
+ notifier=None, # optional notifier (e.g. SlackNotifier)
+ sources_file=None, # str | Path — overrides sources_file from config
+)
+```
-- **Structured logging**: Console and JSON formatters for different use cases
-- **Sensitive data masking**: Automatically masks credentials and passwords in log output
-- **Trace IDs**: Unique trace ID per validation run for log correlation across systems
-- **File rotation**: Automatic log file rotation to prevent unbounded growth
+When both `config` and `config_path` are `None`, the constructor searches the current working directory for a config file in the standard discovery order. It raises `ConfigurationError` if none is found.
----
+**Methods**:
-## Security
+```python
+# Validate a file (CSV or Parquet)
+summary = engine.validate_file("data.parquet")
+summary = engine.validate_file("data.csv", delimiter="|", encoding="latin-1")
+
+# Validate a pre-loaded DataFrame
+import pandas as pd
+df = pd.read_parquet("data.parquet")
+summary = engine.validate_dataframe(df)
+
+# Validate against a named source
+summary = engine.validate_sources(
+ source_name="production_db", # None = use config default
+ table="orders",
+ where="status = 'pending'",
+ query=None, # use query or table, not both
+)
-### Credential handling
+# validate() is a lower-level method — requires file_path or df explicitly
+summary = engine.validate(file_path="data.parquet")
+summary = engine.validate(df=df)
+```
-- **Environment variables**: Use `${VAR}` and `${VAR:-default}` syntax in config files — never hardcode credentials
-- **Credential files**: Load credentials from external files
-- **Password masking**: Credentials are automatically masked in logs and terminal output
-- **Config env audit**: Use `datacheck config env` to verify all required variables are set
+### ValidationSummary
-### Connection security
+`validate_*` methods return a `ValidationSummary` object.
-- Connection string validation before attempting connections
-- SQL injection prevention: table name validation, WHERE clause scanning, parameterized queries
-- Path traversal prevention with null byte and symlink detection
-- SSL/TLS enforcement for warehouse connections
+**Properties**:
----
+| Property | Type | Description |
+|----------|------|-------------|
+| `total_rules` | int | Total rules evaluated |
+| `passed_rules` | int | Rules that passed |
+| `failed_rules` | int | Rules that failed (any severity) |
+| `failed_errors` | int | Failed rules with `error` severity |
+| `failed_warnings` | int | Failed rules with `warning` severity |
+| `failed_info` | int | Failed rules with `info` severity |
+| `error_rules` | int | Rules that threw an exception |
+| `all_passed` | bool | True if `failed_errors == 0` and `error_rules == 0` |
+| `has_failures` | bool | True if any rules failed |
+| `has_errors` | bool | True if any rules threw exceptions |
+| `results` | list[RuleResult] | All rule results |
+| `total_rows` | int | Rows in the dataset |
+| `total_columns` | int | Columns in the dataset |
-## Airflow Integration
+**Methods**:
-DataCheck provides two Airflow operators for use in DAGs, plus a simpler BashOperator pattern.
+```python
+summary.get_passed_results() # list[RuleResult]
+summary.get_failed_results() # list[RuleResult] — any severity
+summary.get_failed_by_severity("error") # list[RuleResult]
+summary.get_error_results() # list[RuleResult] — execution errors
+summary.to_dict() # dict
+```
-### DataCheckOperator
+### RuleResult
-Run data validation inside Airflow DAGs:
+| Property | Type | Description |
+|----------|------|-------------|
+| `rule_name` | str | Rule identifier (e.g., `order_id_min`) |
+| `check_name` | str | Check name from config |
+| `rule_type` | str | Rule type (e.g., `min`, `not_null`) |
+| `column` | str | Column name |
+| `passed` | bool | Pass/fail |
+| `total_rows` | int | Total rows checked |
+| `failed_rows` | int | Rows that violated the rule |
+| `severity` | str | `error`, `warning`, or `info` |
+| `error` | str \| None | Exception message if rule errored |
+| `has_error` | bool | True if `error` is not None |
+| `failure_details` | FailureDetail \| None | Detailed failure info including sample values |
-```python
-from datacheck.airflow.operators import DataCheckOperator
+### Failure iteration
-validate_orders = DataCheckOperator(
- task_id="validate_orders",
- config_path="/path/to/datacheck.yaml",
- file_path="/data/orders.csv",
- fail_on_error=True,
- push_results=True,
- min_pass_rate=95.0,
-)
+```python
+engine = ValidationEngine(config_path="checks.yaml")
+summary = engine.validate_file("orders.parquet")
+
+if not summary.all_passed:
+ for result in summary.get_failed_results():
+ rate = result.failed_rows / result.total_rows * 100 if result.total_rows else 0
+ print(
+ f"FAIL [{result.severity}] {result.rule_name} "
+ f"on {result.column}: "
+ f"{result.failed_rows}/{result.total_rows} rows ({rate:.1f}%)"
+ )
```
-**Parameters:**
+### Raising exceptions
-| Parameter | Type | Default | Description |
-|-----------|------|---------|-------------|
-| `config_path` | str | required | Path to validation config YAML |
-| `file_path` | str | None | Path to data file (CSV or Parquet) |
-| `sources_file` | str | None | Path to sources YAML (overrides config) |
-| `source_name` | str | None | Named source from sources.yaml |
-| `table` | str | None | Database table name |
-| `where` | str | None | SQL WHERE clause |
-| `query` | str | None | Custom SQL query |
-| `parallel` | bool | False | Enable multi-core validation |
-| `workers` | int | None | Number of worker processes |
-| `min_pass_rate` | float | 0 | Minimum rule pass rate (0-100, 0=disabled) |
-| `fail_on_error` | bool | True | Fail Airflow task on validation failure |
-| `push_results` | bool | True | Push results to XCom |
+`ValidationEngine` does not raise on validation failure by default — it returns a `ValidationSummary`. To raise on failure:
-**Template fields**: `config_path`, `file_path`, `sources_file`, `source_name`, `table`, `where`, `query` (supports `.yaml` and `.yml` extensions)
+```python
+from datacheck import ValidationEngine
+from datacheck.exceptions import DataCheckError, ConfigurationError, DataLoadError
-**XCom output:**
-- `validation_results`: Full results dictionary
-- `passed`: Boolean pass/fail result
-- `pass_rate`: Percentage of rules passed
+engine = ValidationEngine(config_path="checks.yaml")
-**Data source resolution order:**
-1. `file_path` — file-based validation
-2. `source_name` + `sources_file` — named source validation
-3. Config default (`source` or `data_source` from config)
+try:
+ summary = engine.validate_file("data.parquet")
+except ConfigurationError as e:
+ raise SystemExit(2) from e
+except DataLoadError as e:
+ raise SystemExit(3) from e
+except DataCheckError as e:
+ raise SystemExit(4) from e
-### DataCheckSchemaOperator
+if not summary.all_passed:
+ raise SystemExit(1)
+```
-Enforce schema contracts inside Airflow DAGs:
+### Exception hierarchy
-```python
-from datacheck.airflow.operators import DataCheckSchemaOperator
+| Exception | Exit code analog | When raised |
+|-----------|-----------------|-------------|
+| `DataCheckError` | 4 | Base class for all DataCheck exceptions |
+| `ConfigurationError` | 2 | Invalid config, missing required fields |
+| `DataLoadError` | 3 | File not found, connection failure, bad credentials |
+| `ValidationError` | 4 | Unexpected failure during rule execution |
+| `RuleDefinitionError` | 2 | Invalid rule parameters |
+| `ColumnNotFoundError` | 4 | Column referenced in rule does not exist in data |
-check_schema = DataCheckSchemaOperator(
- task_id="check_schema",
- file_path="/data/orders.csv",
- baseline_name="orders-v2",
- fail_on_breaking=True,
- push_results=True,
-)
-```
+---
-**Parameters:**
+## 11. Performance Model
-| Parameter | Type | Default | Description |
-|-----------|------|---------|-------------|
-| `file_path` | str | None | Path to data file |
-| `sources_file` | str | None | Path to sources YAML |
-| `source_name` | str | None | Named source from sources.yaml |
-| `table` | str | None | Database table name |
-| `baseline_name` | str | `"baseline"` | Baseline identifier |
-| `baseline_dir` | str | `".datacheck/schemas"` | Baseline storage directory |
-| `fail_on_breaking` | bool | True | Fail Airflow task on breaking schema changes |
-| `push_results` | bool | True | Push results to XCom |
+### SQL pushdown
-**XCom output:**
-- `schema_results`: Schema comparison results dictionary
-- `schema_compatible`: Boolean compatibility flag
+For database sources (PostgreSQL, Redshift, MySQL, SQL Server, Snowflake, BigQuery), DataCheck compiles all eligible rules into a single aggregate `SELECT` per table. The query returns one row of violation counts. No data rows are transferred to the Python process.
-Auto-captures a new baseline if none exists yet.
+**Mechanism**: The `SqlAggregateBuilder` partitions checks into pushable and non-pushable sets using each dialect's `pushable_rules` property. It then generates one query with a `CASE WHEN … THEN 1 ELSE 0 END` expression per rule, wrapped in `SUM()`. A single database round-trip produces all violation counts.
-### BashOperator pattern
+```sql
+SELECT
+ COUNT(*) AS _total_rows,
+ SUM(CASE WHEN "id" IS NULL THEN 1 ELSE 0 END) AS _c0_not_null,
+ SUM(CASE WHEN "amount" IS NOT NULL AND "amount" < 0 THEN 1 ELSE 0 END) AS _c1_min,
+ SUM(CASE WHEN "status" IS NOT NULL AND "status"::text NOT IN ('a','b')
+ THEN 1 ELSE 0 END) AS _c2_allowed_values,
+ COUNT(*) - COUNT(DISTINCT "id") AS _c3_unique
+FROM "orders"
+WHERE created_at > '2026-01-01'
+```
-For simpler integration, use Airflow's `BashOperator` directly:
+SQL pushdown activates automatically when all conditions hold:
+- Source type is a supported database (not CSV, Parquet, or S3)
+- No `--query` argument (custom queries disable pushdown; use `--where` instead)
+- The dialect supports the rule types in the check
-```python
-from airflow.operators.bash import BashOperator
+### Rule pushdown support by dialect
-validate = BashOperator(
- task_id="validate_data",
- bash_command="datacheck validate --config /path/to/config.yaml --output /tmp/results.json",
-)
-```
+All six database dialects support the base rule set. `regex` and `max_age` require dialect-specific functions and are supported on a per-dialect basis.
-Exit codes work directly with Airflow task status — exit code 0 means success, any non-zero code fails the task.
+**Base rules (all dialects):**
+`not_null`, `boolean`, `min`, `max`, `range`, `positive`, `non_negative`, `allowed_values`, `unique`, `unique_combination`, `sum_equals`, `min_length`, `max_length`, `no_future_timestamps`, `timestamp_range`, `date_range`
----
+| Rule | PostgreSQL | Redshift | MySQL | SQL Server | Snowflake | BigQuery |
+|------|:---:|:---:|:---:|:---:|:---:|:---:|
+| Base rules (16) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
+| `max_age` | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
+| `regex` | ✓ | ✓ | ✓ | — | ✓ | ✓ |
-## CI/CD Integration
+Rules not in this table (`type`, `date_format_valid`, `date_format`, `foreign_key_exists`) always execute in Python.
-DataCheck uses standard exit codes for automation. Any non-zero exit code fails the pipeline.
+When a check contains any non-pushable rule, the entire check falls back to the Python path. Other checks in the same run that are fully pushable still execute via SQL.
-| Code | Meaning | CI/CD Effect |
-|------|---------|-------------|
-| 0 | All rules passed | Pipeline continues |
-| 1 | Error-severity failures | Pipeline fails (blocks deploy) |
-| 2 | Configuration error | Pipeline fails |
-| 3 | Data loading error | Pipeline fails |
-| 4 | Unexpected error | Pipeline fails |
+### No row extraction
-### GitHub Actions
+The SQL path transfers zero data rows from the database. The only data movement is the single aggregate result row (one integer per rule). For a table with 100M rows and 20 pushable rules, the network payload is roughly 20 integers.
-Use the native DataCheck Action for the simplest setup — results appear in the **GitHub Security tab** via SARIF:
+### Memory characteristics for file sources
-```yaml
-name: Data Quality Check
-on: [push, pull_request]
+For CSV and Parquet sources, DataCheck loads only the columns referenced by active checks (column pruning). If a file has 50 columns but only 10 are referenced, only those 10 columns are loaded into memory. Memory usage scales with: `(referenced columns) × (row count) × (average cell size)`.
-permissions:
- contents: read
- security-events: write # Required for SARIF upload
+### Parallel execution
-jobs:
- validate:
- runs-on: ubuntu-latest
- steps:
- - uses: actions/checkout@v4
+`--parallel` splits the loaded DataFrame into chunks (default 100,000 rows) and processes each chunk with a thread pool. This is beneficial for large in-memory datasets where rule evaluation is CPU-bound. It does not apply to the SQL pushdown path; the database handles parallelism internally.
- - uses: squrtech/datacheck-action@v1
- with:
- config: .datacheck.yaml
+```bash
+datacheck validate --parallel --workers 8 --chunk-size 50000
```
-Or use the CLI directly for full control:
+### Scaling considerations
-```yaml
- - uses: actions/setup-python@v5
- with:
- python-version: "3.12"
- - run: pip install datacheck-cli
- - run: datacheck validate --format sarif --output results.sarif
-```
+| Scenario | Recommended approach |
+|----------|----------------------|
+| Large database table (10M+ rows) | Use SQL pushdown (omit `--query`); add `--where` to filter if needed |
+| Large Parquet file (multi-GB) | Use `--parallel`; ensure only needed columns are referenced |
+| Many small files | Script sequential calls; aggregate exit codes externally |
+| Wide tables (100+ columns) | Reference only needed columns in checks; column pruning applies automatically |
-### GitLab CI
+---
-```yaml
-validate_data:
- image: python:3.12
- script:
- - pip install datacheck-cli
- - datacheck validate --output results.json
- artifacts:
- paths:
- - results.json
- when: always
-```
+## 12. Troubleshooting
-### Jenkins
-
-```groovy
-pipeline {
- agent any
- stages {
- stage('Data Validation') {
- steps {
- sh 'pip install datacheck-cli'
- sh 'datacheck validate --output results.json'
- }
- post {
- always {
- archiveArtifacts artifacts: 'results.json', allowEmptyArchive: true
- }
- }
- }
- }
-}
-```
+### Configuration errors (exit 2)
----
+**`Configuration Error: No configuration provided and no config file found`**
-## Python API
+DataCheck searched for `.datacheck.yaml`, `.datacheck.yml`, `datacheck.yaml`, `datacheck.yml` in the current directory and found none. Either pass `--config path/to/file.yaml` or create a config file in the working directory.
-### ValidationEngine
+**`Configuration Error: Unknown rule type: 'xyz'`**
-```python
-from datacheck import ValidationEngine
+The rule type `xyz` is not in the supported rule set. Check the rule name against Section 5. Rule names are case-sensitive and use underscores (e.g., `not_null`, `max_age`).
-engine = ValidationEngine(config_path=".datacheck.yaml")
-summary = engine.validate()
+**`Configuration Error: 'checks' is a required field`**
-print(f"Records: {summary.total_rows:,} rows, {summary.total_columns} columns")
-print(f"Passed: {summary.passed_rules}/{summary.total_rules}")
+The config file exists but does not contain a `checks` key. A minimal valid config requires at least an empty `checks` list.
-for result in summary.get_failed_results():
- print(f" FAIL: {result.rule_name} on {result.column} ({result.failed_rows} rows)")
-```
+**Config parses but rules don't run**
-**Constructor parameters:**
+Check whether `enabled: false` is set on the check. Also verify the column name matches exactly (case-sensitive).
-| Parameter | Description |
-|-----------|-------------|
-| `config` / `config_path` | Configuration object or path to YAML file |
-| `parallel` | Enable parallel execution (bool) |
-| `workers` | Number of worker processes (int) |
-| `chunk_size` | Rows per chunk for parallel execution (int) |
-| `show_progress` | Show progress bar (bool) |
-| `notifier` | Optional notifier instance (e.g. `SlackNotifier`) |
-| `sources_file` | Path to sources YAML (overrides config) |
+### Data loading errors (exit 3)
-**Methods:**
+**`Data Load Error: File not found: data.csv`**
-| Method | Description |
-|--------|-------------|
-| `validate()` | Validate using config defaults |
-| `validate_file(file_path, **kwargs)` | Validate a file (supports sampling) |
-| `validate_sources(source_name, table, where, query, **kwargs)` | Validate a named source |
-| `validate_dataframe(df)` | Validate a pre-loaded pandas DataFrame |
+The file path is relative to the current working directory (not the config file's directory, unless the `data_source.path` is defined in the config, in which case it is relative to the config file). Use an absolute path or ensure the working directory is correct.
-### ValidationSummary
+**`Data Load Error: Connection failed`**
-| Property | Type | Description |
-|----------|------|-------------|
-| `total_rules` | int | Total number of rules executed |
-| `passed_rules` | int | Rules that passed |
-| `failed_rules` | int | Rules that failed |
-| `failed_errors` | int | Failed rules with `error` severity |
-| `failed_warnings` | int | Failed rules with `warning` severity |
-| `failed_info` | int | Failed rules with `info` severity |
-| `error_rules` | int | Rules that encountered execution errors |
-| `all_passed` | bool | Whether all rules passed |
-| `has_errors` | bool | Whether any execution errors occurred |
-| `results` | list | List of `RuleResult` objects |
-| `total_rows` | int | Number of data rows |
-| `total_columns` | int | Number of columns |
-| `timestamp` | str | Execution timestamp |
-| `duration` | float | Execution duration in milliseconds |
-| `trace_id` | str | Unique run identifier for log correlation |
-
-| Method | Returns | Description |
-|--------|---------|-------------|
-| `get_passed_results()` | list | RuleResults that passed |
-| `get_failed_results()` | list | RuleResults that failed |
-| `get_error_results()` | list | RuleResults with execution errors |
-| `to_dict()` | dict | Serialize to dictionary |
+Check that the database host is reachable and the port is correct. Verify that environment variables are set: `echo $DB_HOST`. Use `--verbose` to see the connection attempt details.
-### RuleResult
+**`Data Load Error: Authentication failed`**
-| Property | Type | Description |
-|----------|------|-------------|
-| `rule_name` | str | Rule identifier |
-| `column` | str | Target column |
-| `passed` | bool | Whether the rule passed |
-| `total_rows` | int | Total rows checked |
-| `failed_rows` | int | Rows that failed |
-| `rule_type` | str | Rule type name |
-| `check_name` | str | Check name from config |
-| `severity` | str | `error`, `warning`, or `info` |
-| `failure_details` | FailureDetail | Detailed failure information |
-| `error` | str | Error message if rule errored |
-| `execution_time` | float | Execution time in milliseconds |
+Verify credentials. For PostgreSQL, confirm `DB_USER` and `DB_PASSWORD` are correct and the user has `SELECT` access on the target table. For BigQuery, confirm the service account JSON path is correct and the account has `bigquery.dataViewer` role.
----
+**`Data Load Error: Missing extra: postgresql`**
-## Industry Templates
+The source type requires an optional dependency that is not installed. Install it: `pip install datacheck-cli[postgresql]`.
-DataCheck ships with 8 config templates:
+### Environment variable issues
-| Template | Use Case |
-|----------|----------|
-| `basic` | Generic starter config for any data |
-| `ecommerce` | Order data, product catalogs, customer records |
-| `healthcare` | Patient data, HIPAA compliance, date formats |
-| `finance` | Transaction data, SOX compliance, sum validations |
-| `saas` | User activity, subscription data, engagement metrics |
-| `iot` | Sensor data, time-series, device telemetry |
-| `rules-reference` | Complete reference of all validation rules with examples |
-| `sources` | Data source connection templates with environment variable support |
+Environment variables that use `${VAR}` syntax resolve to an empty string if unset — they do not raise an error. If a connection fails with unexpected values, confirm the variable is set in the shell where DataCheck runs:
```bash
-datacheck config init --template ecommerce --with-sample-data
-datacheck config init --template healthcare --with-sample-data --sample-rows 500
-datacheck config templates # List all templates with descriptions
+printenv | grep DB_
```
----
+Use `datacheck config env checks.yaml` to list all variables referenced in a config and their current values.
-## Error Handling
+### Debugging with `--verbose`
-### Exception hierarchy
+`--verbose` sets log level to DEBUG and prints detailed information about each step:
-| Exception | When |
-|-----------|------|
-| `DataCheckError` | Base exception for all DataCheck errors |
-| `ConfigurationError` | Invalid config structure, missing required fields |
-| `ValidationError` | Rule execution failures |
-| `DataLoadError` | File not found, encoding issues, connection failures |
-| `RuleDefinitionError` | Invalid rule parameters or missing required arguments |
-| `UnsupportedFormatError` | Unknown file format or missing optional library |
-| `ColumnNotFoundError` | Column not found in DataFrame |
-| `EmptyDatasetError` | No rows in loaded dataset |
+```bash
+datacheck validate --config checks.yaml --verbose
+datacheck validate --config checks.yaml --verbose --log-format json --log-file debug.log
+```
-All exceptions inherit from `DataCheckError`, so you can catch them broadly:
+Debug output includes: config file path, source resolution, columns loaded, rules evaluated per check, SQL query generated (for database sources), and timing per rule.
-```python
-from datacheck.exceptions import DataCheckError, ConfigurationError, DataLoadError
+### SQL pushdown not activating
-try:
- engine = ValidationEngine(config_path="config.yaml")
- summary = engine.validate()
-except ConfigurationError as e:
- print(f"Config error: {e}")
-except DataLoadError as e:
- print(f"Data load error: {e}")
-except DataCheckError as e:
- print(f"DataCheck error: {e}")
-```
+If you expect SQL pushdown but validation is slow, check:
+
+1. The source type is a supported database (not CSV/Parquet/S3)
+2. `--query` is not specified (custom queries disable pushdown; use `--where` instead)
+3. The rules in the failing check are all in the pushable set for that dialect
+
+Use `--verbose` to confirm whether pushdown is active — the generated SQL query is logged at DEBUG level.
+
+### Arrow/type conversion errors
+
+If a rule throws an execution error on an Arrow-backed column or a Parquet decimal column, this indicates a type conversion issue in the rule implementation. Report the issue at [https://github.com/squrtech/datacheck/issues](https://github.com/squrtech/datacheck/issues) with the column dtype, rule type, and error message.
+
+As a workaround, adding a `type: numeric` check on the same column before the failing numeric rule will force type validation and surface the root cause.