Conversation
…olumns The previous implementation converted a datetime column to a string using the user's format string, then immediately parsed it back with that same format — a tautological round-trip that always passed regardless of whether the format actually matched the data. Fix: after strftime(format) → to_datetime(format), compare the parsed result against the original timestamp. Formats that discard information (e.g. "%d/%m/%Y" on a column with time-of-day values) produce a different timestamp on the round-trip, correctly signalling a format mismatch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… PyPI updates - Add airflow-provider/ package with DataCheckOperator for DAG-based validation - Add github-action/ with action.yml for CI/CD pipeline integration - Add SARIF exporter for GitHub Code Scanning / security tooling compatibility - Update README and README_PYPI with feature comparisons and integration guides - Expand pyproject.toml keywords and classifiers for PyPI discoverability - Add COMPETITIVE_COMPARISON.md and MARKET_REPORT.md - Extend validate CLI and reporting module for new output integrations Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- testing/csv/run_all.py — master runner (python run_all.py [suite...])
- testing/csv/helpers.py — TestSuite class, UTF-8 stdout, CLI detection
- testing/csv/test_{users,products,orders}.py — 137 test cases across 9 groups:
A. Validate passing rules B. Failure detection (exit 1/2/3)
C. Output formats (json, sarif, markdown, csv)
D. Sampling modes (top, count, rate, stratified, time_based, reservoir...)
E. Profiling (terminal, json, markdown, iqr, zscore, suggestions)
F. Schema evolution (capture, list, show, compare, history)
G. Config management (validate, show, generate, templates)
H/I. Extended rule coverage (distribution_type, min/max_length,
date_format, no_future_timestamps, business_days_only, max_age,
foreign_key_exists) — both pass and fail detection
- testing/csv/configs/ — 12 YAML configs (pass/fail/extended per source)
- .gitignore — excludes testing/venv/ and testing/csv/results/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rule implementation gaps (factory.py, numeric_rules.py, composite_rules.py):
- Add factory handlers and rule classes for positive, negative, non_negative,
range, and boolean — previously exit=4 (no handler)
- BooleanRule handles both bool dtype and True/False string values
Severity propagation (engine.py, numeric_rules.py):
- Replace replace("_min","").replace("_max","") with removesuffix() so that
check names containing "_max" or "_min" mid-string are not corrupted;
severity: warning checks no longer incorrectly exit=1
Config validation (loader.py, schema.py):
- Remove "must have at least one check" guard so enabled:false-only configs
exit=0 instead of exit=2
- Unify rule-type allowlist against schema.py's VALID_RULE_TYPES to stay
in sync automatically; add missing date_range to schema
- Replace unimplemented html output format with sarif in VALID_OUTPUT_FORMATS
Temporal rules (temporal_rules.py):
- TimestampRangeRule and NoFutureTimestampsRule now match tz-awareness of
the column before comparison to avoid tz-naive vs tz-aware TypeError
- DateFormatValidRule handles Arrow date32[day] columns via ISO string path
Profiling / statistics (profiler.py, statistics.py, schema/detector.py):
- Guard df.duplicated(), series.nunique(), and value_counts() against
unhashable Arrow complex types (list, struct, map)
- Cast Arrow decimal128 to float64 before numeric stats to avoid ArrowTypeError
- Fix re.error from duplicate named group %H in inferred date format strings
by tracking has_hour and capping hour-segment detection to one emission
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removed the 5 aggregate/statistical rules (mean_between, std_dev_less_than, percentile_range, z_score_outliers, distribution_type) which are anomaly detection tools rather than row-level data quality rules. This simplifies the rule set and avoids user confusion about what validation means. Also includes prior go-to-market work committed together: - SQL pushdown engine (datacheck/sql_pushdown/) for PostgreSQL, Redshift, MySQL, SQL Server, Snowflake, BigQuery — zero data transfer validation - Removed profiling feature (datacheck/profiling/, cli/profile.py, config/generator.py) - Removed custom rule plugin system (datacheck/plugins/) - Removed sampling feature (datacheck/sampling/) - Advanced templates for all 6 domains with sample data generation - Performance improvements: 11x speedup for temporal rules via PyArrow, vectorized ops in type/bool/length rules, ThreadPoolExecutor parallelism - Updated all docs, guides, templates, and benchmarks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ing folder - Bump version to 2.1.0 across pyproject.toml, __init__.py, sarif_exporter.py, airflow-provider, github-action - Remove comparison table and stale competitor references from README - Add boolean rule and fix missing rules (range, positive, non_negative) in all summary tables - Remove positive/non_negative from high-level summary tables (redundant with min/max) - Fix Named Sources heading and email_valid stale reference in README - Add DataCheckSchemaOperator query parameter (code + docs + airflow-provider README) - Add large-table tip for schema operator using LIMIT in query - Add guides/config-guide.md comprehensive config file reference - Update cli-guide.md and guides to remove redundancy, add cross-links - Remove testing/ folder and internal market/competitive reports Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- New tagline: "Catch data quality issues before they reach production" - Rewrite Highlights to lead with benefits (bold) not features - Surface SARIF, GitHub Action, and Airflow in top-level highlights - Remove comparison table from README_PYPI.md, sync with README.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the old stats block + per-failure listing with a single Rich rounded table showing every rule as a row (Result | Check | Column | Details): - Result cell: passed (green) / failed (red) / warning (yellow) / info (blue) / error (red) - Details cell: failure rate + sample bad values for failures; error message for execution errors - One-line footer: 🟢/🟡/🔴 status, check count, row count, per-severity counts, elapsed time - Warnings-only runs show 🟡 "Passed with warnings" instead of red Track elapsed time in validate.py (time.monotonic) and pass to reporter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Show "Validating <source>" line above the table so users know what ran - Named source: "production_db → orders" - Inline file: "orders.csv (csv)" - Warehouse connection: "snowflake → orders" - File arg: "orders.parquet" (just filename, not full path) - Execution errors truncated at 60 chars in the table Details cell with "… (see below)" - Full error messages printed separately after the footer in a red "Execution Errors" panel Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
JSON (--format json): - Switch from basic JSONExporter to JsonReporter (metadata, distributions, suggestions) - Add source, elapsed_seconds to metadata - Status now "PASSED" / "PASSED_WITH_WARNINGS" / "FAILED" - Summary adds failed_errors, failed_warnings, failed_info, total_rows, total_columns - Results add severity field, cleaner status values (PASS/FAIL/WARNING/INFO/ERROR) Markdown (--format markdown): - Source line, status icon, run summary with counts and timing at the top - Full results table: Result | Check | Column | Details | Severity (all rules, not just failures) - Failure details section with sample values table per failed rule - Execution errors section with full error messages in code blocks SARIF (--format sarif): - Add startTimeUtc derived from elapsed time - Add automationDetails.description for source info CSV (--format csv / --csv-export): - Add severity column to both export_failures and export_summary - Drop redundant rule_name column (check_name is cleaner) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove BusinessDaysOnlyRule (country_code was never implemented) - Remove GCS and Azure connectors (stubs with no real implementation) - Remove DuckDB and SQLite loaders - Remove Delta Lake and Avro loaders - Remove min_quality_score from Airflow operator (profiling removed) - Delete empty stub directories: core/, plugins/, profiling/, sampling/ - Fix output_path → output_file in all 7 config templates - Clean all guides, docs, and templates of stale references - Update airflow-provider package to match supported sources Supported file formats: CSV, Parquet only Supported cloud storage: S3 only Supported databases: PostgreSQL, MySQL, MSSQL, Snowflake, BigQuery, Redshift Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Rewrite README headline: 'A Linter for Data Pipelines' - Add enforcement-first description with fail-fast diagram - Add 'Why not observability?' section to README - Expand CI/CD section: SARIF upload, Airflow gate, plain shell examples - Add SQL pushdown callout in database sources section - Remove 'continuous monitoring' roadmap item (wrong direction) - Add Python API halt-on-failure pattern - Rewrite README_PYPI.md with matching positioning - Create docs/philosophy.md: detection vs enforcement, deterministic vs statistical, SQL pushdown rationale, zero-infra rationale, opinionated design principles - Replace 'monitoring dashboards' with 'informational checks' in config guide - Replace 'schema monitoring' with 'schema enforcement' in python-api guide Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- docs/index.md: update title and opening to 'A Linter for Data Pipelines' with enforcement diagram and deterministic/zero-infra framing - guides/guide-who-uses-datacheck.md: update opening from detection language to enforcement/gate language - pyproject.toml: update description to 'A linter for data pipelines. Enforce data quality rules in CI/CD, Airflow, and beyond.' - .github/workflows/data-quality.yml: add ready-to-use GitHub Actions workflow with SARIF upload to Security tab Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add Mental Model section: "Code has linters. Data pipelines need gates." - "data quality rules" -> "deterministic validation rules" throughout - Determinism bullet: "No heuristics. No anomaly scoring. No statistical guessing." - Add "Validate Where Data Lives" section surfacing SQL pushdown as differentiator - Add "What DataCheck Is Not" block after observability section - Quickstart: add echo $? to reinforce gating behavior - "Detect Schema Changes" -> "Enforce Schema Contracts" + enforcement framing - Remove stability self-declaration from Roadmap Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| - Monitoring multiple tables in parallel | ||
| """ | ||
|
|
||
| from datetime import datetime, timedelta |
Check notice
Code scanning / CodeQL
Unused import Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 15 hours ago
To fix the problem, remove the unused timedelta symbol from the import statement so that only the actually used datetime is imported. This removes the unnecessary dependency and makes the code cleaner without changing behavior.
Concretely, in airflow-provider/example_dags/example_schema_dag.py, update line 10 from from datetime import datetime, timedelta to from datetime import datetime. No other changes are needed, since timedelta is not referenced anywhere else in the shown code. This keeps the DAG’s behavior identical while resolving the CodeQL finding.
| @@ -7,7 +7,7 @@ | ||
| - Monitoring multiple tables in parallel | ||
| """ | ||
|
|
||
| from datetime import datetime, timedelta | ||
| from datetime import datetime | ||
|
|
||
| from airflow import DAG | ||
| from airflow.operators.python import PythonOperator |
| series.dtype.pyarrow_dtype | ||
| ): | ||
| return series.astype("float64") | ||
| except Exception: |
Check notice
Code scanning / CodeQL
Empty except Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 3 hours ago
In general, to fix an empty except you either (a) narrow the exception to the specific expected types and explain why it is safe to ignore them, or (b) handle the exception in a meaningful way, such as logging, then proceed with a safe fallback. Here, _ensure_numeric wants to be resilient: failures in pyarrow/decimal detection should simply mean “don’t convert, just return the original series”. We should keep that behavior but avoid swallowing unexpected errors silently.
The best fix with minimal functional change is:
- Narrow each
exceptto a more specific set where reasonable (e.g.ImportErrorandAttributeError), but given limited context and to avoid changing behavior, we’ll keepExceptionand add a small handling action. - Log the exception at debug level using the
loggingmodule so users can diagnose issues when needed, but default behavior remains unaffected. - Add an explanatory comment clarifying that on any error we fall back to returning the original series.
Concretely in datacheck/rules/numeric_rules.py:
- Add
import loggingnear the top (without modifying existing imports). - In the first
tryblock (lines 20–27), replace the emptyexcept Exception: passwithexcept Exception as exc:and add a comment and a logging call such aslogging.getLogger(__name__).debug(...). - Do the same in the second
tryblock (lines 31–37).
This preserves existing logic: _ensure_numeric still returns the original series when conversion fails, but the exceptions are no longer entirely ignored.
| @@ -1,6 +1,7 @@ | ||
| """Numeric validation rules.""" | ||
|
|
||
| import pandas as pd | ||
| import logging | ||
|
|
||
| from datacheck.exceptions import ColumnNotFoundError, RuleDefinitionError | ||
| from datacheck.results import RuleResult | ||
| @@ -24,8 +25,11 @@ | ||
| series.dtype.pyarrow_dtype | ||
| ): | ||
| return series.astype("float64") | ||
| except Exception: | ||
| pass | ||
| except Exception as exc: | ||
| # If pyarrow is unavailable or dtype inspection fails, fall back to the original series. | ||
| logging.getLogger(__name__).debug( | ||
| "Failed to coerce Arrow-backed decimal series to float64: %s", exc | ||
| ) | ||
| # Handle object dtype containing Python decimal.Decimal objects | ||
| if series.dtype == object: | ||
| try: | ||
| @@ -33,8 +37,11 @@ | ||
| first_valid = series.dropna() | ||
| if len(first_valid) > 0 and isinstance(first_valid.iloc[0], decimal.Decimal): | ||
| return pd.to_numeric(series, errors="coerce") | ||
| except Exception: | ||
| pass | ||
| except Exception as exc: | ||
| # If decimal import or conversion fails, fall back to the original series. | ||
| logging.getLogger(__name__.strip()).debug( | ||
| "Failed to coerce Decimal-containing series to numeric: %s", exc | ||
| ) | ||
| return series | ||
|
|
||
|
|
| first_valid = series.dropna() | ||
| if len(first_valid) > 0 and isinstance(first_valid.iloc[0], decimal.Decimal): | ||
| return pd.to_numeric(series, errors="coerce") | ||
| except Exception: |
Check notice
Code scanning / CodeQL
Empty except Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 3 hours ago
General approach: Avoid bare “do-nothing” except Exception blocks. Either (a) narrow the exception type and document why it is safe to ignore, or (b) log/record the error while preserving the current non-failing behavior.
Best fix here without changing existing functionality:
- Keep the behavior that
_ensure_numericnever raises from the decimal-detection logic and instead falls back to returningseries. - Add a comment explaining that failures are intentionally ignored because conversion is best-effort.
- Optionally capture the exception as
eso future logging can be added, but to keep behavior strictly identical we will not log or re-raise. - Apply the same pattern to both
exceptblocks in_ensure_numeric(lines 20–27 and 31–37) for consistency.
Concretely in datacheck/rules/numeric_rules.py:
- Replace
except Exception:followed bypassin the Arrow/pyarrow block withexcept Exception: # best-effort Arrow decimal handlingand include a short explanatory comment inside the block. - Replace the
except Exception:in thedecimal.Decimalblock the same way.
We do not need new imports or helper methods; the changes are limited to these except blocks and comments.
| @@ -25,6 +25,8 @@ | ||
| ): | ||
| return series.astype("float64") | ||
| except Exception: | ||
| # Best-effort Arrow decimal handling: if detection/conversion fails, | ||
| # fall back to returning the original series unchanged. | ||
| pass | ||
| # Handle object dtype containing Python decimal.Decimal objects | ||
| if series.dtype == object: | ||
| @@ -34,6 +36,7 @@ | ||
| if len(first_valid) > 0 and isinstance(first_valid.iloc[0], decimal.Decimal): | ||
| return pd.to_numeric(series, errors="coerce") | ||
| except Exception: | ||
| # Best-effort Decimal handling: on any failure, return the series as-is. | ||
| pass | ||
| return series | ||
|
|
| ) | ||
| except Exception: | ||
| pass | ||
| except Exception: |
Check notice
Code scanning / CodeQL
Empty except Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 15 hours ago
In general, to fix empty except blocks, either (a) narrow the exception type and handle it explicitly, (b) add at least logging or a comment explaining why it is safe to ignore, or (c) re-raise after doing necessary cleanup. Here, the intent is to attempt an optimized PyArrow-based conversion but to fall back silently to pd.to_datetime if PyArrow is unavailable or the cast fails. The best fix without changing functionality is to:
- Add explanatory comments to both
exceptblocks clarifying that the function will fall back topd.to_datetime. - Optionally narrow the outer exception to
ImportError(the main expected failure), while still keeping behavior identical for callers. However, per instruction to avoid changing functionality, we’ll keep the broad catch but document it.
Because the only snippet we may edit is _to_datetime_fast in datacheck/rules/temporal_rules.py, we will replace the two except Exception: pass blocks with versions that include clear comments explaining the intentional fallbacks. We will not add imports or logging libraries (to avoid new dependencies and behavior changes). No new methods or definitions are needed; we only modify these lines in this file.
| @@ -36,8 +36,13 @@ | ||
| name=series.name, | ||
| ) | ||
| except Exception: | ||
| # If any error occurs in the fast Arrow-based path, fall back to | ||
| # pandas' to_datetime below to preserve correctness. | ||
| pass | ||
| except Exception: | ||
| # If pyarrow is not available or Arrow dtype handling fails, silently fall | ||
| # back to pandas' to_datetime below. This keeps behavior identical while | ||
| # only sacrificing the fast path. | ||
| pass | ||
| return pd.to_datetime(series, errors="coerce", format="mixed") | ||
|
|
- CLI help text: "Lightweight data quality validation tool" -> "A linter for data pipelines" - CLI schema command: "Schema evolution detection" -> "Enforce schema contracts" - datacheck/__init__.py: update module docstring - pyproject.toml: "data quality rules" -> "deterministic validation rules"; remove data-quality/data-observability keywords, add data-linter/schema-contracts - airflow/operators.py + __init__.py: "data quality checks" -> "validation rules"; "Detect schema" -> "Enforce schema contracts" - airflow-provider: pyproject.toml description/keywords cleaned; provider __init__ docstring; example DAG docstring - airflow-provider/README.md: "Detects schema changes" -> "Enforces schema contracts" - github-action/README.md: "Validate data quality" -> "Enforce deterministic validation rules" - guides/cli-guide.md: schema section heading + command table + code comment - guides/python-api.md: schema operator description + Airflow example comment - guides/guide-who-uses-datacheck.md: "schema evolution detection" -> enforcement framing; pipeline diagram comments - docs/index.md: "detect schema changes" -> "enforce schema contracts" (3 occurrences) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
LoaderFactory.create_loader extracted 'columns' explicitly but also left it in file_kwargs, causing CSVLoader to receive it twice. Added 'columns' to the exclusion list in file_kwargs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- docs/index.md: remove Dagster and Prefect (no integrations exist) - pyproject.toml: remove dagster/prefect keywords - github-action/README.md: remove gcs/azure from extras list; CSV/Parquet only for data-source input - SECURITY.md: remove GCS and Azure from optional dependencies Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
validate: - Add all 20+ options in grouped tables (data source / output / execution / logging) - Add positional [DATA_SOURCE] argument and direct file example - Add echo $? to reinforce gating behavior schema compare: - Fix incorrect comment: compare does NOT fail by default - only with --fail-on-breaking - Add --fail-on-breaking to examples - Add full schema compare options table README_PYPI.md: add direct file and echo $? examples to validate quickstart Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix all GitHub Actions using non-existent @v6 versions across ci.yml, security.yml, auto-release.yml, release.yml, pr-version-check.yml (checkout@v4, setup-python@v5, upload-artifact@v4) - Remove data-quality.yml from this repo's CI - it is a user template, not a workflow for the DataCheck repo itself (no .datacheck.yaml here) - Fix validate command one-line description to enforcement language - Fix 30 ruff linting errors: unused imports, dead variable, loop variable, Optional[X] -> X | None modernisation, quoted type annotations Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- builder.py: use separate variable name for int(params) in min/max_length to avoid type conflict with str-typed v used elsewhere in the function - sample_data.py: add type annotation to nested seg() helper and data list - engine.py: cast to_dict() result to dict[str, Any] for parse_results() - loader.py: add type: ignore[call-overload] on pd.read_csv calls where **kwargs spread prevents pandas-stubs overload resolution - poetry.lock: regenerated after types-PyYAML and pandas-stubs were installed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
What does this PR do?
Fixes #
Changes
How was it tested?
Anything reviewers should know?