Skip to content
/ sans Public

deterministic compiler and execution kernel for sas-like batch data pipelines, designed for auditability and diffability.

License

Notifications You must be signed in to change notification settings

joenorton/sans

Repository files navigation

sans

Small, deterministic compiler and executor for a strict SAS‑like batch subset.

sans compiles SAS‑like scripts into a machine-readable IR (plan.ir.json), executes them against tabular data, and emits a detailed execution report (report.json). It is built for auditability, reproducibility, and strict safety.

  • Strict by Default: Unsupported constructs refuse the entire script with stable error codes.
  • Deterministic: Bit‑identical outputs (CSV/XPT) across Windows and Linux.
  • Audit‑Ready: Every run generates a signed manifest (SHA‑256) of all inputs and outputs.
  • Portable: No SAS installation required; zero‑dependency runtime (except pydantic for schema).

Installation

pip install -e .

This installs the sans CLI command. You can also use python -m sans.


Quickstart

1. Create a pipeline

sans supports a modern .sans DSL or a strict SAS‑like subset.

# example.sans
# sans 0.1

datasource in = inline_csv do
  a,b
  6,7
  3,2
end

table t = from(in) do
  derive(base2 = a * 2)
  filter(base2 > 10)
  select a, base2
end

save t to "out.csv"

2. Compile and Check

Verify a script without executing it. Emits the execution plan (plan.ir.json) and a refusal/ok report.

sans check example.sans --out out

3. Emit Canonical sans.ir

Generate deterministic canonical sans.ir directly from a script using the same compile/check path as sans check.

sans emit-ir example.sans --out out/example.sans.ir
sans ir-validate --strict out/example.sans.ir

Use --cwd to set the compilation working directory (for relative paths in script context):

sans emit-ir script.expanded.sans --out out/script.sans.ir --cwd fixtures/inputs

4. Execute

Compile, validate, and run. Emits output tables (CSV/XPT) and the final signed manifest.

sans run example.sans --out out

5. Verify

Verify that a previously generated report matches the current state of files on disk.

sans verify out/report.json

6. Format

Canonicalize .sans formatting (presentation only).

sans fmt example.sans
sans fmt example.sans --check
sans fmt example.sans --in-place

FMT Usage

sans fmt is a pure formatter: it changes presentation only and guarantees parse‑equivalence and idempotence.

Modes

  1. canonical (default): applies the canonical v0 style.
  2. identity: preserves bytes (except newline normalization to \n).

Flags

  1. --check: exit non‑zero if formatting would change the file.
  2. --in-place: rewrite the file atomically (writes a temp file, then replaces).

Examples:

sans fmt script.sans
sans fmt script.sans --mode identity
sans fmt script.sans --check
sans fmt script.sans --in-place

Native .sans DSL

The native DSL provides a clean, linear syntax for data pipelines. It is safer than SAS, with strict rules for column creation and overwrites.

  • Additive by default: Use derive(col = expr) to create new columns only (error if column exists).
  • Explicit overwrites: Use update!(col = expr) to modify existing columns only (error if missing).
  • Explicit output: Outputs are defined only via save; there is no implicit "last table wins."
  • Explicit cast: Use cast(col -> type [on_error=null] [trim=true], ...) for deterministic type conversion; target types: str, int, decimal, bool, date, datetime. Evidence (cast_failures, nulled) is emitted in runtime.evidence.json.
  • Stable ties: Sorting is stable; nodupkey preserves the first encountered row.

expanded.sans is the canonical human-readable form (fully explicit, no blocks, kernel vocabulary only); scripts are sugar that lower to the same IR. Compiling expanded.sans must reproduce the same plan.ir.json (byte-identical aside from quarantined metadata).

# process.sans
# sans 0.1
datasource raw = csv("raw.csv")

table enriched = from(raw) do
  derive(base_val = a + 1)
  filter(base_val > 0)
  update!(base_val = base_val * 10)
  derive(risk = if(base_val > 100, "HIGH", "LOW"))
  cast(base_val -> str)
  select(subjid, base_val, risk)
end

save enriched to "enriched.csv"

Supported SAS Subset

  • DATA Step: set, merge (with in=), by (first./last.), retain, if/then/else, keep/drop/rename.
  • Dataset Options: (keep= drop= rename= where=).
  • Procs:
    • proc sort (nodupkey)
    • proc transpose (by, id, var)
    • proc sql (Inner/Left joins, where, group by, aggregates)
    • proc format (Value mappings + put() lookups)
    • proc summary (Class means with autoname)
  • Macro‑lite: %let, %include, &var, single‑line %if/%then/%else.

Determinism & Runtime Semantics

sans guarantees stability through strict runtime rules:

  • Missing Values: Nulls sort before all data and satisfy null < [value].
  • Numeric Precision: Uses Decimal to prevent float precision loss.
  • I/O Normalization: Enforces LF (\n) and deterministic CSV quoting.
  • Stable Hashes: Artifact hashes are invariant across OS platforms.

See DETERMINISM.md for the sacred v1 invariants.


Schema lock

Typed CSV ingestion without hand-typing every column: a run must have either typed pinning in the datasource (e.g. columns(a:int, b:decimal)) or a schema lock file.

  • Generate a lock
    Recommended: use the dedicated subcommand (no --out required; lock written next to the script by default):

    sans schema-lock script.sans

    This writes <script_dir>/<script_stem>.schema.lock.json (e.g. demo_high.schema.lock.json next to demo_high.sans). Use -o PATH or --write PATH to override; relative paths are resolved against the script directory. Optionally add --out DIR to also write report.json and stage inputs under DIR/inputs. No pipeline execution.

    Alternatively, after a successful run or from untyped CSVs via run:

    sans run script.sans --out out --emit-schema-lock schema.lock.json

    The lock is written under --out when the path is relative. With untyped datasources the tool runs in lock-only mode; otherwise it runs normally and emits the lock after success. Stdout shows (lock-only) or (after run).

  • Run with a lock
    Omit column types in the script and pass the lock:

    sans run script.sans --out out --schema-lock schema.lock.json

    The lock is copied into out/schema.lock.json so the bundle is self-contained. Extra columns in input are ignored; missing columns or type mismatches fail with clear codes. The report includes schema_lock_sha256, schema_lock_used_path, and schema_lock_copied_path; when a lock was emitted, also schema_lock_mode and schema_lock_path.

  • Lock-only via run
    To generate a lock under --out without running (e.g. for CI bundles): sans run script.sans --out out --emit-schema-lock schema.lock.json --lock-only. Prefer sans schema-lock script.sans when you only need the lock file.

See SCHEMA_LOCK_V0.md for the full contract, path resolution, report fields, inference rules, and error codes.


Deep References

About

deterministic compiler and execution kernel for sas-like batch data pipelines, designed for auditability and diffability.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages