add name to build source data export #2174

damonmcc · 2026-01-14T17:09:07Z

related to #987

most of the commits are cleanup of things I bumped into. the sections below are related to the more significant changes

source data versions csv

a human-friendly "name" or "display name" of our input datasets are not things we've had to worry about before and it seems like data from edm-recipes are the ones we should focus on first

added name to the InputDataset model and added get_name() to the ingest_datastore.Connector

one of the new relevant tests is in the integration test file test_plan_load because plan.write_source_data_versions is only used in load.load_source_data_from_resolved_recipe. would like to punt on omptimizing where these things and their tests live

to avoid issues reading old csvs, I held off on improving the column names even though the frequently used publishing.get_source_data_versions() renames the columns

Template DB's source data versions from this branch

Seems like some datasets are old (library?) and don't have a "name" declared. library templates only had an id and a body of text (description) that sometimes had a name in the first line. that's not something worth parsing IMO, per @fvankrieken suggestion they'll be NULL

FacDB source data versions from this branch

FacDB cli

before

cli arg wasn't being used. tried to do run_pipelines for non-existing dataset named a_name and it tried to run all pipelines declared in datasets.yml

after

codecov · 2026-01-14T17:15:31Z

Codecov Report

❌ Patch coverage is 94.11765% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 74.26%. Comparing base (a9bce9d) to head (032540a).
⚠️ Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
dcpy/lifecycle/builds/plan.py	92.30%	0 Missing and 1 partial ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
dcpy/connectors/ingest_datastore.py	`100.00% <100.00%> (ø)`
dcpy/lifecycle/builds/load.py	`85.48% <ø> (ø)`
dcpy/models/lifecycle/builds.py	`98.03% <100.00%> (+0.01%)`	⬆️
dcpy/models/lifecycle/ingest.py	`96.73% <100.00%> (+0.02%)`	⬆️
dcpy/lifecycle/builds/plan.py	`88.88% <92.30%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dcpy/test/lifecycle/builds/test_load.py

dcpy/lifecycle/scripts/source_data_versions.py

dcpy/lifecycle/builds/plan.py

dcpy/models/lifecycle/ingest.py

.github/workflows/test_helper.yml

.github/workflows/template_build.yml

dcpy/lifecycle/builds/load.py

dcpy/lifecycle/builds/plan.py

alexrichey · 2026-01-20T21:25:09Z

dcpy/test/lifecycle/builds/test_plan.py

        assert planned.is_resolved, "Dataset is not resolved"

+    def test_input_datasets(self):
+        add_required_version_var_to_env()


nit: if you have a sec, this should be a fixture that adds the env var, yields, then pops the env var

alexrichey

Good stuff! a few questions, but LGTM. You might try planning/loading a few recipes, in particular maybe the factfinder recipe_qa.py

fvankrieken · 2026-01-21T22:00:57Z

.github/workflows/cbbr_build.yml


      - name: Dataloading
-        run: python -m dcpy.lifecycle.builds.load load
+        run: python3 -m dcpy lifecycle builds load load load


too many loads here?

On that note, can you also run a nightly qa on this branch just as a sanity check

nightly qa run successful (minus CDBG for unrelated reasons and PLUTO taking forever)

fvankrieken · 2026-01-21T22:03:45Z

products/ceqr/models/_sources.yml

+          - name: dataset_id
+          - name: dataset_name
+          - name: version


actually I must've done this when I was still hoping to change the csv's column names, but I kept the old ones

would prefer to keep the old ones for now (of course adding a new one) and change em later

for posterity before the rebase blast them away, was gonna make them dataset_id, dataset_name, version, file_type

Fair but also sad

products/green_fast_track/README.md

this function is only used in PFF. every other relevant product uses dcpy.lifecycle for this

damonmcc marked this pull request as draft January 15, 2026 03:42

damonmcc changed the title ~~imporve FacDB utils~~ add name to source data exports Jan 15, 2026

damonmcc changed the title ~~add name to source data exports~~ add name to build source data export Jan 15, 2026

damonmcc force-pushed the dm-facdb-source-names branch 26 times, most recently from 5845c61 to bc4f0fd Compare January 20, 2026 18:54

damonmcc requested a review from a team January 20, 2026 19:01

fvankrieken reviewed Jan 20, 2026

View reviewed changes

dcpy/test/lifecycle/builds/test_load.py Show resolved Hide resolved

fvankrieken reviewed Jan 20, 2026

View reviewed changes

dcpy/lifecycle/scripts/source_data_versions.py Outdated Show resolved Hide resolved

fvankrieken reviewed Jan 20, 2026

View reviewed changes

dcpy/lifecycle/builds/plan.py Show resolved Hide resolved

fvankrieken reviewed Jan 20, 2026

View reviewed changes

dcpy/models/lifecycle/ingest.py Outdated Show resolved Hide resolved