Update data validation btp by JunAishima · Pull Request #21 · NSLS2/smi-workflows

JunAishima · 2026-02-12T22:20:05Z

Use data validation from bluesky-tiled-plugin, which does more checks and correction of dataset errors, as well as performing data reading checks that are more efficient than reading out the entire dataset (read first and last images).

The data validation functionality in bluesky-tiled-plugin requires additional dependencies that are included in the full "tiled" installation, which are not in tiled-client.

Also, the validation functionality requires use of the migration catalog, so this must be used instead of raw.

Changes required in other repos:

Correcting errors requires the ability to write back into Tiled. This requires the write:metadata and register scopes in a new Tiled API key (ansible)

* replace the task that just reads the entire datasets with what is in btp, which also includes checks on the data

* use tiled instead of tiled-client - validation requires more dependencies

genematx · 2026-02-13T14:44:09Z

data_validation.py


-@task(retries=2, retry_delay_seconds=10)
-def read_all_streams(uid, beamline_acronym):
+@flow(retries=2, retry_delay_seconds=10)


Is there a conceptual improvement in defining flows directly (instead of flows comprised of tasks)? Asking for a friend. :)

the idea is that tasks are smaller components and can be retried - https://docs.prefect.io/v3/concepts/flows#why-both-flows-and-tasks
you could say that we could have made reading the individual streams in the previous read_all_streams() as an individual task, while making the main function as a flow.
no restrictions on flows calling tasks and vice versa! (as well as a flow calling a flow and task calling a task)

Just a note on a task calling a task, it is possible, but not recommended (see discussion here).

genematx · 2026-02-13T14:46:37Z

data_validation.py

-@task(retries=2, retry_delay_seconds=10)
-def read_all_streams(uid, beamline_acronym):
+@flow(retries=2, retry_delay_seconds=10)
+def data_validation(uid, beamline_acronym="smi"):


We may still need to keep the old validation task that uses the /smi/raw catalog while the data is still being written to Mongo (and add the new validation of the /smi/migration catalog). It doesn't cost us much (correctly me if I'm wrong), but for completeness I think it makes sense to keep it.
We can the old function once Mongo is turned off and /migration is renamed /raw.

so you want to keep the task that reads all data as a check for /smi/raw? we should talk about this more - this change would remove one of the major advantages of this PR, that we are not reading all of the data, which could be the main reason for prefect-worker2 CPU usage issues.

Ah okay, if improving the performance was the main reason -- maybe just remove it then. I don't know if anyone paid any attention to this reading failing.
Validating /migration (what you've added) is important though; it fixes any inconsistencies in structures to make the data readable.

* continue to check that data in raw catalog is readable

JunAishima added 3 commits February 12, 2026 14:09

replace data validation with bluesky-tiled-plugin validation

c66575e

* replace the task that just reads the entire datasets with what is in btp, which also includes checks on the data

new validation requires using the migration catalog

c84a030

update dependencies

1f5c8b3

* use tiled instead of tiled-client - validation requires more dependencies

JunAishima requested review from AbbyGi and genematx February 12, 2026 22:20

genematx approved these changes Feb 13, 2026

View reviewed changes

JunAishima added 3 commits February 13, 2026 10:42

restore reading data streams

242db8d

* continue to check that data in raw catalog is readable

turn tasks into concurrently running tasks

a2eef40

run tasks concurrently

e8e9450

JunAishima requested review from danielballan and removed request for AbbyGi February 18, 2026 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update data validation btp#21

Update data validation btp#21
JunAishima wants to merge 6 commits intoNSLS2:mainfrom
JunAishima:update-data-validation-btp

JunAishima commented Feb 12, 2026

Uh oh!

genematx Feb 13, 2026

Uh oh!

JunAishima Feb 13, 2026

Uh oh!

AbbyGi Feb 13, 2026

Uh oh!

genematx Feb 13, 2026

Uh oh!

JunAishima Feb 13, 2026

Uh oh!

genematx Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

JunAishima commented Feb 12, 2026

Uh oh!

genematx Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

JunAishima Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

AbbyGi Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

genematx Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

JunAishima Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

genematx Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments