Conversation
* replace the task that just reads the entire datasets with what is in btp, which also includes checks on the data
* use tiled instead of tiled-client - validation requires more dependencies
data_validation.py
Outdated
|
|
||
| @task(retries=2, retry_delay_seconds=10) | ||
| def read_all_streams(uid, beamline_acronym): | ||
| @flow(retries=2, retry_delay_seconds=10) |
There was a problem hiding this comment.
Is there a conceptual improvement in defining flows directly (instead of flows comprised of tasks)? Asking for a friend. :)
There was a problem hiding this comment.
the idea is that tasks are smaller components and can be retried - https://docs.prefect.io/v3/concepts/flows#why-both-flows-and-tasks
you could say that we could have made reading the individual streams in the previous read_all_streams() as an individual task, while making the main function as a flow.
no restrictions on flows calling tasks and vice versa! (as well as a flow calling a flow and task calling a task)
There was a problem hiding this comment.
Just a note on a task calling a task, it is possible, but not recommended (see discussion here).
| @task(retries=2, retry_delay_seconds=10) | ||
| def read_all_streams(uid, beamline_acronym): | ||
| @flow(retries=2, retry_delay_seconds=10) | ||
| def data_validation(uid, beamline_acronym="smi"): |
There was a problem hiding this comment.
We may still need to keep the old validation task that uses the /smi/raw catalog while the data is still being written to Mongo (and add the new validation of the /smi/migration catalog). It doesn't cost us much (correctly me if I'm wrong), but for completeness I think it makes sense to keep it.
We can the old function once Mongo is turned off and /migration is renamed /raw.
There was a problem hiding this comment.
so you want to keep the task that reads all data as a check for /smi/raw? we should talk about this more - this change would remove one of the major advantages of this PR, that we are not reading all of the data, which could be the main reason for prefect-worker2 CPU usage issues.
There was a problem hiding this comment.
Ah okay, if improving the performance was the main reason -- maybe just remove it then. I don't know if anyone paid any attention to this reading failing.
Validating /migration (what you've added) is important though; it fixes any inconsistencies in structures to make the data readable.
* continue to check that data in raw catalog is readable
Use data validation from bluesky-tiled-plugin, which does more checks and correction of dataset errors, as well as performing data reading checks that are more efficient than reading out the entire dataset (read first and last images).
The data validation functionality in bluesky-tiled-plugin requires additional dependencies that are included in the full "tiled" installation, which are not in tiled-client.
Also, the validation functionality requires use of the migration catalog, so this must be used instead of raw.
Changes required in other repos: