-
Notifications
You must be signed in to change notification settings - Fork 0
Description
User story
As a Reporting Analyst (Osama M) I would like a new table in mlwh_reporting for tracking at a sample level so that we can monitor pipelines that cherry pick from source plates over time.
Who are the primary contacts for this story
TW
Ya-Lin H
Osama M
Who is the nominated tester for UAT
Osama M
Acceptance criteria
Consider if these features can be featured flagged to decouple testing and deployment.
To be considered successful the solution must allow:
- A table named seq_ops_tracking_per_sample with the columns given by the query below is created in the mlwh_reporting database.
- Indexes are created for id_sample_lims_composite (unique), id_sample_lims, sanger_sample_id, supplier_name and study_name columns.
- A MySQL Event is scheduled to re-populate the seq_ops_tracking_per_sample table once a day at 1 am.
- The table and event are present in each environment of in Training, UAT and Prod.
- Check with DBA team that weekend cron jobs which re-populate your test/development databases from dumps of the production databases still operate correctly.
- Sanity check the results by comparing at least 5 samples on different plates and studies between this report and the sample_tracking_view plate level.
- Sanity check the results by comparing at least 5 samples on different plates and studies with SequenceScape.
Additional context
RVI & GSU would like to create dashboards monitoring their pipelines in SeqOps.
However, these work by selecting a subset of samples from a given stock plate over time as priorities and result data develop. This means the existing sample_tracking report at a per stock plate level is not useful for measuring things like turn around time.
As part of this story we would like to try using the MySQL Event Scheduler instead of making a view. This is essentially creating 'materialised' view in MySQL.
I have checked this with the DBA's (FS ticket SR-8179) who think it is okay as long as:
- Avoid DELETE and prefer TRUNCATE TABLE (See confluence page on MLWH - Guidelines for bulk updates, migrations and performance.)
- Be aware of is that all MySQL events have an owner, or more accurately, a definer. When a database is dumped using the mysqldump tool, and the --events option is included, then the dump file includes one or more CREATE EVENT commands to re-create the events in the dumped database. The name of the definer will be included in these commands. If the dump file is then imported into a different MySQL instance, such dumping a production database to import into a test/development instance, then the definer account must exist in the receiving instance, otherwise the CREATE EVENT command will fail, and the entire import will be abandoned, leaving you with a partially-imported database. We may need to work with you to ensure that the weekend cron jobs which re-populate your test/development databases from dumps of the production databases will not fall foul of this.
I have attached a draft version of SQL below that is currently being used in Tableau for them to create a draft dashboard from.
Note that the CTE's have a fair amount of overlap with existing view logic in mlwh_reporting - so it may be worth considering if and what to refactor in the next user story.
WITH
-- create temporary table filtering for the events we are interested in the given period
sample_events AS (
SELECT wh_event_id, event_type, occured_at, subject_uuid_bin
FROM mlwh_events.flat_events_view
WHERE role_type = 'sample'
AND event_type IN ('sample_manifest.updated', 'labware.received', 'library_start', 'library_complete', 'sequencing_start', 'sequencing_complete', 'order_made')
AND occured_at >= DATE_SUB(NOW(), INTERVAL 2 YEAR)
GROUP BY wh_event_id, subject_uuid_bin
),
-- create temporary table filtering for the earliest sample submissions we are interested in the given period
labware_manifest_created_event AS (
SELECT subject_friendly_name AS labware_human_barcode, MIN(occured_at) AS occured_at
FROM mlwh_events.flat_events_view
WHERE role_type = 'labware'
AND event_type = 'sample_manifest.created'
AND occured_at >= DATE_SUB(NOW(), INTERVAL 2 YEAR)
GROUP BY subject_friendly_name
),
-- create temporary table of study id that have had something happen in the time interval given
studies_of_interest AS (
SELECT BIN_TO_UUID(s.uuid) AS uuid
FROM mlwh_events.role_types rt
JOIN mlwh_events.roles r ON (r.role_type_id=rt.id)
JOIN mlwh_events.subjects s ON (r.subject_id=s.id)
JOIN mlwh_events.events e ON (r.event_id=e.id)
JOIN mlwh_events.event_types et ON (e.event_type_id=et.id)
WHERE rt.key='study'
AND et.key='sample_manifest.updated'
AND e.occured_at >= NOW() - INTERVAL 2 YEAR
GROUP BY s.id
),
-- create temporary table of samples of interest from studies of interest
samples_of_interest AS (
SELECT
sample.id_sample_tmp AS id_sample_tmp,
sample.id_lims AS id_lims,
sample.id_sample_lims AS id_sample_lims,
sample.sanger_sample_id AS sanger_sample_id,
sample.supplier_name AS supplier_name,
UUID_TO_BIN(sample.uuid_sample_lims) AS sample_uuid_bin, -- convert to same uuid format used in events
studies_of_interest.uuid AS uuid_study_lims,
study.id_study_tmp,
study.name AS study_name,
study.programme,
study.faculty_sponsor,
study.id_study_lims,
study.data_access_group,
stock_resource.labware_human_barcode
FROM studies_of_interest
JOIN mlwarehouse.study ON study.uuid_study_lims = studies_of_interest.uuid
JOIN mlwarehouse.stock_resource ON stock_resource.id_study_tmp = study.id_study_tmp
JOIN mlwarehouse.sample ON sample.id_sample_tmp = stock_resource.id_sample_tmp
),
-- create a temporary table getting Working Dilution timestamps per sample
dilution_timestamps AS (
SELECT
samples_of_interest.id_sample_tmp,
-- QC Working dilution timestamps earlisest and latest
MIN(qc_result.recorded_at) qc_early,
MAX(qc_result.recorded_at) qc_late
FROM samples_of_interest
LEFT JOIN mlwarehouse.qc_result ON (
qc_result.id_sample_tmp = samples_of_interest.id_sample_tmp
AND qc_result.assay = 'Working Dilution - Plate Reader v1.0'
)
WHERE
-- allow pipelines where no QC result is measured OR where it is measured recently
qc_result.id_qc_result_tmp IS NULL
OR
qc_result.recorded_at >= NOW() - INTERVAL 2 YEAR
),
-- create a temporary table linking samples to qc results and flowcell/run information
-- this will have multiple rows per sequencing attempt
sample_flowcell AS (
SELECT
samples_of_interest.id_sample_tmp,
samples_of_interest.id_lims AS id_lims,
samples_of_interest.id_sample_lims AS id_sample_lims,
samples_of_interest.sanger_sample_id AS sanger_sample_id,
samples_of_interest.supplier_name AS supplier_name,
samples_of_interest.sample_uuid_bin AS sample_uuid_bin, -- convert to same uuid format used in events
samples_of_interest.id_study_tmp,
samples_of_interest.study_name AS study_name,
samples_of_interest.programme,
samples_of_interest.faculty_sponsor,
samples_of_interest.id_study_lims,
samples_of_interest.data_access_group,
samples_of_interest.labware_human_barcode,
iseq_flowcell.id_iseq_flowcell_tmp,
iseq_product_metrics.id_iseq_product,
iseq_product_metrics.id_run,
iseq_flowcell.pipeline_id_lims,
iseq_flowcell.bait_name,
iseq_flowcell.cost_code AS sequencing_cost_code,
iseq_run_lane_metrics.instrument_model
FROM samples_of_interest
-- left join to flowcell so we don't lose samples that have not yet been sequenced
LEFT JOIN mlwarehouse.iseq_flowcell ON iseq_flowcell.id_sample_tmp = samples_of_interest.id_sample_tmp
LEFT JOIN mlwarehouse.iseq_product_metrics ON iseq_product_metrics.id_iseq_flowcell_tmp = iseq_flowcell.id_iseq_flowcell_tmp
LEFT JOIN mlwarehouse.iseq_run_lane_metrics ON iseq_run_lane_metrics.id_run = iseq_product_metrics.id_run
)
-- create the desired report grouped by sample
-- select the first timestamp for any mulitple timestamps for a given sample - concat any other fields
SELECT
-- the combination of id_lims and id_sample_lims is unique and we will group on this combination
CONCAT(sample_flowcell.id_lims, '_', sample_flowcell.id_sample_lims) AS id_sample_lims_composite,
sample_flowcell.id_lims, -- assumed unique by grouping on id_sample_lims_composite
sample_flowcell.id_sample_lims, -- assumed unique by grouping on id_sample_lims_composite
sample_flowcell.sanger_sample_id, -- assumed unique by grouping on id_sample_lims_composite
sample_flowcell.supplier_name, -- assumed unique by grouping on id_sample_lims_composite
sample_flowcell.study_name, -- assumed unique by grouping on id_sample_lims_composite
GROUP_CONCAT(DISTINCT sample_flowcell.labware_human_barcode SEPARATOR '; ') AS manifest_plate_barcode,
GROUP_CONCAT(DISTINCT sample_flowcell.id_study_lims SEPARATOR '; ') AS study_id,
GROUP_CONCAT(DISTINCT sample_flowcell.programme SEPARATOR '; ') AS programme,
GROUP_CONCAT(DISTINCT sample_flowcell.faculty_sponsor SEPARATOR '; ') AS faculty_sponsor,
GROUP_CONCAT(DISTINCT sample_flowcell.data_access_group SEPARATOR '; ') AS data_access_group,
GROUP_CONCAT(DISTINCT sample_flowcell.pipeline_id_lims SEPARATOR '; ') AS library_type,
GROUP_CONCAT(DISTINCT sample_flowcell.bait_name SEPARATOR '; ') AS bait_names,
GROUP_CONCAT(DISTINCT sample_flowcell.sequencing_cost_code SEPARATOR '; ') AS sequencing_cost_code,
GROUP_CONCAT(DISTINCT sample_flowcell.instrument_model SEPARATOR '; ') AS platform,
MIN(labware_manifest_created_event.occured_at) AS manifest_created,
MIN(IF(sample_events.event_type = 'sample_manifest.updated', sample_events.occured_at, NULL)) AS manifest_uploaded,
MIN(IF(sample_events.event_type = 'labware.received', sample_events.occured_at, NULL)) AS labware_received,
MIN(IF(sample_events.event_type = 'order_made', sample_events.occured_at, NULL)) order_made,
MIN(dilution_timestamps.qc_early) working_dilution,
MIN(IF(sample_events.event_type = 'library_start', sample_events.occured_at, NULL)) library_start,
MIN(IF(sample_events.event_type = 'library_complete', sample_events.occured_at, NULL)) library_complete,
MIN(IF(sample_events.event_type = 'sequencing_start', sample_events.occured_at, NULL)) sequencing_run_start,
MIN(IF(sample_events.event_type = 'sequencing_complete', sample_events.occured_at, NULL)) sequencing_qc_complete
FROM sample_flowcell
LEFT JOIN labware_manifest_created_event ON (labware_manifest_created_event.labware_human_barcode = sample_flowcell.labware_human_barcode)
LEFT JOIN sample_events ON (sample_events.subject_uuid_bin = sample_flowcell.sample_uuid_bin)
LEFT JOIN dilution_timestamps ON (dilution_timestamps.id_sample_tmp=sample_flowcell.id_sample_tmp)
LEFT JOIN mlwh_events.metadata md ON (
sample_events.event_type='sequencing_complete'
AND sample_events.wh_event_id=md.event_id
AND md.key='result'
)
-- speed up query by restricting to a given programme. With filter takes about 3 mins
-- WHERE programme IN ('Pathogen Variation', 'Malaria')
GROUP BY id_sample_lims_composite
ORDER BY sample_flowcell.programme,
sample_flowcell.study_name,
id_sample_lims_composite
;Metadata
Metadata
Assignees
Labels
Type
Projects
Status