Y25-043 - As a Reporting Analyst (Osama M) I would like a new table in mlwh_reporting for tracking at a sample level so that we can monitor pipelines that cherrypick from source plates over time.

**User story**

As a Reporting Analyst (Osama M) I would like a new table in mlwh_reporting for tracking at a sample level so that we can monitor pipelines that cherry pick from source plates over time.

**Who are the primary contacts for this story**
TW
Ya-Lin H
Osama M

**Who is the nominated tester for UAT**
Osama M

**Acceptance criteria**

Consider if these features can be _featured flagged_ to decouple testing and deployment.
To be considered successful the solution must allow:
- [ ] A table named **seq_ops_tracking_per_sample** with the columns given by the query below is created in the mlwh_reporting database.
- [ ] Indexes are created for id_sample_lims_composite (unique), id_sample_lims, sanger_sample_id, supplier_name and study_name columns.
- [ ] A MySQL Event is scheduled to re-populate the **seq_ops_tracking_per_sample** table once a day at 1 am.
- [ ] The table and event are present in each environment of in Training, UAT and Prod.
- [ ] Check with DBA team that weekend cron jobs which re-populate your test/development databases from dumps of the production databases still operate correctly.
- [ ] Sanity check the results by comparing at least 5 samples on different plates and studies between this report and the sample_tracking_view plate level.
- [ ] Sanity check the results by comparing at least 5 samples on different plates and studies with SequenceScape.


**Additional context**

RVI & GSU would like to create dashboards monitoring their pipelines in SeqOps.
However, these work by selecting a subset of samples from a given stock plate over time as priorities and result data develop. This means the existing sample_tracking report at a per stock plate level is not useful for measuring things like turn around time.

As part of this story we would like to try using the [MySQL Event Scheduler](https://dev.mysql.com/doc/refman/8.4/en/events-overview.html) instead of making a view. This is essentially creating 'materialised' view in MySQL.

I have checked this with the DBA's (FS ticket SR-8179) who think it is okay as long as:
* Avoid DELETE and prefer TRUNCATE TABLE (See confluence page on MLWH - Guidelines for bulk updates, migrations and performance.)
* Be aware of is that all MySQL events have an owner, or more accurately, a definer.  When a database is dumped using the mysqldump tool, and the --events option is included, then the dump file includes one or more CREATE EVENT commands to re-create the events in the dumped database.  The name of the definer will be included in these commands.  If the dump file is then imported into a different MySQL instance, such dumping a production database to import into a test/development instance, then the definer account must exist in the receiving instance, otherwise the CREATE EVENT command will fail, and the entire import will be abandoned, leaving you with a partially-imported database.  We may need to work with you to ensure that the weekend cron jobs which re-populate your test/development databases from dumps of the production databases will not fall foul of this.

I have attached a draft version of SQL below that is currently being used in Tableau for them to create a draft dashboard from.
Note that the CTE's have a fair amount of overlap with existing view logic in mlwh_reporting - so it may be worth considering if and what to refactor in the next user story.

```sql
WITH 
-- create temporary table filtering for the events we are interested in the given period
sample_events AS (
    SELECT wh_event_id, event_type, occured_at, subject_uuid_bin
    FROM mlwh_events.flat_events_view
    WHERE role_type = 'sample'
      AND event_type IN ('sample_manifest.updated', 'labware.received', 'library_start', 'library_complete', 'sequencing_start', 'sequencing_complete', 'order_made')
      AND occured_at >= DATE_SUB(NOW(), INTERVAL 2 YEAR)
    GROUP BY wh_event_id, subject_uuid_bin
),
-- create temporary table filtering for the earliest sample submissions we are interested in the given period
labware_manifest_created_event AS (
    SELECT subject_friendly_name AS labware_human_barcode, MIN(occured_at) AS occured_at
    FROM mlwh_events.flat_events_view
    WHERE role_type = 'labware'
      AND event_type =  'sample_manifest.created'
      AND occured_at >= DATE_SUB(NOW(), INTERVAL 2 YEAR)
    GROUP BY subject_friendly_name
),
-- create temporary table of study id that have had something happen in the time interval given
studies_of_interest AS (
  SELECT BIN_TO_UUID(s.uuid) AS uuid
  FROM mlwh_events.role_types rt
    JOIN mlwh_events.roles r ON (r.role_type_id=rt.id)
    JOIN mlwh_events.subjects s ON (r.subject_id=s.id)
    JOIN mlwh_events.events e ON (r.event_id=e.id)
    JOIN mlwh_events.event_types et ON (e.event_type_id=et.id)
  WHERE rt.key='study'
    AND et.key='sample_manifest.updated'
    AND e.occured_at >= NOW() - INTERVAL 2 YEAR
  GROUP BY s.id
),
-- create temporary table of samples of interest from studies of interest
samples_of_interest AS (
    SELECT
    sample.id_sample_tmp AS id_sample_tmp,
    sample.id_lims AS id_lims,
    sample.id_sample_lims AS id_sample_lims,
    sample.sanger_sample_id AS sanger_sample_id,
    sample.supplier_name AS supplier_name,
    UUID_TO_BIN(sample.uuid_sample_lims) AS sample_uuid_bin, -- convert to same uuid format used in events 
    studies_of_interest.uuid AS uuid_study_lims,
    study.id_study_tmp,
    study.name AS study_name,
    study.programme,
    study.faculty_sponsor, 
    study.id_study_lims,
    study.data_access_group,
    stock_resource.labware_human_barcode
    FROM studies_of_interest
    JOIN mlwarehouse.study ON study.uuid_study_lims = studies_of_interest.uuid
    JOIN mlwarehouse.stock_resource ON stock_resource.id_study_tmp = study.id_study_tmp
    JOIN mlwarehouse.sample ON sample.id_sample_tmp = stock_resource.id_sample_tmp
),
-- create a temporary table getting Working Dilution timestamps per sample
dilution_timestamps AS (
    SELECT 
    samples_of_interest.id_sample_tmp,
    -- QC Working dilution timestamps earlisest and latest
    MIN(qc_result.recorded_at) qc_early,
    MAX(qc_result.recorded_at) qc_late
    FROM samples_of_interest
    LEFT JOIN mlwarehouse.qc_result ON (
                    qc_result.id_sample_tmp = samples_of_interest.id_sample_tmp
                    AND qc_result.assay = 'Working Dilution - Plate Reader v1.0'
    )

    WHERE
    -- allow pipelines where no QC result is measured OR where it is measured recently
    qc_result.id_qc_result_tmp IS NULL
    OR
    qc_result.recorded_at >= NOW() - INTERVAL 2 YEAR
),
-- create a temporary table linking samples to qc results and flowcell/run information
-- this will have multiple rows per sequencing attempt
sample_flowcell AS (
  SELECT
    samples_of_interest.id_sample_tmp,
    samples_of_interest.id_lims AS id_lims,
    samples_of_interest.id_sample_lims AS id_sample_lims,
    samples_of_interest.sanger_sample_id AS sanger_sample_id,
    samples_of_interest.supplier_name AS supplier_name,
    samples_of_interest.sample_uuid_bin AS sample_uuid_bin, -- convert to same uuid format used in events 
    samples_of_interest.id_study_tmp,
    samples_of_interest.study_name AS study_name,
    samples_of_interest.programme,
    samples_of_interest.faculty_sponsor, 
    samples_of_interest.id_study_lims,
    samples_of_interest.data_access_group,
    samples_of_interest.labware_human_barcode,
    iseq_flowcell.id_iseq_flowcell_tmp,
    iseq_product_metrics.id_iseq_product,
    iseq_product_metrics.id_run,
    iseq_flowcell.pipeline_id_lims,
    iseq_flowcell.bait_name,
    iseq_flowcell.cost_code AS sequencing_cost_code,
    iseq_run_lane_metrics.instrument_model
 
  FROM samples_of_interest

  -- left join to flowcell so we don't lose samples that have not yet been sequenced
  LEFT JOIN mlwarehouse.iseq_flowcell ON iseq_flowcell.id_sample_tmp = samples_of_interest.id_sample_tmp
  LEFT JOIN mlwarehouse.iseq_product_metrics ON iseq_product_metrics.id_iseq_flowcell_tmp = iseq_flowcell.id_iseq_flowcell_tmp
  LEFT JOIN mlwarehouse.iseq_run_lane_metrics ON iseq_run_lane_metrics.id_run = iseq_product_metrics.id_run
)
-- create the desired report grouped by sample
-- select the first timestamp for any mulitple timestamps for a given sample - concat any other fields
SELECT
    -- the combination of id_lims and id_sample_lims is unique and we will group on this combination
    CONCAT(sample_flowcell.id_lims, '_', sample_flowcell.id_sample_lims) AS id_sample_lims_composite,
    sample_flowcell.id_lims, -- assumed unique by grouping  on id_sample_lims_composite
    sample_flowcell.id_sample_lims, -- assumed unique by grouping  on id_sample_lims_composite
    sample_flowcell.sanger_sample_id, -- assumed unique by grouping  on id_sample_lims_composite
    sample_flowcell.supplier_name, -- assumed unique by grouping  on id_sample_lims_composite
    sample_flowcell.study_name, -- assumed unique by grouping  on id_sample_lims_composite
    GROUP_CONCAT(DISTINCT sample_flowcell.labware_human_barcode SEPARATOR '; ') AS manifest_plate_barcode,
    GROUP_CONCAT(DISTINCT sample_flowcell.id_study_lims SEPARATOR '; ') AS study_id,
    GROUP_CONCAT(DISTINCT sample_flowcell.programme SEPARATOR '; ') AS programme,
    GROUP_CONCAT(DISTINCT sample_flowcell.faculty_sponsor SEPARATOR '; ') AS faculty_sponsor,
    GROUP_CONCAT(DISTINCT sample_flowcell.data_access_group SEPARATOR '; ') AS data_access_group,
    GROUP_CONCAT(DISTINCT sample_flowcell.pipeline_id_lims SEPARATOR '; ') AS library_type,
    GROUP_CONCAT(DISTINCT sample_flowcell.bait_name SEPARATOR '; ') AS bait_names,
    GROUP_CONCAT(DISTINCT sample_flowcell.sequencing_cost_code SEPARATOR '; ') AS sequencing_cost_code,
    GROUP_CONCAT(DISTINCT sample_flowcell.instrument_model SEPARATOR '; ') AS platform,
    MIN(labware_manifest_created_event.occured_at) AS manifest_created,
    MIN(IF(sample_events.event_type = 'sample_manifest.updated', sample_events.occured_at, NULL)) AS manifest_uploaded, 
    MIN(IF(sample_events.event_type = 'labware.received', sample_events.occured_at, NULL)) AS labware_received,
    MIN(IF(sample_events.event_type = 'order_made', sample_events.occured_at, NULL)) order_made,
    MIN(dilution_timestamps.qc_early) working_dilution,
    MIN(IF(sample_events.event_type = 'library_start', sample_events.occured_at, NULL)) library_start,
    MIN(IF(sample_events.event_type = 'library_complete', sample_events.occured_at, NULL)) library_complete,
    MIN(IF(sample_events.event_type = 'sequencing_start', sample_events.occured_at, NULL)) sequencing_run_start,
    MIN(IF(sample_events.event_type = 'sequencing_complete', sample_events.occured_at, NULL)) sequencing_qc_complete

FROM sample_flowcell
LEFT JOIN labware_manifest_created_event ON (labware_manifest_created_event.labware_human_barcode = sample_flowcell.labware_human_barcode)
LEFT JOIN sample_events ON (sample_events.subject_uuid_bin = sample_flowcell.sample_uuid_bin)
LEFT JOIN dilution_timestamps ON (dilution_timestamps.id_sample_tmp=sample_flowcell.id_sample_tmp)
LEFT JOIN mlwh_events.metadata md ON (
            sample_events.event_type='sequencing_complete'
            AND sample_events.wh_event_id=md.event_id
            AND md.key='result'
        )

-- speed up query by restricting to a given programme. With filter takes about 3 mins
-- WHERE programme IN ('Pathogen Variation', 'Malaria')

GROUP BY id_sample_lims_composite


ORDER BY sample_flowcell.programme, 
         sample_flowcell.study_name,
         id_sample_lims_composite
;
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Y25-043 - As a Reporting Analyst (Osama M) I would like a new table in mlwh_reporting for tracking at a sample level so that we can monitor pipelines that cherrypick from source plates over time. #38

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Y25-043 - As a Reporting Analyst (Osama M) I would like a new table in mlwh_reporting for tracking at a sample level so that we can monitor pipelines that cherrypick from source plates over time. #38

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions