This repository hosts the code used in the analysis performed as part of the manuscript "Systematic cross-study assessment of RNA-Seq experimental workflows for plasma cell-free transcriptome profiling" by Tuñí et al.
fl-cfRNAmeta/
├── README.md
├── nextflow/
├── sra_metadata/
├── src/
└── tables/
-
preprocess_metadata_functions.py
Main script for preprocessing and harmonizing metadata from multiple cfRNA-seq studies.- Loads and standardizes sample-level metadata from various studies.
- Merges with dataset-level metadata.
- Applies dataset-specific cleaning, exclusion, and annotation logic.
- Outputs harmonized per-sample and per-batch metadata tables.
-
sra_columns_mapping.py
Helper functions for renaming and standardizing column names and values across metadata tables. -
dataset_mappings.json
JSON file with mappings for dataset names, colours, orders, or values used across scripts. -
boxplots_fig2.R
R script to generate the following boxplots: percentage of spliced reads, percentage of exonic reads, percentage of fragments mapping to the correct gene orientation, NG80, fragment number, percentage of reads mapping to reference human genome, percentage of reads mapping to ERCC spike-ins, effective fragment length distribution, etc. Also used to create the NG80 vs spliced reads scatterplot, and the percent of human reads vs percent of microbial reads scatterplot. -
fig_1a_stacked_barplot.R
R script for creating donor phenotype stacked bar plot. -
fig_1b_heatmap.R
R script for creating the pre-analytical variables heatmap. -
gene_coverage_profile_fig2_tmpH.ipynb
Jupyter notebook for plotting gene coverage profiles. -
join_count_matrix_and_qc_table.ipynb
Jupyter notebook for merging sliced count matrices and QC tables into a single matrix or QC table file. -
merge_fastqs_array_isolate.sh
Shell script for merging FASTQ files by array or isolate. -
ng80.R
R script to obtain the NG80 metric reported on the manuscript. Needs the count matrix as input file. -
spliced_reads.sh
Shell script to obtain the number and the % of spliced reads. Needs the deduplicated BAM file as input file.
- Purpose:
Contains configuration files and parameter sets for running nf-core/rnaseq Nextflow pipeline with each dataset. - Files:
base.config,base_params.yml: Base Nextflow configuration and parameters.smarter.config,smarter_v2_params.yml,smarter_v3_params.yml: Configs for SMARTer protocols.non_smarter.config,non_smarter_reverse_params.yml,non_smarter_unstranded_params.yml: Configs for non-SMARTer protocols.hg38_gencodev39_params.yml: Parameters for hg38/Gencode v39 reference.two-color-illumina.config: Config for two-color Illumina sequencing.
- Purpose:
Stores raw and preprocessed metadata files for each study, as well as supplementary tables and GEO series matrix files. - Files:
<dataset>_metadata.csv/<dataset>_metadata_preprocessed.csv: Raw and processed sample metadata.<dataset>_supp_table_*.xlsx/.tsv: Supplementary tables with additional sample/batch info.<dataset>_GSE*_series_matrix.txt: GEO series matrix files for extracting sample annotations.
Contains output and intermediate tables generated by the preprocessing scripts and downstream analyses:
-
cfRNA-meta_per_sample_metadata.tsv
Harmonized per-sample metadata table for all included cfRNA-seq datasets. -
cfRNA-meta_per_batch_metadata.tsv
Harmonized per-batch metadata table summarizing batch-level information. -
sampleinfo_all-batches.tsv
Sample information table including all batches. -
taxa_simple_df_w_batch.tsv
Simplified taxa table for downstream taxonomic analyses.
See LICENSE file for details.
For questions or contributions, please open an issue.