Skip to content

progenetix/CNseg_process

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CNseg_process

A pipeline for processing and refining Copy Number Alteration (CNA) calls.

Key Features

  • Core Calling: Integrates the labelSeg and CNAdjust algorithms for robust CNA calling.
  • Noise Reduction: Implements a 1D fused lasso solution (from AM_process pipeline) to merge segments, which is particularly effective for noisy samples.
  • Genome Assembly Liftover: Includes segment-liftover algorithm to lift over segment coordinates from the hg19 to the hg38 reference genome.

Prerequisites

  1. Java 11 or later
  2. Local install (recommended) or docker

For locally installing all dependencies, R and Python are required.

R Dependencies:

install.packages(c('R.utils','MASS','remotes','genlasso'))
remotes::install_github('baudisgroup/labelSeg')

Python Dependencies (use pip as example):

pip install click pandas numpy setuptools

Installation

  1. Install nextflow by using the follwowing command:
curl -s https://get.nextflow.io | bash

This command will create a nextflow file in the current directory.

  1. Make Nextflow executable:
chmod +x nextflow
  1. Move Nextflow into an executable path:
sudo mv nextflow /usr/local/bin
  1. Confirm that Nextflow is installed correctly:
nextflow info
  1. Install this pipeline:

You have two options for installing the pipeline.

Install using nextflow: This will automatically install the pipeline in the $HOME directory under the .nextflow/assets sub-directory.

nextflow pull baudisgroup/CNseg_process

Clone the repository: Alternatively, you can clone the pipeline repository into a desired directory using:

git clone https://github.com/baudisgroup/CNseg_process.git

When you run the pipeline using this method, you will need to provide the path to the local repository (e.g., ./CNseg_process).

Usage

  1. Test the Installation

First, ensure the pipeline is working correctly by running it with the built-in test profile. This profile uses a small, included dataset to verify that all components are functioning as expected.

# If you used 'nextflow pull' to install:
nextflow run ~/.nextflow/assets/baudisgroup/CNseg_process -profile test

# If you cloned the repository locally:
nextflow run ./CNseg_process -profile test

This command will run a small test and place the output in a results directory within your current folder. A successful run is the best confirmation of a correct installation.

  1. Basic Command Structure
nextflow run /path/to/CNseg_process --inputdir /path/to/Inputdir --series <series1>,<series2> --outputdir /path/to/Outputdir

A detailed description of all available parameters is provided below.

  1. Example: Running on Test Datasets

The repository includes test data from two studies that highlight the pipeline's features:

  • prad_su2c_2019: : A dataset with noisy segments.
  • lung_pdx_msk_2021: A dataset with prominant baseline shift issues.

This example command runs the pipeline on both series, and performs a liftover of the segment coordinates from hg19 to hg38.

nextflow run /path/to/CNseg_process --inputdir /path/to/CNseg_process/test-data --outputdir /path/to/Outputdir --series prad_su2c_2019,lung_pdx_msk_2021 --genome hg19 --seg_liftover true --output_seg labeled_hg19_segment.seg --output_plot labeled_hg19_segment.pdf

After the run completes, there are following key files:

  • labeled_hg19_segment.seg: The final, processed segment file with corrected CNA calls (in hg19 coordinates).
  • labeled_hg19_segment.pdf: A plot visualizing the corrected CNA profile for the sample.
  • hg38_labeled_hg19_segment.seg: The final segment file lifted over to hg38 coordinates (this is only created if --seg_liftover true and --genome hg19).

Input Data Structure

The pipeline expects a specific directory structure for input data. All input files for a given project or dataset (a "series") should be grouped into a subdirectory. The main input directory, specified with --inputdir, should be organized as follows:

Inputdir/
├── <series1>
│   ├── <series1>.seg.txt            segment data
│   ├── cohort-assignment.txt        cohort assignments
│   ├── cohort-cna-pattern.txt       cohort-specific CNA frequency
│   ├── sampleid-mapping.txt         optional: sample ID mapping
│   └── cohort-cna-region.txt        optional: custom genomic bin location
└── <series2>
    └── ...     

Each series subdirectory is processed in parallel.

1. Segment files (required)

The segment files are the primary input and should be tab-delimited. To be recognized by the workflow, filenames must include the ".seg" extension. These files must contain at least six columns in a specific order: sample ID, chromosome, start, end, number of markers, and the log R ratio. Headers are necessary, but the pipeline identifies columns by their order, not by their names. The pipeline behavior depends on whether your input files already contain CNA calls. If the files only have the six required columns, the pipeline will by default use the labelSeg algorithm to generate relative CNA calls. However, if your files already contain CNA calls, you must include a column named "label" to provide these calls as strings. In this case, you must also set the parameter --fromseg false to instruct the pipeline to use your provided labels. The label column should use values such as "+2" for high-level duplication, "+1" for low-level duplication, "0" for copy neutral, "-1" for low-level deletion, and "-2" for high-level deletion.

External information

External information includes cohort assignments for samples and cohort-specific CNA prior patterns.

If external information is not provided, you should set the parameter --use_external_ref to "false". This setting assumes that all samples within the series are biologically and technologically similar. Consequently, the workflow utilizes the CNA frequency calculated from all samples as prior information.

2. Cohort assignment files

The cohort assignment file should be tab-delimited and include two columns: sample IDs and cohort identifiers. The default filename for this file is "cohort-assignment.txt" and can be changed by the parameter --cohort_assign_file.

3. Cohort CNA occurence file

The cohort CNA occurrence file should also be tab-delimited. Different columns represent CNA occurence probability (CNA frequency) from different cohorts. Column names should be consistent with cohort identifiers specified in the cohort assignment file. More details about CNA occurence probability see below. The default filename for this file is "cohort-cna-pattern.txt" and can be changed by the parameter --prior_file.

4. Sample ID mapping file (optional)

This optional file is used for mapping IDs used in segment data to expected sample identifiers, such as mapping UUIDs to barcodes in TCGA data. If provided and the parameter --use_idmapping is set to "true", the sample IDs used in the cohort assignment file should be the expected identifiers. All sample identifiers used in output files will be mapped to these new identifiers. The file format should be tab-delimited and include two columns: original sample identifiers and new sample identifiers. The default filename for this file is "sampleid-mapping.txt" and can be changed by the parameter --idmapping_file.

5. Genomic region file (optional)

This optional file indicates the genomic regions used to calculate cohort CNA frequency. If provided and the parameter --use_custom_region is set to "true", the prior computation will be based on the provided regions. The file format should be tab-delimited and include four columns: region index, chromosome, start position, and end position (example see data/hg38_bin.txt). The default filename for this file is "cohort-cna-region.txt" and can be changed by the parameter --region_file.

Output Data Structure

By default, the output directory is a folder named "output" in the current directory. The structure of the output files is as follows:

The output directory is organized by series, then by sample:

Outputdir/
├── <series1>
│   ├─── analyse_report.txt
│   ├─── <sample1>
│   │   ├── segments.pdf
│   │   └── result.seg.txt
│   ├─── <sample2>
│   │   ├── segments_before_shift.pdf
│   │   ├── segments.pdf
│   │   └── result.seg.txt
│   └── ...
└─ <series2>   
    └── ...                 
  1. The analysis report

Inside each series directory, you will find a summary report named "analysis_report.txt". This file records CNA freatures from final segment profiles and logs the processing steps and decisions made by the pipeline for each sample in that series.

The "note" column records the sequence of actions applied, separated by a semicolon (;). Key actions include:

  • initial: No change to the data.

  • shift-baseline-lower/higher: Attempt to shift the baseline, with the result being better than the original one, therefore used as the final output.

  • shift-baseline-lower/higher_reverse: Attempt to shift the baseline, but the result was not satisfactory, so the original callings were retained.

  • segmentmerge: Attempt to merge segments for noisy samples.

  • noisy_lowthre_<lowcutoff>_highthre_<highcutoff>": Using cutoffs for calling noisy samples.

  • noisy_nochange: Attempt to apply cutoffs for noisy samples, but the result was not satisfactory, so the original callings were retained.

  1. Sample-specific output

Within each series directory, a subdirectory is created for every sample. Each sample folder contains the final results for that sample:

  • Segment Data (*.seg): Final segment data files with adjusted callings. The default filename for this file is "result.seg.txt" and can be changed by the parameter --output_seg.

  • Segment Plots (*.pdf): Visual representations of segments colored by CNA states. The default filename for this file is "segments.pdf" and can be changed by the parameter --output_plot.

  • Pre-shift plots (conditional): If successful baseline shifting was performed on the sample, an additional plot named "<plotfileBasename>_before_shift.pdf" is generated, which allows you to visually compare the CNA profile before and after the baseline correction was applied.

Profile

By default, the pipeline is locally executed, but it can be run using docker engine by setting -profile docker.

Parameters

Mandatory parameters

  • --inputdir: Path to the input data directory.
  • --series: A comma-separated list of the series to analyze. These names must match the subfolder names under --inputdir

Optional parameters

  • --outputdir: Path to the directory where results will be saved. Default: a folder named "output" in the current directory.
  • --fromseg: Logical value indicating if labelSeg should be used to generate CNA calls. Set to false if your input files already contain a "label" column. Default: true.
  • --seg_liftover: Logical value to enable lifting over final segment coordinates from hg19 to hg38. This only applies if --genome is "hg19". Default: false.
  • --genome: The reference genome assembly of the input data. Options: "hg38", "hg19", "hg18". Default: "hg38".
  • --use_external_ref: Logical value to control whether external cohort and prior information is used to adjust CNA calls. Default: true.
  • --use_idmapping: Logical value to enable sample ID mapping using the provided mapping file. Default: false.
  • --use_custom_region: Logical value to enable the use of a custom genomic bin file for prior calculations. Default: false.
  • --cohort_assign_file: Filename for the cohort assignment file. Default: "cohort-assignment.txt".
  • --prior_file: Filename for the cohort CNA occurrence file. Default: "cohort-cna-pattern.txt".
  • --idmapping_file: Filename for the sample ID mapping file. Default: "sampleid-mapping.txt".
  • --region_file: Filename for the custom genomic region file. Default: "cohort-cna-region.txt".
  • --output_seg: Base filename for the final output segment file. Default: "result.seg.txt".
  • --output_plot: Base filename for the output segment plot. Default: "segments.pdf".
  • --logrsd: Upper limit of weighted standard deviation of logR by the number of involved markers, used in the first (strict) criteria for flagging problematic samples. Default: 0.35.
  • --cnafrac: Upper limit of CNA fraction, used in the first (strict) criteria. Default: 0.5.
  • --dupdelratio: Upper limit of the duplication-to-deletion fraction ratio and its reciprocal as lower limit, used in the first (strict) criteria. Default: 3.
  • --cnafrac2: Upper limit of CNA fraction, used in the second (relaxed) criteria for flagging problematic samples. Default: 0.2.
  • --dupdelratio2: Upper limit of the duplication-to-deletion fraction ratio and its reciprocal as lower limit, used in the second (relaxed) criteria. Default: 5.
  • --cnafrac3: Upper limit of CNA fraction, used in the third (very relaxed) criteria for flagging problematic samples. Default: 0.7.
  • --segnum: The maximum number of segments allowed before a sample is flagged as potentially noisy. Default: 1000.
  • --highfrac: The maximum fraction of high-level CNAs allowed before a sample is flagged as potentially noisy. Default: 0.1.
  • --lowthre: A comma-separated list of logR ratio thresholds to test for calling low-level gains in noisy profiles. The negative of these values will be used for losses. The pipeline will try each threshold sequentially and select the one that yields the best result. Default: "0.1,0.15,0.3".
  • --highthre: A comma-separated list of logR ratio thresholds to test for calling high-level gains in noisy profiles. The negative of these values will be used for losses. The pipeline will try each threshold sequentially and select the one that yields the best result. Default: "1,1.5,2".

Cohort CNA pattern data preparation

By default, the prior CNA pattern in each cohort represents the occurrence probability of CNAs in genomic bins. The bin size is 1 MB. For GRCh38, there are 2 * 3106 bins (gain_frequency + loss_frequency). Detailed genomic bin locations for GRCh38, GRCh37, and GRCh36 can be found in data/hg38_bin.txt, data/hg19_bin.txt, and data/hg18_bin.txt respectively. You can define the CNA prior pattern yourself based on the same genomic region or custom genomic regions by setting the parameter --use_custom_region to "true", as long as the values can be used as probabilities. Here we introduce two simple approaches to get the cohort-specific CNA reference pattern.

from Progenetix

The simplest way to get the cohort CNA pattern is to access pre-computed CNA frequency data from the Progenetix database via the Beacon v2 API. It is defined as the percentage of samples showing a CNV for a genomic region over the total number of samples in a cohort specified by filters. For example, if you want to get the reference CNA pattern from glioblastoma samples, you can use the pgxRpi R package to query:

library(pgxRpi)
library(SummarizedExperiment)

# choose the filter to specify glioblastoma cohort using NCIt encoding system
tumor_filter <- "NCIT:C3058" 
# query
cnafreq <- pgxLoader(type = 'cnv_frequency',filters = tumor_filter,output = 'pgxmatrix')  
# transform from frequency value to probability value
cnaprob <- assay(cnafreq)/100 
# Write data in the format that the workflow accepts
write.table(cnaprob,"cohort-cna-pattern.txt"),quote = F,sep = '\t',row.names = F,col.names = T)

You can also get the CNA frequency data directly via the REST API, but the data need to be modified slightly to follow the format which CNAdjust uses.

from custom segment data

You can utilize the function segtoFreq from pgxRpi R package (version >= 1.0.1 or >= 1.1.2) to derive CNA frequency from segment data. These frequencies need to be transformed into probabilities. By default, the binning aligns with that used in CNAdjust. However, if you calculate CNA frequency for different genomic bins, such as using different bin sizes, you should provide the region file accordingly. Example usage is as follows:

nextflow run /path/to/CNseg_process --inputdir /path/to/Inputdir --series <series1>,<series2> --outputdir /path/to/Outputdir --use_custom_prior true 

Processing Tips

  1. Verify the quality of the segment file before running this pipeline. Check for potential issues such as segment overlaps, coordinate inversions (where start > end), duplicate records, or anomalous values for probe counts, segment values, and CNA coverage. Pre-cleaning data can prevent unexpected errors

  2. Check for duplicate segments after segment-liftover. The process of converting coordinates from hg19 to hg38 can occasionally generate redundant or overlapping segments. These may need to be reviewed and filtered before downstream analysis.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published