Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
134 commits
Select commit Hold shift + click to select a range
253e971
Merging and batch processing improvements
jonperdomo Oct 30, 2024
e7211ed
Add inversion detection
jonperdomo Oct 31, 2024
71e375e
Fix vcf writer error and add invdup
jonperdomo Nov 1, 2024
e4df9d1
Update sv types
jonperdomo Nov 1, 2024
6fb4b72
Update docs
jonperdomo Nov 1, 2024
a38ada7
Update docs
jonperdomo Nov 1, 2024
1cf3a12
Remove unused code and update docs
jonperdomo Nov 1, 2024
181f6d1
simplify predictions
jonperdomo Nov 1, 2024
fa4f744
Simplify split sv calling
jonperdomo Nov 1, 2024
956e7e5
Fix GT error and reduce fps
jonperdomo Nov 4, 2024
e88f562
Fix alt allele and likelihood comparisons
jonperdomo Nov 11, 2024
41f8382
Fix CN length filter error
jonperdomo Nov 12, 2024
a19c21a
Venn diagram plots and vcf writer error
jonperdomo Nov 13, 2024
5dc01cb
Fix vcf writer error
jonperdomo Nov 14, 2024
b9b4890
Fix overlap errors
jonperdomo Nov 15, 2024
9665e8e
Create a cpp module for memory tests
jonperdomo Nov 16, 2024
a505308
Fix all memory leaks
jonperdomo Nov 17, 2024
7cac258
Improve memory management
jonperdomo Nov 20, 2024
d5d7dcd
Add merging and efficiency updates
jonperdomo Nov 23, 2024
7d36424
Update pfb to htslib
jonperdomo Nov 23, 2024
ba19aeb
update reading snp vcf to htslib
jonperdomo Nov 25, 2024
53473af
Fix reading AFs
jonperdomo Nov 25, 2024
4b83e62
Fix mean chr cov
jonperdomo Nov 25, 2024
302598e
Update gitignore
jonperdomo Nov 25, 2024
9d3164c
Update test
jonperdomo Nov 25, 2024
2077f26
Fix trimming
jonperdomo Nov 26, 2024
942ce74
Reduce debug outputs
jonperdomo Nov 26, 2024
f7a17e2
Update test
jonperdomo Nov 26, 2024
bb4d2c4
fix warnings
jonperdomo Nov 26, 2024
f241af5
Read hmm once only
jonperdomo Nov 26, 2024
39602fe
remove some comments
jonperdomo Nov 26, 2024
1884f67
Fix multithreading
jonperdomo Nov 26, 2024
128575a
implement thread pooling
jonperdomo Nov 27, 2024
d62fe12
fix duplication error
jonperdomo Nov 27, 2024
da4d72f
add arguments and version and reduce copies
jonperdomo Nov 30, 2024
ea06c1c
Fix errors and reduce copying
jonperdomo Dec 1, 2024
50cc505
reduce output
jonperdomo Dec 1, 2024
18bd4a9
Improve merging
jonperdomo Dec 1, 2024
3458bc8
add support back and improve merging
jonperdomo Dec 2, 2024
ef4df0b
remove filter
jonperdomo Dec 2, 2024
c79c1da
Fix breakpoint error and improve filtering
jonperdomo Dec 3, 2024
3afeaac
Update filtering
jonperdomo Dec 3, 2024
6f6ecc5
Improve multithreading
jonperdomo Dec 4, 2024
1486176
Fix warnings
jonperdomo Dec 5, 2024
2115492
add threadpool
jonperdomo Dec 5, 2024
934350f
Calculate ME rates
jonperdomo Dec 5, 2024
d5923fe
Add ME debug output
jonperdomo Dec 6, 2024
7e3bac4
Fix alt allele error
jonperdomo Dec 11, 2024
a169b73
Replace fixed window with sample size
jonperdomo Dec 11, 2024
9e43678
log memory usage
jonperdomo Dec 12, 2024
2d1337c
improve mem eff
jonperdomo Dec 14, 2024
56507d4
Fix vector error
jonperdomo Dec 15, 2024
a9b6fcc
Fix alignments error
jonperdomo Dec 15, 2024
d175682
work on error handling
jonperdomo Dec 16, 2024
6b61117
Fix 1 off breakpoint error
jonperdomo Dec 17, 2024
b35c107
thread safe ref genome
jonperdomo Dec 19, 2024
11180be
Fix snp thread locks
jonperdomo Dec 20, 2024
d884e80
Split read update
jonperdomo Dec 21, 2024
5d6a53c
Use single mutex and update dup seqsim threshold
jonperdomo Dec 29, 2024
917a68c
update thresholds
jonperdomo Dec 29, 2024
5083aa2
Improve multi-threading and add min-reads parameter
jonperdomo Jan 23, 2025
b31875c
efficiency updates
jonperdomo Jan 28, 2025
249cb64
resolve primary overlaps to increase speed
jonperdomo Jan 29, 2025
29333bb
fix errors preventing 1 recall for ins dup
jonperdomo Jan 29, 2025
0b3d205
high recall merging
jonperdomo Jan 30, 2025
8e4da06
improved merging to obtain 1 recall for sv types in chr21
jonperdomo Jan 30, 2025
07551d9
dbscan in cpp
jonperdomo Feb 1, 2025
4793bbf
update dbscan
jonperdomo Feb 3, 2025
0c329a3
add dbscan parameters
jonperdomo Feb 3, 2025
686fe99
cluster primary alignments
jonperdomo Feb 7, 2025
d6f1b6f
add invdel
jonperdomo Feb 8, 2025
ddf6d50
improve split read breakpoints
jonperdomo Feb 11, 2025
3341a92
remove test code
jonperdomo Feb 11, 2025
12ae0f7
get chromosomes from bam
jonperdomo Feb 15, 2025
f39fd8a
Fix large cnv detection errors
jonperdomo Feb 27, 2025
2c0a8a0
use pct mean coverage for minpts
jonperdomo Mar 1, 2025
f282196
remove unused code
jonperdomo Mar 1, 2025
972a271
fix memory leaks
jonperdomo Mar 5, 2025
f0f8427
Fix inversion detection
jonperdomo Mar 6, 2025
96f18df
inversion fixes
jonperdomo Mar 6, 2025
766b220
improve insertion detection
jonperdomo Mar 9, 2025
0776822
fix insertions
jonperdomo Mar 13, 2025
84d1d6e
fix sv duplicate merging
jonperdomo Mar 13, 2025
0f305ba
improved ins and del
jonperdomo Mar 14, 2025
63ee46c
remove comments and fix warnings
jonperdomo Mar 14, 2025
c10a530
cnv merge fix
jonperdomo Mar 15, 2025
62ba567
fix genotypes
jonperdomo Mar 18, 2025
877d1ff
clean up comments and save cnv in json format
jonperdomo Mar 19, 2025
53cc23d
fix json format
jonperdomo Mar 19, 2025
56fc8d2
json plots
jonperdomo Mar 19, 2025
2e23c50
fix multi-cnv json
jonperdomo Mar 20, 2025
347e5ad
update plots
jonperdomo Mar 20, 2025
708e782
simplify snp analysis
jonperdomo Mar 20, 2025
c6fe7af
remove debug output
jonperdomo Mar 20, 2025
255b9a3
remove test code
jonperdomo Mar 20, 2025
57c6c0c
add clipped base insertions
jonperdomo Mar 21, 2025
190a699
improve split read merging
jonperdomo Mar 22, 2025
c7f7f89
fix split read merge and reduce svcall overhead
jonperdomo Mar 28, 2025
b1cf23f
remove mismatch filter
jonperdomo Mar 29, 2025
a0e713f
update read depth and mismatch rate
jonperdomo Mar 31, 2025
b7eed21
add cn state and read distance features
jonperdomo Apr 11, 2025
153db48
fix save cnv error
jonperdomo Apr 15, 2025
883bfc7
improve cnv state predictions
jonperdomo Apr 20, 2025
98d41d4
remove test code
jonperdomo May 1, 2025
fe77d3d
fix false positive in split reads
jonperdomo May 5, 2025
fd75f7f
fix split sv detection errors
jonperdomo May 12, 2025
d86e1c7
filter assembly gaps
jonperdomo May 16, 2025
0acbf13
improve split sv merging
jonperdomo May 16, 2025
5affdd3
fix dup prediction error
jonperdomo May 17, 2025
34f8a9a
reduce cnv false positives
jonperdomo May 17, 2025
7948062
assembly gaps
jonperdomo May 19, 2025
32e14ea
add inversion hmm prediction and improve merging
jonperdomo May 19, 2025
ab27436
achieve highest recall for large svs
jonperdomo May 19, 2025
940f61e
revert merge
jonperdomo May 21, 2025
87be75f
remove comments
jonperdomo May 21, 2025
b62c9d8
save cluster plot data and merge duplicates
jonperdomo May 21, 2025
17ebf23
add debug mode
jonperdomo Jun 26, 2025
4ba3680
Update .gitignore
jonperdomo Jun 27, 2025
81b09ac
update makefile
jonperdomo Jul 9, 2025
9f2aea2
update cnv plots
jonperdomo Aug 1, 2025
8357052
simplify environment and update installation
jonperdomo Aug 1, 2025
f4f964b
update environment
jonperdomo Aug 1, 2025
94ae67d
update python version
jonperdomo Aug 1, 2025
8ff85bf
update build yml
jonperdomo Aug 1, 2025
d13f68c
update build yml
jonperdomo Aug 1, 2025
05ff3f8
update build yml
jonperdomo Aug 1, 2025
562a040
update build yml
jonperdomo Aug 1, 2025
3b1f02a
build yml update
jonperdomo Aug 1, 2025
b58db51
update build yml
jonperdomo Aug 1, 2025
d4b6c0c
set build env
jonperdomo Aug 1, 2025
a93076a
update build yml
jonperdomo Aug 1, 2025
82ad9b2
htslib debug output
jonperdomo Aug 1, 2025
7e2b09f
update integer type
jonperdomo Aug 1, 2025
be81457
update unit test
jonperdomo Aug 1, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 25 additions & 11 deletions .github/workflows/build-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,26 +23,40 @@ jobs:
shell: bash --login {0}
run: unzip TestData.zip

- name: Set up conda environment
- name: Set up conda (Miniconda only)
uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: contextsv
environment-file: environment.yml
python-version: 3.9
auto-activate-base: false
auto-activate-base: true

- name: Install samtools and bcftools using sudo apt-get
- name: Configure conda channels and create environment
shell: bash -l {0}
run: |
sudo apt-get update
sudo apt-get install -y samtools bcftools
conda config --remove channels defaults || true
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --set channel_priority strict
conda info # confirm the change
conda env create -f environment.yml

- name: Build C++ code
shell: bash --login {0} # --login enables PATH variable access
run: |
make
source $(conda info --base)/etc/profile.d/conda.sh
conda activate contextsv
echo "CONDA_PREFIX=$CONDA_PREFIX"
ls -l $CONDA_PREFIX/include/htslib
make CONDA_PREFIX=$CONDA_PREFIX

- name: Run unit tests
shell: bash --login {0}
run: |
mkdir -p tests/output
python -m pytest -s -v tests/test_general.py
source $(conda info --base)/etc/profile.d/conda.sh
conda activate contextsv
./build/contextsv --version
./build/contextsv --help

# run: |
# source $(conda info --base)/etc/profile.d/conda.sh
# conda activate contextsv
# mkdir -p tests/output
# python -m pytest -s -v tests/test_general.py
18 changes: 18 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ CMakeSettings.json

# Output folder
output/
python/

# Doxygen
docs/html/
Expand All @@ -67,12 +68,15 @@ python/dbscan
python/agglo
linktoscripts
tests/data
tests/cpp_module_out

# Population allele frequency filepaths
data/gnomadv2_filepaths.txt
data/gnomadv3_filepaths.txt
data/gnomadv4_filepaths.txt
data/gnomadv4_filepaths_ssd.txt
data/gnomadv4_hg19_filepaths.txt
data/gnomadv4_hg19_filepaths_ssd.txt

# Training data
data/sv_scoring_dataset/
Expand All @@ -84,3 +88,17 @@ data/hg19ToHg38.over.chain.gz
# Test images
python/dbscan_clustering*.png
python/dist_plots
upset_plot*.png

# Temporary files
lib/.nfs*
valgrind.log

# Log files
*.log
*.err
*.out

# Snakemake files
.snakemake
snakemake_bench/results/
52 changes: 47 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,11 +1,53 @@
# Directories
INCL_DIR := $(CURDIR)/include
SRC_DIR := $(CURDIR)/src
BUILD_DIR := $(CURDIR)/build
LIB_DIR := $(CURDIR)/lib

# Version header
VERSION := $(shell git describe --tags --always)
VERSION_HEADER := $(INCL_DIR)/version.h
.PHONY: $(VERSION_HEADER)
@echo "#pragma once" > $@
@echo "#define VERSION \"$(VERSION)\"" >> $@

all:
# Generate the SWIG wrapper (C++ -> Python)
swig -c++ -python -I$(INCL_DIR) -o $(SRC_DIR)/swig_wrapper.cpp -outdir $(LIB_DIR) $(SRC_DIR)/swig_wrapper.i
# Conda environment directories
CONDA_PREFIX := $(shell echo $$CONDA_PREFIX)
CONDA_INCL_DIR := $(CONDA_PREFIX)/include
CONDA_LIB_DIR := $(CONDA_PREFIX)/lib

# Compile the SWIG wrapper using setuptools
python3 setup.py build_ext --build-lib $(LIB_DIR)
# Compiler and Flags
CXX := g++
CXXFLAGS := -std=c++17 -g -I$(INCL_DIR) -I$(CONDA_INCL_DIR) -Wall -Wextra -pedantic

# Linker Flags
# Ensure that the library paths are set correctly for linking
LDFLAGS := -L$(LIB_DIR) -L$(CONDA_LIB_DIR) -Wl,-rpath=$(CONDA_LIB_DIR) # Add rpath for shared libraries
LDLIBS := -lhts # Link with libhts.a or libhts.so

# Sources and Output
SOURCES := $(filter-out $(SRC_DIR)/swig_wrapper.cpp, $(wildcard $(SRC_DIR)/*.cpp)) # Filter out the SWIG wrapper from the sources
OBJECTS := $(patsubst $(SRC_DIR)/%.cpp,$(BUILD_DIR)/%.o,$(SOURCES))
TARGET := $(BUILD_DIR)/contextsv

# Default target
all: $(TARGET)

# Debug target
debug: CXXFLAGS += -DDEBUG
debug: all

# Link the executable
$(TARGET): $(OBJECTS)
@mkdir -p $(BUILD_DIR)
$(CXX) $(CXXFLAGS) -o $@ $^ $(LDFLAGS) $(LDLIBS)

# Compile source files
$(BUILD_DIR)/%.o: $(SRC_DIR)/%.cpp
@mkdir -p $(BUILD_DIR)
$(CXX) $(CXXFLAGS) -c $< -o $@

# Clean the build directory
clean:
rm -rf $(BUILD_DIR)

115 changes: 34 additions & 81 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,33 +12,51 @@ corresponding reference genome (FASTA), a VCF with high-quality SNPs
Class documentation is available at <a href="https://wglab.openbioinformatics.org/ContextSV">https://wglab.openbioinformatics.org/ContextSV</a>
</p>

## Installation (Linux)
### Using Anaconda (recommended)
First, install [Anaconda](https://www.anaconda.com/).
## Installation

Next, create a new environment. This installation has been tested with Python 3.11:

```
conda create -n contextsv python=3.11
conda activate contextsv
```

ContextSV can then be installed using the following command:
### Building from source (for testing/development)
ContextSV requires HTSLib as a dependency that can be installed using [Anaconda](https://www.anaconda.com/). Create an environment
containing HTSLib:

```
conda install -c bioconda -c wglab contextsv=1.0.0
conda create -n htsenv -c bioconda -c conda-forge htslib
conda activate htsenv
```

### Building from source (for testing/development)
First install [Anaconda](https://www.anaconda.com/). Then follow the instructions below to install LongReadSum and its dependencies:
Then follow the instructions below to build ContextSV:

```
git clone https://github.com/WGLab/ContextSV
cd ContextSV
conda env create -f environment.yml
make
```

ContextSV can then be run:
```
./build/contextsv --help

Usage: ./build/contextsv [options]
Options:
-b, --bam <bam_file> Long-read BAM file (required)
-r, --ref <ref_file> Reference genome FASTA file (required)
-s, --snp <vcf_file> SNPs VCF file (required)
-o, --outdir <output_dir> Output directory (required)
-c, --chr <chromosome> Chromosome
-r, --region <region> Region (start-end)
-t, --threads <thread_count> Number of threads
-h, --hmm <hmm_file> HMM file
-n, --sample-size <size> Sample size for HMM predictions
--min-cnv <min_length> Minimum CNV length
--eps <epsilon> DBSCAN epsilon
--min-pts-pct <min_pts_pct> Percentage of mean chr. coverage to use for DBSCAN minimum points
-e, --eth <eth_file> ETH file
-p, --pfb <pfb_file> PFB file
--save-cnv Save CNV data
--debug Debug mode with verbose logging
--version Print version and exit
-h, --help Print usage and exit
```

## Downloading gnomAD SNP population frequencies
SNP population allele frequency
information is used for copy number predictions in this tool (see
Expand All @@ -53,7 +71,7 @@ Download links for genome VCF files are located here (last updated April 3,
- **gnomAD v2.1.1 (GRCh37)**: https://gnomad.broadinstitute.org/downloads#2


### Example download
### Script for downloading gnomAD VCFs
```
download_dir="~/data/gnomad/v4.0.0/"

Expand All @@ -78,71 +96,6 @@ X=~/data/gnomad/v4.0.0/gnomad.genomes.v4.0.sites.chrX.vcf.bgz
Y=~/data/gnomad/v4.0.0/gnomad.genomes.v4.0.sites.chrY.vcf.bgz
```

## Calling structural variants
### Example full script generating a merged VCF of structural variants
```
# Activate the environment
conda activate contextsv

# Set the input reference genome
ref_file="~/data/GRCh38.fa"

# Set the input alignment file (e.g. from minimap2)
long_read_bam="~/data/HG002.GRCh38.bam"

# Set the input SNPs file (e.g. from NanoCaller)
snps_file="~/data/variant_calls.snps.vcf.gz"

# Set the SNP population frequencies filepath
pfb_file="~/data/gnomadv4_filepaths.txt"

# Set the output directory
output_dir=~/data/contextSV_output

# Specify the number of threads (system-specific)
thread_count=40

# Run SV calling (~3-4 hours for whole-genome, 40 cores)
python contextsv --threads $thread_count -o $output_dir -lr $long_read_bam --snps $snps_file --reference $ref_file --pfb $pfb_file

# The output VCF filepath is located here:
output_vcf=$output_dir/sv_calls.vcf

# Merge SVs (~3-4 hours for whole-genome, 40 cores)
python contextsv --merge $output_vcf

# The final merged VCF filepath is located here:
merged_vcf=$output_dir/sv_calls.merged.vcf
```

## Input arguments

```
python contextsv --help

ContextSV: A tool for integrative structural variant detection.

options:
-h, --help show this help message and exit
-lr LONG_READ, --long-read LONG_READ
path to the long read alignment BAM file
-g REFERENCE, --reference REFERENCE
path to the reference genome FASTA file
-s SNPS, --snps SNPS path to the SNPs VCF file
--pfb PFB path to the file with SNP population frequency VCF filepaths (see docs for format)
-o OUTPUT, --output OUTPUT
path to the output directory
-r REGION, --region REGION
region to analyze (e.g. chr1, chr1:1000-2000). If not provided, the entire genome will be analyzed
-t THREADS, --threads THREADS
number of threads to use
--hmm HMM path to the PennCNV HMM file
--window-size WINDOW_SIZE
window size for calculating log2 ratios for CNV predictions (default: 10 kb)
-d, --debug debug mode (verbose logging)
-v, --version print the version number and exit
```

## Revision history
For release history, please visit [here](https://github.com/WGLab/ContextSV/releases).

Expand Down
1 change: 0 additions & 1 deletion __main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,6 @@ def main():
# Set input parameters
input_data = contextsv.InputData()
input_data.setVerbose(args.debug)
input_data.setShortReadBam(args.short_read)
input_data.setLongReadBam(args.long_read)
input_data.setRefGenome(args.reference)
input_data.setSNPFilepath(args.snps)
Expand Down
16 changes: 2 additions & 14 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -1,21 +1,9 @@
name: contextsv
channels:
- defaults
- anaconda
- conda-forge
- bioconda
- conda-forge
dependencies:
- python
- python=3.10
- numpy
- htslib
- swig
- pytest
- plotly

# [A] Generate directly from the file:
# conda env create -f environment.yml -n contextsv
# [B] Generate after creating a new environment:
# conda create -n contextsv
# conda activate contextsv
# conda env update -f environment.yml --prune # Prune removes unused packages

Loading
Loading