Skip to content

Remove absolute filepaths, use conda for more dependencies#5

Open
EthanHolleman wants to merge 11 commits intokernco:masterfrom
EthanHolleman:master
Open

Remove absolute filepaths, use conda for more dependencies#5
EthanHolleman wants to merge 11 commits intokernco:masterfrom
EthanHolleman:master

Conversation

@EthanHolleman
Copy link

Hello again! I originally made a pull request called Use Bioconda ChromHMM distribution but I made some additional changes that might be useful and so I am wrapping everything up in a new pull request here.

I will try and summarize the changes I have made below. I realized after making commits that it would have made more sense to refer to "local paths" as absolute paths. The take away is that I was trying to refer to paths that where specific to one machine.

Add chromHMM.yaml

Here I basically just copied the format of other yaml files for conda tools and replaced chromhmm as the dependency so it could be used in the snakemake rule files.

Use conda ChromHMM in Rules/ChromHMM.smk

It seems that once installed with conda ChromHMM can be called as ChromHMM.sh. I first tested this in just a personal conda environment on Ubuntu so I am not sure how this may differ on other operating systems.

First, I followed the format lased in other rules and added

conda:
    '../Envs/chromHMM.yaml'

to any rule that called ChromHMM.

Next in the shell commands, I replaced java -mx10000M -jar /home/ckern/ChromHMM/ChromHMM.jar where ever it occurred with ChromHMM.sh.

Testing ChromHMM changes

I do not have data to fully test the pipeline with these changes but I did create a small proof of concept snakefile which is below and was named snake-test.

rule all:
    input: "results.txt"

rule test_chip:
    output:
        "results.txt"
    conda:
        "Envs/chromHMM.yaml"
    shell:'''
    ChromHMM.sh Version> {output}
    '''

I created a clean conda environment with the command conda create -y --name clean -c conda-forge -c bioconda snakemake-minimal and then executed this tiny snakefile with the command snakemake -s snake-test -j 1 --use-conda. This was just to test that the chromHMM.yaml environment file could be used successfully and that ChromHMM could be called in this way.

The snakefile ran successfully and produced results.txt with the following content.

This is Version 1.14 of ChromHMM (c) Copyright 2008-2012 Massachusetts Institute of Technology

which is what I was expecting.

Based on this I believe ChromHMM calls should execute normally. However, I have not tested on Windows or Mac OS or in a full scale run.

Remove absolute paths in Rules/DeployTrackHub.smk

Looks like some absolute filepaths have made it this file. There were a couple instances of /home/ckern/bin/bedClip being used instead of just bedClip. The conda environment used by these rules already specifies ucsc-bedclip as dependency so just with that change those rules should be good to go.

Similarly, in some rules bedToBigBed was being called with /home/ckern/bin/bedToBigBed. It is available through conda so I added it as a dependency to the bdg2bw.yaml environment file along with conda: '../Env/bdg2bw.yaml' to the effected rules.

Use bedtools.yaml environment file in a couple rules

There were a couple rules that seemed to call bedtools but did not specify a conda environment so I just added conda: ./Env/bedtools.yaml to these.

Add note about call to run_spp.R

This looked like another possible absolute path issue. The rule run_spp executes Rscript /home/ckern/phantompeakqualtools/run_spp.R. I am not sure if this rule is actually run in the pipeline but if it is it would likely fail. I also did not see run_spp.R in the scripts folder so I was mostly wondering if this should be marked as depreciated.

Small change for clarity to the README

I noticed that in the run.sh script the ----cluster-config` argument has an absolute path specified so I just added a note that this would have to be changed to run on "your" machine.

That is all, thanks and have a good one!

EthanHolleman and others added 11 commits February 17, 2021 14:11
…m conda installation (seems to be run using "ChromHMM.sh")
…her ucsd tool in there and then replace rules that were calling bedToBigBed from a local path with the conda env
… current state since the shell command refers to a local script (run_spp.r) which does not seem to be included in the repo. Not sure if the rule is actually run though
Add run_spp.R script and modify spp_stats rule to use local path
@EthanHolleman
Copy link
Author

Just tested the pipeline out with some real data in dry run mode. I modified the run.sh file to use the path to the Snakefile I have in my cloned repo but that was about all the modification I made. The jobs that would have been run are below.

Job counts:
        count   jobs
        10      BamCoverage_BigWig
        1       Create_TrackDB
        10      align_library
        1       all
        8       bam_compare_bedgraph
        10      bam_coverage
        10      bam_to_tagalign
        10      bwa_mem
        5       calculate_jaccard
        12      call_peaks
        4       combine_peaks
        4       combined_frip
        1       combined_signal_depth_table
        18      correlation_graph
        4       coverage_graph
        8       deeptools_jsd
        10      filter_alignments
        4       fingerprint_assay_graph
        2       fingerprint_sample_graph
        10      fixmate_bam
        4       fold_enrichment
        10      get_alignment_stats
        10      index_bam
        12      individual_frip
        5       jaccard_matrix
        4       log_likelihood
        6       make_alignment_report
        8       make_library_stats
        5       make_peak_summary_report
        4       make_quality_metrics_report
        10      mark_duplicates
        5       multibigwig_summary
        4       multibigwig_vs_input_summary
        10      namesort_bam
        8       overlap_peaks_with_replicate_fe
        17      pca_plot
        10      remove_duplicates
        1       signal_depth_table
        10      sort_bam
        10      spp_stats
        1       tpm_density_plot
        10      trim_reads
        1       tsi_density_plot
        8       validate_peaks
        8       zscore_bedgraph_to_bigwig
        8       zscore_normalize_bedgraph
        331

I think all of these should run on any machine except spp_stats.

My config.yaml file is below.

tissues: [Colon, Gizzard, Jejunum, Cecum, Duodenum, Ileum, Duodenum2, Colon1, Colon2, Provent]
reps: [CA, CB]
narrow_peaks: [H3K4me3, H3K4me1, H3K27ac]
broad_peaks: [H3K27me3, ATAC]
no_input: [ATAC]
inputs: [Input]
H3K4me3_input: Input
H3K27me3_input: Input
H3K4me1_input: Input
H3K27ac_input: Input
CTCF_input: Input
genome: /group/zhougrp/Genomes/galGal6/galGal6.fa
genomesize: 1055588788
annotation: /group/zhougrp/Genomes/galGal6/galGal6_Ensembl95.gtf
mapq: 30
tempdir: Temp
ChromHMM_genome: galGal6
chromsizes: Chromosome_Lengths.txt

So since it does not look like the ChromHMM related rules are run I was wondering if there are additional parameters that need to be specified?

Thanks Colin!

@kernco
Copy link
Owner

kernco commented Feb 18, 2021

Hi Ethan, thanks for making all these changes! Getting rid of the absolute paths is essential to making to the pipeline portable, but it wasn't something I ever had time to finish.

There is one issue with switching to use the ChromHMM conda package, which is why I hadn't been using it already. There are files ChromHMM expects to find in its own installation directory with information such as the genomic positions of TSS and exons. ChromHMM is shipped with these files for human, mouse and fruit fly, but any other genomes (such as the farm animal ones) need to be added before ChromHMM can be used on those genomes. This is why I had it set up to run ChromHMM from my own directory, because that installation of ChromHMM has the necessary files for the farm animal genomes. I sent a message to the ChromHMM devs letting them know that because of this, it was nearly impossible to integrate ChromHMM into a pipeline in a portable way. They added some arguments in v1.16 (-u and -v) to specify alternate directories for these files, but I never got around to integrating it into the pipeline. I also suggested they have a ChromHMM command to convert a GTF annotation file into the format ChromHMM uses. Unfortunately, they added a command to convert a UCSC browser gene table to their format, but not a GTF. I have some scripts for this, but they're not on GitHub. I can send them to you if you think you could integrate them into the pipeline. Basically, there would need to be rules added to convert the GTF file set as the annotation in the pipeline configuration into the ChromHMM format, and then the ChromHMM rules would need to be changed to use the -u and -v parameters with the location of where those converted files were saved.

@EthanHolleman
Copy link
Author

Ah, makes total sense for ChromHMM and it looks like the latest version on conda is 1.14 so that would not help with using -u and -v arguments. I'd definitely be willing to take a look at your scripts and give a go at a at least partial ChromHMM integration.

Also pull request wise should I close and send another without ChromHMM changes? I am not sure if GitHub lets you select which commits in a request to merge if other changes look alright to you. Let me know what you think.

Thanks again Collin!

@EthanHolleman
Copy link
Author

EthanHolleman commented Mar 3, 2021

A recent discovery relating to conda ChromHMM.

If you install the package using conda and navigate to the location of the conda environment you will find two scripts relating to ChromHmm.

  1. ChromHmm.sh: The actual ChromHmm executable file.
  2. download_chromhmm_data.sh: Seems to download the additional files the program needs to run that you mentioned in you last comment.

Contents of download_chromhmm_data.sh.

#!/bin/bash
set -e
set -o pipefail
set -x

if [[ -z "$1" ]]; then
    echo
    echo "Downloads ChromHMM example data to <outdir>"
    echo "Usage: $0 <outdir>"
    echo
    exit 1
fi
OUTDIR=$1
mkdir -p $OUTDIR
TMP=$(mktemp -d)
cd $TMP
curl -O -L http://compbio.mit.edu/ChromHMM/ChromHMM.zip > ChromHMM.zip
unzip ChromHMM

mv $TMP/ChromHMM/ANCHORFILES $OUTDIR
mv $TMP/ChromHMM/CHROMSIZES $OUTDIR
mv $TMP/ChromHMM/COORDS $OUTDIR
mv $TMP/ChromHMM/SAMPLEDATA_HG18 $OUTDIR

So it seems those who got ChromHmm downloadable through conda had a similar issue and therefore have included a script to do the downloading presumably at some point when ChromHmm is run but I am not %100 sure of this.

If that is the case getting a working conda installation version running would need to involve figuring out where ChromHmm is expecting these downloaded files to be and then putting our own files there.

Additional Thought

If the above is the case then if the location where conda ChromHmm expects these files can be determined a super hacky fix could be to replace the download_chromhmm_data.sh while the pipeline is being run but before ChromHmm ever gets called with a script that puts data for the target genome there instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants