Aligning transcripts to their genomic source locations is a surprisingly difficult problem. This project seeks to answer the following questions:
- Do aligners fail some simple sanity checks?
- Which aligner is the most accurate?
- Which aligner is the most efficient?
- What kinds of sequence and gene features create the most problems?
- What improvements can be made in spliced alignment?
- Install conda (e.g. Miniforge3)
- Clone this repo
- Create conda environment
- Run the demos in the
bakeoffusage statement
See the TUTORIAL.md for a step-by-step walkthrough.
README.mdthis documentTUTORIAL.mda quick walk-through to check that things workINFO.mdsome behind-the-scenes informationNOTES.mdrandom stuff the devs are thinking about or working onbakeofftop-level program for assessing alignersenv/directory of conda environments for different platformsdata/directory with some sample files (1% of favorite genomes)src/directory with programs that run various parts of the analyses2025/directory with specifics for the 2025 study
Testing
conda activate sabr-linux-x86
python3 src/read-simulator.py data/ce01.fa.gz data/ce01.ftx.gz --seed 1 --samplegene 0.05 --samplereads 0.01 > rtest.fa
python3 src/est-genome.py data/ce01.fa.gz rtest.fa --verbose
Runnning
- Remove `--verbose`` flag
- Add
--threads nfor however many CPUs you are using - Remove
tempdirectory (it doesn't auto-delete for debugging reasons)