-
Notifications
You must be signed in to change notification settings - Fork 20
Description
DSRC may corrupt data if it contains non-ACGTN nucleotide codes.
Example files (5 kB each):
a) http://kirill.med.u-tokai.ac.jp/data/temp/dsrc-repro-1-a.fastq
b) http://kirill.med.u-tokai.ac.jp/data/temp/dsrc-repro-1-b.fastq
"a" can be compressed/decompressed without problems. "b" is the same, but has one "H" (in the middle of 2-nd read). Compressing/decompressing the "b" input produces a file that is different from original.
DSRC's README.md mentions:
Full support for Illumina, ABI SOLiD, and 454/Ion Torrent dataset formats with non-standard (AGCTN) IUPAC base values.
Note that "H" is a standard IUPAC nucleotide code, meaining "A or C or T" (see, e.g., here or here).
Full repro steps:
cd /tmp
mkdir dsrc-repro-1
cd dsrc-repro-1
git clone https://github.com/refresh-bio/DSRC
cd DSRC
make -f Makefile.c++11 bin
cd ..
wget http://kirill.med.u-tokai.ac.jp/data/temp/dsrc-repro-1-a.fastq
wget http://kirill.med.u-tokai.ac.jp/data/temp/dsrc-repro-1-b.fastq
./DSRC/bin/dsrc c -t1 dsrc-repro-1-a.fastq dsrc-repro-1-a.dsrc
./DSRC/bin/dsrc c -t1 dsrc-repro-1-b.fastq dsrc-repro-1-b.dsrc
./DSRC/bin/dsrc d -t1 dsrc-repro-1-a.dsrc dsrc-repro-1-a-decompressed.fastq
./DSRC/bin/dsrc d -t1 dsrc-repro-1-b.dsrc dsrc-repro-1-b-decompressed.fastq
echo Verifying decompressed files
diff -q dsrc-repro-1-a.fastq dsrc-repro-1-a-decompressed.fastq
diff -q dsrc-repro-1-b.fastq dsrc-repro-1-b-decompressed.fastq
For me this produces:
Verifying decompressed files
Files dsrc-repro-1-b.fastq and dsrc-repro-1-b-decompressed.fastq differ
The corruption affects not only one read with "H", but also all reads after it.
Test machine: Ubuntu 18.04.1 LTS, GCC: 7.4.0.
Let me know if you need any additional information or tests.