Wrote the following projects:
- GenomicsDB Performance Profiling
- Docker for Building GenomicsDB utilities
- GenomicsDB Wrapping Dockers
GenomicsDB Profiler, or Genomics-Benchmark as late name,d is a self-start project for profiling performance of GenomicsDB utillities, see GenomicsDB wiki at https://github.com/Intel-HLS/GenomicsDB for details about GenomicsDB utilities.
The profiling targets are GenomicsDB utilities ‘vcf2tiledb’ (data loader) and ‘gt_mpi_gather’ (data retriever). Both utilities require a json config file as input, see detals ay https://github.com/Intel-HLS/GenomicsDB. Initially the work was manually creating a config file then profile a GenomicsDB utility against the file. I created GenomicsDB Profiler that automatically generates different config files, then profile GenimicsDB utilities with the config files. I have profiled 2000+ in a batch. Technically speaking, the number of runs in a batch is unlimited. GenomicsDB Profiler also has visualization for profiling results, 2-D and 3-D results comparisons and query geno position distribution.
A single run of vcf2tiledb or gt_mpi_gather is a very simple task. However managing thousands runs with similar configurations on many nodes become very challenging. To avoid conflicts, GenomicsDB Profiler has a complex name scheme for JSON configuration files, DB names and result file names. Because running the loader often takes hours, the code enable quick test, dry run, option of reuse existing DB, automatically process results and so on.
In addition to profiling GenomicsDB utilities, the profiler collects the system information and optionally collect process stats or/and IO stats data.
This is a self-started project. Building GenomicsDB could be quite time consuming, see GenomicsDB wiki. The project offers scripts for building docker image and run the docker. The created docker image is ready for building GenomicsDB utilities. Running the run script will launch a docker contain that pulls fresh source code from GenomicsDB github repo and builds the GenomicsDB utilities. By default, the container pulls from master branch. However it will pull code from the brach if a branch name is passed in.
The source code of making the docker could be found at GenomicsDB github branch "GenomicsDB_dockers"
The docker image "genomicsdb_builder" is published at https://hub.docker.com/r/intelhlsgenomicsdb/genomicsdb_builder/.
Two GenomicsDB utility dockers are available: importer and querier, see source code at GenomicsDB github branch "GenomicsDB_dockers"
To build importer docker, run:
$ bash docker_src/VCFImporter_builder/build_docker.bash
To build querier docker, run:
$ bash docker_src/GDBQuerier_builder/build_docker.bash
Assum we have the following directory and files layout:
ROOT
|_ mapping_data
|_ customer_vid.json # vid mapper file
[ |_ callsets.json ] # callsets file. if not preset, code will generates one
|_ Homo_sapiens_assembly19.fasta # reference file
|_ vcfs
|_ xxx.gz # block compressed VCF file
|_ xxx.gz.tbi # index file
|_ ....
|_ genomicsdb_ws # where genomicsDB will be
See https://github.com/Intel-HLS/GenomicsDB/wiki/Importing-VCF-data-into-GenomicsDB for more details on how to make xxx.gz and xxx.gz.tbi from xxx.vcf file.
For syntax,
$ docker run vcf_importer:0.6
or
$ docker run genomicsdb_querier:0.9.2-93da4b0-0.5
$ docker run -v $ROOT:/myenv vcf_importer:0.6 vcf_importer.py -R /myenv/Homo_sapiens_assembly19.fasta -V /myenv/mapping_data/customer_vid.json -C /myenv/mapping_data/ -o /myenv/genomicsdb_ws/ -i /myenv/vcfs --range 1:1-249250621
example of querying positions with GenomicsDB
$ docker run -v $ROOT:/myenv genomicsdb_querier:0.9.2-93da4b0-0.5 genomicsdb_querier.py -R /myenv/Homo_sapiens_assembly19.fasta -V /myenv/mapping_data/customer_vid.json -C /myenv/mapping_data/callsets.json -o /myenv/genomicsdb_ws/ --print-AC --positions /examples/query_1_10_f1.json
/examples/query_1_10_f1.json is an example query position file. You can find more by
$ docker run genomicsdb_querier:0.9.2-93da4b0-0.5 ls /examples
The docker images, "vcf_importer", "vcf_combiner" and "genomicsdb_querier" are published at https://hub.docker.com/u/intelhlsgenomicsdb/
