DxFit is a Python-based tool for matching individual disorder (Mondo) and ICD data to Mondo disease ontology terms using graph-based proximity and NLP-based semantic similarity scoring. It integrates ontology mapping, string preprocessing, and similarity metrics to prioritize disease matches.
1. Clone the repository
git clone https://github.com/GeisingerResearchPublic/DxFit.git
cd DxFit
2. Create and activate the Conda environment
A dxfit_environment.yml file will be provided. Use it to create the environment:
conda env create -f dxfit_env.yml
conda activate dxfit_env
3. Install SpaCy and download the language model
DxFit uses SpaCy for NLP preprocessing. After activating the environment, download en_core_sci_md (or an NLP model of your choice) and run:
pip install /path/to/en_core_sci_md.<version>.tar.gz
If you plan to use a different SpaCy model, update the config or CLI argument accordingly.
4. Create an environmental variable for the source directory
export PROJECT_ROOT=/path/to/cloned/dxfit
5. Install local DxFit package
pip install ./dxfit/
Example files are provided in the dxfit/tests/data directory.
From the project root directory dxfit, run:
python -m unittest dxfit.tests.test_dxfit
Warning: The first time DxFit is run, if there is no directory ~/.DxFit/ with config.ini, an error will occur. The directory will be created, default input files will be copied there, and a pruned mondo file will be created.
Once default files are copied to ~/.DxFit/data and the user has a pruned mondo file, then the following can be run:
End-to-end test with output
python -m dxfit \
--disorder_file ./dxfit/tests/data/individual_genetic_disorder.tsv \
--icd_file ./dxfit/tests/data/individual_icd.tsv \
--output_file ./dxfit/tests/data/test_output.tsv
Run the tool from the command line:
python -m dxfit \
--pruned_mondo_file path/to/pruned_mondo.pkl \
--disorder_file path/to/disorder_file.tsv \
--icd_file path/to/icd_file.tsv \
--notes_file path/to/notes_file.tsv \
--output_file path/to/DxFit_output.txt \
--plot_output_file path/to/DxFit_output.txt
| Argument | Description |
|---|---|
--pruned_mondo_file |
Pickled NetworkX graph of pruned Mondo ontology. |
--disorder_file |
TSV file with individual disorder data. Assumes columns individual_id, disease_title (set of disorder names), and gene_symbol are present. If disease_curie column (set of Mondo IDs), is provided along with --icd_mondo_map_file option, then ICD-to-Mondo code mapping will be performed. |
--icd_file |
TSV file with individual ICD data. Assumes columns individual_id, DX_CD, and DX_NM are present. |
--notes_file |
TSV file with individual notes data. Assumes columns individual_id and note_text are present |
--icd_mondo_map_file |
Optional: TSV file mapping ICD codes to Mondo IDs. Optional. If this file is not provided, then Mondo matching will not be performed. |
--custom_stopwords_file |
Text file with additional stopwords. |
--discard_terms_file |
Text file with terms to exclude from NLP processing. |
--common_terms_file |
Text file with terms to exclude from linguistic matching. |
--output_file |
Output file name. |
--plot_output_file |
Plot output file name. |
--nlp_model |
SpaCy model name (e.g., en_core_web_sm). |
--semantic_score_threshold |
Minimum score for a significant semantic similarity match. Defaults to 0.8. |
--jaccard_score_threshold |
Minimum score for a significant Jaccard similarity match. Defaults to 0.8. |
--create_control_cohort |
Boolean flag for whether to scramble individual IDs to create a random control cohort. Defaults to False. Otherwise, user has the option to provide a "cohort" column in either the disorders or icd file. |
--update_icd_to_mondo_map |
Flag for whether --icd_to_mondo_map_file should be updated. Defaults to False. |
--use_mondo_synonyms |
Flag for whether to use Mondo synonyms during linguistic matching. Defaults to False. |
- Annotated dataframe with DxFit prioritization
- Similarity scores (SpaCy embeddings and Jaccard)
- Optional visualizations (if enabled with
plot_output_file)
- Python 3.9+
- pandas, numpy, tqdm, networkx
- SpaCy
- plotnine (for visualization)
- obonet (for ontology parsing)