DxFit: Disease Matching and Similarity Scoring Tool

DxFit is a Python-based tool for matching individual disorder (Mondo) and ICD data to Mondo disease ontology terms using graph-based proximity and NLP-based semantic similarity scoring. It integrates ontology mapping, string preprocessing, and similarity metrics to prioritize disease matches.

Installation

1. Clone the repository

git clone https://github.com/GeisingerResearchPublic/DxFit.git
cd DxFit

2. Create and activate the Conda environment

A dxfit_environment.yml file will be provided. Use it to create the environment:

conda env create -f dxfit_env.yml
conda activate dxfit_env

3. Install SpaCy and download the language model

DxFit uses SpaCy for NLP preprocessing. After activating the environment, download en_core_sci_md (or an NLP model of your choice) and run:

pip install /path/to/en_core_sci_md.<version>.tar.gz

If you plan to use a different SpaCy model, update the config or CLI argument accordingly.

4. Create an environmental variable for the source directory

export PROJECT_ROOT=/path/to/cloned/dxfit

5. Install local DxFit package

pip install ./dxfit/

Test Install

Example files are provided in the dxfit/tests/data directory. From the project root directory dxfit, run:

python -m unittest dxfit.tests.test_dxfit

Warning: The first time DxFit is run, if there is no directory ~/.DxFit/ with config.ini, an error will occur. The directory will be created, default input files will be copied there, and a pruned mondo file will be created.

Once default files are copied to ~/.DxFit/data and the user has a pruned mondo file, then the following can be run: End-to-end test with output

python -m dxfit \
--disorder_file ./dxfit/tests/data/individual_genetic_disorder.tsv \
--icd_file ./dxfit/tests/data/individual_icd.tsv \
--output_file ./dxfit/tests/data/test_output.tsv

Usage

Run the tool from the command line:

python -m dxfit \
--pruned_mondo_file path/to/pruned_mondo.pkl \
--disorder_file path/to/disorder_file.tsv \
--icd_file path/to/icd_file.tsv \
--notes_file path/to/notes_file.tsv \ 
--output_file path/to/DxFit_output.txt \
--plot_output_file path/to/DxFit_output.txt

Arguments

Argument	Description
`--pruned_mondo_file`	Pickled NetworkX graph of pruned Mondo ontology.
`--disorder_file`	TSV file with individual disorder data. Assumes columns `individual_id`, `disease_title` (set of disorder names), and `gene_symbol` are present. If `disease_curie` column (set of Mondo IDs), is provided along with `--icd_mondo_map_file` option, then ICD-to-Mondo code mapping will be performed.
`--icd_file`	TSV file with individual ICD data. Assumes columns `individual_id`, `DX_CD`, and `DX_NM` are present.
`--notes_file`	TSV file with individual notes data. Assumes columns `individual_id` and `note_text` are present
`--icd_mondo_map_file`	Optional: TSV file mapping ICD codes to Mondo IDs. Optional. If this file is not provided, then Mondo matching will not be performed.
`--custom_stopwords_file`	Text file with additional stopwords.
`--discard_terms_file`	Text file with terms to exclude from NLP processing.
`--common_terms_file`	Text file with terms to exclude from linguistic matching.
`--output_file`	Output file name.
`--plot_output_file`	Plot output file name.
`--nlp_model`	SpaCy model name (e.g., `en_core_web_sm`).
`--semantic_score_threshold`	Minimum score for a significant semantic similarity match. Defaults to 0.8.
`--jaccard_score_threshold`	Minimum score for a significant Jaccard similarity match. Defaults to 0.8.
`--create_control_cohort`	Boolean flag for whether to scramble individual IDs to create a random control cohort. Defaults to `False`. Otherwise, user has the option to provide a "cohort" column in either the `disorders` or `icd` file.
`--update_icd_to_mondo_map`	Flag for whether `--icd_to_mondo_map_file` should be updated. Defaults to `False`.
`--use_mondo_synonyms`	Flag for whether to use Mondo synonyms during linguistic matching. Defaults to `False`.

Outputs

Annotated dataframe with DxFit prioritization
Similarity scores (SpaCy embeddings and Jaccard)
Optional visualizations (if enabled with plot_output_file)

Dependencies

Python 3.9+
pandas, numpy, tqdm, networkx
SpaCy
plotnine (for visualization)
obonet (for ontology parsing)

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.github/workflows		.github/workflows
dxfit		dxfit
LICENSE		LICENSE
README.md		README.md
dxfit_env.yml		dxfit_env.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DxFit: Disease Matching and Similarity Scoring Tool

Installation

Test Install

Usage

Arguments

Outputs

Dependencies

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

License

GeisingerResearchPublic/DxFit

Folders and files

Latest commit

History

Repository files navigation

DxFit: Disease Matching and Similarity Scoring Tool

Installation

Test Install

Usage

Arguments

Outputs

Dependencies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages