Skip to content

GeisingerResearchPublic/DxFit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

105 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DxFit: Disease Matching and Similarity Scoring Tool

DxFit is a Python-based tool for matching individual disorder (Mondo) and ICD data to Mondo disease ontology terms using graph-based proximity and NLP-based semantic similarity scoring. It integrates ontology mapping, string preprocessing, and similarity metrics to prioritize disease matches.

Installation

1. Clone the repository

git clone https://github.com/GeisingerResearchPublic/DxFit.git
cd DxFit

2. Create and activate the Conda environment

A dxfit_environment.yml file will be provided. Use it to create the environment:

conda env create -f dxfit_env.yml
conda activate dxfit_env

3. Install SpaCy and download the language model

DxFit uses SpaCy for NLP preprocessing. After activating the environment, download en_core_sci_md (or an NLP model of your choice) and run:

pip install /path/to/en_core_sci_md.<version>.tar.gz

If you plan to use a different SpaCy model, update the config or CLI argument accordingly.

4. Create an environmental variable for the source directory

export PROJECT_ROOT=/path/to/cloned/dxfit

5. Install local DxFit package

pip install ./dxfit/

Test Install

Example files are provided in the dxfit/tests/data directory. From the project root directory dxfit, run:

python -m unittest dxfit.tests.test_dxfit

Warning: The first time DxFit is run, if there is no directory ~/.DxFit/ with config.ini, an error will occur. The directory will be created, default input files will be copied there, and a pruned mondo file will be created.

Once default files are copied to ~/.DxFit/data and the user has a pruned mondo file, then the following can be run: End-to-end test with output

python -m dxfit \
--disorder_file ./dxfit/tests/data/individual_genetic_disorder.tsv \
--icd_file ./dxfit/tests/data/individual_icd.tsv \
--output_file ./dxfit/tests/data/test_output.tsv

Usage

Run the tool from the command line:

python -m dxfit \
--pruned_mondo_file path/to/pruned_mondo.pkl \
--disorder_file path/to/disorder_file.tsv \
--icd_file path/to/icd_file.tsv \
--notes_file path/to/notes_file.tsv \ 
--output_file path/to/DxFit_output.txt \
--plot_output_file path/to/DxFit_output.txt 

Arguments

Argument Description
--pruned_mondo_file Pickled NetworkX graph of pruned Mondo ontology.
--disorder_file TSV file with individual disorder data. Assumes columns individual_id, disease_title (set of disorder names), and gene_symbol are present. If disease_curie column (set of Mondo IDs), is provided along with --icd_mondo_map_file option, then ICD-to-Mondo code mapping will be performed.
--icd_file TSV file with individual ICD data. Assumes columns individual_id, DX_CD, and DX_NM are present.
--notes_file TSV file with individual notes data. Assumes columns individual_id and note_text are present
--icd_mondo_map_file Optional: TSV file mapping ICD codes to Mondo IDs. Optional. If this file is not provided, then Mondo matching will not be performed.
--custom_stopwords_file Text file with additional stopwords.
--discard_terms_file Text file with terms to exclude from NLP processing.
--common_terms_file Text file with terms to exclude from linguistic matching.
--output_file Output file name.
--plot_output_file Plot output file name.
--nlp_model SpaCy model name (e.g., en_core_web_sm).
--semantic_score_threshold Minimum score for a significant semantic similarity match. Defaults to 0.8.
--jaccard_score_threshold Minimum score for a significant Jaccard similarity match. Defaults to 0.8.
--create_control_cohort Boolean flag for whether to scramble individual IDs to create a random control cohort. Defaults to False. Otherwise, user has the option to provide a "cohort" column in either the disorders or icd file.
--update_icd_to_mondo_map Flag for whether --icd_to_mondo_map_file should be updated. Defaults to False.
--use_mondo_synonyms Flag for whether to use Mondo synonyms during linguistic matching. Defaults to False.

Outputs

  • Annotated dataframe with DxFit prioritization
  • Similarity scores (SpaCy embeddings and Jaccard)
  • Optional visualizations (if enabled with plot_output_file)

Dependencies

  • Python 3.9+
  • pandas, numpy, tqdm, networkx
  • SpaCy
  • plotnine (for visualization)
  • obonet (for ontology parsing)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages