IHC-LLMiner is a Python module for automatically extracting immunohistochemistry (IHC) marker-tumour profiles from PubMed abstracts. It leverages LLMs and BERT-based models for:
- Downloading abstracts for specific IHC markers
- Classifying abstract relevance
- Extracting structured IHC marker data
- Normalising entity mentions using UMLS
Python 3.10
git clone https://github.com/knowlab/IHC-LLMiner.git
cd IHC-LLMiner
conda create -n ihcllminer python=3.10
conda activate ihcllminer
pip install .python download.py --markers BOB1 TTF1 --max_per_marker 9999 --output_file pmid_list_w_abstract.tsvpython classify.py \
--input_file pmid_list_w_abstract.tsv \
--output_file predictions.jsonpython extract.py \
--input_file predictions.json \
--output_file extraction_result.tsvYou would need UMLS metathesaurus downloaded. For this, you would need to log in with your own credential. Download Full Subset from here then run generate_UMLS_data.ipynb
python normalize.py \
--input_file extraction_result.tsv \
--output_file inference_umls_mapped_data.tsvPlease refer to data_analysis.ipynb
The code was tested with A5000 GPU 24GB memory.
@misc{kim2025ihcllminer,
title={IHC-LLMiner: Automated extraction of tumour immunohistochemical profiles from PubMed abstracts using large language models},
author={Yunsoo Kim and Michal W. S. Ong and Daniel W. Rogalsky and Manuel Rodriguez-Justo and Honghan Wu and Adam P. Levine},
year={2025},
eprint={2504.00748},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.00748},
}