microbELP is a text mining pipeline for automatic recognition and normalisation of microbiome-related entities in biomedical literature.
It identifies microbial mentions (excluding viruses) in full-text scientific articles and links them to their NCBI Taxonomy identifiers.
The system was developed to support large-scale microbiome curation and downstream text mining tasks by providing consistent and standardised microbial annotations in BioC-formatted JSON files (from Auto-CORPus).
- π Overview
- π§© Pipeline Overview
- βοΈ Installation
- π Input and Output Format of the main functions (microbELP, parallel_microbELP, microbELP_DL)
- π Usage
- π§Ύ PMCID retrieval and conversion to BioC
- π€ Microbiome Entity Recognition (DL)
- π Normalisation Utility
- π³ Visualisation Module
- π§ Linux / π macOS / π Cygwin (Linux-like on Windows)
- π Ecosystem
β οΈ Important - Please Read!- π€ Acknowledgements
- π Citing
- π₯ Code Contributors
- π·οΈ Version History
The microbELP pipeline processes research articles encoded in BioC JSON format.
It automatically detects mentions of microbiome entities β covering archaea, bacteria, and fungi β and attaches standardised taxonomy identifiers from NCBI.
- Automatic annotation of microbiome mentions in BioC-formatted research articles.
- Entity normalisation to NCBI Taxonomy IDs, providing consistent reference identifiers.
- Parallel processing: leverages multiprocessing to distribute workloads across multiple CPU cores, reducing runtime on large datasets.
- Visualisation module: generates interactive and comparative phylogenetic trees to explore microbial diversity and overlap across studies or datasets.
- Incremental annotation: previously annotated files are automatically skipped on rerun.
- Standalone normalisation functions: convert microbial name strings to NCBI Taxonomy identifiers, now available for both CPU and GPU usage.
- Standalone recognition function: detect microbiome entities directly from free text using the DL version.
- Flexible pipelines:
- A nonβDL pipeline optimised for CPU processing.
- A DL-based pipeline optimised for CPU and GPU processing.
- OA article support: includes an automatic PMCID downloader and converter, transforming Open Access PubMed Central articles into BioC JSON format.
- Output in BioC JSON, ensuring full compatibility with existing biomedical NLP pipelines.
The pipeline consists of the following main stages:
-
Input ingestion
- Identifies all BioC JSON files from a specified input directory.
- Only files with
_biocin their names are processed.
-
Named Entity Recognition (NER)
- Recognised mentions are annotated with their text offsets and types (e.g.
bacteria_species,bacteria_genus, etc.).
- Recognised mentions are annotated with their text offsets and types (e.g.
-
Entity Normalisation
- Each detected entity is mapped to an NCBI Taxonomy identifier using curated lexical resources.
-
Output generation
- Annotated BioC JSON files are written to a new output directory called
'microbELP_result/'or'microbELP_DL_result/'. - Each annotation includes:
- The text span of the entity
- The type (e.g.
bacteria_species) - The NCBI identifier
- The parent taxonomic identifier (if available) (only for the non-DL)
- Metadata: annotator name, date, and annotation ID
- Annotated BioC JSON files are written to a new output directory called
-
Incremental updates
- On re-execution, files that already contain microbiome annotations are skipped, ensuring efficient updates.
MicrobELP has a number of dependencies on other Python packages; it is recommended to install it in an isolated environment.
git clone https://github.com/omicsNLP/microbELP.git
pip install ./microbELP
A directory containing BioC JSON files (e.g. exported from Auto-CORPus). E.g. the unannotated test set available from Zenodo.
Example filename:
PMC92037_bioc.json
Each file contains a standard BioC structure:
{
"source": "Auto-CORPus (full-text)",
"documents": [
{
"id": "PMC92037",
"passages": [
{
"text": "Phylogenetic Relationships of Butyrate-Producing Bacteria from the Human Gut",
"annotations": []
}
]
}
]
}A new directory (microbELP_result/) is created, containing the same files with additional microbiome annotations:
{
"annotations": [
{
"text": "Eubacterium rectale",
"infons": {
"identifier": "NCBI:txid39491",
"type": "bacteria_species",
"annotator": "microbELP@omicsNLP.ic.ac.uk",
"date": "2025-10-07 14:23:02",
"parent_taxonomic_id": "NCBI:txid186928"
},
"locations": {"offset": 1418, "length": 19}
}
]
}Run the pipeline on a folder of BioC files with the name ending with _bioc.json:
from microbELP import microbELP
microbELP('$input_folder$') #type strOptional arguments:
from microbELP import microbELP
microbELP(
'$input_folder$', #type str
output_directory='$output_path$' #type str # The path to where the results should be saved. Default value is './'
) The output_directory parameter lets you specify where to save the results. By default, output files are stored in the current working directory ('./') under 'microbELP_result/'.
Run the pipeline on a folder of BioC files with the name ending with _bioc.json:
from microbELP import microbELP_DL
microbELP_DL('$input_folder$') #type strOptional arguments:
from microbELP import microbELP_DL
microbELP_DL(
'$input_folder$', #type str
output_directory='$output_path$', #type str # The path to where the results should be saved. Default value is './'
cpu = False, # type bool # If True, the code runs on CPU, otherwise it will use a GPU if any available.
normalisation = True # #type bool # If changed to False, will only perform NER instead of NER+NEN/EL. Default value is 'True'
) The output_directory parameter lets you specify where to save the results. By default, output files are stored in the current working directory ('./') under 'microbELP_DL_result/'. The cpu parameter lets you specify whether to perform Named Entity Normalisation / Entity Linking; using the CPU or the GPU. If using the CPU, the longest part is to load the vocabulary as opposed to a much faster loading on the GPU. The normalisation parameter lets you specify whether to perform Named Entity Normalisation / Entity Linking; when set to False, it only performs Named Entity Recognition.
Retrieve Open Access publications from PubMed Central and automatically convert them into BioC JSON format:
from microbELP import pmcid_to_microbiome
pmcid_to_microbiome(
'$path_PMCID_text_file$', #type str
'$email_address$' #type str
)It takes 2 mandatory parameters:
pmcid_file<class 'str'>: Path to a text file where each line contains a single PMCID.email<class 'str'>: An email address required by the NCBI API for query identification.
The function queries the NCBI API to retrieve all Open Access publications from the provided PMCID list and automatically converts them into BioC JSON format.
Optional argument:
pmcid_to_microbiome(
'$path_PMCID_text_file$', #type str
'$email_address$', #type str
output_directory = './' #type str # Path to save the retrieved and converted files. Default is './'
)By default, all results are saved in a new directory named 'microbELP_PMCID_microbiome/', which contains two subfolders:
PMCID_XML/β raw XML files retrieved from NCBI.bioc/β BioC-converted versions of the publications, ready for downstream processing withmicrobELP,parallel_microbELPormicrobELP_DL.
The package includes a standalone function for Microbiome Entity Recognition using a DL model optimised for both CPU and GPU.
from microbELP import microbiome_DL_ner
input_text = 'The first microbiome I learned about is called Helicobacter pylori.' #type str or list of str
print(microbiome_DL_ner(input_text))Output:
[{'Entity': 'Helicobacter pylori', 'locations': {'offset': 47, 'length': 19}}]
If a match is found, the function returns a list of detected entities, each represented as a dictionary containing:
Entity: the recognised microbiome namelocations: the character offset and length of the entity within the text
You can also provide a list of text strings as input for batch processing to reduce loading overhead:
from microbELP import microbiome_DL_ner
input_list = [
'The first microbiome I learned about is called Helicobacter pylori.',
'Then I learned about Eubacterium rectale.'
]
print(microbiome_DL_ner(input_list))Output:
[
[{'Entity': 'Helicobacter pylori', 'locations': {'offset': 47, 'length': 19}}],
[{'Entity': 'Eubacterium rectale', 'locations': {'offset': 21, 'length': 19}}]
]
Each element in the output list corresponds to one input text, containing its recognised microbiome entities and their text locations.
There is one optional parameter to this function called cpu <type 'bool'>, the default value is False, i.e. runs on a GPU if any are available. If you want to force the usage of the CPU, you will need to use microbiome_DL_ner(input_list, cpu = True).
The package includes two helper functions for standalone microbial name normalisation, available for both nonβDL and DL usage.
from microbELP import microbiome_normalisation
microbiome_normalisation('Eubacterium rectale') #type str # Output: NCBI:txid39491If a match is found, it returns the NCBI Taxonomy identifier; otherwise None.
For time efficiency (as loading the vocabulary requires the most time), you can provide a list of strings as input and receive a list of dictionaries as output in the format:
input = ['Eubacterium rectale', 'bacteria']
output = [{'Eubacterium rectale': 'NCBI:txid39491'}, {'bacteria': 'NCBI:txid2'}]
from microbELP import microbiome_normalisation
microbiome_normalisation(['Eubacterium rectale', 'bacteria']) #type list # Output: [{'Eubacterium rectale': 'NCBI:txid39491'}, {'bacteria': 'NCBI:txid2'}]For deep learningβbased name normalisation using the BioSyn model, the package provides the following function:
from microbELP import microbiome_biosyn_normalisation
input_list = ['bacteria', 'Eubacterium rectale', 'Helicobacter pylori'] # type list
print(microbiome_biosyn_normalisation(input_list))Output:
[
{'mention': 'bacteria', 'candidates': [
{'NCBI:txid2': 'bacteria'},
{'NCBI:txid2': 'Bacteria'},
{'NCBI:txid1869227': 'bacteria bacterium'},
{'NCBI:txid1869227': 'Bacteria bacterium'},
{'NCBI:txid1573883': 'bacterium associated'}
]},
{'mention': 'Eubacterium rectale', 'candidates': [
{'NCBI:txid39491': 'eubacterium rectale'},
{'NCBI:txid39491': 'Eubacterium rectale'},
{'NCBI:txid39491': 'pseudobacterium rectale'},
{'NCBI:txid39491': 'Pseudobacterium rectale'},
{'NCBI:txid39491': 'e. rectale'}
]},
{'mention': 'Helicobacter pylori', 'candidates': [
{'NCBI:txid210': 'Helicobacter pylori'},
{'NCBI:txid210': 'helicobacter pylori'},
{'NCBI:txid210': 'Campylobacter pylori'},
{'NCBI:txid210': 'campylobacter pylori'},
{'NCBI:txid210': 'campylobacter pyloridis'}
]}
]
This function supports both single strings and lists of microbiome mentions as input.
from microbELP import microbiome_biosyn_normalisation
input_text = 'Helicobacter pylori'
print(microbiome_biosyn_normalisation(input_text))Output:
[{'mention': 'Helicobacter pylori', 'candidates': [
{'NCBI:txid210': 'Helicobacter pylori'},
{'NCBI:txid210': 'helicobacter pylori'},
{'NCBI:txid210': 'Campylobacter pylori'},
{'NCBI:txid210': 'campylobacter pylori'},
{'NCBI:txid210': 'campylobacter pyloridis'}
]}]
Parameters:
to_normalise<class 'str' or 'list['str']'>): Text or list of microbial names to normalise.cpu(<class 'bool'>, default=False): When set toFalse, it will run on any GPU available. The longest part for inference on the CPU is to load the vocabulary used to predict the identifier.candidates_number(<class 'int'>, default=5): Number of top candidate matches to return (from most to least likely).max_lenght(<class 'int'>, default=25): Maximum token length allowed for the model input.ontology(<class 'str'>, default=''): Path to a custom vocabulary text file in id||entity format. If left empty, the default curated NCBI Taxonomy vocabulary is used.save(<class 'bool'>, default=False): If True, saves results tomicrobiome_biosyn_normalisation_output.jsonin the current directory.
The library includes a visualisation module designed to generate phylogenetic trees of identified microbial taxa. This allows users to visually inspect the microbial composition of individual datasets or compare across different domains or study sections.
To generate a single phylogenetic tree from the output of the pipeline or from a list of NCBI Taxonomy identifiers:
from microbELP import phylogenetic_tree_microbiome
phylogenetic_tree_microbiome(
input_path, #type str
figure_text = '', #type str
ioa_filter = [], #type list
output_image_path = './', #type str
save = False, #type boolean
verbose = False #type boolean
)Parameters:
input_path<class 'str'> Path to a.csv,.tsv, or.txtfile containing a list of NCBI Taxonomy IDs, or a directory containing annotated files produced by the pipeline.figure_text<class 'str'> Optional text displayed in the centre of the generated figure (e.g., study or dataset name).ioa_filter<class 'list'> Only used when providing a directory as input. Enables filtering by specific sections of a paper (e.g., ['IAO:0000318'], ['IAO:0000318', 'IAO:0000319'] whereIAO:0000318= results section andIAO:0000319= discussion section).output_image_path<class 'str'> Path where the output will be saved. The images are stored under a new subdirectorymicrobiome_visualisation/.save<class 'bool'> IfTrue, saves the generated images automatically.verbose<class 'bool'> IfTrue, displays detailed logs of the tree generation process.
This function produces a phylogenetic tree based on the counts of microbial taxa found in the provided dataset.
To compare microbial profiles between two datasets or domains:
from microbELP import comparative_phylogenetic_tree_microbiome
comparative_phylogenetic_tree_microbiome(
back_input_path, #type str
front_input_path, #type str
ioa_filter = [], #type list
back_text = '', #type str
front_text = '', #type str
overlap_text = '', #type str
output_image_path = './', #type str
save = False, #type boolean
verbose = False #type boolean
)Parameters:
back_input_path<class 'str'> The reference dataset or domain to compare against.front_input_path<class 'str'> The dataset or domain to highlight over the reference.ioa_filter<class 'list'>,output_image_path<class 'str'>,save<class 'bool'>, andverbose<class 'bool'>: Same as in the previous function.back_text<class 'str'>,front_text<class 'str'>,overlap_text<class 'str'>: Optional labels to display in the generated figures (e.g., dataset names or conditions).
This function generates four comparative images:
- Phylogenetic tree and counts for the
back_input_path. - Phylogenetic tree and counts for the
front_input_path. - Tree of
front_input_pathwith q-values compared toback_input_path. - Overlay of
front_input_pathon top ofback_input_path, showing comparative abundance.
To reduce processing time, microbELP leverages Pythonβs multiprocessing library. This allows the workload to be distributed across multiple CPU cores, significantly speeding up the overall execution of the pipeline.
Run the pipeline on a folder containing BioC files:
from microbELP import parallel_microbELP
if __name__ == "__main__":
parallel_microbELP(
'$input_folder$', #type str
NUMBER_OF_CORES_ALLOCATED #type int
)Optional arguments:
from microbELP import parallel_microbELP
if __name__ == "__main__":
parallel_microbELP(
'$input_folder$', #type str
NUMBER_OF_CORES_ALLOCATED, #type int
output_directory='$output_path$' #type str # Default: './'
)The output_directory parameter lets you specify where to save the results. By default, output files are stored in the current working directory ('./') under 'microbELP_result/'.
Published literature can be subject to copyright with restrictions on redistribution. Users need to be mindful of the data storage requirements and how the derived products are presented and shared. Many publishers provide guidance on the use of content for redistribution and use in research.
The corpus was collected using cadmus
,
and the raw file formats were then standardised to BioC using Auto-CORPus
.
The original weights before fine-tuning for named entity recognition and entity linking were obtained from
dmis-lab/biobert-base-cased-v1.1
.
The entity linking fine-tuning and inference scripts were obtained and modified from
BioSyn
. Finally, thank you to Adrien PavΓ£o for answering all our questions regarding setting up our benchmarks on Codabench.
If you find this repository useful, please consider giving a star β and citation π:
@article {Patel2025.08.29.671515,
author = {Patel, Dhylan and Lain, Antoine D. and Vijayaraghavan, Avish and Mirzaei, Nazanin Faghih and Mweetwa, Monica N. and Wang, Meiqi and Beck, Tim and Posma, Joram M.},
title = {Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis},
elocation-id = {2025.08.29.671515},
year = {2025},
doi = {10.1101/2025.08.29.671515},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515},
eprint = {https://www.biorxiv.org/content/early/2025/08/30/2025.08.29.671515.full.pdf},
journal = {bioRxiv}
}If you used microbiome_biosyn_normalisation or microbELP_DL with the normalisation module please also cite π:
@inproceedings{sung2020biomedical,
title = {Biomedical Entity Representations with Synonym Marginalization},
author = {Sung, Mujeen and Jeon, Hwisang and Lee, Jinhyuk and Kang, Jaewoo},
booktitle = {ACL},
year = {2020}
}If you used pmcid_to_microbiome please also cite π:
@article {10.3389/fdgth.2022.788124,
author = {Beck, Tim and Shorter, Tom and Hu, Yan and Li, Zhuoyu and Sun, Shujian and Popovici, Casiana M. and McQuibban, Nicholas A. R. and Makraduli, Filip and Yeung, Cheng S. and Rowlands, Thomas and Posma, Joram M. },
title = {Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature},
journal = {Frontiers in Digital Health},
volume = {Volume 4 - 2022},
year = {2022},
url = {https://www.frontiersin.org/journals/digital-health/articles/10.3389/fdgth.2022.788124},
doi = {10.3389/fdgth.2022.788124}
π Dhylan
Β Β
π Antoine
Β Β
π Avish
Β Β
π Nazanin
Β Β
π Joram
-> First public release.