Skip to content

Conversation

@zyacub
Copy link

@zyacub zyacub commented Dec 8, 2025

BiLM + BiLSTM NER Example (Biomedical NER)

This example demonstrates a simple end-to-end pipeline for biomedical named entity recognition (NER) using:

  • A bidirectional language model (BiLM) over tokens for unsupervised pretraining.
  • A BiLSTM token classifier for NER (no CRF for simplicity).
  • A small CoNLL-style biomedical NER dataset (or a built-in synthetic toy dataset).

The goal is to show how a research-style reproduction (BiLM + NER) can be packaged as a reusable PyHealth example, improving the reproducibility of AI4H models.

This example is adapted from a course project that reproduces a published biomedical NER architecture and evaluates the effect of BiLM pretraining on NER performance.


Files

  • bilm_ner.py
    Main script which:

    • Loads a CoNLL-style token-level NER dataset (or a synthetic toy dataset).
    • Builds a BiLM over token IDs.
    • Trains the BiLM for a few epochs on unlabeled sentences.
    • Builds a BiLSTM-based NER model.
    • Trains:
      • a baseline NER model (no pretraining), and
      • a BiLM-pretrained NER model (word embeddings + forward LSTM initialized from BiLM).
    • Reports test F1 for both models.
  • test_bilm_ner.py
    unittest test suite which:

    • Verifies that the synthetic dataset builder works.
    • Verifies that the BiLM forward pass runs and returns a finite scalar loss.
    • Verifies that the NER model forward + backward passes work and produce gradients.
    • Verifies that decode() returns sequences whose lengths match the unpadded token lengths.

These tests are lightweight and designed to run quickly on CPU.


Dataset Format

By default, the example can run entirely on a synthetic toy dataset (no external files required).

To use a real dataset, provide files in a simple CoNLL-style TSV format:

  • One token per line.
  • Columns: TOKEN<TAB>TAG
  • Sentences separated by blank lines.

Example:

BRAF    B-GENE
mutation    O
in  O
melanoma    B-DISEASE

EGFR    B-GENE
mutations   O
are O
common  O

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant