[CS 598DLH] Add BiLM + BiLSTM NER biomedical example to PyHealth #708

zyacub · 2025-12-08T01:36:45Z

BiLM + BiLSTM NER Example (Biomedical NER)

This example demonstrates a simple end-to-end pipeline for biomedical named entity recognition (NER) using:

A bidirectional language model (BiLM) over tokens for unsupervised pretraining.
A BiLSTM token classifier for NER (no CRF for simplicity).
A small CoNLL-style biomedical NER dataset (or a built-in synthetic toy dataset).

The goal is to show how a research-style reproduction (BiLM + NER) can be packaged as a reusable PyHealth example, improving the reproducibility of AI4H models.

This example is adapted from a course project that reproduces a published biomedical NER architecture and evaluates the effect of BiLM pretraining on NER performance.

Files

bilm_ner.py
Main script which:
- Loads a CoNLL-style token-level NER dataset (or a synthetic toy dataset).
- Builds a BiLM over token IDs.
- Trains the BiLM for a few epochs on unlabeled sentences.
- Builds a BiLSTM-based NER model.
- Trains:
  - a baseline NER model (no pretraining), and
  - a BiLM-pretrained NER model (word embeddings + forward LSTM initialized from BiLM).
- Reports test F1 for both models.
test_bilm_ner.py
unittest test suite which:
- Verifies that the synthetic dataset builder works.
- Verifies that the BiLM forward pass runs and returns a finite scalar loss.
- Verifies that the NER model forward + backward passes work and produce gradients.
- Verifies that decode() returns sequences whose lengths match the unpadded token lengths.

These tests are lightweight and designed to run quickly on CPU.

Dataset Format

By default, the example can run entirely on a synthetic toy dataset (no external files required).

To use a real dataset, provide files in a simple CoNLL-style TSV format:

One token per line.
Columns: TOKEN<TAB>TAG
Sentences separated by blank lines.

Example:

BRAF    B-GENE
mutation    O
in  O
melanoma    B-DISEASE

EGFR    B-GENE
mutations   O
are O
common  O

Test cases + Model

a61fd77

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CS 598DLH] Add BiLM + BiLSTM NER biomedical example to PyHealth #708

[CS 598DLH] Add BiLM + BiLSTM NER biomedical example to PyHealth #708

Uh oh!

zyacub commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[CS 598DLH] Add BiLM + BiLSTM NER biomedical example to PyHealth #708

Are you sure you want to change the base?

[CS 598DLH] Add BiLM + BiLSTM NER biomedical example to PyHealth #708

Uh oh!

Conversation

zyacub commented Dec 8, 2025

BiLM + BiLSTM NER Example (Biomedical NER)

Files

Dataset Format

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant