This repository contains the source code, experiments, and documentation for the "Breast Cancer Prognosis Prediction" project. The goal of the project is to improve prognostic predictions for breast cancer patients by leveraging a multimodal deep learning approach. Specifically, the project integrates imaging and clinical data using a joint fusion strategy: a CNN processes DICOM breast MRI images with .nrrd segmentation files being applied to specific series, while an RNN (using LSTM/GRU) handles clinical data. The features extracted from both modalities are combined via a fusion layer to predict patient outcomes. This work serves as a reference implementation for utilizing multimodal fusion techniques in medical applications, aiming to enhance accuracy and support precision oncology.
We aim to predict breast cancer recurrence by fusing imaging features (from a CNN on MRI scans) with clinical features (via an RNN on electronic health record data). Our pipeline:
- Extract CNN features from DICOM MRI volumes (TumorFeatureCNN).
- Extract RNN features from tabular clinical data (advanced bidirectional LSTM).
- Fuse the two feature sets in a simple fully‑connected “fusion” model.
This approach leverages contextual patient information alongside imaging to improve prognostic accuracy.
-
Literature Review & Architecture Design
- Studied multimodal fusion in medical imaging.
- Chose a CNN to capture spatial tumor characteristics.
- Chose an RNN to model time‑series and tabular clinical features.
- Designed a fusion layer to join both modalities for final prediction.
-
Implementation Steps
- CNN component (Mason)
- Load & preprocess DICOM series + optional segmentation masks.
- Define
TumorFeatureCNN(3 conv layers + adaptive pool). - Serialize extracted features to
cnn_features.pkl.
- Baseline RNN & Fusion (Alex)
- Preprocess & encode clinical Excel sheet.
- Train a SimpleRNN baseline.
- Fuse baseline CNN+RNN features in
fusion_layer.py.
- Advanced RNN & Fusion (Austine)
- Build bidirectional LSTM with regularization.
- Improve fusion layer with threshold tuning & cross‑validation.
- Serialize final features to
rnn_features.pkl.
- CNN component (Mason)
- Languages: Python 3.8+
- Core Libraries:
- TensorFlow 2.x / Keras
- PyTorch
- scikit‑learn, imbalanced‑learn
- pandas, numpy, matplotlib
- pydicom, SimpleITK (for DICOM/NRRD I/O)
- Optional: Google Colab (for free GPU)
- Data Files:
Clinical_and_Other_Features.xlsx(included)- DICOM MRI volumes & NRRD masks (download separately; see below)
Install via
pip install tensorflow torch torchvision scikit-learn imbalanced-learn pandas numpy matplotlib pydicom SimpleITK-
Google Colab
- We developed and tested the notebooks (
.ipynb) on Colab to leverage free GPU/TPU resources, speeding up CNN training on large MRI volumes and RNN training on tabular data. - Simply open the notebook in Colab, connect to a GPU runtime, install any missing packages, and run the cells in order.
- We developed and tested the notebooks (
-
Local Python
- All steps are mirrored in standalone
.pyscripts so you can run end‑to‑end on your own machine. - Requires installing Python dependencies (see Section 3).
- Supports GPU if you have CUDA‑enabled hardware; otherwise runs on CPU (expect longer training times).
- All steps are mirrored in standalone
- Already included:
Clinical_and_Other_Features.xlsx.
- Not included (too large).
- Download the Duke Breast Cancer MRI dataset from TCIA:
https://wiki.cancerimagingarchive.net/display/Public/Duke-Breast-Cancer-MRI - Organize as:
data/images/<patient_id>/*.dcm data/masks/<patient_id>/Segmentation_<patient_id>_Breast.seg.nrrd
- Place a CSV (
clinical.csv) with columnsName,Recurrenceunderdata/clinical.csv.
CNN_with_pickle.ipynb→ producescnn_features.pklClinical_Data_RNN.ipynb→ producesrnn_features.pklFusion_layer.ipynb→ trains & evaluates the fused model
python cnn_with_pickle.py \
--images_dir data/images/
--masks_dir data/masks/
--clinical_csv data/clinical.csv
--output cnn_features.pkl
python clinical_data_rnn.py \
--input Clinical_and_Other_Features.xlsx
--output rnn_features.pkl
python fusion_layer.py \
--cnn_features rnn_features.pkl
--rnn_features rnn_features.pkl
- Inspect console output for confusion matrices, ROC AUC, and precision‑recall metrics.
- Visual artifacts (plots) are saved in the working directory.
This project uses the Duke Breast Cancer MRI dataset, including the clinical and other features, which is licensed under Creative Commons (CC BY-NC 4.0). The dataset is provided by The Cancer Imaging Archive (TCIA). For more details and to access the dataset, please visit: https://www.cancerimagingarchive.net/collection/duke-breast-cancer-mri DOI: 10.7937/TCIA.e3sv-re93
- Mason — Advanced CNN model, feature extraction, & serialization
- Alex — Presentation, sprint planning, baseline RNN model & baseline fusion layer
- Austine— Clinical data encoding, advanced RNN model & advanced fusion layer