This data science project analyzes the MIMIC-III (Medical Information Mart for Intensive Care III) clinical database to investigate factors affecting mortality rates of ICU patients at Beth Israel Deaconess Medical Center from 2001 to 2012. The analysis explores temporal patterns, medication effects, and demographic factors that may influence patient outcomes.
The primary research question addressed in this project is:
"What factors affected the mortality rate of ICU patients hospitalized at Beth Israel Deaconess Medical Center from 2001 to 2012?"
- Temporal Analysis: Are there seasonal or monthly patterns in ICU mortality rates?
- Medication Analysis: Do prescribed drugs and their duration correlate with mortality rates?
- Demographics Analysis: Can demographic factors (ethnicity, language, religion) predict patient mortality?
The project uses the MIMIC-III Clinical Database v1.4, which contains de-identified health data from approximately 60,000 ICU stays. Key datasets analyzed include:
PATIENTS.csv- Patient demographics and basic informationADMISSIONS.csv- Hospital admission detailsICUSTAYS.csv- ICU stay informationPRESCRIPTIONS.csv- Medication prescriptionsNOTEEVENTS.csv- Clinical notesD_ICD_DIAGNOSES.csv- ICD diagnosis codes
- Python 3.7+
- Access to MIMIC-III dataset
pip install pandas numpy matplotlib seaborn scikit-learn statsmodels scipy- Get access to the MIMIC-III dataset
- Place the dataset files in the
mimic-iii-clinical-database-1.4/directory - Run
CleanedData.ipynbto preprocess the raw data
- Monthly variation exists: April showed the highest mortality rate, June the lowest
- Statistical significance: Two-sample proportion t-test revealed significant differences between months (z > 3)
- Limited predictability: Multi-class classification models achieved only ~7% accuracy in predicting death month from diagnosis variables
- Weak correlations: Cramér's V tests showed statistically significant but weak associations between drug categories (ATC levels) and mortality (V = 0.0516-0.1448)
- Drug duration: No strong correlation found between prescription duration and mortality
- Feature engineering: External ATC (Anatomical Therapeutic Chemical) dataset was integrated for drug classification
- Strong predictive power: Support Vector Machine models achieved ~90% accuracy in predicting mortality from demographic variables
- Key factors: Religion, ethnicity, and language were all individually predictive of mortality
- Statistical significance: Chi-square tests revealed significant associations between all demographic categories and mortality outcomes
- Quality control: Removed invalid entries, handled missing values, normalized categorical variables
- Date consistency: Converted temporal variables to pandas datetime format
- Feature engineering: Created derived variables including monthly mortality rates, prescription durations, and one-hot encoded categorical variables
- Data validation: Dropped inconsistent demographic entries and invalid date ranges
- Temporal patterns: Created monthly mortality rate calculations and seasonal analysis
- Association testing: Used Cramér's V for categorical associations and chi-square tests for independence
- Hypothesis testing: Two-sample proportion tests for monthly mortality differences
- Multi-class classification: Logistic regression for predicting month of death
- Binary classification: Support Vector Machine (SVM) for mortality prediction
- Cross-validation: 100-iteration cross-validation for model reliability
- Sampling strategy: Used 20% training, 5% testing, 75% discarded for computational efficiency
Demographics (ethnicity, language, religion) are strong predictors of ICU mortality at this hospital (90% accuracy), while temporal factors show statistical differences but limited practical predictability. Medication factors show weak associations with outcomes.
- Demographics SVM: 90% average accuracy (88-93% range)
- Temporal multi-class: 7% average accuracy (5-12% range)
- Medication correlation: Weak but significant associations (Cramér's V < 0.15)
- Single institution: Results specific to Beth Israel Deaconess Medical Center
- Date shifting: MIMIC-III dates are shifted to protect privacy, though temporal relationships are preserved
- Sample bias: Demographics findings may not generalize to other hospitals or populations
- Model constraints: Limited independent variables available for temporal analysis
- Investigate more detailed medication interactions and dosage effects
- Expand temporal analysis with additional clinical variables
- Validate demographic findings across multiple healthcare institutions
- Explore potential biases in healthcare delivery indicated by demographic predictors
- Data Cleaning: Start with
CleanedData.ipynbto process raw MIMIC-III files - Temporal Analysis: Run
VisualizationTime.ipynbfor mortality pattern analysis - Medication Analysis: Execute
VisualizationMedication.ipynbfor drug correlation studies - Demographics Analysis: Use
VisualizationDemographics.ipynb
- Johnson, A., Pollard, T., & Mark, R. (2019). MIMIC-III Clinical Database Demo (version 1.4). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/C2HM2Q
- Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3, 160035.
- Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
- MIMIC-III Documentation: https://mimic.physionet.org/