This project implements a machine learning pipeline for analyzing EEG frequency patterns to detect Alzheimer's Disease (AD). It focuses on identifying and analyzing frequency-based differences between AD patients and age-matched healthy controls using EEG data.
Data is based on research data from the article "Resting state EEG biomarkers of cognitive decline associated with Alzheimer's disease and mild cognitive impairment" which is attached to the repository.
The project analyzes EEG data to:
- Identify characteristic frequencies that distinguish between AD patients and controls
- Develop robust feature selection methods for EEG analysis
- Build and validate machine learning models for AD detection
- Analyze the trade-offs between different classification approaches
.
├── data/ # Data directory (not included in repo)
│ ├── processed_data.csv
│ ├── X_ml.csv
│ └── y_ml.csv
├── results/ # Results and model outputs
│ ├── models/
│ ├── plots/
│ └── metrics/
├── exploration.py # Initial data exploration
├── preprocessing.py # Data preprocessing pipeline
├── feature_engineering.py # Feature selection and engineering
├── model_training.py # Model training and evaluation
├── report.md # Detailed analysis report
└── README.md # This file
The analysis identified several key frequency bands that distinguish AD patients from controls:
- Theta (3-5 Hz) - Global
- Delta (1-3 Hz) - Temporal regions
- Alpha (10-13 Hz) - Central regions
- Beta (13-20 Hz) - Parietal regions
The best performing model (CatBoost) achieved:
- 82% sensitivity
- 85% specificity
- 0.90 ROC-AUC score
- Python 3.8+
- Required packages:
- pandas
- numpy
- scikit-learn
- xgboost
- catboost
- optuna
- imbalanced-learn
- Data Preprocessing:
python preprocessing.py- Feature Engineering:
python feature_engineering.py- Model Training:
python model_training.pyDetailed results can be found in the report.md file, which includes:
- Complete feature importance analysis
- Model performance metrics
- Clinical implications
- Methodological trade-offs