🎓 Academic Context: This project is a production refactor of a research thesis. The original "notebook-style" research code and full experiments can be found here: 📂 Link to Original Thesis Repository
A production-grade Machine Learning microservice for real-time Network Intrusion Detection. This project refactors an academic thesis ("Optimization of SVM utilizing PCA") into a scalable, containerized REST API.
It utilizes Principal Component Analysis (PCA) to reduce network traffic feature space by 70% (78
The system transforms raw network traffic vectors into threat predictions using a strict scikit-learn pipeline.
graph LR
A[Client Request] -- "JSON - 78 Features" --> B(FastAPI Endpoint)
B --> C{Input Validation}
C -- Valid --> D[Standard Scaler]
D -- Normalized --> E[PCA Transform]
E -- "Reduced - 23 Features" --> F[SVM Classifier]
F -- Prediction --> G[Response]
C -- Invalid --> H[400 Error]
- Dimensionality Reduction: Compresses 78 CIC-IDS-2017 features into 23 principal components using PCA.
- Production API: Exposes the model via FastAPI with strict Pydantic schema validation.
- Containerized: Fully dockerized environment using python:3.10-slim for consistent deployment.
- Performance:Accuracy: ~86-88% (Benchmark against CIC-IDS-2017 dataset).
- Latency: Sub-millisecond internal inference time.📂
├── app/
│ ├── core/ # Config & Settings
│ ├── schemas/ # Pydantic Models (Input/Output Contracts)
│ ├── services/ # Inference Engine (Singleton Pattern)
│ └── main.py # API Entrypoint
├── models/ # Serialized Artifacts (Scaler, PCA, SVM)
├── Dockerfile # Multi-stage build instructions
└── requirements.prod.txt
- Run with Docker (Recommended)
# Build the image
docker build -t ids-api:v1 .
# Run container (Exposed on port 8000)
docker run -d -p 8000:8000 --name ids-service ids-api:v1- API Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | /health | Health check and model status |
| POST | /predict | Main inference endpoint |
- Example Request Input: Raw feature vector (78 floats) representing network flow statistics.
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"features": [80, 10452, 0, ... (78 features) ...]}'Output:
{
"threat_detected": true,
"confidence": 0.985,
"label": "ATTACK",
"processing_time_ms": 0.42
}The transition from raw features to PCA features demonstrated a massive reduction in complexity with minimal loss in detection capability.
| Metric | Original (78 Features) | PCA (23 Features) | Impact |
|---|---|---|---|
| Information Retained | 100% | 95% | 5% Loss |
| Training Time | High | Low | Speedup |
| Accuracy (Weighted) | 0.88 | 0.86 | ~2% Drop |
Note: The slight drop in accuracy is a strategic trade-off for the massive gain in throughput required for real-time network monitoring.