This research is funded by the Research Council of Norway through the SFI Norwegian Centre for Cybersecurity in Critical Sectors (NORCICS), under project number #310105. It is also supported by the European Union’s HORIZON Research and Innovation Programme under grant agreement No 101120657, project ENFIELD (European Lighthouse to Manifest Trustworthy and Green AI).
A comprehensive machine learning pipeline for classifying IoT network traffic attacks using deep learning models and GAN-based synthetic data generation for data augmentation.
- Overview
- Project Structure
- Pipeline Workflow
- Installation & Setup
- Usage
- File Descriptions
- Dependencies
- Configuration
- Output Structure
- Troubleshooting
This project implements a complete pipeline for IoT network traffic classification that includes:
- Data Balancing: Processes the CIC IoT Dataset 2023 and creates balanced training/test splits
- Feature Engineering: Extracts and scales features using StandardScaler
- GAN Training: Trains Generative Adversarial Networks for synthetic data generation
- Data Augmentation: Generates synthetic samples to balance dataset categories
- Classification: Trains deep neural networks (DNN/CNN) for attack classification
- Evaluation: Tests models on unseen attack types to evaluate generalization
The pipeline classifies network traffic into the following categories:
- Benign: Normal network traffic
- DDoS: Distributed Denial of Service attacks
- DoS: Denial of Service attacks
- Recon: Reconnaissance attacks
- Mirai: Mirai botnet attacks
- Web-based: Web application attacks (SQL injection, XSS, etc.)
- Brute Force: Dictionary-based attacks
- Spoofing: DNS and ARP spoofing attacks
SecureIoT /
├── 📄 Project Documentation
│ ├── README.md # This documentation file
│ └── requirements.txt # Python dependencies
│
├── 📁 Source Code (src/)
│ ├── 📁 data_processing/ # Data preprocessing pipeline
│ │ ├── balance_data.py # Data balancing and train/test splitting
│ │ ├── scaler_extraction.py # Feature scaling and preprocessing
│ │ ├── preprocess_features.py # Feature preprocessing utilities
│ │ └── drop_features.py # Feature selection utilities
│ │
│ ├── 📁 models/ # Model training and generation
│ │ ├── gan_training.py # GAN model training for data augmentation
│ │ ├── generate_samples_gan.py # Synthetic data generation
│ │ └── train_classify_models.py # Classifier training (DNN/CNN)
│ │
│ ├── 📁 analysis/ # Model evaluation and analysis
│ │ ├── unseen_test.py # Evaluation on unseen attack types
│ │ ├── feature_importance_analysis.py # Feature importance analysis
│ │ └── time_test.py # Performance timing analysis
│ │
│ └── 📁 utils/ # Utility functions and constants
│ └── constants.py # Attack mappings and configuration constants
│
├── � Execution Scripts (scripts/)
│ ├── run_all.ps1 # PowerShell execution script (Windows)
│ └── run_all.sh # Bash execution script (Linux/Mac)
│
├── 📁 Data Directories (Created during execution)
│ ├── CIC_IOT_Dataset2023/
│ │ ├── CSV/ # Original dataset files
│ │ └── Balanced_CSV/ # Processed balanced datasets
│ ├── models/
│ │ └── gan/ # Trained GAN models
│ ├── classifiers/ # Trained classification models
│ ├── generated_data/ # Synthetic data generated by GANs
│ ├── scaler/ # Fitted scalers and preprocessors
│ └── plots/ # Generated analysis plots and figures
│
└── 📁 Virtual Environment
└── .venv/ # Python virtual environment
graph TD
A[Raw CSV Data] --> B[src/data_processing/balance_data.py]
B --> C[Balanced Train/Test Sets]
C --> D[src/data_processing/scaler_extraction.py]
D --> E[Feature Scaler]
C --> F[src/models/gan_training.py]
E --> F
F --> G[Trained GAN Models]
G --> H[src/models/generate_samples_gan.py]
H --> I[Synthetic Data]
C --> J[src/models/train_classify_models.py]
I --> J
E --> J
J --> K[Trained Classifiers]
K --> L[src/analysis/unseen_test.py]
L --> M[Evaluation Results]
-
Data Balancing (
src/data_processing/balance_data.py)- Processes raw CSV files from CIC IoT Dataset 2023
- Creates balanced train/test splits per attack category
- Handles file-level and row-level splitting strategies
- Outputs balanced datasets and tracking metadata
-
Feature Scaling (
src/data_processing/scaler_extraction.py)- Aggregates all training data across attack types
- Fits a global StandardScaler on combined features
- Handles outliers, infinities, and missing values
- Saves fitted scaler for consistent preprocessing
-
GAN Training (
src/models/gan_training.py)- Trains separate GAN models for each attack type
- Uses Generator-Discriminator architecture
- Implements MinMax scaling for GAN training data
- Saves trained models and training loss curves
-
Synthetic Data Generation (
src/models/generate_samples_gan.py)- Uses trained GANs to generate synthetic samples
- Balances dataset by augmenting underrepresented classes
- Maintains feature distributions and relationships
- Outputs synthetic data with metadata
-
Classifier Training (
src/models/train_classify_models.py)- Trains deep neural networks (DNN/CNN architectures)
- Combines real and synthetic data for training
- Implements class weighting and regularization
- Saves trained models with performance metrics
-
Unseen Attack Evaluation (
src/analysis/unseen_test.py)- Evaluates models on previously unseen attack types
- Tests generalization capability of trained classifiers
- Provides detailed classification reports and confusion matrices
- Python 3.11 or higher
- Windows (PowerShell) or Linux/Mac (Bash)
- At least 8GB RAM (16GB recommended for GAN training)
- GPU support recommended for faster training
-
Clone the repository and navigate to project directory
cd SecureIoT/ -
Create and activate virtual environment
# Windows python -m venv .venv .venv\Scripts\activate # Linux/Mac python3 -m venv .venv source .venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
-
Place dataset files
- Download CIC IoT Dataset 2023 (updated 2024)
- Extract CSV files to
CIC_IOT_Dataset2023/CSV/directory - Ensure folder structure matches the expected format
Windows (PowerShell):
.\scripts\run_all.ps1Linux/Mac (Bash):
./scripts/run_all.sh# Step 1: Balance and split data
python -m src.data_processing.balance_data
# Step 2: Extract and fit feature scaler
python -m src.data_processing.scaler_extraction
# Step 3: Train GAN models (time-intensive)
python -m src.models.gan_training
# Step 4: Generate synthetic data
python -m src.models.generate_samples_gan
# Step 5: Train classification models
python -m src.models.train_classify_models
# Step 6: Evaluate on unseen attacks (Needs manual insertion of model path)
python -m src.analysis.unseen_test# Feature importance analysis
python -m src.analysis.feature_importance_analysis
# Performance timing analysis
python -m src.analysis.time_test# Train DNN classifier with synthetic data
python -m src.models.train_classify_modelsNote: All scripts should be run from the project root directory (SecureIoT/) to ensure proper module imports and file paths.
Purpose: Data preprocessing and balancing
- Input: Raw CSV files from CIC IoT Dataset 2023
- Output: Balanced train/test splits, metadata files
- Key Functions:
find_files_for_attack_type(): Discovers CSV files for each attack type- File-level vs row-level splitting strategies
- Category-based balancing with configurable target sizes
- Configuration:
CATEGORY_TARGET_SIZE = 500,000samples per category
Purpose: Feature engineering and scaling
- Input: Balanced training datasets
- Output: Fitted StandardScaler and feature metadata
- Key Functions:
generate_global_scaler(): Fits scaler on combined training data- Handles outliers, infinities, and missing values
- Feature selection and preprocessing
Purpose: Feature preprocessing utilities
- Function:
preprocess_features() - Loads fitted scaler and applies consistent preprocessing
- Handles infinities, outliers, and missing value imputation
- Used across training and inference pipelines
Purpose: Feature selection and filtering
- Function:
drop_unused_features() - Removes non-informative or problematic features
- Configurable feature exclusion list
- Maintains consistency across train/test phases
Purpose: GAN model training for data augmentation
- Input: Balanced training data per attack type
- Output: Trained Generator and Discriminator models
- Architecture:
- Generator: Dense layers with BatchNormalization and ReLU activation
- Discriminator: Dense layers with LeakyReLU and Dropout
- Training: 10,000 epochs with Adam optimizer and Binary Cross-Entropy loss
Purpose: Synthetic data generation using trained GANs
- Input: Trained GAN models and scalers
- Output: Synthetic samples for data augmentation
- Features:
- Configurable sample generation per attack type
- Feature distribution preservation
- Metadata tracking and validation
Purpose: Classification model training
- Input: Real + synthetic training data, test data
- Output: Trained DNN/CNN classifiers with performance metrics
- Architectures:
- DNN: Multi-layer perceptron with batch normalization and dropout
- CNN: 1D convolutional layers for temporal pattern recognition
- Features: Class weighting, early stopping, learning rate reduction
Purpose: Model evaluation on unseen attack types
- Input: Trained classifier, unseen attack data
- Output: Classification reports, confusion matrices, F1 scores
- Evaluation: Tests model generalization on previously unseen attack categories
Purpose: Feature importance analysis using SHAP and permutation importance
- Input: Trained models, test data
- Output: Feature importance plots, SHAP values, analysis reports
- Methods: SHAP explainability, permutation importance, feature ranking
Purpose: Performance timing and benchmarking
- Input: Various model configurations
- Output: Timing reports and performance comparisons
- Analysis: Training time, inference speed, memory usage
Purpose: Configuration constants and mappings
ATTACK_CATEGORY_MAP: Maps specific attacks to general categoriesEXCLUDED_FOR_UNSEEN_TEST: Defines attacks reserved for evaluation- Centralized configuration for consistent attack categorization
Purpose: Complete pipeline execution
- Executes all pipeline steps in correct order
- Error handling and progress reporting
- Cross-platform compatibility (Windows/Linux/Mac)
numpy>=1.24.0 # Numerical computing
pandas>=2.0.0 # Data manipulation
scikit-learn>=1.3.0 # Machine learning utilities
tensorflow>=2.13.0 # Deep learning framework
joblib>=1.3.0 # Model serialization
imbalanced-learn>=0.11.0 # SMOTE and resampling
matplotlib>=3.7.0 # Visualization
seaborn>=0.12.0 # Statistical plotting
tqdm>=4.65.0 # Progress bars
tensorflow → Model training (GANs, DNNs, CNNs)
scikit-learn → Preprocessing, metrics, utilities
pandas → Data manipulation and I/O
numpy → Numerical operations
joblib → Model and scaler persistence
matplotlib/seaborn → Visualization and plotting
CATEGORY_TARGET_SIZE = 500_000 # Target samples per category
CATEGORY_TEST_CAP = 150_000 # Maximum test samples per categorylatent_dim = 100 # Generator input dimension
batch_size = 128 # Training batch size
epochs = 10_000 # Training epochs
save_interval = 500 # Model checkpoint intervaloutlier_threshold = 3 # Standard deviations for outlier clippingThe pipeline uses a hierarchical attack categorization:
| Specific Attack | Category | Description |
|---|---|---|
| DDoS-SYN_Flood, DDoS-UDP_Flood | DDoS | Distributed denial of service |
| DoS-HTTP_Flood, DoS-TCP_Flood | DoS | Single-source denial of service |
| Recon-PortScan, VulnerabilityScan | Recon | Network reconnaissance |
| SqlInjection, XSS | Web-based | Web application attacks |
| Mirai-greeth_flood | Mirai | Mirai botnet attacks |
| DictionaryBruteForce | Brute Force | Password attacks |
| DNS_Spoofing, MITM-ArpSpoofing | Spoofing | Network spoofing |
CIC_IOT_Dataset2023/
└── Balanced_CSV/
├── balance_stats.json # Balancing statistics
├── file_split_tracking.json # File split metadata
└── [AttackType]/
├── train.csv # Training data
└── test.csv # Test data
models/
└── gan/
└── [AttackType]/
├── generator_final.keras # Trained generator
├── discriminator_final.keras # Trained discriminator
└── loss_curve.png # Training loss plot
classifiers/
└── run_[timestamp]/
├── [model]_final.keras # Trained classifier
├── training_history.json # Training metrics
├── confusion_matrix.png # Performance visualization
└── classification_report.txt # Detailed metrics
generated_data/
└── [AttackType]/
├── generated_samples.csv # Synthetic samples
└── generation_summary.json # Generation metadata
scaler/
├── feature_scaler.joblib # Global scaler
└── [AttackType]/
└── feature_scaler.joblib # Attack-specific scaler
- Balanced Datasets: Train/test splits for each attack type
- Trained Models: GAN generators/discriminators and classifiers
- Synthetic Data: Generated samples for data augmentation
- Scalers: Fitted preprocessors for consistent feature scaling
- Metrics: Performance reports, confusion matrices, loss curves
- Metadata: Tracking files for reproducibility and analysis
- Problem: Out of memory during GAN training
- Solution: Reduce batch size in
gan_training.pyor enable GPU memory growth
- Problem: Import errors for TensorFlow or other packages
- Solution: Ensure virtual environment is activated and run
pip install -r requirements.txt
- Problem: Cannot find CSV files
- Solution: Verify dataset is placed in
CIC_IOT_Dataset2023/CSV/directory
- Problem: TensorFlow not using GPU
- Solution: Install
tensorflow-gpuor ensure CUDA drivers are properly installed
- GPU Usage: Enable GPU acceleration for faster training
- Batch Size: Adjust batch sizes based on available memory
- Parallel Processing: Use multiprocessing for data loading
- Feature Selection: Remove irrelevant features to reduce dimensionality
- Monitor training loss curves for convergence
- Check data balance in output statistics
- Validate feature distributions in synthetic data
- Review classification metrics for reasonable performance
- Accuracy: 85-95% on seen attack types
- F1-Score-macro: 0.50-0.60 macro-averaged across categories
- F1-Score-weigthAvg: 0.70-0.80 weighted-averaged across categories
- Generalization: 70-85% accuracy on unseen attack types
- Data Balancing: 5-15 minutes
- Feature Scaling: 2-5 minutes
- GAN Training: 0.2-6 hours (varies by attack type and hardware)
- Synthetic Generation: 10-30 minutes
- Classifier Training: 10-120 minutes
This project is developed for research purposes with Norsk Regnesentral (NR).
This documentation provides a comprehensive overview of the IoT network traffic classification pipeline. For specific implementation details, refer to the individual Python files and their inline documentation.