Skip to content

SecureIoT–AI evaluates how DNNs and CNNs generalize to unseen IoT cyber-attacks under severe class imbalance using GAN-based augmentation, SMOTE, and class weighting.

License

Notifications You must be signed in to change notification settings

KnuteLute/SecureIoT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SecureIoT: AI-based Cyber Threat Detection for IoT-based Critical Sectors

Created By: Knut Selstad at Norwegian Computing Center (Norsk regnesentral)

🤝 Contribution

I would like to thank everyone who was contributed to the project:

Sandeep Pirbhulal, Habtamu Abie, Riku Lehkonen and Ismail Ari.

This research is funded by the Research Council of Norway through the SFI Norwegian Centre for Cybersecurity in Critical Sectors (NORCICS), under project number #310105. It is also supported by the European Union’s HORIZON Research and Innovation Programme under grant agreement No 101120657, project ENFIELD (European Lighthouse to Manifest Trustworthy and Green AI).

The use of GitHub Copilot with "Claude Sonnet 4" has been used for coding purposes.

A comprehensive machine learning pipeline for classifying IoT network traffic attacks using deep learning models and GAN-based synthetic data generation for data augmentation.

📋 Table of Contents

🎯 Overview

This project implements a complete pipeline for IoT network traffic classification that includes:

  1. Data Balancing: Processes the CIC IoT Dataset 2023 and creates balanced training/test splits
  2. Feature Engineering: Extracts and scales features using StandardScaler
  3. GAN Training: Trains Generative Adversarial Networks for synthetic data generation
  4. Data Augmentation: Generates synthetic samples to balance dataset categories
  5. Classification: Trains deep neural networks (DNN/CNN) for attack classification
  6. Evaluation: Tests models on unseen attack types to evaluate generalization

Attack Categories

The pipeline classifies network traffic into the following categories:

  • Benign: Normal network traffic
  • DDoS: Distributed Denial of Service attacks
  • DoS: Denial of Service attacks
  • Recon: Reconnaissance attacks
  • Mirai: Mirai botnet attacks
  • Web-based: Web application attacks (SQL injection, XSS, etc.)
  • Brute Force: Dictionary-based attacks
  • Spoofing: DNS and ARP spoofing attacks

📁 Project Structure

SecureIoT /
├── 📄 Project Documentation
│   ├── README.md                # This documentation file
│   └── requirements.txt         # Python dependencies
│
├── 📁 Source Code (src/)
│   ├── 📁 data_processing/      # Data preprocessing pipeline
│   │   ├── balance_data.py      # Data balancing and train/test splitting
│   │   ├── scaler_extraction.py # Feature scaling and preprocessing
│   │   ├── preprocess_features.py # Feature preprocessing utilities
│   │   └── drop_features.py     # Feature selection utilities
│   │
│   ├── 📁 models/               # Model training and generation
│   │   ├── gan_training.py      # GAN model training for data augmentation
│   │   ├── generate_samples_gan.py # Synthetic data generation
│   │   └── train_classify_models.py # Classifier training (DNN/CNN)
│   │
│   ├── 📁 analysis/             # Model evaluation and analysis
│   │   ├── unseen_test.py       # Evaluation on unseen attack types
│   │   ├── feature_importance_analysis.py # Feature importance analysis
│   │   └── time_test.py         # Performance timing analysis
│   │
│   └── 📁 utils/                # Utility functions and constants
│       └── constants.py         # Attack mappings and configuration constants
│
├── � Execution Scripts (scripts/)
│   ├── run_all.ps1              # PowerShell execution script (Windows)
│   └── run_all.sh               # Bash execution script (Linux/Mac)
│
├── 📁 Data Directories (Created during execution)
│   ├── CIC_IOT_Dataset2023/
│   │   ├── CSV/                 # Original dataset files
│   │   └── Balanced_CSV/        # Processed balanced datasets
│   ├── models/
│   │   └── gan/                 # Trained GAN models
│   ├── classifiers/             # Trained classification models
│   ├── generated_data/          # Synthetic data generated by GANs
│   ├── scaler/                  # Fitted scalers and preprocessors
│   └── plots/                   # Generated analysis plots and figures
│
└── 📁 Virtual Environment
    └── .venv/                   # Python virtual environment

🔄 Pipeline Workflow

graph TD
    A[Raw CSV Data] --> B[src/data_processing/balance_data.py]
    B --> C[Balanced Train/Test Sets]
    C --> D[src/data_processing/scaler_extraction.py]
    D --> E[Feature Scaler]
    C --> F[src/models/gan_training.py]
    E --> F
    F --> G[Trained GAN Models]
    G --> H[src/models/generate_samples_gan.py]
    H --> I[Synthetic Data]
    C --> J[src/models/train_classify_models.py]
    I --> J
    E --> J
    J --> K[Trained Classifiers]
    K --> L[src/analysis/unseen_test.py]
    L --> M[Evaluation Results]
Loading

Step-by-Step Process

  1. Data Balancing (src/data_processing/balance_data.py)

    • Processes raw CSV files from CIC IoT Dataset 2023
    • Creates balanced train/test splits per attack category
    • Handles file-level and row-level splitting strategies
    • Outputs balanced datasets and tracking metadata
  2. Feature Scaling (src/data_processing/scaler_extraction.py)

    • Aggregates all training data across attack types
    • Fits a global StandardScaler on combined features
    • Handles outliers, infinities, and missing values
    • Saves fitted scaler for consistent preprocessing
  3. GAN Training (src/models/gan_training.py)

    • Trains separate GAN models for each attack type
    • Uses Generator-Discriminator architecture
    • Implements MinMax scaling for GAN training data
    • Saves trained models and training loss curves
  4. Synthetic Data Generation (src/models/generate_samples_gan.py)

    • Uses trained GANs to generate synthetic samples
    • Balances dataset by augmenting underrepresented classes
    • Maintains feature distributions and relationships
    • Outputs synthetic data with metadata
  5. Classifier Training (src/models/train_classify_models.py)

    • Trains deep neural networks (DNN/CNN architectures)
    • Combines real and synthetic data for training
    • Implements class weighting and regularization
    • Saves trained models with performance metrics
  6. Unseen Attack Evaluation (src/analysis/unseen_test.py)

    • Evaluates models on previously unseen attack types
    • Tests generalization capability of trained classifiers
    • Provides detailed classification reports and confusion matrices

🚀 Installation & Setup

Prerequisites

  • Python 3.11 or higher
  • Windows (PowerShell) or Linux/Mac (Bash)
  • At least 8GB RAM (16GB recommended for GAN training)
  • GPU support recommended for faster training

Setup Instructions

  1. Clone the repository and navigate to project directory

    cd SecureIoT/
  2. Create and activate virtual environment

    # Windows
    python -m venv .venv
    .venv\Scripts\activate
    
    # Linux/Mac
    python3 -m venv .venv
    source .venv/bin/activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Place dataset files

    • Download CIC IoT Dataset 2023 (updated 2024)
    • Extract CSV files to CIC_IOT_Dataset2023/CSV/ directory
    • Ensure folder structure matches the expected format

🎮 Usage

Quick Start - Full Pipeline

Windows (PowerShell):

.\scripts\run_all.ps1

Linux/Mac (Bash):

./scripts/run_all.sh

Step-by-Step Execution

# Step 1: Balance and split data
python -m src.data_processing.balance_data

# Step 2: Extract and fit feature scaler
python -m src.data_processing.scaler_extraction

# Step 3: Train GAN models (time-intensive)
python -m src.models.gan_training

# Step 4: Generate synthetic data
python -m src.models.generate_samples_gan

# Step 5: Train classification models
python -m src.models.train_classify_models

# Step 6: Evaluate on unseen attacks (Needs manual insertion of model path)
python -m src.analysis.unseen_test

Individual Analysis Scripts

# Feature importance analysis
python -m src.analysis.feature_importance_analysis

# Performance timing analysis
python -m src.analysis.time_test

Classifier Training Options

# Train DNN classifier with synthetic data
python -m src.models.train_classify_models

Note: All scripts should be run from the project root directory (SecureIoT/) to ensure proper module imports and file paths.

📄 File Descriptions

Data Processing Pipeline (src/data_processing/)

balance_data.py

Purpose: Data preprocessing and balancing

  • Input: Raw CSV files from CIC IoT Dataset 2023
  • Output: Balanced train/test splits, metadata files
  • Key Functions:
    • find_files_for_attack_type(): Discovers CSV files for each attack type
    • File-level vs row-level splitting strategies
    • Category-based balancing with configurable target sizes
  • Configuration: CATEGORY_TARGET_SIZE = 500,000 samples per category

scaler_extraction.py

Purpose: Feature engineering and scaling

  • Input: Balanced training datasets
  • Output: Fitted StandardScaler and feature metadata
  • Key Functions:
    • generate_global_scaler(): Fits scaler on combined training data
    • Handles outliers, infinities, and missing values
    • Feature selection and preprocessing

preprocess_features.py

Purpose: Feature preprocessing utilities

  • Function: preprocess_features()
  • Loads fitted scaler and applies consistent preprocessing
  • Handles infinities, outliers, and missing value imputation
  • Used across training and inference pipelines

drop_features.py

Purpose: Feature selection and filtering

  • Function: drop_unused_features()
  • Removes non-informative or problematic features
  • Configurable feature exclusion list
  • Maintains consistency across train/test phases

Model Training Pipeline (src/models/)

gan_training.py

Purpose: GAN model training for data augmentation

  • Input: Balanced training data per attack type
  • Output: Trained Generator and Discriminator models
  • Architecture:
    • Generator: Dense layers with BatchNormalization and ReLU activation
    • Discriminator: Dense layers with LeakyReLU and Dropout
  • Training: 10,000 epochs with Adam optimizer and Binary Cross-Entropy loss

generate_samples_gan.py

Purpose: Synthetic data generation using trained GANs

  • Input: Trained GAN models and scalers
  • Output: Synthetic samples for data augmentation
  • Features:
    • Configurable sample generation per attack type
    • Feature distribution preservation
    • Metadata tracking and validation

train_classify_models.py

Purpose: Classification model training

  • Input: Real + synthetic training data, test data
  • Output: Trained DNN/CNN classifiers with performance metrics
  • Architectures:
    • DNN: Multi-layer perceptron with batch normalization and dropout
    • CNN: 1D convolutional layers for temporal pattern recognition
  • Features: Class weighting, early stopping, learning rate reduction

Analysis and Evaluation (src/analysis/)

unseen_test.py

Purpose: Model evaluation on unseen attack types

  • Input: Trained classifier, unseen attack data
  • Output: Classification reports, confusion matrices, F1 scores
  • Evaluation: Tests model generalization on previously unseen attack categories

feature_importance_analysis.py

Purpose: Feature importance analysis using SHAP and permutation importance

  • Input: Trained models, test data
  • Output: Feature importance plots, SHAP values, analysis reports
  • Methods: SHAP explainability, permutation importance, feature ranking

time_test.py

Purpose: Performance timing and benchmarking

  • Input: Various model configurations
  • Output: Timing reports and performance comparisons
  • Analysis: Training time, inference speed, memory usage

Utilities (src/utils/)

constants.py

Purpose: Configuration constants and mappings

  • ATTACK_CATEGORY_MAP: Maps specific attacks to general categories
  • EXCLUDED_FOR_UNSEEN_TEST: Defines attacks reserved for evaluation
  • Centralized configuration for consistent attack categorization

Execution Scripts (scripts/)

run_all.ps1 / run_all.sh

Purpose: Complete pipeline execution

  • Executes all pipeline steps in correct order
  • Error handling and progress reporting
  • Cross-platform compatibility (Windows/Linux/Mac)

📦 Dependencies

Core Libraries

numpy>=1.24.0          # Numerical computing
pandas>=2.0.0          # Data manipulation
scikit-learn>=1.3.0    # Machine learning utilities
tensorflow>=2.13.0     # Deep learning framework

Supporting Libraries

joblib>=1.3.0          # Model serialization
imbalanced-learn>=0.11.0  # SMOTE and resampling
matplotlib>=3.7.0      # Visualization
seaborn>=0.12.0        # Statistical plotting
tqdm>=4.65.0          # Progress bars

Dependency Graph

tensorflow → Model training (GANs, DNNs, CNNs)
scikit-learn → Preprocessing, metrics, utilities
pandas → Data manipulation and I/O
numpy → Numerical operations
joblib → Model and scaler persistence
matplotlib/seaborn → Visualization and plotting

⚙️ Configuration

Key Parameters

Data Balancing (src/data_processing/balance_data.py)

CATEGORY_TARGET_SIZE = 500_000    # Target samples per category
CATEGORY_TEST_CAP = 150_000       # Maximum test samples per category

GAN Training (src/models/gan_training.py)

latent_dim = 100          # Generator input dimension
batch_size = 128          # Training batch size
epochs = 10_000           # Training epochs
save_interval = 500       # Model checkpoint interval

Feature Processing (src/data_processing/scaler_extraction.py)

outlier_threshold = 3     # Standard deviations for outlier clipping

Attack Category Mapping

The pipeline uses a hierarchical attack categorization:

Specific Attack Category Description
DDoS-SYN_Flood, DDoS-UDP_Flood DDoS Distributed denial of service
DoS-HTTP_Flood, DoS-TCP_Flood DoS Single-source denial of service
Recon-PortScan, VulnerabilityScan Recon Network reconnaissance
SqlInjection, XSS Web-based Web application attacks
Mirai-greeth_flood Mirai Mirai botnet attacks
DictionaryBruteForce Brute Force Password attacks
DNS_Spoofing, MITM-ArpSpoofing Spoofing Network spoofing

📊 Output Structure

Generated Directories

CIC_IOT_Dataset2023/
└── Balanced_CSV/
    ├── balance_stats.json           # Balancing statistics
    ├── file_split_tracking.json     # File split metadata
    └── [AttackType]/
        ├── train.csv                # Training data
        └── test.csv                 # Test data

models/
└── gan/
    └── [AttackType]/
        ├── generator_final.keras    # Trained generator
        ├── discriminator_final.keras # Trained discriminator
        └── loss_curve.png          # Training loss plot

classifiers/
└── run_[timestamp]/
    ├── [model]_final.keras         # Trained classifier
    ├── training_history.json       # Training metrics
    ├── confusion_matrix.png        # Performance visualization
    └── classification_report.txt   # Detailed metrics

generated_data/
└── [AttackType]/
    ├── generated_samples.csv       # Synthetic samples
    └── generation_summary.json     # Generation metadata

scaler/
├── feature_scaler.joblib           # Global scaler
└── [AttackType]/
    └── feature_scaler.joblib       # Attack-specific scaler

Key Output Files

  1. Balanced Datasets: Train/test splits for each attack type
  2. Trained Models: GAN generators/discriminators and classifiers
  3. Synthetic Data: Generated samples for data augmentation
  4. Scalers: Fitted preprocessors for consistent feature scaling
  5. Metrics: Performance reports, confusion matrices, loss curves
  6. Metadata: Tracking files for reproducibility and analysis

🔧 Troubleshooting

Common Issues

Memory Issues

  • Problem: Out of memory during GAN training
  • Solution: Reduce batch size in gan_training.py or enable GPU memory growth

Missing Dependencies

  • Problem: Import errors for TensorFlow or other packages
  • Solution: Ensure virtual environment is activated and run pip install -r requirements.txt

Dataset Path Issues

  • Problem: Cannot find CSV files
  • Solution: Verify dataset is placed in CIC_IOT_Dataset2023/CSV/ directory

GPU Configuration

  • Problem: TensorFlow not using GPU
  • Solution: Install tensorflow-gpu or ensure CUDA drivers are properly installed

Performance Optimization

  1. GPU Usage: Enable GPU acceleration for faster training
  2. Batch Size: Adjust batch sizes based on available memory
  3. Parallel Processing: Use multiprocessing for data loading
  4. Feature Selection: Remove irrelevant features to reduce dimensionality

Validation

  • Monitor training loss curves for convergence
  • Check data balance in output statistics
  • Validate feature distributions in synthetic data
  • Review classification metrics for reasonable performance

📈 Expected Results

Performance Metrics

  • Accuracy: 85-95% on seen attack types
  • F1-Score-macro: 0.50-0.60 macro-averaged across categories
  • F1-Score-weigthAvg: 0.70-0.80 weighted-averaged across categories
  • Generalization: 70-85% accuracy on unseen attack types

Training Times (Approximate)

  • Data Balancing: 5-15 minutes
  • Feature Scaling: 2-5 minutes
  • GAN Training: 0.2-6 hours (varies by attack type and hardware)
  • Synthetic Generation: 10-30 minutes
  • Classifier Training: 10-120 minutes

📄 License

This project is developed for research purposes with Norsk Regnesentral (NR).


This documentation provides a comprehensive overview of the IoT network traffic classification pipeline. For specific implementation details, refer to the individual Python files and their inline documentation.

About

SecureIoT–AI evaluates how DNNs and CNNs generalize to unseen IoT cyber-attacks under severe class imbalance using GAN-based augmentation, SMOTE, and class weighting.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published