This repository contains data, feature extraction code, and evaluation scripts for Website Fingerprinting (WF) under both Closed-World and Open-World scenarios.
The primary goals of this project are:
- To provide a structured dataset of monitored and unmonitored traffic traces.
- To extract handcrafted traffic features (Full / Robust / Basic).
- To evaluate the robustness of traffic classification under various WF defense settings (split, padding, domain shift).
Download Link: Google Drive - MLTor Dataset
We provide serialized datasets for both closed-world and open-world experiments. Each dataset contains monitored (known sites) and unmonitored (unknown sites) traffic traces.
Please download the files from the link above and organize them into a dataset directory as follows:
dataset/openworld/mon_standard.pkldataset/closeworld/unmon_standard10.pkldataset/closeworld/unmon_standard10_3000.pkl
After downloading and organizing the datasets, you must update the file paths in the notebooks to match your local environment.
In each notebook (Closed.ipynb, Open_binary.ipynb, Open_multi.ipynb), locate the CONFIGURATION section at the top and modify the PATHS dictionary:
PATHS = {
# Update these paths to match your local file location
"mon": "./your/local/path/dataset/openworld/mon_standard.pkl",
"unmon": "./your/local/path/dataset/closeworld/unmon_standard10_3000.pkl"
}Ensure that the paths correctly point to where you saved the .pkl files on your machine.
All datasets are serialized using the Python pickle format.
| File | Description |
|---|---|
mon_standard.pkl |
Monitored traffic from 95 websites (≈19k traces) |
unmon_standard10.pkl |
Unmonitored traffic (≈10k traces) |
unmon_standard10_3000.pkl |
Reduced unmonitored subset (≈3k traces) |
Goal: Classify traffic traces into one of 95 monitored websites (Classes 0-94).
Key Metrics: Accuracy, Macro-F1.
You can reproduce the experimental results by running the main Jupyter Notebook.
Closed.ipynbConfiguration:
Locate the Configuration section at the beginning of the notebook and ensure the SCENARIO variable is set as follows:
SCENARIO = 'closed'Simply run the entire notebook after setting this variable.
- Data Loading: Automatically loads monitored traffic data. (Note Unmonitored data is skipped to optimize memory).
- Model Optimization (RF vs XGB):
- Compares Random Forest and XGBoost under various correlation thresholds (e.g., 0.95, 0.99).
- Automatically selects the best model based on Accuracy.
- Feature Selection: Applies
EnhancedPreProcessorto remove highly correlated features using the optimal threshold. - Final Evaluation: Outputs the Best Model, Optimal Threshold, Accuracy, and Macro-F1 score.
The console will display the model selection process and final results:
================================================================================
CLOSED-WORLD MODEL SELECTION (Primary Metric: Accuracy)
================================================================================
SELECTED MODEL: Random Forest
• Corr Threshold: 0.95
• Accuracy: 0.8412
• Macro F1: 0.8350
================================================================================
Goal: Determine whether a given web traffic trace corresponds to a Monitored website (Known) or an Unmonitored website (Unknown). This is achieved by training a model on 95 Monitored classes and using Open-Set Rejection (Confidence Scoring) to reject Unknown samples.
Label Assignment:
- Monitored Websites:
1(Internally 0-94 for multi-class training) - Unmonitored Websites:
-1
Key Metrics:
- ROC-AUC: Primary criterion for model selection (Measures overall discriminative power).
- TPR (Recall): Measures the rate of correctly identifying Known/Monitored traffic.
- FPR: Measures the rate of incorrectly classifying Unknown/Unmonitored traffic as Known (Lower is better).
- Precision: Measures the rate of correctly rejecting Unknown/Unmonitored traffic (TNR = 1 - FPR).
You can reproduce the experimental results by running the main Jupyter Notebook.
Open_binary.ipynbConfiguration:
Locate the Configuration section at the beginning of the notebook and ensure the SCENARIO variable is set as follows:
SCENARIO = 'open_binary'-
Data Loading: Automatically loads monitored traffic data and unmonitored traffic data.
-
Binary Label Construction:
- Monitored traffic →
1 - Unmonitored traffic (value defined in
CURRENT_CONFIG['unmon_label']) →-1
- Monitored traffic →
-
Train/Test Split:
- The full dataset (X, y_binary) is split into training (X_train, y_train) and testing (X_test, y_test) sets using stratified sampling
stratify=y_binaryto preserve class balance. test_sizeandrandom_statefromCURRENT_CONFIG
- The full dataset (X, y_binary) is split into training (X_train, y_train) and testing (X_test, y_test) sets using stratified sampling
-
Nested Evaluation Loop: A results list (
results_binary) is created and filled with results for every combination of:- Correlation Thresholds — from
CURRENT_CONFIG['corr_th'] - Confidence Threshold Percentiles — from
CURRENT_CONFIG['threshold_percentiles'] - Models — Random Forest (RF) and XGBoost (XGB)
- This produces a full grid of experiments.
- Correlation Thresholds — from
-
Preprocessing (per Correlation Threshold): For each
corr_thvalue:EnhancedPreprocessor(correlation_threshold=corr_th)is created.X_train→fit_transformX_test→transform- This removes highly correlated features according to the specified threshold.
-
Model Training (Once per Correlation Threshold):
- For each correlation threshold, the notebook trains:
- Random Forest (RF)
- Trained on
X_train_prepwith binary labelsy_train
- Trained on
- XGBoost (XGB)
- Uses
CustomLabelEncoderto convert the binary labels {-1, 1} into encoded form {0, 1} for XGBoost'sbinary:logisticobjective. - Trained on preprocessed features and encoded labels.
- Note: Each model is trained only once per correlation threshold; threshold tuning happens afterward.
- Uses
- Random Forest (RF)
- For each correlation threshold, the notebook trains:
-
Confidence-Based Threshold Tuning & Rejection:
-
For each threshold percentile (
th_pct):-
Step A — Compute Confidence
- RF:
rf_proba = rf_model.predict_proba(X_test_prep) - XGB:
xgb_proba = xgb_model.predict_proba(X_test_prep) - Confidence for each sample:
conf = np.max(proba, axis=1)
- RF:
-
Step B — Compute Decision Threshold
threshold = np.percentile(conf, th_pct)
-
Step C — Apply Rejection Rule
- If
confidence < threshold, reassign prediction to-1(unmonitored).
- If
-
-
This implements an open-world rejection mechanism: If the model is not confident enough, classify as unmonitored.
-
-
Final Model Selection:
- The final model is automatically selected based on the highest ROC-AUC score achieved across all configurations.
- Outputs the Best Model, Optimal Correlation Threshold, Optimal Threshold Percentile, ROC-AUC, TPR, FPR, and Precision.
The console will display the model selection process and final results:
================================================================================
OPEN-WORLD BINARY MODEL SELECTION (Primary: ROC-AUC)
================================================================================
• Model : XGB
• Corr Threshold : 1.0000
• Threshold Percentile : 3.0000
• Roc Auc : 0.9676
• Tpr : 0.9863
• Fpr : 0.2733
• Precision : 0.9581
================================================================================
Goal:
- Determine whether an incoming traffic trace belongs to a Monitored website (Known, classes 0–94) or an Unmonitored website (Unknown, -1).
- If the trace is accepted as Monitored, classify it into one of the 95 monitored website classes.
Key Metrics:
-
Binary Detection Metrics
- ROC-AUC — Primary model-selection metric
- Precision — Secondary selection metric
- TPR — True Positive Rate
- FPR — False Positive Rate
- TNR — True Negative Rate
-
95-Class Identification Metrics
- Computed only for samples that are accepted:
- Monitored Accuracy
- Monitored Macro-F1
-
Evaluated on all test samples
- Overall Accuracy
Run the notebook:
Open_multi.ipynb
Configuration:
Locate the Configuration section at the beginning of the notebook and ensure the SCENARIO variable is set as follows:
SCENARIO = 'open_multi'-
Data Loading
- The notebook loads:
- Monitored dataset (95 classes)
- Unmonitored dataset (label =
-1) - Important:
- Training uses only Monitored samples
- Unmonitored samples are used only for testing
- The notebook loads:
-
Train/Test Split
# Monitored split (train + test) X_mon_train, X_mon_test, y_mon_train, y_mon_test = train_test_split(..., stratify=y_mon) # Unmonitored split (test only) X_unmon_train, X_unmon_test, ... = train_test_split(...)
- Note: Only X_unmon_test is used for evaluation. The unmonitored train split is not used because the model is trained exclusively on monitored samples.
-
Nested Evaluation Loop
- For each:
- Correlation Threshold ∈
[1.0, 0.99, 0.98, 0.95, 0.9] - Threshold Percentile ∈
[10, 15, 20, 25, 30] - Model ∈
{RF, XGB}
- Correlation Threshold ∈
the notebook runs:
- Preprocessing
- Model training (monitored only)
- Open-set rejection
- Evaluation
All results go into:
results_open
- For each:
-
Preprocessing — Per Correlation Threshold
prep = EnhancedPreprocessor(correlation_threshold=corr_th) X_mon_train_prep = prep.fit_transform(X_mon_train) X_mon_test_prep = prep.transform(X_mon_test) X_unmon_test_prep = prep.transform(X_unmon_test)
The preprocessor performs:
- Correlation-based feature pruning
- StandardScaler normalization
- Fit using only Monitored training data
- Stores
feature_names, ensuring proper column ordering at transform time
-
Model Training
- Two models are trained once per correlation threshold.
-
Random Forest
rf_model = RandomForestClassifier(**CURRENT_CONFIG['rf']) rf_model.fit(X_mon_train_prep, y_mon_train)
-
XGBoost
le = CustomLabelEncoder() y_mon_train_enc = le.fit_transform(y_mon_train) xgb_model = XGBClassifier( objective='multi:softprob', num_class=len(le.mapper), **CURRENT_CONFIG['xgb'] ) xgb_model.fit(X_mon_train_prep, y_mon_train_enc)
-
- Models are never trained on unmonitored data.
- Two models are trained once per correlation threshold.
-
Open-Set Rejection (Confidence Thresholding)
-
Note: This implementation computes the threshold using the monitored test split, not a separate validation set. The code does not create or use a dedicated validation set.
-
For each percentile (
threshold_pct):-
A. Compute a threshold from monitored validation-set confidences
y_mon_conf = np.max(y_mon_proba, axis=1) threshold = np.percentile(y_mon_conf, threshold_pct)
-
B. Apply rejection separately Monitored:
y_mon_pred = np.where( y_mon_conf >= threshold, y_mon_pred_classes, -1 )
Unmonitored:
y_unmon_conf = np.max(y_unmon_proba, axis=1) y_unmon_pred = np.where( y_unmon_conf >= threshold, y_unmon_pred_classes, # argmax 기반 predicted classes -1 )
-
C. Combine
y_pred_all = np.concatenate([y_mon_pred, y_unmon_pred]) y_true_all = np.concatenate([y_mon_test, y_unmon_test])
This implements Max-Softmax Open-Set Recognition (OSR).
-
-
-
Evaluation
-
Metrics are computed directly inside the open-multi evaluation loop and stored in
results_open. -
Includes:
-
Binary Detection TPR, FPR, TNR, Precision, ROC-AUC
-
Overall Metrics Overall Accuracy
-
Monitored 95-Class Metrics Monitored Accuracy, Monitored Macro-F1
-
-
Each result is appended to:
results_open -
Threshold Stability: Among configurations with similar ROC-AUC and Precision, mid-range percentiles (e.g., 15–30%) are preferred to avoid unstable extremes.
-
-
Final Model Selection
-
The best model is selected using:
- ROC-AUC (Primary)
- Precision (Secondary)
- Threshold Stability (Tertiary)
-
Final best row is selected accordingly using:
FINAL_MODEL_INFO = select_final_model_strict( 'open_multi', df_open, )
-
-
95-Class Direct Classifier (Baseline)
-
The updated implementation evaluates a monitored-only 95-class classifier:
-
Trained on: Monitored training only
-
Tested on: Monitored test + Unmonitored test
-
Performs NO open-set rejection
-
This shows how poorly a closed-world classifier performs when unmonitored samples appear.
-
The console will display the model selection process and final results:
================================================================================
OPEN-WORLD MULTI-CLASS MODEL SELECTION
Primary: ROC-AUC | Secondary: Precision | Tertiary: Threshold Stability
================================================================================
SELECTED MODEL: RF
• Corr Threshold: 0.99
• Threshold Percentile: 30.0%
• ROC-AUC: 0.8403
• Precision: 0.9712
• TPR: 0.7000
• FPR: 0.1317
• Overall Acc: 0.7050
================================================================================
This module evaluates robustness against traffic analysis defenses.
Download Link: Google Drive - Defense Dataset
The dataset is stored in .cell format within compressed archives.
Note: Please extract the .zip files within your code execution environment.
| Type | File Name | Classes | Total Instances | Note |
|---|---|---|---|---|
| Monitored | mon_50.zip |
50 (0–49) | 10,000 | Balanced (200/class) |
| Unmonitored | unmon_5000.zip |
1 (-1) | 5,000 | Single class |
Each instance contains multiple variations (defense simulations):
-
Keys:
['split_0', 'split_1', 'split_2', 'split_3', 'split_4', 'join'] -
Shape:
$(N, 3)$ where$N$ is packet count. -
Features:
[Time, Direction, Signed Size]
| Scenario | Description |
|---|---|
| Closed-World | Standard 50-class classification |
| Open-Binary | Detect whether a trace is from a monitored site |
| Open-Multi | Detect + classify monitored sites among unknown ones |
Traffic defenses are simulated through split-domain datasets and robust feature reduction:
- join → split: models trained on normal traffic, tested on defended (domain-shifted) traces.
- split → split: models trained and tested on defended traffic.
You can reproduce the experimental results by running the main Jupyter Notebook.
break_WF_defense.ipynbThis notebook covers feature extraction, preprocessing, and evaluation for all three scenarios (Closed, Open-Binary, Open-Multi). You can run these steps sequentially in the provided notebook/script.
Prerequisites:
Ensure the dataset files (mon_50.zip, unmon_5000.zip) are located in the data directory as defined in the notebook.
Before running any specific experiments, load the raw data and extract features.
Evaluate performance on 50 monitored sites using Multi-class Classification.
Step 1: Create Train/Test Splits
- Generate stratified train/test sets for both 'Join' (undefended) and 'Split' (defended) datasets.
# Returns dictionaries containing X_train, X_test, y_train, y_test for each split
full_split_datasets, summary = create_split_trained_datasets(mon_data_full)Step 2: Run Evaluation Scenarios
- Run both Join-Trained (Baseline) and Split-Trained (Robustness) scenarios.
# Returns results for Scenario 1 (df_s1) and Scenario 2 (df_s2)
df_s1, df_s2 = run_all_scenarios(mon_data_full, full_split_datasets, models_to_closed)Step 3: Visualization
- Compare performance across feature sets (Full vs. Robust vs. Basic).
# Generate Bar plots and Line charts for Macro-F1 & Accuracy
plot_feature_set_comparison(df_s1_full, df_s2_full, ...)
plot_macro_f1_lines(df_join_all, df_split_all)Evaluate the ability to distinguish Monitored vs. Unmonitored traffic.
Step 1: Run Scenarios (Auto-Threshold)
- Train the model and automatically find the optimal confidence threshold for detection.
# Find optimal threshold (ROC-AUC, TPR, FPR, etc.)
df_auto_full = run_auto_threshold_binary_all(mon_data_full, unmon_data_full, models_to_open_binary)
# Run evaluation with fixed thresholds
df_fixed_full = run_fixed_threshold_binary_all(mon_data_full, unmon_data_full, models_to_open_binary, th_full)Step 2: Visualization
- Visualize detection performance and trade-offs.
# Plot F1-Score, ROC-AUC, and TPR vs. FPR curves
plot_open_binary_all(df_fixed_full, df_fixed_robust, df_fixed_basic)Evaluate model performance on open-set multi-class classification, which jointly measures detection (monitored vs unmonitored) and class-level identification.
Step 1: Run Scenarios (Auto-Threshold) Two primary setups are evaluated:
- Join → Split: Train on undefended (Join) traffic, test on defended (Split) traffic → tests domain-shift robustness.
- Split → Split: Train and test on defended traffic → tests in-defense adaptability.
# Run Open-Multi evaluation for all feature sets
df_multi_full = run_open_multi_pipeline(mon_data_full, unmon_data_full, models_to_open_multi)
df_multi_robust = run_open_multi_pipeline(mon_data_robust, unmon_data_robust, models_to_open_multi)
df_multi_basic = run_open_multi_pipeline(mon_data_basic, unmon_data_basic, models_to_open_multi)Step 2: Visualization
- Compare detection and classification metrics across thresholds and feature sets.
# Plot Detection-F1 and Class-F1 by Threshold
plot_open_multi_f1(df_multi_full, df_multi_robust, df_multi_basic)
# Plot ROC-AUC trends by Feature Set
plot_open_multi_auc(df_multi_full, df_multi_robust, df_multi_basic)
- Deep Fingerprinting (Sirinam et al., CCS 2018) — Baseline deep model.
- Subverting Website Fingerprinting Defenses with Robust Traffic Representation (Shen et al., 2023) — Inspiration for TAM representations.
Please address any questions to the authors/developers:
- Eunhyeon Kwon (keh54110@gmail.com)
- Hyewon Kim (hhongyeahh@gmail.com)
- Yoonhyung Park (pyoon820@gmail.com)
- Jeongin Heo (jeongin0822@gmail.com)
- Seoyeon Ahn (aliceasy504@gmail.com)