GitHub

This PySpark notebook performs network forensic analysis using unsupervised clustering to identify patterns and potential anomalies in network connection logs. Here’s what it does, step-by-step:

1. Data Loading and Schema Definition

Imports modules for PySpark SQL, data types, machine learning, vectorization, and user-defined functions.
Defines a custom schema for reading a tab-separated network log file ("bigger.log") with fields like timestamps, IPs, ports, protocol, bytes, and packet counts.
Reads the log file into a Spark DataFrame df and replaces all missing values with zeros (df2).

2. Feature Engineering

Defines a user-defined function (toInt) to convert string fields (like IPs, proto, ports, and byte counts) to integers by concatenating their ASCII values—useful for numeric machine learning features.
Registers the UDF, then applies it across relevant columns to create integerized columns such as iorigp, irespp, iproto, iorigbytes, irespbytes, iorigpkts, and iorigipbytes.

3. Vectorization

Uses VectorAssembler to combine selected integer feature columns into a single features vector column for ML.
Resulting DataFrame, router, is ready for clustering.

4. Clustering Analysis (K-Means)

Trains a KMeans model (k=7 clusters, fixed seed) on the feature vectors.
Generates cluster predictions for each connection record and appends the predicted cluster to the DataFrame.

5. Cluster Analysis and Visualization

Groups the predictions by cluster label to create a count of records per cluster.
Converts the groupby result to a Pandas DataFrame and plots a bar chart with Plotly, showing the distribution of connection records across clusters.
Displays the cluster count table for deeper inspection.

6. Anomaly Identification

Filters the records assigned to cluster #2 (as an example "suspect" cluster).
Selects fields such as timestamp, connection UID, original host, and responding host for further manual inspection or investigation.

Summary

Purpose: To cluster massive network logs and identify group-level patterns or potential anomalies for security or operational forensics.
Workflow:
- Loads tab-separated connection logs as a Spark DataFrame with strict schema.
- Converts key fields to numeric for ML compatibility.
- Performs clustering with K-Means.
- Analyzes and visualizes the size of each cluster.
- Filters and extracts suspicious or interesting subgroups for further analysis.
Result: Delivers a scalable, ML-driven pipeline for exploring and labeling network behavior, highlighting potentially unusual clusters for further review.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
bigger.log.zip		bigger.log.zip
verizon.ipynb		verizon.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

1. Data Loading and Schema Definition

2. Feature Engineering

3. Vectorization

4. Clustering Analysis (K-Means)

5. Cluster Analysis and Visualization

6. Anomaly Identification

Summary

About

Uh oh!

Releases

Packages

Languages

werowe/cyberSecurityAnalytics

Folders and files

Latest commit

History

Repository files navigation

1. Data Loading and Schema Definition

2. Feature Engineering

3. Vectorization

4. Clustering Analysis (K-Means)

5. Cluster Analysis and Visualization

6. Anomaly Identification

Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages