This PySpark notebook performs network forensic analysis using unsupervised clustering to identify patterns and potential anomalies in network connection logs. Here’s what it does, step-by-step:
- Imports modules for PySpark SQL, data types, machine learning, vectorization, and user-defined functions.
- Defines a custom schema for reading a tab-separated network log file ("bigger.log") with fields like timestamps, IPs, ports, protocol, bytes, and packet counts.
- Reads the log file into a Spark DataFrame
dfand replaces all missing values with zeros (df2).
- Defines a user-defined function (
toInt) to convert string fields (like IPs, proto, ports, and byte counts) to integers by concatenating their ASCII values—useful for numeric machine learning features. - Registers the UDF, then applies it across relevant columns to create integerized columns such as
iorigp,irespp,iproto,iorigbytes,irespbytes,iorigpkts, andiorigipbytes.
- Uses
VectorAssemblerto combine selected integer feature columns into a singlefeaturesvector column for ML. - Resulting DataFrame,
router, is ready for clustering.
- Trains a KMeans model (k=7 clusters, fixed seed) on the feature vectors.
- Generates cluster predictions for each connection record and appends the predicted cluster to the DataFrame.
- Groups the predictions by cluster label to create a count of records per cluster.
- Converts the groupby result to a Pandas DataFrame and plots a bar chart with Plotly, showing the distribution of connection records across clusters.
- Displays the cluster count table for deeper inspection.
- Filters the records assigned to cluster #2 (as an example "suspect" cluster).
- Selects fields such as timestamp, connection UID, original host, and responding host for further manual inspection or investigation.
- Purpose: To cluster massive network logs and identify group-level patterns or potential anomalies for security or operational forensics.
- Workflow:
- Loads tab-separated connection logs as a Spark DataFrame with strict schema.
- Converts key fields to numeric for ML compatibility.
- Performs clustering with K-Means.
- Analyzes and visualizes the size of each cluster.
- Filters and extracts suspicious or interesting subgroups for further analysis.
- Result: Delivers a scalable, ML-driven pipeline for exploring and labeling network behavior, highlighting potentially unusual clusters for further review.