Skip to content

werowe/cyberSecurityAnalytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

This PySpark notebook performs network forensic analysis using unsupervised clustering to identify patterns and potential anomalies in network connection logs. Here’s what it does, step-by-step:


1. Data Loading and Schema Definition

  • Imports modules for PySpark SQL, data types, machine learning, vectorization, and user-defined functions.
  • Defines a custom schema for reading a tab-separated network log file ("bigger.log") with fields like timestamps, IPs, ports, protocol, bytes, and packet counts.
  • Reads the log file into a Spark DataFrame df and replaces all missing values with zeros (df2).

2. Feature Engineering

  • Defines a user-defined function (toInt) to convert string fields (like IPs, proto, ports, and byte counts) to integers by concatenating their ASCII values—useful for numeric machine learning features.
  • Registers the UDF, then applies it across relevant columns to create integerized columns such as iorigp, irespp, iproto, iorigbytes, irespbytes, iorigpkts, and iorigipbytes.

3. Vectorization

  • Uses VectorAssembler to combine selected integer feature columns into a single features vector column for ML.
  • Resulting DataFrame, router, is ready for clustering.

4. Clustering Analysis (K-Means)

  • Trains a KMeans model (k=7 clusters, fixed seed) on the feature vectors.
  • Generates cluster predictions for each connection record and appends the predicted cluster to the DataFrame.

5. Cluster Analysis and Visualization

  • Groups the predictions by cluster label to create a count of records per cluster.
  • Converts the groupby result to a Pandas DataFrame and plots a bar chart with Plotly, showing the distribution of connection records across clusters.
  • Displays the cluster count table for deeper inspection.

6. Anomaly Identification

  • Filters the records assigned to cluster #2 (as an example "suspect" cluster).
  • Selects fields such as timestamp, connection UID, original host, and responding host for further manual inspection or investigation.

Summary

  • Purpose: To cluster massive network logs and identify group-level patterns or potential anomalies for security or operational forensics.
  • Workflow:
    • Loads tab-separated connection logs as a Spark DataFrame with strict schema.
    • Converts key fields to numeric for ML compatibility.
    • Performs clustering with K-Means.
    • Analyzes and visualizes the size of each cluster.
    • Filters and extracts suspicious or interesting subgroups for further analysis.
  • Result: Delivers a scalable, ML-driven pipeline for exploring and labeling network behavior, highlighting potentially unusual clusters for further review.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published