diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 85dcac3..db04b09 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -25,6 +25,11 @@ jobs:
           python-version: 3.11
           auto-activate: false
 
+      - name: Install system dependencies for OpenCV
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y libtheora-dev
+
       - name: Run pre-commit (linting/formatting/mypy etc)
         shell: bash -l {0}
         run: |
diff --git a/Analysis_And_Thoughts.md b/Analysis_And_Thoughts.md
index 25444a1..17dd68f 100644
--- a/Analysis_And_Thoughts.md
+++ b/Analysis_And_Thoughts.md
@@ -1,167 +1,57 @@
-#### Do we need to normalize the images to ImageNet's normalization values during training (fine tuning) and inference?
-
-If you’re fine-tuning a pretrained model (like ResNet from `torchvision.models`), then yes — you pretty much have to normalize with ImageNet’s mean/std (or at least something very close) during both training and inference.
-
-Why?
-
-That pretrained model was trained on ImageNet with inputs normalized to `mean=[0.485, 0.456, 0.406]` and `std=[0.229, 0.224, 0.225]`.
-
-The convolutional filters (especially in the early layers) are tuned to pick up patterns assuming pixel values are centered around 0 and scaled.
-
-If you skip normalization and just give `[0,1]` scaled tensors, the activations are shifted and scaled incorrectly → leading to much worse performance.
-
-⚖️ The “no way around it” part:
-
-If you don’t want to normalize, the only real option is to train from scratch on your dataset (starting with random weights). Then the model can learn directly from your unnormalized inputs.
-
-But if you’re using pretrained weights for transfer learning, normalization is required to benefit from that pretrained knowledge.
-
-✨ TL;DR:
-
-Fine-tuning pretrained model → must normalize.
-
-Training from scratch → normalization is optional, but recommended (better stability).
-
----
-
-#### Whats `AdamW` optimizer
-
-AdamW is basically Adam with weight decay done the right way*.
-
-🔍 The backstory
-
-Adam is an adaptive optimizer that adjusts the learning rate for each parameter based on past gradients.
-
-If you want L2 regularization (to prevent overfitting), you normally add weight decay.
-
-In original Adam, adding L2 was done by simply adding λ * w to the gradient — but that interacts badly with Adam’s internal moving averages and scaling, making the regularization effect weaker and inconsistent.
-
-💡 What AdamW changes
-
-AdamW decouples weight decay from the gradient update.
-
-It applies the Adam update first based on gradients.
-
-Then it shrinks the weights directly by a small factor:
-
-$$$
-w←w−η⋅λ⋅w
-$$$
-
-This makes the regularization effect predictable and stable, especially for large models.
-
-📌 TL;DR
-
-Adam + naive L2 = kinda messy
-AdamW = Adam + proper, decoupled weight decay
-Used in most modern deep learning setups (Transformers, ViTs, large CNNs).
-If you’re doing vision or NLP today, AdamW is usually your default unless you have a reason not to.
-
----
-
-#### If I do `model.eval()` then is it freezing the weights?
-
-Not exactly — `model.eval()` does not freeze weights in the sense of making them untrainable.
-
-What it does:
-Switches the model into evaluation mode.
-
-This changes the behavior of certain layers:
-Dropout → turned off (always passes all activations through)
-BatchNorm → uses running statistics instead of batch statistics
-
-No gradients are disabled yet — so if you do `loss.backward()` afterwards, gradients will still be computed and weights can still change.
-
-So:
-
-`model.eval()` = inference behavior (but weights still trainable if you try).
-with `torch.no_grad()` = no gradients computed.
-`requires_grad=False` = actual freezing of weights.
-
----
-
-#### If I take a pretrained resnet model its not necessary that I have to resize my input images to 224x224 right. This is only needed if I use the pretrained model. But if I train it from scratch with image of size say 500x500 then I can infer on only on images of that size. 224 comes from the imagenet dataset which was used to provide the pretrained model?
-
-1. Why 224×224 for pretrained ResNets?
-
-The original ImageNet dataset was preprocessed with images resized to 224×224.
-
-Pretrained ResNet weights assume that input size.
-
-The network’s convolution + pooling layers are flexible with spatial size, but the very last fully connected (FC) layer is tied to the feature map size that comes out of the conv stack.
-
-So if you take the pretrained model as-is, you must give it 224×224 (or close, like 256 crops) so that the tensor shapes line up.
-
-2. What if you train from scratch?
-
-If you start training a ResNet from scratch with 500×500 inputs, the network will happily learn with that size.
-
-At inference, you can then feed 500×500 images, and it will work fine — as long as you don’t change the size.
-
-BUT you can’t mix — if you trained on 500×500 and then try 224×224 at inference, the FC layer dimensions won’t match.
-
-3. Global Average Pooling trick
-
-Modern variants (like torchvision’s ResNet implementations) often use `AdaptiveAvgPool2d((1, 1))` before the FC layer.
-
-This adapts to any input size, reducing the spatial dimensions to 1×1 before the FC layer.
-
-This means:
-
-You can fine-tune a pretrained ResNet on larger images (e.g., 512×512).
-
-It won’t crash at inference time with different sizes.
-
-But — distribution shift is real: a ResNet pretrained on 224×224 and then fed 512×512 might not generalize as well unless you fine-tune it.
-
-TL;DR 🚀
-
-Pretrained ResNet (ImageNet weights) → stick to 224×224 (or retrain/fine-tune if you want different).
-
-Training from scratch → pick any size you want, but inference must match that.
-
-Adaptive pooling makes the model size-flexible, but the statistical expectations (like receptive fields, learned filters) still favor the training size.
-
-##### But if Im doing fine tuning/transfer learning a model which was originally trained on images 224 and now Im using images of size 500. And it has AdapetiveAvgPool2d(). Does it still work?
-
-Yup! That’s one of the big perks of AdaptiveAvgPool2d — it makes transfer learning with different input sizes possible. 🎯
-
-Let’s walk through what happens if you fine-tune a ResNet pretrained on 224×224 and then feed it 500×500 images:
-
-🔎 Flow through the network
-
-The convolution + pooling stack doesn’t care about fixed sizes — it just keeps reducing spatial dimensions as you go deeper.
-
-For 224×224 → you end up with a 7×7 feature map before the global pooling.
-
-For 500×500 → you end up with a 16×16 feature map (approx).
-
-Then `AdaptiveAvgPool2d((1,1))` collapses whatever size you have (7×7, 16×16, 10×10, doesn’t matter) down to 1×1 per channel.
-
-That gives you a fixed 2048-dimensional vector (for ResNet-50/101/152), no matter what the input size was.
-
-The FC layer then works fine, since it always gets the same 2048 features.
-
-⚠️ Caveats
-
-The pretrained weights were optimized for 224×224 crops.
-
-Feeding 500×500 changes the **effective receptive field**: each filter now “sees” a different scale of information.
-
-The model might not perform optimally unless you fine-tune it on your 500×500 dataset.
-
-Training will be slower and use more memory, since early layers now process bigger feature maps.
-
-✅ Practical recipe
-
-If you’re fine-tuning:
-
-Load the pretrained ResNet (224×224 weights).
-
-Replace the final FC layer with one matching your number of classes.
-
-Train on your 500×500 images — the adaptive pooling makes it work.
-
-Consider using data augmentation with crops/resizing so the model isn’t too sensitive to one specific size.
-
-👉 So the short answer: Yes, it works thanks to `AdaptiveAvgPool2d`. But performance may improve if you fine-tune at your new resolution instead of just freezing the pretrained backbone.
+#### Class percentage exploration
+
+Class percentages for overall dataset:
+{
+    'Center_0': 77.62068965517241,
+    'Donut_1': 0.0,
+    'Edge-Loc_2': 12.241379310344827,
+    'Edge-Ring_3': 0.8620689655172413,
+    'Loc_4': 5.931034482758621,
+    'Random_5': 1.8275862068965518,
+    'Scratch_6': 0.7931034482758621,
+    'Near-full_7': 0.7241379310344828,
+    'none_8': 0.0
+}
+
+
+Class percentages for train dataset:
+{
+    'Center_0': 77.70935960591133,
+    'Donut_1': 0.0,
+    'Edge-Loc_2': 12.31527093596059,
+    'Edge-Ring_3': 0.7389162561576355,
+    'Loc_4': 6.280788177339902,
+    'Random_5': 1.5394088669950738,
+    'Scratch_6': 0.6773399014778325,
+    'Near-full_7': 0.7389162561576355,
+    'none_8': 0.0
+}
+
+
+Class percentages for validation dataset:
+{
+    'Center_0': 77.87356321839081,
+    'Donut_1': 0.0,
+    'Edge-Loc_2': 12.068965517241379,
+    'Edge-Ring_3': 1.0057471264367817,
+    'Loc_4': 4.885057471264368,
+    'Random_5': 2.586206896551724,
+    'Scratch_6': 0.8620689655172413,
+    'Near-full_7': 0.7183908045977011,
+    'none_8': 0.0
+}
+
+Class percentages for test dataset:
+{
+    'Center_0': 77.06896551724138,
+    'Donut_1': 0.0,
+    'Edge-Loc_2': 12.241379310344827,
+    'Edge-Ring_3': 1.0344827586206897,
+    'Loc_4': 6.206896551724138,
+    'Random_5': 1.7241379310344827,
+    'Scratch_6': 1.0344827586206897,
+    'Near-full_7': 0.6896551724137931,
+    'none_8': 0.0
+}
+
+Even though we did not explicitly try to ensure that the class percentages sync up when splitting into train, val, test datasets, it still matches up with the overall dataset. I think this is a result of the dataset being large enough.
diff --git a/nanodefectnet/scripts/data_preprocess.py b/nanodefectnet/scripts/data_preprocess.py
index ebd1852..c0c33ea 100644
--- a/nanodefectnet/scripts/data_preprocess.py
+++ b/nanodefectnet/scripts/data_preprocess.py
@@ -1,5 +1,5 @@
 import os
-from typing import Optional
+from typing import Optional, Dict
 
 import numpy as np
 import pandas as pd
@@ -215,6 +215,28 @@ def save_wafer_images(df: pd.DataFrame, split_name: str, output_dir: str) -> Non
     )
 
 
+def compute_class_percentages(df: pd.DataFrame) -> Dict:
+    """
+    Computes the percentage of each class in the provided wafer map.
+
+    Args:
+        df (pd.DataFrame): DataFrame containing wafer map data.
+
+    Returns:
+        Dict: A dictionary with class names as keys and their percentages as values.
+    """
+    class_percentages = {}
+    num_classes = len(FAILURE_TYPE_TO_ID)
+
+    for i in range(num_classes):  # For each wafermap defect class
+        class_mask = df[df["failureNum"] == i]
+        class_percentages[f"{ID_TO_FAILURE_TYPE[i]}_{i}"] = (
+            len(class_mask) / len(df) * 100
+        )
+
+    return class_percentages
+
+
 def make_train_val_test_split(
     df: pd.DataFrame, root_processed_dataset_path: str
 ) -> None:
@@ -251,6 +273,22 @@ def make_train_val_test_split(
     LOGGER.info(f"Validation data percentage: {len(df_val) / len(df) * 100:.2f}%")
     LOGGER.info(f"Test data percentage: {len(df_test) / len(df) * 100:.2f}%")
 
+    LOGGER.debug(
+        f"Class percentages for overall dataset: {compute_class_percentages(df)}"
+    )
+    LOGGER.debug(f"\n\n")
+    LOGGER.debug(
+        f"Class percentages for train dataset: {compute_class_percentages(df_train)}"
+    )
+    LOGGER.debug(f"\n\n")
+    LOGGER.debug(
+        f"Class percentages for validation dataset: {compute_class_percentages(df_val)}"
+    )
+    LOGGER.debug(f"\n\n")
+    LOGGER.debug(
+        f"Class percentages for test dataset: {compute_class_percentages(df_test)}"
+    )
+
     save_wafer_images(df_train, "train", root_processed_dataset_path)
     save_wafer_images(df_val, "val", root_processed_dataset_path)
     save_wafer_images(df_test, "test", root_processed_dataset_path)