Goal: obtaining a smooth 2D latent representation of MNIST images that clusters naturally by digits, without using the labels (SSL).
Performed a few simple experiments focusing on the following questions:
- What augmentations are conducive to a meaningful latent space?
- Contrastive loss vs reconstructive loss
- Various other parameters: temperature, batch size, regularisations on latent space, ...
Evaluation by visual inspection of latent space and knn accuracy on a val set with k = 1, 5, 20
-
Contrastive vs. reconstructive loss: contrastive loss works much better. Uses latent space more efficiently, and has better performance on knn.
- Interpretation: reconstruction requires encoding a lot more info in latent space than would be necessary for class identity, which we care about.
-
Subtle augmentations that mimic "natural" variation between digits work better than more "artificial" ones.
- Interpretation: data augmentation allows us to guide the model to learn invariances. If the invariances correspond to within-label variance, the representation gravitates towards clustering by labels.
-
L2 regularisation improves knn performance
- Interpretation: L2 regularisation forces the model to cluster more tightly, improving global coherence of the representation.
Clear clusters of similar digits. For example, the "1" (orange) cluster is well separated. Local alignment is strong, though the global structure still has room for improvement.
Original images are shown in the left-most column; remaining columns display applied augmentations.
Strong positional translations didn't work well. Presumably, this is due to MNIST images being well-centered, so the generated variation is not useful for our goal.
- might be interesting to find ways to perturb the latent space, ideally leading to more global coherence. unclear how though.
- maybe some VAE-inspired distribution embedding per image, instead of a single point?




