diff --git a/_posts/2023-11-11-plankton-representation.md b/_posts/2023-11-11-plankton-representation.md new file mode 100644 index 00000000..b6cff043 --- /dev/null +++ b/_posts/2023-11-11-plankton-representation.md @@ -0,0 +1,130 @@ +--- +layout: distill +title: Comparing Clustering in Representation Space for Plankton Images +description: Plankton imaging devices are becoming a key method for gathering in-situ data about plankton communities. These instruments can produce millions of images so automatic processes are necessary for extracting information from the images, making the application of machine learning to plankton data a key step in advancing the study of ocean biogeochemistry. In this project I will explore how the representation of plankton images in the latent space differs between classic supervised learning, the unsupervised contrastive learning method descriped in SimCLR, and a modified supervised contrastive learning method. +date: 2023-11-11 +htmlwidgets: true + +# Anonymize when submitting +# authors: +# - name: Anonymous + +authors: + - name: Barbara Duckworth + url: "https://github.com/barbara42" + affiliations: + name: MIT + + +# must be the exact same name as your blogpost +bibliography: 2023-11-11-plankton-representation.bib + +# Add a table of contents to your post. +# - make sure that TOC names match the actual section names +# for hyperlinks within the post to work correctly. +toc: + - name: Equations + - name: Images and Figures + subsections: + - name: Interactive Figures + - name: Citations + - name: Footnotes + - name: Code Blocks + - name: Layouts + - name: Other Typography? + +# Below is an example of injecting additional post-specific styles. +# This is used in the 'Layouts' section of this post. +# If you use this post as a template, delete this _styles block. +_styles: > + .fake-img { + background: #bbb; + border: 1px solid rgba(0, 0, 0, 0.1); + box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); + margin-bottom: 12px; + } + .fake-img p { + font-family: monospace; + color: white; + text-align: left; + margin: 12px 0; + text-align: center; + font-size: 16px; + } +--- +# Introduction + +Plankton are fundamental to ocean ecosystems, serving as the primary producers in marine food webs and playing a critical role in our planet's carbon cycle . Understanding their distribution and behavior is vital for assessing ocean health and predicting environmental changes . + +Marine imaging devices, such as the Imaging Flow Cytobot, produce millions of datapoints during expeditions, making machine learning methods necessary for analyzing their output . This project aims to advance this analysis by exploring representation learning for plankton images. + +Representation learning, where the system learns to automatically identify and capture the most relevant features of the data, can provide insight into morphological patterns that are not immediately obvious, and improve classification of the organisms. + +Using a new dataset from a 2017 North Pacific research cruise, I will compare the effectiveness of classic supervised learning, unsupervised contrastive learning as described in SimCLR, and a modified supervised contrastive learning method. + +This exploration will potentially offer insights into the latent space representation of plankton imagery, enhancing our understanding of these crucial organisms and improving the efficiency of ecological data processing. + +# Dataset + +{% include figure.html path="assets/img/2023-11-11-plankton-representation/fig-IFCB-examples.png" class="img-fluid" %} + +Data for this project was gathered using the Imaging FlowCytobot (IFCB), an in-situ automated submersible imaging flow cytometer. It generates images of particles within aquatic samples between the size of 10 to 200 microns. These particles might be detritus, organisms, or beads used for calibration . In this dataset, gathered during the Gradients research cruise in the North Pacific in 2017, there are 168,406 images. Each of the images has been classified by a taxonomic expert, Fernanda Freitas, as a part of the Angelique White's research group at the University of Hawaii. There are 170 unique classes, with a handful of classes dominating the data. The distribution of samples per class can be seen in the figure below. + +{% include figure.html path="assets/img/2023-11-11-plankton-representation/fig-dataset-class-distribution.png" class="img-fluid" %} + +# Methodologies + +## Baseline classifier + +To lighten the workload of the project, I will use a codebase developed by a team at WHOI that is designed to work with IFCB data - [WHOI IFCB classifier](https://github.com/WHOIGit/ifcb_classifier). This CNN classifier provides the option to use a number of namebrand models as the backbone, and I will be using ResNet . The WHOI IFCB classifier first resizes all images, and I will be using the standard class balancing and augmentation techniques it provides. + +## SimCLR + +SimCLR is an unsupervised contrastive learning method for visual data. The labels of the dataset are ignored, and each individual image will have a "postive" pair generated from a set of augmentations. It is then assumed that all other images in a batch are "negative". Positive pairs are pulled together and pushed apart from the negatives in representation space using the NT-Xent (normalized temperature-scaled cross entropy) loss function. For the project, the base encoder will be ResNet, and subsequent projection head will be a 2-layer MLP, with all other settings based on the defualts descriped in the SimCLR paper. + +{% include figure.html path="assets/img/2023-11-11-plankton-representation/fig-simCLR-diagram.png" class="img-fluid" %} + +The NT-Xent loss function is defined as follows: + +$$ +\ell_{i,j} = -\log \left( \frac{\exp(\text{sim}(z_i, z_j)/T)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq j]} \exp(\text{sim}(z_i, z_k)/T)} \right) +$$ + + +where $z_i$ and $z_j$ are the positive pairs, $z_k$ is a negative pair, and $\tau$ is the temerature parameter. + + +## Supervised Constrastive Learning + +Instead of just using data augmentation to create a positive pair $x_j$ for image $x_i$, the positive match for image $x_i$ will be chosen from the pool of images with the same class label as $x_i$. Similarly, the negative matches will not be the full set of data, but rather only from classes that $x_i$ does not belong to. + +The resulting subervised normalized temperature-scaled cross entropy loss function (SNT-Xent) is defined as follows: + +$$ +\ell_{i,j} = -\log \left( \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{C}_{[i, k]} \exp(\text{sim}(z_i, z_k)/\tau)} \right) +$$ + +where $\mathbb{C}_{[i, k]}$ is a function that evaluates to 1 if the class $C_j$ of the image $z_j$ is not the same as the class $C_i$ of the image $x_i$. + +$$ +\mathbb{C}_{[i, k]} = + +\begin{cases} +1 & \text{if } C_i \neq C_j \\ +0 & \text{if } C_i = C_j \\ +\end{cases} +$$ + +I am choosing not to pull all images from the same class together do to computational constraints. Instead, $z_j$ is randomly chosen from the pool of images with the same class as $z_i$. + +## Representation Space Analysis + +Latent space representations will be extracted from each of the models at the layer preceeding the last fully connected layer. + +I will use t-SNE (t-Distributed Stochastic Neighbor Embedding) to visualize the feature vectors. + +I will then use K-means clustering and compare the emergent groups to the labels in the dataset. + +# Aknowledgments + +ChatGPT was used to create visualizations, write some paragraphs which were then edited, and generate latex equations. \ No newline at end of file diff --git a/assets/bibliography/2023-11-11-plankton-representation.bib b/assets/bibliography/2023-11-11-plankton-representation.bib new file mode 100644 index 00000000..cdcbf5e6 --- /dev/null +++ b/assets/bibliography/2023-11-11-plankton-representation.bib @@ -0,0 +1,159 @@ +@article{falkowski_role_1994, + title = {The role of phytoplankton photosynthesis in global biogeochemical cycles}, + volume = {39}, + issn = {1573-5079}, + url = {https://doi.org/10.1007/BF00014586}, + doi = {10.1007/BF00014586}, + abstract = {Phytoplankton biomass in the world's oceans amounts to only ∽1–2\% of the total global plant carbon, yet these organisms fix between 30 and 50 billion metric tons of carbon annually, which is about 40\% of the total. On geological time scales there is profound evidence of the importance of phytoplankton photosynthesis in biogeochemical cycles. It is generally assumed that present phytoplankton productivity is in a quasi steady-state (on the time scale of decades). However, in a global context, the stability of oceanic photosynthetic processes is dependent on the physical circulation of the upper ocean and is therefore strongly influenced by the atmosphere. The net flux of atmospheric radiation is critical to determining the depth of the upper mixed layer and the vertical fluxes of nutrients. These latter two parameters are keys to determining the intensity, and spatial and temporal distributions of phytoplankton blooms. Atmospheric radiation budgets are not in steady-state. Driven largely by anthropogenic activities in the 20th century, increased levels of IR- absorbing gases such as CO2, CH4 and CFC's and NOx will potentially increase atmospheric temperatures on a global scale. The atmospheric radiation budget can affect phytoplankton photosynthesis directly and indirectly. Increased temperature differences between the continents and oceans have been implicated in higher wind stresses at the ocean margins. Increased wind speeds can lead to higher nutrient fluxes. Throughout most of the central oceans, nitrate concentrations are sub-micromolar and there is strong evidence that the quantum efficiency of Photosystem II is impaired by nutrient stress. Higher nutrient fluxes would lead to both an increase in phytoplankton biomass and higher biomass-specific rates of carbon fixation. However, in the center of the ocean gyres, increased radiative heating could reduce the vertical flux of nutrients to the euphotic zone, and hence lead to a reduction in phytoplankton carbon fixation. Increased desertification in terrestrial ecosystems can lead to increased aeolean loadings of essential micronutrients, such as iron. An increased flux of aeolean micronutrients could fertilize nutrient-replete areas of the open ocean with limiting trace elements, thereby stimulating photosynthetic rates. The factors which limit phytoplankton biomass and photosynthesis are discussed and examined with regard to potential changes in the Earth climate system which can lead the oceans away from steady-state. While it is difficult to confidently deduce changes in either phytoplankton biomass or photosynthetic rates on decadal time scales, time-series analysis of ocean transparency data suggest long-term trends have occurred in the North Pacific Ocean in the 20th century. However, calculations of net carbon uptake by the oceans resulting from phytoplankton photosynthesis suggest that without a supply of nutrients external to the ocean, carbon fixation in the open ocean is not presently a significant sink for excess atmospheric CO2.}, + language = {en}, + number = {3}, + urldate = {2024-01-06}, + journal = {Photosynthesis Research}, + author = {Falkowski, Paul G.}, + month = mar, + year = {1994}, + keywords = {biogeochemical cycles, oceans, photoacclimation, Photosystem II nutrient limitation, phytoplankton quantum efficiency of photosynthesis}, + pages = {235--258}, + file = {Full Text PDF:/Users/birdy/Zotero/storage/6ECKTMBR/Falkowski - 1994 - The role of phytoplankton photosynthesis in global.pdf:application/pdf}, +} + +@article{treguer_influence_2018, + title = {Influence of diatom diversity on the ocean biological carbon pump}, + volume = {11}, + copyright = {2017 Springer Nature Limited}, + issn = {1752-0908}, + url = {https://www.nature.com/articles/s41561-017-0028-x}, + doi = {10.1038/s41561-017-0028-x}, + abstract = {Diatoms sustain the marine food web and contribute to the export of carbon from the surface ocean to depth. They account for about 40\% of marine primary productivity and particulate carbon exported to depth as part of the biological pump. Diatoms have long been known to be abundant in turbulent, nutrient-rich waters, but observations and simulations indicate that they are dominant also in meso- and submesoscale structures such as fronts and filaments, and in the deep chlorophyll maximum. Diatoms vary widely in size, morphology and elemental composition, all of which control the quality, quantity and sinking speed of biogenic matter to depth. In particular, their silica shells provide ballast to marine snow and faecal pellets, and can help transport carbon to both the mesopelagic layer and deep ocean. Herein we show that the extent to which diatoms contribute to the export of carbon varies by diatom type, with carbon transfer modulated by the Si/C ratio of diatom cells, the thickness of the shells and their life strategies; for instance, the tendency to form aggregates or resting spores. Model simulations project a decline in the contribution of diatoms to primary production everywhere outside of the Southern Ocean. We argue that we need to understand changes in diatom diversity, life cycle and plankton interactions in a warmer and more acidic ocean in much more detail to fully assess any changes in their contribution to the biological pump.}, + language = {en}, + number = {1}, + urldate = {2024-01-06}, + journal = {Nature Geoscience}, + author = {Tréguer, Paul and Bowler, Chris and Moriceau, Brivaela and Dutkiewicz, Stephanie and Gehlen, Marion and Aumont, Olivier and Bittner, Lucie and Dugdale, Richard and Finkel, Zoe and Iudicone, Daniele and Jahn, Oliver and Guidi, Lionel and Lasbleiz, Marine and Leblanc, Karine and Levy, Marina and Pondaven, Philippe}, + month = jan, + year = {2018}, + note = {Number: 1 +Publisher: Nature Publishing Group}, + keywords = {Biogeochemistry, Carbon cycle, Marine biology, Ocean sciences}, + pages = {27--37}, + file = {Full Text PDF:/Users/birdy/Zotero/storage/VLLSAC7H/Tréguer et al. - 2018 - Influence of diatom diversity on the ocean biologi.pdf:application/pdf}, +} + +@article{olson_submersible_2007, + title = {A submersible imaging-in-flow instrument to analyze nano-and microplankton: {Imaging} {FlowCytobot}}, + volume = {5}, + issn = {1541-5856}, + shorttitle = {A submersible imaging-in-flow instrument to analyze nano-and microplankton}, + url = {https://onlinelibrary.wiley.com/doi/abs/10.4319/lom.2007.5.195}, + doi = {10.4319/lom.2007.5.195}, + abstract = {A fundamental understanding of the interaction between physical and biological factors that regulate plankton species composition requires, first of all, detailed and sustained observations. Only now is it becoming possible to acquire these types of observations, as we develop and deploy instruments that can continuously monitor individual organisms in the ocean. Our research group can measure and count the smallest phytoplankton cells using a submersible flow cytometer (FlowCytobot), in which optical properties of individual suspended cells are recorded as they pass through a focused laser beam. However, FlowCytobot cannot efficiently sample or identify the much larger cells (10 to {\textgreater}100 µm) that often dominate the plankton in coastal waters. Because these larger cells often have recognizable morphologies, we have developed a second submersible flow cytometer, with imaging capability and increased water sampling rate (typically, 5 mL seawater analyzed every 20 min), to characterize these nano- and microplankton. Like the original, Imaging FlowCytobot can operate unattended for months at a time; it obtains power from and communicates with a shore laboratory, so we can monitor results and modify sampling procedures when needed. Imaging FlowCytobot was successfully tested for 2 months in Woods Hole Harbor and is presently deployed alongside FlowCytobot at the Martha's Vineyard Coastal Observatory. These combined approaches will allow continuous long-term observations of plankton community structure over a wide range of cell sizes and types, and help to elucidate the processes and interactions that control the life cycles of individual species.}, + language = {en}, + number = {6}, + urldate = {2022-10-20}, + journal = {Limnology and Oceanography: Methods}, + author = {Olson, Robert J. and Sosik, Heidi M.}, + year = {2007}, + note = {\_eprint: https://onlinelibrary.wiley.com/doi/pdf/10.4319/lom.2007.5.195}, + pages = {195--203}, + file = {Full Text:/Users/birdy/Zotero/storage/JP4F24FI/Olson and Sosik - 2007 - A submersible imaging-in-flow instrument to analyz.pdf:application/pdf;Snapshot:/Users/birdy/Zotero/storage/9E2FX4K7/lom.2007.5.html:text/html}, +} + +@misc{sosik_automated_2007, + title = {Automated taxonomic classification of phytoplankton sampled with imaging-in-flow cytometry}, + volume = {5}, + issn = {1541-5856}, + url = {https://onlinelibrary.wiley.com/doi/abs/10.4319/lom.2007.5.204}, + doi = {10.4319/lom.2007.5.204}, + abstract = {High-resolution photomicrographs of phytoplankton cells and chains can now be acquired with imaging-in-flow systems at rates that make manual identification impractical for many applications. To address the challenge for automated taxonomic identification of images generated by our custom-built submersible Imaging FlowCytobot, we developed an approach that relies on extraction of image features, which are then presented to a machine learning algorithm for classification. Our approach uses a combination of image feature types including size, shape, symmetry, and texture characteristics, plus orientation invariant moments, diffraction pattern sampling, and co-occurrence matrix statistics. Some of these features required preprocessing with image analysis techniques including edge detection after phase congruency calculations, morphological operations, boundary representation and simplification, and rotation. For the machine learning strategy, we developed an approach that combines a feature selection algorithm and use of a support vector machine specified with a rigorous parameter selection and training approach. After training, a 22-category classifier provides 88\% overall accuracy for an independent test set, with individual category accuracies ranging from 68\% to 99\%. We demonstrate application of this classifier to a nearly uninterrupted 2-month time series of images acquired in Woods Hole Harbor, including use of statistical error correction to derive quantitative concentration estimates, which are shown to be unbiased with respect to manual estimates for random subsamples. Our approach, which provides taxonomically resolved estimates of phytoplankton abundance with fine temporal resolution (hours for many species), permits access to scales of variability from tidal to seasonal and longer.}, + language = {en}, + number = {6}, + urldate = {2022-10-20}, + journal = {Limnology and Oceanography: Methods}, + author = {Sosik, Heidi M. and Olson, Robert J.}, + year = {2007}, + note = {\_eprint: https://onlinelibrary.wiley.com/doi/pdf/10.4319/lom.2007.5.204}, + pages = {204--216}, + file = {Full Text:/Users/birdy/Zotero/storage/6E2WBCLM/Sosik and Olson - 2007 - Automated taxonomic classification of phytoplankto.pdf:application/pdf;Snapshot:/Users/birdy/Zotero/storage/UIZ3PRI8/lom.2007.5.html:text/html}, +} + +@misc{bengio_representation_2014, + title = {Representation {Learning}: {A} {Review} and {New} {Perspectives}}, + shorttitle = {Representation {Learning}}, + url = {http://arxiv.org/abs/1206.5538}, + abstract = {The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.}, + urldate = {2024-01-06}, + publisher = {arXiv}, + author = {Bengio, Yoshua and Courville, Aaron and Vincent, Pascal}, + month = apr, + year = {2014}, + note = {arXiv:1206.5538 [cs]}, + keywords = {Computer Science - Machine Learning}, + file = {arXiv.org Snapshot:/Users/birdy/Zotero/storage/EYA2B52H/1206.html:text/html;Full Text PDF:/Users/birdy/Zotero/storage/2CFSH6MT/Bengio et al. - 2014 - Representation Learning A Review and New Perspect.pdf:application/pdf}, +} + +@misc{white_gradients2-mgl1704-ifcb-abundance_2020-04-01_v10_2020, + title = {Gradients2-{MGL1704}-{IFCB}-{Abundance}\_2020-04-01\_v1.0}, + url = {https://zenodo.org/records/4267140}, + doi = {10.5281/zenodo.4267140}, + abstract = {Cruise: Gradients 2, MGL1704 Project Name: Simons Foundation, Gradient NPSG Dataset Description: The Imaging FlowCytoBot (IFCB) is an in situ automated imaging flow cytometer that generates images of particles suspended in seawater, in this case from the underway uncontaminated seawater system aboard the R/V Langseth (intake 5m).   The IFCB uses a recycled sheath fluid (0.2 µm filtered seawater) to align and drive particles individually towards a light source (red laser, 4.5 mW) in order to detect and identify single or colonial cells using a combination of optical properties (red fluorescence and light scattering intensities) and high resolution images (3.2 pixels per micron) by a mounted camera. Both optical properties are used to trigger targeted image acquisition of suspended particles in the size range {\textless}4 to 100 μm. The instrument continuously samples (few seconds) from  {\textasciitilde}5 ml aliquots from the intake, and processes all particles contained in that volume for the next 20 mins.  Images corresponding to the "Sample" variable are available on the cruise's dashboard (http://ifcb-data.soest.hawaii.edu/IFCB\_NPTZ).  This dataset is for the abundance of imaged cells by genus.  For each sample, the total number of cells classified to the genus-level by a random forest algorithm (Sosik and Olson, 2007 doi:10.4319/lom.2007.5.204) is counted and divided by the corresponding volume analyzed ({\textasciitilde}5 mL). Note that we used all the images collected during Gradients 2.0 to train the random forest algorithm and that classification is therefore highly accurate for this dataset. Using 7 µm calibration beads, we estimated that the error of cell concentration due to cell detection during sample acquisition averages 11 ± 10 \%, independently of concentrations in the range 1-10000 cell/mL.}, + urldate = {2024-01-06}, + publisher = {Zenodo}, + author = {White, Angelicque}, + month = apr, + year = {2020}, + file = {Zenodo Snapshot:/Users/birdy/Zotero/storage/456EVCSQ/4267140.html:text/html}, +} + +@article{he_deep_2016, + title = {Deep {Residual} {Learning} for {Image} {Recognition}}, + url = {https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html}, + urldate = {2024-01-06}, + author = {He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian}, + year = {2016}, + pages = {770--778}, + file = {Full Text PDF:/Users/birdy/Zotero/storage/RYXX2LD8/He et al. - 2016 - Deep Residual Learning for Image Recognition.pdf:application/pdf}, +} + +@article{chen_simple_2020, + title = {A {Simple} {Framework} for {Contrastive} {Learning} of {Visual} {Representations}}, + url = {http://arxiv.org/abs/2002.05709}, + abstract = {This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5\% top-1 accuracy, which is a 7\% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1\% of the labels, we achieve 85.8\% top-5 accuracy, outperforming AlexNet with 100X fewer labels.}, + urldate = {2024-01-04}, + publisher = {arXiv}, + author = {Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey}, + month = jun, + year = {2020}, + note = {arXiv:2002.05709 [cs, stat]}, + keywords = {Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Statistics - Machine Learning}, + file = {arXiv.org Snapshot:/Users/birdy/Zotero/storage/QSFZN5RI/2002.html:text/html;Chen et al. - 2020 - A Simple Framework for Contrastive Learning of Vis.pdf:/Users/birdy/Zotero/storage/UA8HJXR5/Chen et al. - 2020 - A Simple Framework for Contrastive Learning of Vis.pdf:application/pdf}, +} + +@article{vanderMaaten2008tsne, + title={Visualizing Data using t-SNE}, + author={van der Maaten, Laurens and Hinton, Geoffrey}, + journal={Journal of Machine Learning Research}, + volume={9}, + pages={2579--2605}, + year={2008} +} + + +@article{kanungo_efficient_2002, + title = {An efficient k-means clustering algorithm: analysis and implementation}, + volume = {24}, + issn = {1939-3539}, + shorttitle = {An efficient k-means clustering algorithm}, + url = {https://ieeexplore.ieee.org/document/1017616}, + doi = {10.1109/TPAMI.2002.1017616}, + abstract = {In k-means clustering, we are given a set of n data points in d-dimensional space R/sup d/ and an integer k and the problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k-means clustering is Lloyd's (1982) algorithm. We present a simple and efficient implementation of Lloyd's k-means clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.}, + number = {7}, + urldate = {2024-01-06}, + journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence}, + author = {Kanungo, T. and Mount, D.M. and Netanyahu, N.S. and Piatko, C.D. and Silverman, R. and Wu, A.Y.}, + month = jul, + year = {2002}, + note = {Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence}, + pages = {881--892}, + file = {IEEE Xplore Abstract Record:/Users/birdy/Zotero/storage/RC5LGQ6V/1017616.html:text/html;IEEE Xplore Full Text PDF:/Users/birdy/Zotero/storage/NUTPDLG4/Kanungo et al. - 2002 - An efficient k-means clustering algorithm analysi.pdf:application/pdf}, +} + diff --git a/assets/img/2023-11-11-plankton-representation/fig-IFCB-examples.png b/assets/img/2023-11-11-plankton-representation/fig-IFCB-examples.png new file mode 100644 index 00000000..e76daf7b Binary files /dev/null and b/assets/img/2023-11-11-plankton-representation/fig-IFCB-examples.png differ diff --git a/assets/img/2023-11-11-plankton-representation/fig-dataset-class-distribution.png b/assets/img/2023-11-11-plankton-representation/fig-dataset-class-distribution.png new file mode 100644 index 00000000..887f9476 Binary files /dev/null and b/assets/img/2023-11-11-plankton-representation/fig-dataset-class-distribution.png differ diff --git a/assets/img/2023-11-11-plankton-representation/fig-simCLR-diagram.png b/assets/img/2023-11-11-plankton-representation/fig-simCLR-diagram.png new file mode 100644 index 00000000..00cc5ce7 Binary files /dev/null and b/assets/img/2023-11-11-plankton-representation/fig-simCLR-diagram.png differ