Chapter4 #10

carsen-stringer · 2025-07-22T11:04:24Z

we did not add anything to the glossary, happy to help with this once all the chapters are in

carsen-stringer · 2025-07-22T11:07:43Z

fyi @ScientistRachel added the outline of chapter 5 on this branch (I did not edit Chapter 5)

opp1231 · 2026-01-09T19:27:06Z

docs/4-architectures/conv_happy.gif

We really like all the GIF's!. However, for this and all GIF's, we need to figure out a technical issue where Quarto does not render them correctly on the website. For now, would it be possible to replace these with still images until we have a chance to fix that issue?

opp1231 · 2026-01-09T19:28:20Z

docs/4-architectures/grad_descent.gif

See previous comment on GIF's.

opp1231 · 2026-01-09T19:28:28Z

docs/4-architectures/same_padding_no_strides.gif

See previous comment on GIF's.

opp1231 · 2026-01-09T19:32:45Z

docs/4-architectures.qmd

+In computer vision, multiple neural network architectures were designed for various visual tasks. These architectures take as input an image, which often has multiple channels - in the case of natural images there are three input channels for red/green/blue. The neural networks process the image and output various quantities depending on the task. For example, in the case of object recognition, the networks output a single probability vector to indicate the likelihood that each possible object (cat, dog etc) is in the image. 

-Starting prompt for this chapter: Chapter 4 introduces architectures and loss models, defining them and providing examples through two practical case studies: image restoration and segmentation. Although this chapter will include code snippets/exercises, the presentation of essential concepts should communicate the philosophy behind the choice of a model for non-programmers.
+![Alexnet architecture, adapted from [@krizhevsky2012imagenet]. This early neural network first demonstrated the capabilities of deep learning when trained on large datasets.](4-architectures/alexnet.PNG){#fig-alexnet width=90%}


Re: Figures 1 and 2. For those people who have never seen any architecture diagram before, they may not grasp the meaning of these figures. Adding a sentence or two of context for those individuals would be helpful. Alternatively, would you prefer to move these figures deeper into the chapter where the concepts are explained more in-depth?

opp1231 · 2026-01-09T19:33:46Z

docs/4-architectures.qmd

+### Linear layers (or fully-connected layers, or dense layers)

-###  Sub-subsection headers are also available
+Early research focused on the computational properties of  perceptrons, defined as linear weighted sums of inputsfollowed by a nonlinear activation functions [@rumelhart1986learning]. A neural network layer consistes of a collection of such perceptrons, each with their own input weights. A multilayer perceptron (MLP) is the simplest example of a deep neural network -- which we will discuss later -- and consists of a sequence of such layers that are applied in series, each on the output of the previous one. The linear layer performs a matrix multiplication of a “weights” matrix ($W$) with the input $\vec{x}$ followed by an addition with a vector of “bias” terms ($\vec{b}$). The size of the weights matrix is the number of outputs by the number of inputs, and length of the bias vector is the number of outputs.


Please break this section down into multiple steps. In doing so, please define fully-connected and dense.

opp1231 · 2026-01-09T19:35:45Z

docs/4-architectures.qmd

+$$ \vec{y} = W\vec{x} + \vec{b} = [\sum_{j=1}^n W_{ij} x_{j} + b_{i}]_{i=1}^n$$

-## Adding to the Glossary
+The activation function ($f$) is the nonlinearity applied to each output of the linear layer, such as a ReLU nonlinearity which sets the minimum value of the output to zero. The nonlinearity allows the network to compute more complicated functions of the input than would be possible with a simple linear model [@cybenko1989approximation].


Suggested change

The activation function ($f$) is the nonlinearity applied to each output of the linear layer, such as a ReLU nonlinearity which sets the minimum value of the output to zero. The nonlinearity allows the network to compute more complicated functions of the input than would be possible with a simple linear model [@cybenko1989approximation].

The activation function ($f$) is the nonlinearity applied to each output of the linear layer, such as a ReLU nonlinearity which sets the minimum value of the output to zero. In this case, the input--real microscopy data--is complex. The nonlinearity allows the network to compute more complicated functions of the input than would be possible with a simple linear model [@cybenko1989approximation].

opp1231 · 2026-01-09T19:37:07Z

docs/4-architectures.qmd

+### Convolutional layers

-In some cases, you may want to include a code-block that is not executed when the book is compiled. Use the `eval: false` option for this.
+Instead of linear layers, convolutional layers are often used in vision tasks, as a parameter-efficient alternative [@lecun1995convolutional]. A convolutional layer slides a small two-dimensional filter across the input image, computing the dot product between the filter and the input at each image position. It also includes the addition of a vector of bias terms. For 2D image processing, we use two-dimensional convolutions, but one/three dimensional convolutions may be used for 1D/3D data respectively. The output of the 2D convolution operation at position $(x,y)$ can be written as follows for an example grayscale image $I$ with a single input channel, where the filter $W$ is size $(2k+1, 2k+1)$ and the bias is $b$:


Please break down the math here into multiple steps.

opp1231 · 2026-01-09T19:39:52Z

docs/4-architectures.qmd

+
+![Toy illustration of convolutional operation from this [article](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-convolution-neural-networks-e3f054dd5daa). In practice, convolutional kernels are more complicated than simple template detectors.](4-architectures/conv_happy.gif){#fig-convill  width=70%}
+
+This is an example with a single input and a single output channel, but in general a 2D convolutional layer has multiple input and output channels. Each output **channel** is the result of a 2D convolutional kernel applied to the input. In the gif below, the input is in blue, the kernel is in gray, and the output is in green. The number of units in the output channel depends on a  *stride* parameter. In the gif below, the stride is 1 because the input image is sampled at each position. A stride of 2 would mean skipping over every other input position both vertically and horizontally. In most applications, especially with small kernel sizes, a stride of 1 is used for convolution.


Suggested change

This is an example with a single input and a single output channel, but in general a 2D convolutional layer has multiple input and output channels. Each output **channel** is the result of a 2D convolutional kernel applied to the input. In the gif below, the input is in blue, the kernel is in gray, and the output is in green. The number of units in the output channel depends on a *stride* parameter. In the gif below, the stride is 1 because the input image is sampled at each position. A stride of 2 would mean skipping over every other input position both vertically and horizontally. In most applications, especially with small kernel sizes, a stride of 1 is used for convolution.

This is an example with a single input and a single output channel, but in general a 2D convolutional layer has multiple input and output channels. Each output **channel** is the result of a 2D convolutional kernel applied to the input. In the animation below, the input is in blue, the kernel is in gray, and the output is in green. The number of pixels in the output channel depends on a *stride* parameter. In the animation below, the stride is 1 because the input image is sampled at each position. A stride of 2 would mean skipping over every other input position both vertically and horizontally. In most applications, especially with small kernel sizes, a stride of 1 is used for convolution.

opp1231 · 2026-01-09T19:40:07Z

docs/4-architectures.qmd

+![Convolutional operation with padding from this [github](https://github.com/vdumoulin/conv_arithmetic). This illustrates the creation of a single channel output from a single channel input. In practice, both inputs and outputs have multiple channels and all combinations of input and output need to be calculated and summed accordingly.](4-architectures/same_padding_no_strides.gif){#fig-convpad width=40%}
+
+::: {.callout-note} 
+If the kernel size *K* is odd and you set the `padding=K//2` (floor[K/2]) and `stride=1` as  shown in @fig-convpad, you get a **channel** of units that is the same size as the input.


Suggested change

If the kernel size *K* is odd and you set the `padding=K//2` (floor[K/2]) and `stride=1` as shown in @fig-convpad, you get a **channel** of units that is the same size as the input.

If the kernel size *K* is odd and you set the `padding=K//2` (floor[K/2]) and `stride=1` as shown in @fig-convpad, you get a **channel** of pixels that is the same size as the input.

opp1231 · 2026-01-09T19:40:53Z

docs/4-architectures.qmd

+![Convolutional operation with padding from this [github](https://github.com/vdumoulin/conv_arithmetic). This illustrates the creation of a single channel output from a single channel input. In practice, both inputs and outputs have multiple channels and all combinations of input and output need to be calculated and summed accordingly.](4-architectures/same_padding_no_strides.gif){#fig-convpad width=40%}
+
+::: {.callout-note} 
+If the kernel size *K* is odd and you set the `padding=K//2` (floor[K/2]) and `stride=1` as  shown in @fig-convpad, you get a **channel** of units that is the same size as the input.


Neither padding nor stride have been introduced up until this point.

opp1231 · 2026-01-09T19:42:06Z

docs/4-architectures.qmd

+If the kernel size *K* is odd and you set the `padding=K//2` (floor[K/2]) and `stride=1` as  shown in @fig-convpad, you get a **channel** of units that is the same size as the input.
+:::
+
+A convolutional layer operates under two main assumptions: 1) the computation only requires local features that are within the spatial extent of the filter operation; and 2) it is not necessary to perform different computations at different positions in the image, and thus the same filter operation can be convolutionally applied across all positions in the image. When these assumptions are acceptable, a convolutional layer can reduce the number of parameters substantially compared to linear layers.


Suggested change

A convolutional layer operates under two main assumptions: 1) the computation only requires local features that are within the spatial extent of the filter operation; and 2) it is not necessary to perform different computations at different positions in the image, and thus the same filter operation can be convolutionally applied across all positions in the image. When these assumptions are acceptable, a convolutional layer can reduce the number of parameters substantially compared to linear layers.

A convolutional layer operates under two main assumptions: 1) the computation only requires local features that are within the spatial extent of the filter operation; and 2) it is not necessary to perform different computations at different positions in the image, and thus the same filter operation can be convolutionally applied across all positions in the image. In other words, for a microscopy image, the biology of interest needs to fit within your filter size and the same processing steps must apply to the whole field of view. When these assumptions are acceptable, a convolutional layer can reduce the number of parameters substantially compared to linear layers.

opp1231 · 2026-01-09T19:42:50Z

docs/4-architectures.qmd

+
+A convolutional layer operates under two main assumptions: 1) the computation only requires local features that are within the spatial extent of the filter operation; and 2) it is not necessary to perform different computations at different positions in the image, and thus the same filter operation can be convolutionally applied across all positions in the image. When these assumptions are acceptable, a convolutional layer can reduce the number of parameters substantially compared to linear layers.
+
+Taking our example from above, let’s estimate the number of parameters with filters/kernels of size 3 by 9 by 9 pixels, where 3 is the number of input channels and 9 is the size in pixel space, the “kernel size”. The number of input and output images are called “channels”, similar to the red/green/blue channels for RGB images. If we define the layer to have 6 output channels, this requires 6 of these kernels resulting in 1458 parameters in the kernels, along with 6 bias terms, resulting in 1464 parameters in total. As you can see, the number of parameters now is independent of the size of the input in pixels, and a dramatic reduction from the nearly 1 billion parameters for the dense linear layer example above. 


The explanation of "channels" would be especially helpful if it were included earlier in the text.

opp1231 · 2026-01-09T19:43:13Z

docs/4-architectures.qmd

+
+A convolutional layer operates under two main assumptions: 1) the computation only requires local features that are within the spatial extent of the filter operation; and 2) it is not necessary to perform different computations at different positions in the image, and thus the same filter operation can be convolutionally applied across all positions in the image. When these assumptions are acceptable, a convolutional layer can reduce the number of parameters substantially compared to linear layers.
+
+Taking our example from above, let’s estimate the number of parameters with filters/kernels of size 3 by 9 by 9 pixels, where 3 is the number of input channels and 9 is the size in pixel space, the “kernel size”. The number of input and output images are called “channels”, similar to the red/green/blue channels for RGB images. If we define the layer to have 6 output channels, this requires 6 of these kernels resulting in 1458 parameters in the kernels, along with 6 bias terms, resulting in 1464 parameters in total. As you can see, the number of parameters now is independent of the size of the input in pixels, and a dramatic reduction from the nearly 1 billion parameters for the dense linear layer example above. 


Please break this into intermediate steps.

opp1231 · 2026-01-09T19:44:14Z

docs/4-architectures.qmd

+print('output shape: ', y.shape)
 ```

+The activations of a single convolutional layer, as described, have a receptive field size equivalent to its kernel size: each activation only receives information from pixels within the kernel size. However, objects in images are often larger than the kernel size. To increase the amount of spatial information used for computation, pooling layers are introduced between convolutional layers. A pooling layer consists of a sliding-window operation, just like the convolution layer, but in this case a maximum or average operation is computed within the window, independently for each input channel. To reduce the size of the image, this convolutional operation is applied with a “stride”: to downsample the image by a factor of two, we may use a pooling size of 2 with a stride also set to 2, like in the example below. 


We know this isn't strictly a mathematical discussion, but breaking this down into smaller pieces would help explain concepts such as: receptive field size, pooling layers, sliding windows, etc..

opp1231 · 2026-01-09T19:45:17Z

docs/4-architectures.qmd


+The activations of a single convolutional layer, as described, have a receptive field size equivalent to its kernel size: each activation only receives information from pixels within the kernel size. However, objects in images are often larger than the kernel size. To increase the amount of spatial information used for computation, pooling layers are introduced between convolutional layers. A pooling layer consists of a sliding-window operation, just like the convolution layer, but in this case a maximum or average operation is computed within the window, independently for each input channel. To reduce the size of the image, this convolutional operation is applied with a “stride”: to downsample the image by a factor of two, we may use a pooling size of 2 with a stride also set to 2, like in the example below. 
+
+![Illustration of max-pooling with a kernel size of 2 and stride of 2, from [here](https://github.com/dvgodoy/PyTorchStepByStep).](4-architectures/pooling.png){#fig-pool width=90%}


Suggested change

![Illustration of max-pooling with a kernel size of 2 and stride of 2, from [here](https://github.com/dvgodoy/PyTorchStepByStep).](4-architectures/pooling.png){#fig-pool width=90%}

![Illustration of max-pooling with a kernel size of 2 and stride of 2 on an image with one channel, one z-slice and 6 pixels in x and y. The output only has 3 pixels in x and y due to the pooling step. Example from from [here](https://github.com/dvgodoy/PyTorchStepByStep).](4-architectures/pooling.png){#fig-pool width=90%}

opp1231 · 2026-01-09T19:46:06Z

docs/4-architectures.qmd

 ```

-Here is an example equation.
+If we apply several convolutional and pooling layers, we end up with an output which is smaller in resolution than the input. We can see example activations across convolutional, pooling and linear layers in a network trained to classify MNIST [@lecun1998mnist] digits:


Suggested change

If we apply several convolutional and pooling layers, we end up with an output which is smaller in resolution than the input. We can see example activations across convolutional, pooling and linear layers in a network trained to classify MNIST [@lecun1998mnist] digits:

If we apply several convolutional and pooling layers, we end up with an output which is smaller in resolution than the input. We can see example activations across convolutional, pooling and linear layers in a network trained to classify MNIST [@lecun1998mnist] digits (a dataset of handwritten numbers often used to test ML ideas):

opp1231 · 2026-01-09T19:48:25Z

docs/4-architectures.qmd

-$$
-s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}
-$$ {#eq-stddev}
+![Activations across convolutional and pooling layers from this [demo](https://adamharley.com/nn_vis/cnn/2d.html). Lighter blue indicates a higher activation.](4-architectures/mnist.PNG){#fig-mnist}


Suggested change

![Activations across convolutional and pooling layers from this [demo](https://adamharley.com/nn_vis/cnn/2d.html). Lighter blue indicates a higher activation.](4-architectures/mnist.PNG){#fig-mnist}

![Each layer of a small, illustrative network where the output channels of each layer are shown as images. Lighter blue indicates higher activation at each layer. The final output layer represents the probability of the input image corresponding to a value between 0 and 9. In this example, the network correctly identifies the input as the number 3. To see this in action, checkout the interactive demonstration [here](https://adamharley.com/nn_vis/cnn/2d.html).](4-architectures/mnist.PNG){#fig-mnist}

opp1231 · 2026-01-09T19:48:52Z

docs/4-architectures.qmd

-You can also embed figures from other notebooks in the repo as shown in the following embed example.
+### U-nets

+U-nets were introduced by Ronneberger, Fischer, and Brox [-@ronneberger2015u] (@fig-unet). They share some similarities with feature pyramid networks [@lin2017feature], but u-nets are more frequently used in biological applications so we will focus on them. U-nets have an encoder-decoder structure, like an autoencoder [@hinton2006reducing], with the encoder consisting of convolutional layers and pooling layers (downsampling), and the decoder consisting of convolutional layers and upsampling or strided conv-transpose layers. End-to-end, u-nets typically produce an output at the same spatial resolution as the input.


Please break this down into intermediate explanations.

opp1231 · 2026-01-09T19:50:44Z

docs/4-architectures.qmd

+The downsampling results in a loss of fine spatial information. To recover this information, the output of the convolutional layers in the encoder is concatenated with the activations from the decoder at each spatial scale using skip connections (“copy” operation in @fig-unet). This preserves the higher resolution details, which is important for precise segmentations and pixel-wise predictions.

-{{< embed ../notebooks/test.ipynb#fig-test-fig echo-true >}}
+Conventional u-nets [@ronneberger2015u] have two convolutional layers per block and a small kernel size of 3 in each layer. The downsampling after each block is often set to a factor of 2. Because the kernel size is small, the only way to have large receptive field sizes is through several downsampling blocks. If we want the network to learn complicated tasks across many diverse images, then we need it to have a large “capacity”. This can be achieved by adding more weights to the network, for example by increasing the number of channels in the convolutional layers and/or by adding more convolutional layers in each block and/or by increasing the number of downsampling and upsampling stages [@stringer2021].


Replace "block" with "spatial scale" or otherwise relate to this concept.

opp1231 · 2026-01-09T19:51:37Z

docs/4-architectures.qmd

+### Vision transformers

-## Quarto has additional features.
+Vision transformers are modern architectures that are replacing convolutional networks in many applications. They are not as parameter-efficient as convolutional neural networks - for example the Cellpose segmentation u-net has 6.6 million parameters while ViT-H (“vision-transformer-huge”) has 632 million parameters [@dosovitskiy2020image]. They are still much more efficient than dense linear layers due to special architecture choices (see below), and they introduce a new type of operation called “self-attention”. Transformers avoid overfitting this large set of parameters through training on very large amounts of data. Even though they have many more parameters than standard convolutional networks, they are not too much slower because most of the operations within the transformer are matrix multiplications which are fast on newer GPUs with tensor cores (e.g. from [nvidia](https://www.nvidia.com/en-us/data-center/tensor-cores/)). With more parameters, they have a larger capacity than standard convolutional neural networks to learn from large training datasets.


This is potentially silly, but it is possible the Janelia lawyers will not appreciate linking to a specific product. Is it possible to refer to "tensor cores" without linking to NVIDIA directly?

opp1231 · 2026-01-09T19:52:10Z

docs/4-architectures.qmd

+Vision transformers are modern architectures that are replacing convolutional networks in many applications. They are not as parameter-efficient as convolutional neural networks - for example the Cellpose segmentation u-net has 6.6 million parameters while ViT-H (“vision-transformer-huge”) has 632 million parameters [@dosovitskiy2020image]. They are still much more efficient than dense linear layers due to special architecture choices (see below), and they introduce a new type of operation called “self-attention”. Transformers avoid overfitting this large set of parameters through training on very large amounts of data. Even though they have many more parameters than standard convolutional networks, they are not too much slower because most of the operations within the transformer are matrix multiplications which are fast on newer GPUs with tensor cores (e.g. from [nvidia](https://www.nvidia.com/en-us/data-center/tensor-cores/)). With more parameters, they have a larger capacity than standard convolutional neural networks to learn from large training datasets.

-You can learn more about markdown options and additional Quarto features in the [Quarto documentation](https://quarto.org/docs/authoring/markdown-basics.html).  One example that you might find interesting is the option to include callouts in your text. These callouts can be used to highlight potential pitfalls or provide additional optional exercises that the reader might find helpful. Below are examples of the types of callouts available in Quarto.
+The vision transformer divides the input image into patches, e.g. 16 by 16 pixels each. In the first layer of the transformer the patches are transformed into the embedding space, using a linear operation that is often implemented using strided convolutions. This embedding space is generally several times larger than the number of pixels in the patch; for example the embedding space in ViT-H is 1280. These patch embeddings are input to the transformer encoder, which consists of many blocks ($L$). Each transformer block has a self-attention block and an MLP block. In the self-attention block, the attention matrix is computed as pairwise interactions between all patches, enabling sharing of information across the entire image. 


Please either break this down into intermediate explanatory steps (especially self-attention), or consider removing it.

opp1231 · 2026-01-09T19:52:43Z

docs/4-architectures.qmd

-$$
-s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}
-$$ {#eq-stddev}
+![Activations across convolutional and pooling layers from this [demo](https://adamharley.com/nn_vis/cnn/2d.html). Lighter blue indicates a higher activation.](4-architectures/mnist.PNG){#fig-mnist}


This is a very cool demonstration!

opp1231 · 2026-01-09T19:52:57Z

docs/4-architectures.qmd

+
+## Loss functions
+
+In a standard image classification network, the output is a vector with  length equal to the number of classes in the task. Each of the entries in this vector represents the predicted probability of the class, and the predicted label for the image is chosen as the index of the vector with the largest entry. For each training image we have a ground-truth label for the class. How close the network matches the label is the loss, which is defined as a function between the vector output of the network and the ground-truth label. A lower loss means we matched the ground-truth data better. The gradients of the network are computed automatically via back-propagation, and an optimizer is specified to modify the parameters in order to minimize the loss (described more below).


This is great to include!

opp1231 · 2026-01-09T19:53:20Z

docs/4-architectures.qmd

+
+In segmentation and biological classification tasks, as mentioned before, the output is often the same size as the input in pixels and the loss is computed per-pixel. There will thus be multiple outputs of this size, each one corresponding to a class, like the entries in the vector for overall image classification. In the original u-net paper, the loss was defined using two classes, "not cell" and "cell" [@ronneberger2015u]. 
+
+![Binary cross-entropy loss, computed from the output of a u-net trained to classify cell/not cell (predicted cell probability).](4-architectures/bceloss.PNG){#fig-bce width=70%}


A walk-through of this figure in the text would add a lot of context for it.

opp1231 · 2026-01-09T19:54:35Z

docs/4-architectures.qmd

-::: {.callout-warning}
-Be careful to avoid hallucinations.
-:::
+To create segmentations for each cell, a threshold is defined on the cell probability and any pixels above the threshold that are connected to each other are formed into objects. This threshold is defined using a validation set - images that are not used for training or testing - to help ensure the threshold generalizes to the held-out test images. The predicted segmentations with this loss function often contain several merges, because cells can often touch each other and the connected components of the image will combine the touching cells into single components. 


these concepts are first explained in Section 4.3. Perhaps consider moving that explanation up or cross-reference that section here.

opp1231 · 2026-01-09T19:54:54Z

docs/4-architectures.qmd

+
+Now that we have defined a loss function, we want to minimize the loss $\ell$ by updating the weights in the network. For a problem like segmentation, we will need images with ground-truth segmentation labels - this labeling can be done using tools like [Ilastik](https://www.ilastik.org/), [ImageJ](https://imagej.net/software/imagej/), [Paintera](https://github.com/saalfeldlab/paintera), or [Napari](https://napari.org/stable/). Once ground-truth labeling is done on some images, training can be attempted. We will want to use most of the ground-truth labeled images for training, making up a training set, and leave a small subset (like 10%) for testing the performance of the network. For some algorithms, we may also need a validation set for setting post-processing parameters like the cell probability threshold, in which case we can reserve 10-15% of the training set images for validation.
+
+Minimizing the loss is done via gradient descent, in which the weights are updated by the gradient of the loss with respect to each parameter, scaled by the learning rate $\alpha$:


Please break this down into intermediate steps.

opp1231 · 2026-01-09T19:55:32Z

docs/4-architectures.qmd

+
+Moving the parameters in the negative direction of the gradient reduces the loss for the given images or data points over which the loss is computed. We could compute the loss and gradients over all images in the training set, but this would take too long so in practice the loss is computed in batches of a few to a few hundred images - the number of images in a batch is called the *batch size*. The optimization algorithm for updating the weights in batches is called stochastic gradient descent (SGD). This is often faster than full-dataset gradient descent because it updates the parameters many times on a single pass through the training set (called “epoch”). Also, the stochasticity induced by the random sampling step in SGD effectively adds some noise in the search for a good minimum of the loss function, which may be useful for avoiding local minima.
+
+It can also be beneficial to include momentum, with some value $\beta$ between zero and one, which pushes weight updates along the same direction they have been updating in the past. The updated version of $\vec{v}$ in this case is


Suggested change

It can also be beneficial to include momentum, with some value $\beta$ between zero and one, which pushes weight updates along the same direction they have been updating in the past. The updated version of $\vec{v}$ in this case is

It can also be beneficial to include momentum, with some value $\beta$ between zero and one, which pushes weight updates along the same direction they have been updating in the past, to avoid spurious changes due to noise. The updated version of $\vec{v}$ in this case is

Or some other worded motivation for momentum

opp1231 · 2026-01-09T19:56:15Z

docs/4-architectures.qmd

+It can also be beneficial to include momentum, with some value $\beta$ between zero and one, which pushes weight updates along the same direction they have been updating in the past. The updated version of $\vec{v}$ in this case is
+$$ \vec{v}_t = \beta \vec{v}_{t-1} + \alpha\, \frac{d L(\vec{w}_t)}{d \vec{w}_t} $$
+
+Different weights in the network may have differently scaled gradients, and thus a single learning rate may not work well. The Adam optimizer uses a moving average of both the first and second moment of the gradient for rescaling the weight updates while including a momentum term [@kingma2014adam]. This optimizer works better than standard SGD in many cases, and requires less fine-tuning to find good hyperparameter values. In addition to using an optimizer like Adam, it may be helpful to use a learning rate schedule which reduces the learning rate towards the end of training to enable smaller steps for fine-tuning the final weights [@loshchilov2016sgdr]. Sometimes a validation set is used to re-instantiate the best weights, as evaluated on the validation set, before a decrease in the learning rate [@prechelt1998automatic].


Please expand the separate steps here.

opp1231 · 2026-01-09T19:56:28Z

docs/4-architectures.qmd

+
+Different weights in the network may have differently scaled gradients, and thus a single learning rate may not work well. The Adam optimizer uses a moving average of both the first and second moment of the gradient for rescaling the weight updates while including a momentum term [@kingma2014adam]. This optimizer works better than standard SGD in many cases, and requires less fine-tuning to find good hyperparameter values. In addition to using an optimizer like Adam, it may be helpful to use a learning rate schedule which reduces the learning rate towards the end of training to enable smaller steps for fine-tuning the final weights [@loshchilov2016sgdr]. Sometimes a validation set is used to re-instantiate the best weights, as evaluated on the validation set, before a decrease in the learning rate [@prechelt1998automatic].
+
+During fitting it is important to monitor the training loss and the validation loss. With an appropriate learning rate that is not too large, the training loss should always decrease. The loss on held-out validation data should also ideally decrease over training. If not, then the network is overfitting to the training set: the weights are becoming specifically tuned for the training set examples and no longer generalize to held-out data.


This is great to mention.

opp1231 · 2026-01-09T19:57:03Z

docs/4-architectures.qmd

+
+![Example training loss and validation loss across epochs.](4-architectures/trainloss.png){#fig-trainloss width=40%}
+
+To avoid overfitting, regularization is often used. Most commonly in computer vision problems, weight decay will be used for regularization, which is closely-related to L2 regularization. This operation reduces the weights by a small fraction $\lambda$ at each optimization step: 


We don't think the intended audience is likely familiar with regularization at all. We think this is a good change to generalize this explanation.

opp1231 · 2026-01-09T19:57:34Z

docs/4-architectures.qmd

+* Tutorials from [pytorch](https://docs.pytorch.org/tutorials/)
+* CNN Explainer [demo](https://poloclub.github.io/cnn-explainer/) with activations across layers, and interactive visualizations for padding and stride [@wang2020cnn]
+* Illustration of how momentum works from [Gabriel Goh](https://distill.pub/2017/momentum/)



A concluding section or paragraph would be helpful.

opp1231

Hi Carsen!

As we discussed previously, we have added notes for places where intermediate steps or further explanations would be helpful. Please take a look, and of course, feel free to reach out to us if anything is unclear.

One general comment is we would like to ensure that we have permission to re-use the images that are borrowed throughout. If you could confirm the access, that would be great.

Additionally, this is a list of terms that we will add to the glossary based on your chapter. You are welcome to provide a definition, or we will pull the definition from your chapter:

Neural Network
Architecture
Natural Image
Probability Vector
Skip Connection
Down sampling
Up sampling
Perceptron
Activation Function
Linear Layer
Bias
ReLU
Non-linearity
Convolutional filter/kernel
Receptive Field
Pooling Layer
Sliding Window
Stride
Padding
Encoder-Decoder Structure
Autoencoder
Self-attention
Patches
Embedding Space
Auxiliary Variables
Back propagation
Optimizer
Stardist
CellPose
Gradient Descent
Regularization

Glossasry entries that are/will be in other chapters, but you are welcome to chime in on if you'd like:

Convolution
Foundation Model
Channels: Disambiguate fluorescence channels from convolutional outputs
Validation Images
Test Images
Overfitting

Thank you for your dedicated effort to this chapter. We really appreciate your insight and assistance.

Best,
Owen and Rachel

ScientistRachel and others added 4 commits April 2, 2025 09:48

Chapter 5 initial outline

f545556

Chapter 4 initial outline

e07555c

adding ch 4

312b702

adding references

51cd73e

carsen-stringer requested review from ScientistRachel and opp1231 as code owners July 22, 2025 11:04

update trainloss fig

8aaba36

opp1231 reviewed Jan 9, 2026

View reviewed changes

docs/4-architectures/grad_descent.gif

Copy link

Member

opp1231 Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment on GIF's.

opp1231 reviewed Jan 9, 2026

View reviewed changes

docs/4-architectures/same_padding_no_strides.gif

Copy link

Member

opp1231 Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment on GIF's.

opp1231 reviewed Jan 9, 2026

View reviewed changes

opp1231 requested changes Jan 9, 2026

View reviewed changes

	The activation function ($f$) is the nonlinearity applied to each output of the linear layer, such as a ReLU nonlinearity which sets the minimum value of the output to zero. The nonlinearity allows the network to compute more complicated functions of the input than would be possible with a simple linear model [@cybenko1989approximation].
	The activation function ($f$) is the nonlinearity applied to each output of the linear layer, such as a ReLU nonlinearity which sets the minimum value of the output to zero. In this case, the input--real microscopy data--is complex. The nonlinearity allows the network to compute more complicated functions of the input than would be possible with a simple linear model [@cybenko1989approximation].


		![Toy illustration of convolutional operation from this [article](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-convolution-neural-networks-e3f054dd5daa). In practice, convolutional kernels are more complicated than simple template detectors.](4-architectures/conv_happy.gif){#fig-convill width=70%}

		This is an example with a single input and a single output channel, but in general a 2D convolutional layer has multiple input and output channels. Each output channel is the result of a 2D convolutional kernel applied to the input. In the gif below, the input is in blue, the kernel is in gray, and the output is in green. The number of units in the output channel depends on a stride parameter. In the gif below, the stride is 1 because the input image is sampled at each position. A stride of 2 would mean skipping over every other input position both vertically and horizontally. In most applications, especially with small kernel sizes, a stride of 1 is used for convolution.

	If the kernel size K is odd and you set the `padding=K//2` (floor[K/2]) and `stride=1` as shown in @fig-convpad, you get a channel of units that is the same size as the input.
	If the kernel size K is odd and you set the `padding=K//2` (floor[K/2]) and `stride=1` as shown in @fig-convpad, you get a channel of pixels that is the same size as the input.


		A convolutional layer operates under two main assumptions: 1) the computation only requires local features that are within the spatial extent of the filter operation; and 2) it is not necessary to perform different computations at different positions in the image, and thus the same filter operation can be convolutionally applied across all positions in the image. When these assumptions are acceptable, a convolutional layer can reduce the number of parameters substantially compared to linear layers.

		Taking our example from above, let’s estimate the number of parameters with filters/kernels of size 3 by 9 by 9 pixels, where 3 is the number of input channels and 9 is the size in pixel space, the “kernel size”. The number of input and output images are called “channels”, similar to the red/green/blue channels for RGB images. If we define the layer to have 6 output channels, this requires 6 of these kernels resulting in 1458 parameters in the kernels, along with 6 bias terms, resulting in 1464 parameters in total. As you can see, the number of parameters now is independent of the size of the input in pixels, and a dramatic reduction from the nearly 1 billion parameters for the dense linear layer example above.


		The activations of a single convolutional layer, as described, have a receptive field size equivalent to its kernel size: each activation only receives information from pixels within the kernel size. However, objects in images are often larger than the kernel size. To increase the amount of spatial information used for computation, pooling layers are introduced between convolutional layers. A pooling layer consists of a sliding-window operation, just like the convolution layer, but in this case a maximum or average operation is computed within the window, independently for each input channel. To reduce the size of the image, this convolutional operation is applied with a “stride”: to downsample the image by a factor of two, we may use a pooling size of 2 with a stride also set to 2, like in the example below.

		![Illustration of max-pooling with a kernel size of 2 and stride of 2, from [here](https://github.com/dvgodoy/PyTorchStepByStep).](4-architectures/pooling.png){#fig-pool width=90%}

	If we apply several convolutional and pooling layers, we end up with an output which is smaller in resolution than the input. We can see example activations across convolutional, pooling and linear layers in a network trained to classify MNIST [@lecun1998mnist] digits:
	If we apply several convolutional and pooling layers, we end up with an output which is smaller in resolution than the input. We can see example activations across convolutional, pooling and linear layers in a network trained to classify MNIST [@lecun1998mnist] digits (a dataset of handwritten numbers often used to test ML ideas):


		## Loss functions

		In a standard image classification network, the output is a vector with length equal to the number of classes in the task. Each of the entries in this vector represents the predicted probability of the class, and the predicted label for the image is chosen as the index of the vector with the largest entry. For each training image we have a ground-truth label for the class. How close the network matches the label is the loss, which is defined as a function between the vector output of the network and the ground-truth label. A lower loss means we matched the ground-truth data better. The gradients of the network are computed automatically via back-propagation, and an optimizer is specified to modify the parameters in order to minimize the loss (described more below).


		In segmentation and biological classification tasks, as mentioned before, the output is often the same size as the input in pixels and the loss is computed per-pixel. There will thus be multiple outputs of this size, each one corresponding to a class, like the entries in the vector for overall image classification. In the original u-net paper, the loss was defined using two classes, "not cell" and "cell" [@ronneberger2015u].

		![Binary cross-entropy loss, computed from the output of a u-net trained to classify cell/not cell (predicted cell probability).](4-architectures/bceloss.PNG){#fig-bce width=70%}


		Now that we have defined a loss function, we want to minimize the loss $\ell$ by updating the weights in the network. For a problem like segmentation, we will need images with ground-truth segmentation labels - this labeling can be done using tools like [Ilastik](https://www.ilastik.org/), [ImageJ](https://imagej.net/software/imagej/), [Paintera](https://github.com/saalfeldlab/paintera), or [Napari](https://napari.org/stable/). Once ground-truth labeling is done on some images, training can be attempted. We will want to use most of the ground-truth labeled images for training, making up a training set, and leave a small subset (like 10%) for testing the performance of the network. For some algorithms, we may also need a validation set for setting post-processing parameters like the cell probability threshold, in which case we can reserve 10-15% of the training set images for validation.

		Minimizing the loss is done via gradient descent, in which the weights are updated by the gradient of the loss with respect to each parameter, scaled by the learning rate $\alpha$:


		Moving the parameters in the negative direction of the gradient reduces the loss for the given images or data points over which the loss is computed. We could compute the loss and gradients over all images in the training set, but this would take too long so in practice the loss is computed in batches of a few to a few hundred images - the number of images in a batch is called the batch size. The optimization algorithm for updating the weights in batches is called stochastic gradient descent (SGD). This is often faster than full-dataset gradient descent because it updates the parameters many times on a single pass through the training set (called “epoch”). Also, the stochasticity induced by the random sampling step in SGD effectively adds some noise in the search for a good minimum of the loss function, which may be useful for avoiding local minima.

		It can also be beneficial to include momentum, with some value $\beta$ between zero and one, which pushes weight updates along the same direction they have been updating in the past. The updated version of $\vec{v}$ in this case is


		Different weights in the network may have differently scaled gradients, and thus a single learning rate may not work well. The Adam optimizer uses a moving average of both the first and second moment of the gradient for rescaling the weight updates while including a momentum term [@kingma2014adam]. This optimizer works better than standard SGD in many cases, and requires less fine-tuning to find good hyperparameter values. In addition to using an optimizer like Adam, it may be helpful to use a learning rate schedule which reduces the learning rate towards the end of training to enable smaller steps for fine-tuning the final weights [@loshchilov2016sgdr]. Sometimes a validation set is used to re-instantiate the best weights, as evaluated on the validation set, before a decrease in the learning rate [@prechelt1998automatic].

		During fitting it is important to monitor the training loss and the validation loss. With an appropriate learning rate that is not too large, the training loss should always decrease. The loss on held-out validation data should also ideally decrease over training. If not, then the network is overfitting to the training set: the weights are becoming specifically tuned for the training set examples and no longer generalize to held-out data.


		![Example training loss and validation loss across epochs.](4-architectures/trainloss.png){#fig-trainloss width=40%}

		To avoid overfitting, regularization is often used. Most commonly in computer vision problems, weight decay will be used for regularization, which is closely-related to L2 regularization. This operation reduces the weights by a small fraction $\lambda$ at each optimization step:

Chapter4 #10

Are you sure you want to change the base?

Chapter4 #10

Uh oh!

Conversation

carsen-stringer commented Jul 22, 2025

Uh oh!

carsen-stringer commented Jul 22, 2025

Uh oh!

opp1231 Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

opp1231 Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

opp1231 left a comment

Choose a reason for hiding this comment

Uh oh!

opp1231 Jan 9, 2026 •

edited

Loading

opp1231 Jan 9, 2026 •

edited

Loading