Skip to content

Conversation

@carsen-stringer
Copy link
Collaborator

we did not add anything to the glossary, happy to help with this once all the chapters are in

@carsen-stringer
Copy link
Collaborator Author

fyi @ScientistRachel added the outline of chapter 5 on this branch (I did not edit Chapter 5)

Copy link
Member

@opp1231 opp1231 Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really like all the GIF's!. However, for this and all GIF's, we need to figure out a technical issue where Quarto does not render them correctly on the website. For now, would it be possible to replace these with still images until we have a chance to fix that issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment on GIF's.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment on GIF's.

In computer vision, multiple neural network architectures were designed for various visual tasks. These architectures take as input an image, which often has multiple channels - in the case of natural images there are three input channels for red/green/blue. The neural networks process the image and output various quantities depending on the task. For example, in the case of object recognition, the networks output a single probability vector to indicate the likelihood that each possible object (cat, dog etc) is in the image.

Starting prompt for this chapter: Chapter 4 introduces architectures and loss models, defining them and providing examples through two practical case studies: image restoration and segmentation. Although this chapter will include code snippets/exercises, the presentation of essential concepts should communicate the philosophy behind the choice of a model for non-programmers.
![Alexnet architecture, adapted from [@krizhevsky2012imagenet]. This early neural network first demonstrated the capabilities of deep learning when trained on large datasets.](4-architectures/alexnet.PNG){#fig-alexnet width=90%}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: Figures 1 and 2. For those people who have never seen any architecture diagram before, they may not grasp the meaning of these figures. Adding a sentence or two of context for those individuals would be helpful. Alternatively, would you prefer to move these figures deeper into the chapter where the concepts are explained more in-depth?

### Linear layers (or fully-connected layers, or dense layers)

### Sub-subsection headers are also available
Early research focused on the computational properties of perceptrons, defined as linear weighted sums of inputsfollowed by a nonlinear activation functions [@rumelhart1986learning]. A neural network layer consistes of a collection of such perceptrons, each with their own input weights. A multilayer perceptron (MLP) is the simplest example of a deep neural network -- which we will discuss later -- and consists of a sequence of such layers that are applied in series, each on the output of the previous one. The linear layer performs a matrix multiplication of a “weights” matrix ($W$) with the input $\vec{x}$ followed by an addition with a vector of “bias” terms ($\vec{b}$). The size of the weights matrix is the number of outputs by the number of inputs, and length of the bias vector is the number of outputs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please break this section down into multiple steps. In doing so, please define fully-connected and dense.

$$ \vec{y} = W\vec{x} + \vec{b} = [\sum_{j=1}^n W_{ij} x_{j} + b_{i}]_{i=1}^n$$

## Adding to the Glossary
The activation function ($f$) is the nonlinearity applied to each output of the linear layer, such as a ReLU nonlinearity which sets the minimum value of the output to zero. The nonlinearity allows the network to compute more complicated functions of the input than would be possible with a simple linear model [@cybenko1989approximation].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The activation function ($f$) is the nonlinearity applied to each output of the linear layer, such as a ReLU nonlinearity which sets the minimum value of the output to zero. The nonlinearity allows the network to compute more complicated functions of the input than would be possible with a simple linear model [@cybenko1989approximation].
The activation function ($f$) is the nonlinearity applied to each output of the linear layer, such as a ReLU nonlinearity which sets the minimum value of the output to zero. In this case, the input--real microscopy data--is complex. The nonlinearity allows the network to compute more complicated functions of the input than would be possible with a simple linear model [@cybenko1989approximation].

### Convolutional layers

In some cases, you may want to include a code-block that is not executed when the book is compiled. Use the `eval: false` option for this.
Instead of linear layers, convolutional layers are often used in vision tasks, as a parameter-efficient alternative [@lecun1995convolutional]. A convolutional layer slides a small two-dimensional filter across the input image, computing the dot product between the filter and the input at each image position. It also includes the addition of a vector of bias terms. For 2D image processing, we use two-dimensional convolutions, but one/three dimensional convolutions may be used for 1D/3D data respectively. The output of the 2D convolution operation at position $(x,y)$ can be written as follows for an example grayscale image $I$ with a single input channel, where the filter $W$ is size $(2k+1, 2k+1)$ and the bias is $b$:
Copy link
Member

@opp1231 opp1231 Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please break down the math here into multiple steps.


![Toy illustration of convolutional operation from this [article](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-convolution-neural-networks-e3f054dd5daa). In practice, convolutional kernels are more complicated than simple template detectors.](4-architectures/conv_happy.gif){#fig-convill width=70%}

This is an example with a single input and a single output channel, but in general a 2D convolutional layer has multiple input and output channels. Each output **channel** is the result of a 2D convolutional kernel applied to the input. In the gif below, the input is in blue, the kernel is in gray, and the output is in green. The number of units in the output channel depends on a *stride* parameter. In the gif below, the stride is 1 because the input image is sampled at each position. A stride of 2 would mean skipping over every other input position both vertically and horizontally. In most applications, especially with small kernel sizes, a stride of 1 is used for convolution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is an example with a single input and a single output channel, but in general a 2D convolutional layer has multiple input and output channels. Each output **channel** is the result of a 2D convolutional kernel applied to the input. In the gif below, the input is in blue, the kernel is in gray, and the output is in green. The number of units in the output channel depends on a *stride* parameter. In the gif below, the stride is 1 because the input image is sampled at each position. A stride of 2 would mean skipping over every other input position both vertically and horizontally. In most applications, especially with small kernel sizes, a stride of 1 is used for convolution.
This is an example with a single input and a single output channel, but in general a 2D convolutional layer has multiple input and output channels. Each output **channel** is the result of a 2D convolutional kernel applied to the input. In the animation below, the input is in blue, the kernel is in gray, and the output is in green. The number of pixels in the output channel depends on a *stride* parameter. In the animation below, the stride is 1 because the input image is sampled at each position. A stride of 2 would mean skipping over every other input position both vertically and horizontally. In most applications, especially with small kernel sizes, a stride of 1 is used for convolution.

![Convolutional operation with padding from this [github](https://github.com/vdumoulin/conv_arithmetic). This illustrates the creation of a single channel output from a single channel input. In practice, both inputs and outputs have multiple channels and all combinations of input and output need to be calculated and summed accordingly.](4-architectures/same_padding_no_strides.gif){#fig-convpad width=40%}

::: {.callout-note}
If the kernel size *K* is odd and you set the `padding=K//2` (floor[K/2]) and `stride=1` as shown in @fig-convpad, you get a **channel** of units that is the same size as the input.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the kernel size *K* is odd and you set the `padding=K//2` (floor[K/2]) and `stride=1` as shown in @fig-convpad, you get a **channel** of units that is the same size as the input.
If the kernel size *K* is odd and you set the `padding=K//2` (floor[K/2]) and `stride=1` as shown in @fig-convpad, you get a **channel** of pixels that is the same size as the input.

![Convolutional operation with padding from this [github](https://github.com/vdumoulin/conv_arithmetic). This illustrates the creation of a single channel output from a single channel input. In practice, both inputs and outputs have multiple channels and all combinations of input and output need to be calculated and summed accordingly.](4-architectures/same_padding_no_strides.gif){#fig-convpad width=40%}

::: {.callout-note}
If the kernel size *K* is odd and you set the `padding=K//2` (floor[K/2]) and `stride=1` as shown in @fig-convpad, you get a **channel** of units that is the same size as the input.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither padding nor stride have been introduced up until this point.

If the kernel size *K* is odd and you set the `padding=K//2` (floor[K/2]) and `stride=1` as shown in @fig-convpad, you get a **channel** of units that is the same size as the input.
:::

A convolutional layer operates under two main assumptions: 1) the computation only requires local features that are within the spatial extent of the filter operation; and 2) it is not necessary to perform different computations at different positions in the image, and thus the same filter operation can be convolutionally applied across all positions in the image. When these assumptions are acceptable, a convolutional layer can reduce the number of parameters substantially compared to linear layers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A convolutional layer operates under two main assumptions: 1) the computation only requires local features that are within the spatial extent of the filter operation; and 2) it is not necessary to perform different computations at different positions in the image, and thus the same filter operation can be convolutionally applied across all positions in the image. When these assumptions are acceptable, a convolutional layer can reduce the number of parameters substantially compared to linear layers.
A convolutional layer operates under two main assumptions: 1) the computation only requires local features that are within the spatial extent of the filter operation; and 2) it is not necessary to perform different computations at different positions in the image, and thus the same filter operation can be convolutionally applied across all positions in the image. In other words, for a microscopy image, the biology of interest needs to fit within your filter size and the same processing steps must apply to the whole field of view. When these assumptions are acceptable, a convolutional layer can reduce the number of parameters substantially compared to linear layers.


A convolutional layer operates under two main assumptions: 1) the computation only requires local features that are within the spatial extent of the filter operation; and 2) it is not necessary to perform different computations at different positions in the image, and thus the same filter operation can be convolutionally applied across all positions in the image. When these assumptions are acceptable, a convolutional layer can reduce the number of parameters substantially compared to linear layers.

Taking our example from above, let’s estimate the number of parameters with filters/kernels of size 3 by 9 by 9 pixels, where 3 is the number of input channels and 9 is the size in pixel space, the “kernel size”. The number of input and output images are called “channels”, similar to the red/green/blue channels for RGB images. If we define the layer to have 6 output channels, this requires 6 of these kernels resulting in 1458 parameters in the kernels, along with 6 bias terms, resulting in 1464 parameters in total. As you can see, the number of parameters now is independent of the size of the input in pixels, and a dramatic reduction from the nearly 1 billion parameters for the dense linear layer example above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation of "channels" would be especially helpful if it were included earlier in the text.


A convolutional layer operates under two main assumptions: 1) the computation only requires local features that are within the spatial extent of the filter operation; and 2) it is not necessary to perform different computations at different positions in the image, and thus the same filter operation can be convolutionally applied across all positions in the image. When these assumptions are acceptable, a convolutional layer can reduce the number of parameters substantially compared to linear layers.

Taking our example from above, let’s estimate the number of parameters with filters/kernels of size 3 by 9 by 9 pixels, where 3 is the number of input channels and 9 is the size in pixel space, the “kernel size”. The number of input and output images are called “channels”, similar to the red/green/blue channels for RGB images. If we define the layer to have 6 output channels, this requires 6 of these kernels resulting in 1458 parameters in the kernels, along with 6 bias terms, resulting in 1464 parameters in total. As you can see, the number of parameters now is independent of the size of the input in pixels, and a dramatic reduction from the nearly 1 billion parameters for the dense linear layer example above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please break this into intermediate steps.

print('output shape: ', y.shape)
```

The activations of a single convolutional layer, as described, have a receptive field size equivalent to its kernel size: each activation only receives information from pixels within the kernel size. However, objects in images are often larger than the kernel size. To increase the amount of spatial information used for computation, pooling layers are introduced between convolutional layers. A pooling layer consists of a sliding-window operation, just like the convolution layer, but in this case a maximum or average operation is computed within the window, independently for each input channel. To reduce the size of the image, this convolutional operation is applied with a “stride”: to downsample the image by a factor of two, we may use a pooling size of 2 with a stride also set to 2, like in the example below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know this isn't strictly a mathematical discussion, but breaking this down into smaller pieces would help explain concepts such as: receptive field size, pooling layers, sliding windows, etc..


The activations of a single convolutional layer, as described, have a receptive field size equivalent to its kernel size: each activation only receives information from pixels within the kernel size. However, objects in images are often larger than the kernel size. To increase the amount of spatial information used for computation, pooling layers are introduced between convolutional layers. A pooling layer consists of a sliding-window operation, just like the convolution layer, but in this case a maximum or average operation is computed within the window, independently for each input channel. To reduce the size of the image, this convolutional operation is applied with a “stride”: to downsample the image by a factor of two, we may use a pooling size of 2 with a stride also set to 2, like in the example below.

![Illustration of max-pooling with a kernel size of 2 and stride of 2, from [here](https://github.com/dvgodoy/PyTorchStepByStep).](4-architectures/pooling.png){#fig-pool width=90%}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
![Illustration of max-pooling with a kernel size of 2 and stride of 2, from [here](https://github.com/dvgodoy/PyTorchStepByStep).](4-architectures/pooling.png){#fig-pool width=90%}
![Illustration of max-pooling with a kernel size of 2 and stride of 2 on an image with one channel, one z-slice and 6 pixels in x and y. The output only has 3 pixels in x and y due to the pooling step. Example from from [here](https://github.com/dvgodoy/PyTorchStepByStep).](4-architectures/pooling.png){#fig-pool width=90%}

```

Here is an example equation.
If we apply several convolutional and pooling layers, we end up with an output which is smaller in resolution than the input. We can see example activations across convolutional, pooling and linear layers in a network trained to classify MNIST [@lecun1998mnist] digits:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If we apply several convolutional and pooling layers, we end up with an output which is smaller in resolution than the input. We can see example activations across convolutional, pooling and linear layers in a network trained to classify MNIST [@lecun1998mnist] digits:
If we apply several convolutional and pooling layers, we end up with an output which is smaller in resolution than the input. We can see example activations across convolutional, pooling and linear layers in a network trained to classify MNIST [@lecun1998mnist] digits (a dataset of handwritten numbers often used to test ML ideas):

$$
s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}
$$ {#eq-stddev}
![Activations across convolutional and pooling layers from this [demo](https://adamharley.com/nn_vis/cnn/2d.html). Lighter blue indicates a higher activation.](4-architectures/mnist.PNG){#fig-mnist}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
![Activations across convolutional and pooling layers from this [demo](https://adamharley.com/nn_vis/cnn/2d.html). Lighter blue indicates a higher activation.](4-architectures/mnist.PNG){#fig-mnist}
![Each layer of a small, illustrative network where the output channels of each layer are shown as images. Lighter blue indicates higher activation at each layer. The final output layer represents the probability of the input image corresponding to a value between 0 and 9. In this example, the network correctly identifies the input as the number 3. To see this in action, checkout the interactive demonstration [here](https://adamharley.com/nn_vis/cnn/2d.html).](4-architectures/mnist.PNG){#fig-mnist}

You can also embed figures from other notebooks in the repo as shown in the following embed example.
### U-nets

U-nets were introduced by Ronneberger, Fischer, and Brox [-@ronneberger2015u] (@fig-unet). They share some similarities with feature pyramid networks [@lin2017feature], but u-nets are more frequently used in biological applications so we will focus on them. U-nets have an encoder-decoder structure, like an autoencoder [@hinton2006reducing], with the encoder consisting of convolutional layers and pooling layers (downsampling), and the decoder consisting of convolutional layers and upsampling or strided conv-transpose layers. End-to-end, u-nets typically produce an output at the same spatial resolution as the input.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please break this down into intermediate explanations.

The downsampling results in a loss of fine spatial information. To recover this information, the output of the convolutional layers in the encoder is concatenated with the activations from the decoder at each spatial scale using skip connections (“copy” operation in @fig-unet). This preserves the higher resolution details, which is important for precise segmentations and pixel-wise predictions.

{{< embed ../notebooks/test.ipynb#fig-test-fig echo-true >}}
Conventional u-nets [@ronneberger2015u] have two convolutional layers per block and a small kernel size of 3 in each layer. The downsampling after each block is often set to a factor of 2. Because the kernel size is small, the only way to have large receptive field sizes is through several downsampling blocks. If we want the network to learn complicated tasks across many diverse images, then we need it to have a large “capacity”. This can be achieved by adding more weights to the network, for example by increasing the number of channels in the convolutional layers and/or by adding more convolutional layers in each block and/or by increasing the number of downsampling and upsampling stages [@stringer2021].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace "block" with "spatial scale" or otherwise relate to this concept.

### Vision transformers

## Quarto has additional features.
Vision transformers are modern architectures that are replacing convolutional networks in many applications. They are not as parameter-efficient as convolutional neural networks - for example the Cellpose segmentation u-net has 6.6 million parameters while ViT-H (“vision-transformer-huge”) has 632 million parameters [@dosovitskiy2020image]. They are still much more efficient than dense linear layers due to special architecture choices (see below), and they introduce a new type of operation called “self-attention”. Transformers avoid overfitting this large set of parameters through training on very large amounts of data. Even though they have many more parameters than standard convolutional networks, they are not too much slower because most of the operations within the transformer are matrix multiplications which are fast on newer GPUs with tensor cores (e.g. from [nvidia](https://www.nvidia.com/en-us/data-center/tensor-cores/)). With more parameters, they have a larger capacity than standard convolutional neural networks to learn from large training datasets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially silly, but it is possible the Janelia lawyers will not appreciate linking to a specific product. Is it possible to refer to "tensor cores" without linking to NVIDIA directly?

Vision transformers are modern architectures that are replacing convolutional networks in many applications. They are not as parameter-efficient as convolutional neural networks - for example the Cellpose segmentation u-net has 6.6 million parameters while ViT-H (“vision-transformer-huge”) has 632 million parameters [@dosovitskiy2020image]. They are still much more efficient than dense linear layers due to special architecture choices (see below), and they introduce a new type of operation called “self-attention”. Transformers avoid overfitting this large set of parameters through training on very large amounts of data. Even though they have many more parameters than standard convolutional networks, they are not too much slower because most of the operations within the transformer are matrix multiplications which are fast on newer GPUs with tensor cores (e.g. from [nvidia](https://www.nvidia.com/en-us/data-center/tensor-cores/)). With more parameters, they have a larger capacity than standard convolutional neural networks to learn from large training datasets.

You can learn more about markdown options and additional Quarto features in the [Quarto documentation](https://quarto.org/docs/authoring/markdown-basics.html). One example that you might find interesting is the option to include callouts in your text. These callouts can be used to highlight potential pitfalls or provide additional optional exercises that the reader might find helpful. Below are examples of the types of callouts available in Quarto.
The vision transformer divides the input image into patches, e.g. 16 by 16 pixels each. In the first layer of the transformer the patches are transformed into the embedding space, using a linear operation that is often implemented using strided convolutions. This embedding space is generally several times larger than the number of pixels in the patch; for example the embedding space in ViT-H is 1280. These patch embeddings are input to the transformer encoder, which consists of many blocks ($L$). Each transformer block has a self-attention block and an MLP block. In the self-attention block, the attention matrix is computed as pairwise interactions between all patches, enabling sharing of information across the entire image.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please either break this down into intermediate explanatory steps (especially self-attention), or consider removing it.

$$
s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}
$$ {#eq-stddev}
![Activations across convolutional and pooling layers from this [demo](https://adamharley.com/nn_vis/cnn/2d.html). Lighter blue indicates a higher activation.](4-architectures/mnist.PNG){#fig-mnist}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very cool demonstration!


## Loss functions

In a standard image classification network, the output is a vector with length equal to the number of classes in the task. Each of the entries in this vector represents the predicted probability of the class, and the predicted label for the image is chosen as the index of the vector with the largest entry. For each training image we have a ground-truth label for the class. How close the network matches the label is the loss, which is defined as a function between the vector output of the network and the ground-truth label. A lower loss means we matched the ground-truth data better. The gradients of the network are computed automatically via back-propagation, and an optimizer is specified to modify the parameters in order to minimize the loss (described more below).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great to include!


In segmentation and biological classification tasks, as mentioned before, the output is often the same size as the input in pixels and the loss is computed per-pixel. There will thus be multiple outputs of this size, each one corresponding to a class, like the entries in the vector for overall image classification. In the original u-net paper, the loss was defined using two classes, "not cell" and "cell" [@ronneberger2015u].

![Binary cross-entropy loss, computed from the output of a u-net trained to classify cell/not cell (predicted cell probability).](4-architectures/bceloss.PNG){#fig-bce width=70%}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A walk-through of this figure in the text would add a lot of context for it.

::: {.callout-warning}
Be careful to avoid hallucinations.
:::
To create segmentations for each cell, a threshold is defined on the cell probability and any pixels above the threshold that are connected to each other are formed into objects. This threshold is defined using a validation set - images that are not used for training or testing - to help ensure the threshold generalizes to the held-out test images. The predicted segmentations with this loss function often contain several merges, because cells can often touch each other and the connected components of the image will combine the touching cells into single components.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these concepts are first explained in Section 4.3. Perhaps consider moving that explanation up or cross-reference that section here.


Now that we have defined a loss function, we want to minimize the loss $\ell$ by updating the weights in the network. For a problem like segmentation, we will need images with ground-truth segmentation labels - this labeling can be done using tools like [Ilastik](https://www.ilastik.org/), [ImageJ](https://imagej.net/software/imagej/), [Paintera](https://github.com/saalfeldlab/paintera), or [Napari](https://napari.org/stable/). Once ground-truth labeling is done on some images, training can be attempted. We will want to use most of the ground-truth labeled images for training, making up a training set, and leave a small subset (like 10%) for testing the performance of the network. For some algorithms, we may also need a validation set for setting post-processing parameters like the cell probability threshold, in which case we can reserve 10-15% of the training set images for validation.

Minimizing the loss is done via gradient descent, in which the weights are updated by the gradient of the loss with respect to each parameter, scaled by the learning rate $\alpha$:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please break this down into intermediate steps.


Moving the parameters in the negative direction of the gradient reduces the loss for the given images or data points over which the loss is computed. We could compute the loss and gradients over all images in the training set, but this would take too long so in practice the loss is computed in batches of a few to a few hundred images - the number of images in a batch is called the *batch size*. The optimization algorithm for updating the weights in batches is called stochastic gradient descent (SGD). This is often faster than full-dataset gradient descent because it updates the parameters many times on a single pass through the training set (called “epoch”). Also, the stochasticity induced by the random sampling step in SGD effectively adds some noise in the search for a good minimum of the loss function, which may be useful for avoiding local minima.

It can also be beneficial to include momentum, with some value $\beta$ between zero and one, which pushes weight updates along the same direction they have been updating in the past. The updated version of $\vec{v}$ in this case is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It can also be beneficial to include momentum, with some value $\beta$ between zero and one, which pushes weight updates along the same direction they have been updating in the past. The updated version of $\vec{v}$ in this case is
It can also be beneficial to include momentum, with some value $\beta$ between zero and one, which pushes weight updates along the same direction they have been updating in the past, to avoid spurious changes due to noise. The updated version of $\vec{v}$ in this case is

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or some other worded motivation for momentum

It can also be beneficial to include momentum, with some value $\beta$ between zero and one, which pushes weight updates along the same direction they have been updating in the past. The updated version of $\vec{v}$ in this case is
$$ \vec{v}_t = \beta \vec{v}_{t-1} + \alpha\, \frac{d L(\vec{w}_t)}{d \vec{w}_t} $$

Different weights in the network may have differently scaled gradients, and thus a single learning rate may not work well. The Adam optimizer uses a moving average of both the first and second moment of the gradient for rescaling the weight updates while including a momentum term [@kingma2014adam]. This optimizer works better than standard SGD in many cases, and requires less fine-tuning to find good hyperparameter values. In addition to using an optimizer like Adam, it may be helpful to use a learning rate schedule which reduces the learning rate towards the end of training to enable smaller steps for fine-tuning the final weights [@loshchilov2016sgdr]. Sometimes a validation set is used to re-instantiate the best weights, as evaluated on the validation set, before a decrease in the learning rate [@prechelt1998automatic].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expand the separate steps here.


Different weights in the network may have differently scaled gradients, and thus a single learning rate may not work well. The Adam optimizer uses a moving average of both the first and second moment of the gradient for rescaling the weight updates while including a momentum term [@kingma2014adam]. This optimizer works better than standard SGD in many cases, and requires less fine-tuning to find good hyperparameter values. In addition to using an optimizer like Adam, it may be helpful to use a learning rate schedule which reduces the learning rate towards the end of training to enable smaller steps for fine-tuning the final weights [@loshchilov2016sgdr]. Sometimes a validation set is used to re-instantiate the best weights, as evaluated on the validation set, before a decrease in the learning rate [@prechelt1998automatic].

During fitting it is important to monitor the training loss and the validation loss. With an appropriate learning rate that is not too large, the training loss should always decrease. The loss on held-out validation data should also ideally decrease over training. If not, then the network is overfitting to the training set: the weights are becoming specifically tuned for the training set examples and no longer generalize to held-out data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great to mention.


![Example training loss and validation loss across epochs.](4-architectures/trainloss.png){#fig-trainloss width=40%}

To avoid overfitting, regularization is often used. Most commonly in computer vision problems, weight decay will be used for regularization, which is closely-related to L2 regularization. This operation reduces the weights by a small fraction $\lambda$ at each optimization step:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't think the intended audience is likely familiar with regularization at all. We think this is a good change to generalize this explanation.

* Tutorials from [pytorch](https://docs.pytorch.org/tutorials/)
* CNN Explainer [demo](https://poloclub.github.io/cnn-explainer/) with activations across layers, and interactive visualizations for padding and stride [@wang2020cnn]
* Illustration of how momentum works from [Gabriel Goh](https://distill.pub/2017/momentum/)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A concluding section or paragraph would be helpful.

Copy link
Member

@opp1231 opp1231 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Carsen!

As we discussed previously, we have added notes for places where intermediate steps or further explanations would be helpful. Please take a look, and of course, feel free to reach out to us if anything is unclear.

One general comment is we would like to ensure that we have permission to re-use the images that are borrowed throughout. If you could confirm the access, that would be great.

Additionally, this is a list of terms that we will add to the glossary based on your chapter. You are welcome to provide a definition, or we will pull the definition from your chapter:

  • Neural Network
  • Architecture
  • Natural Image
  • Probability Vector
  • Skip Connection
  • Down sampling
  • Up sampling
  • Perceptron
  • Activation Function
  • Linear Layer
  • Bias
  • ReLU
  • Non-linearity
  • Convolutional filter/kernel
  • Receptive Field
  • Pooling Layer
  • Sliding Window
  • Stride
  • Padding
  • Encoder-Decoder Structure
  • Autoencoder
  • Self-attention
  • Patches
  • Embedding Space
  • Auxiliary Variables
  • Back propagation
  • Optimizer
  • Stardist
  • CellPose
  • Gradient Descent
  • Regularization

Glossasry entries that are/will be in other chapters, but you are welcome to chime in on if you'd like:

  • Convolution
  • Foundation Model
  • Channels: Disambiguate fluorescence channels from convolutional outputs
  • Validation Images
  • Test Images
  • Overfitting

Thank you for your dedicated effort to this chapter. We really appreciate your insight and assistance.

Best,
Owen and Rachel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants