# CNNs

Tags | CS 231N |
---|

# Building the CNN

The key different in CNNs is that we assume the inputs are images, and implicitly, that pixels close to each other have similar meanings. Pixels further away will be less relevant, so a fully-connected network seems wasteful.

Neurons in the CNN will be connected to only a few neurons in its activation (after the convolution operation), and we slowly collapse the image down into a vector. Each layer transforms a 3D tensor into another 3D tensor. This is different than a neural network, which transforms vectors. Tensors make more sense for images, because they do have inherent spatial structure.

## Where we were

We had a linear classifier first. This is equivalent to making one template per *class*. When we started using neural networks, we added more templates per class.

We can extract features from an image, like a color histogram or a histogram of oriented gradients. You can also do a `bag of words`

, in which you have a code book of random patches, and then you sweep a window through the image and you generate a histogram of how similar the windows were to each patch of images. These all add low-dim information, which can be combined into one feature vector. This does perform better, but it raises the question: can we learn filters automatically?

Well, you say: yes. We’ve already done that with the neural network and even the linear classifiers. However, these classifiers fail to use the spatial nature of the image to their advantage.

# Convolution layer

At a very broad overview, a convolution layer will slide a filter across an image (or representation) and it will take the dot product betwen the filter and the current window in the image. Lastly, it will add a bias vector to each of the pixels generated by the convolution. So mathematically, it looks something like this, where $I’$ is the reduced image (to the window size)

Therefore, each neuron is connected to only a small patch of neurons (or pixels) in the previous layer. This allows for efficient backprop and reduced structural complexity.

We can do multiple filters, which allow for greater expressiveness.

## Shapes

We typically deal with a $3 \times W \times W$ image, and our convolutions are therefore $3\times F \times F$ for a single filter. Of course, we can add $k$ filters, which is $k \times 3 \times F \times F$. For batches, you have an input of $n\times 3 \times F \times F$, and your weights are still $k \times 3 \times F \times F$..

## Asymmetry of the convolution

Here’s something crucial to understanding the convolution. The height and the width of the “image” (it can be a true image or an intermediate convolutional representation) are subject to the narrowed receptive field. In other words, each neuron only considers a small square in the image.

However, the *depth* of the image is *not* subject to a reduced field. In other words, if we were to take a depth “core” at a single pixel and its corresponding output “core”, each point depends on all of the filters. In other words, if we take the core at one pixel, all the pixels in that core will be connected to all the filters that created it; it’s not a 1:1 mapping between filters and activations. We can also think of the convnet as a (rough) pixel-by-pixel fully connected network with shared weights, if this is helpful. We are looking ahead, but this is why 1X1 convolutions exist.

We refer to this “core” as the `depth column`

or `fibre`

. The size of the depth column depends on the number of filters you use, so it is a hyperparameter.

## Strides

You can think of this receptive field as moving around as you move from neuron to neuron in the activation.

A common stride is 1, meaning that if you move one neuron over in the activation, the window moves one pixel. Without padding (more on this later), if your window as size $k$ and the image size $n\times n$, then the activation will be size $(n - k + 1) \times (n-k + 1)$.

Other strides include $2$ and above, although higher strides are less common. However, the higher the stride, the smaller the activation height & width. Here’s a nice diagram of how the connections change with different strides

## Padding

To control the size of the activation, we pad the edges such that the image and the activation width and height are the same. This often saves a lot of headaches, because before the padding, you would need to make sure that the strides would yield a good number.

How does padding work? Well a window of size $F$ and stride size $S$, you would need to pad

This is just the number of elements from the “center” of the window to the side.

For F = 3 and S = 1, we have P = 1. For F = 5, S = 1, we have P = 2. For F = 7, S = 1, we have P = 3. And so on.

## Receptive fields

The connectivity is known as the `receptive field`

. For the first layer, it is just the filter size. For subsequent layers, the receptive field in the original image grows larger (as more thing are condensed). The general formula is that each layer of, our receptive field increases by $F - 1$. With $k$ layers, you have $1 + k(F - 1)$. This is to be expected

## The general formula

This is often a frustrating question, but here’s a way we can figure out the output dimensions given the stride $S$, the input size $W$, padding $P$, and the filter size $F$:

Note that if we plug in the padding definition for same-size activations, we would get

## Sharing parameters

So if we go back to this:

We’ve already lost some complexity by having each neuron depend on a small window of the input. We can make it *even* more space-efficient by sharing weights. In other words, *all* neurons in this particular layer share the same convolutional filter weights. The key insight here is that image objects can shift, so the location in the image doesn’t tell us too much about what to expect. Therefore, we can use the same weights.

This is why we refer to the weights as `filters`

or `kernels`

.

Sometimes, however, there *is* a specific structure and you can just remove the weight-sharing aspect of the convolution. For example, if you are given pictures of centered faces, you would benefit from having each neuron determining its own filter. this is known as a `locally-connected layer`

.

With sharing, you have $F^2D$ weights per filter, and without parameter sharing, you have $F^2W^2D$.

## Review of dimensions

Weights are $[out, in, F, F]$

The $F$ is the size of the filter, which is convolved with the image. `out`

is the number of weight sets we have, `in`

is the stack size on the weights. You convolve by taking the dot product of $[in, F, F]$ with the appropriate chunk in the input. Each output channel is therefore independent of the other output channels, but it is wholly dependent on each input channels.

# Special convolutions

## 1 x 1 convolution

Sometimes we see a $1\times 1$ convolution, i.e. a window that is literally a single pixel. This sounds dumb, but remember that the depth of an activation is fully connected. So you can imagine this as a mapping from depth $d$ to $d’$ that is fully connected. As a practical example, a 1x1 convolution could map something like $32 \times 32 \times 30$ to $32 \times 32 \times 90$.

## Dilated convolution

Sometimes we also see a `dialated convolution`

. This is basically a convolution window but you put gaps in the window. You increase the *effective* receptive field without increasing the complexity.

## Upsampling (transposed convolution)

The idea is basically the input is multiplied by the filter (scalar multiplication), and then the scaled filter is tiled across the image. If there are overlaps, you sum those values. Below is an example of a 1d transposed convolution.

The reason why we call this a `transposed convolution`

is because you essentially transpose the weights. Below is a 1d example

the tl;dr is that through this, you can turn a small activation in to a larger activation through a learned upsampling process.

# Pooling layer

Thanks to padding and small-sized windows, each convolution doesn’t shrink the width and the height much. So, if we want to end up with a reasonably sized output, we need to `pool`

the image.

The pooling layer is essentially a downsampling layer. It might take a $k\times k$ tile and find the average, or the maximum, and then replace this whole tile with that one pixel.

## Backpropagation

If we are using maximum, then the gradient is shuttled to only one out the $k^2$ pixels in the window. If it’s the sum, then the gradient is distributed evenly.

## Unpooling

You can just do nearest neighbors, or bed of nails. NN will be good for the gradient, but these all seem rather primitive

You can do `max unpooling`

, which means that you save the locations of the maximums and you just put the maximums back where they belong

## Pooling controversy

People are trying to move away from pooling, and instead they are trying to move to layers of greater strides that reduce the width and height. It has been found to perform better.

# Normalization layer

Sometimes, you get bad inputs to a layer, which may require very different magnitude weights or a very large bias. To fix this, you can normalize.

Normalization is added after fully-connected or convolutional layers, before non-linearity (because non-linearity can succumb to poor scaling.

## Batchnorm on FC

The idea is that for each channel, we normalize across the batch.

For fully-connected networks, this means that we take the mean across the batch and the sigma across the batch dimension $N$. If $X$ is $n\times d$, then $\mu, \sigma$ would be $1\times d$.

Sometimes, normalization can be too strong. We can fix this by adding in learnable parameters $\gamma, \beta$ for each element, where $\gamma$ is the scaling and the $\beta$ is the offset. This allows a reconstruction of some of the original non-normalization

See how if $\gamma = \sigma, \beta = \mu$, we would reconstruct the original input.

The parameters $\mu$ and $\sigma$ are derived from the training batch. However, during testing, the layer keeps track of a running average of $\mu, \sigma$ that is used instead. Just make sure to turn the layer from training to test mode, or else you will get weird results.

## Gradient through batchnorm, layer norm

For simplicity, let us consider a single feature across a batch. You can compute the batchnorm through the simple chain rule

You can use the computational graph, but eventually you’ll get something like

which simplifies down to

(note that we don’t include the $\beta, \gamma$ here for simplicity sake.

You can always add another dimension to do the batchnorm with more than one feature, because the features are independent anyways.

You can also get a more complicated version of batchnorm that builds the computation graph and then uses traditional backprop to compute the gradient. It ends up being the same thing. The computation graph looks like this:

## Extending Batchnorm to convolution

It’s actually pretty simple. You can imagine picking a dimension to “save”, i.e. to prevent the normalization operation from touching. Batchnorm saves the channel, so each channel gets its own mean and STD.

As a dimensionality check, $X$ is shape $N\times C\times H \times W$, and the norm will define $\mu, \sigma \in 1 \times C \times 1 \times 1$.

## Other forms of normalization

A `layer normalization`

will take the average of the data and leave the batch alone. In other words, for a FC network and an input of $N\times D$, we’d take the mean and STD across the second axis, yielding $\mu, \sigma \in N \times 1$.

For convolutions, it’s the same deal; we keep the $N$, which means that $\mu, \sigma \in N \times 1 \times 1 \times 1$

`instance normalization`

saves both axes, which means that it’s only relevant for convolutions. Your mean and std are shape $N\times C \times 1 \times 1$.

`group normalization`

is when you select some channels to perform an average across but not everything.

## Nice visual

# Converting FC to Convolution

Because CNNs are just block of locally connected networks, you can implement convolution through a FC network (not efficiently, though!)

You can also turn a FC network into a convolution layer. This is actually a pretty straightforward, degenerate case. If you had some input volume $7 \times 7 \times 512$ and you wanted an output of $4096$ values, then you make $4096$ filters of size $7$, which will output a $1\times 1\times 4096$ activation.

## Why is this helpful?

Well, it’s a subtle but crucial advantage. It allows for robustness to the image size! If you put in a larger image to a model that uses convolutions as FC’s, then the model still runs. It just outputs a bunch of predictions at once.

These predictions can be understood as the predictions made by the original network (meant for smaller images) sliding over the larger image like a normal convolution would, and generating predictions for each of these crops. Therefore, the number of predictions can be seen as an ensemble, and you can just take their majority or average to get a good prediction!

The sliding network performs strides depending on how large the image is. If you wanted to make smaller strides, you can feed it the same image with a slight offset along both width and height.

# Fine Tuning

If you have a different visual task, you can consider training from scratch. You can also try `fine-tuning`

a pretrained network. The idea here is that the pretrained model will probably have some generalizable structures at the beginning of the network, like edge detection, etc. Therefore we just need to modify the later few layers to be task specific.

If you have very little data, then use a linear classifier on top of some final layer.

If you have a lot of data, finetune a few of the original model’s layers, and if you have even more data, you can finetune a larger number of layers.

However, training from scratch often still works; it just takes longer. You need to be careful when you’re fine-tuning. Typically you should use a learning rate that is 10 times lower than standard training LR.

# Implementing Convolutions

## Convolutions as matrix multiplications

We do this from one insight: the dot product is exactly the filter operation! So we just need to “linearize” the image somehow. We do this through an `im2col`

algorithm, and it essentially extracts all the “views” in the image and linearizes them such that each ‘view’ is a column. Let’s call this $X$.

We stretch out each filter into a row, to create a row-matrix $W$.

Then, we simply compute $WX$, and this evaluates the dot product between every filter and receptive field location. The output will give us the value of the convolution at each point, across each filter

To understand this, let’s look at a quick example.

- Suppose you had a $227 \times 227 \times 3$ image. and your filter is size $11 \times 11 \times 3 \times 96$ (i.e. 96 filters). This means that your viewing window is size $363$

- At a stride of $4$, you can calculate that you have $55$ locations lengthwise and heightwise, yielding $3025$ total location. This means that your image matrix is $363 \times 3025$

- Now, your filter matrix is $96 \times 363$

- The convolution operation yields $96 \times 3025$, which is the number of filters times the number of views. Nice!

## Backpropagation: intro

If you implement convolution as a matrix multiplication, then the backprop should be simple. But in the naive implementation where you are literally taking inner products over the different parts of the image, we actually have a very nice emergent pattern.

## Backpropagation: getting $dL/dW$

It’s best to try a one-dimensional convolution to get the feel (i.e. a vector filter and a vector $x$). You will start to notice that $dL/dw$ is just $X$ convoluted with $dL/dY$. This is assuming that there is no batches and only one output channel.

## Backpropagation: getting $dL/dX$

You will also notice that $dL/dX$ is just a padded version of $W$ convoluted with a modified version $dL/dY$, with the same assumptions. The key modification is that you rotate $dL/dY$ by 180 degrees. There’s no intuitive theory for why this is the case; you will just see the pattern emerge.

As for that $W$, you basically add as much padding as you need to get the output of the convolution equal to the dimensions of $X$. Mathematically speaking, it is just the size of $X$ subtracted by the size of $W$ (the edge size, not the size of the matrix). At the end, if you padded your $X$ originally, you will get a matrix the size of the padding. You will just need to disregard the gradients beyond the edges of the original image; this is totally fine.

## Backpropagation: dealing with high strides

If you are dealing with higher stride sizes, just know that the *strides* in forward space turn into *dilations in backward space* i.e. convolutions with holes in between each element of the filter. For example, if we have a stride of 2, our filter turns from

The dialation is needed for the $dL/dY$ in compuations of $dL/dW$ as well as $dL/dX$.

## Backpropagation: adding channels and batches

So far, we’ve been talking about the abstract 2d environment where both $X$ and $Y$ and $W$ are 2d matrices. In real life, this is not the case.

For some context: $dL/dY$ have the form $[batch, out, H, W]$, $X$ has the form $[batch, in, H, W]$, and $W$ has the form $[out, in, F, F]$. This helps us formulate what we intend to do with batches and depths.

Let’s start with $dL/dW$. We want to fill this 4D tensor with the gradients. Each batch contributes to this gradient through summation (there’s no slice in $dL/dW$ for batch). So, we can compute batches separately. this reduces the dimension on $X$ to a 3d tensor.

We are getting there, but there’s more work to be done. The first axis remaining on the $dL/dY$ and the first axis on the $W$ are both output channels and there is an independent set of weights for each output channel. Therefore, you can compute $dL/dW$ separately for each output channel. Thsi means that we need to fill a 3D tensor of $dL/dW$ each time, using a 2D $dL/dY$ convoluted with some 3D tensor $X$.

Here, we note that the per-channel $W$ is also a 3d tensor. because we are not convoluting in the third dimension, this dimension can be separated. In other words, you can imagine splitting $W$ into individual 2D filters and splitting $X$ into individual windows that the filters sweep over. Then, you do 2D convolutions with the 2D $dL/dY$ that we previously calculated. The convolution uses the same $dL/dY$ but uses different slices of $X$ of the input channels. At the end, we stack them together to get $dL/dW$, which at this point is a 3D matrix. We stack the output channels to get the 4D tensor, and we sum across batches to get our final result.

So we’ve just gotten the approach for $dL/dW$. Now, we can work on getting $dL/dX$. We can still decompose things by batch, because we know that $X$ is sliced by batch. Our eventual goal is to convolute a padded $W$ with a rotated version of $dL/dY$.

Like before, we can separate each output channel and compute it separately. But note that $dL/dX$ doesn’t have slices for output channels, because there’s a divergence effect: each element of $X$ contributes to multiple output channels. The gradients flow back to one spot. Therefore, we will compute individual output channel gradients but we will perform a vector sum onto $dL/dX$ to mark each of its contributions.

At this point, we are iterating through batches and output channels, which means that we are left with a 2D matrix for $dL/dY$. This you can modify with dilations and rotation.

Because we are again iterating through output channels, we have a 3D matrix for $W$. Again, we come across the realization that each of the depth layers of $W$ influence the depth layers of $X$ independently, which means that the gradient WRT $X$ can also be segmented by depth layer. Therefore, we can imagine splitting $W$ again into these stacks of 2D filters and we can perform the convolutions of the modified $dL/dY$ . At this point, we have a set of 2D matrices that we stack to form the 3rd dimension of input channels. Now, we sum all contributions (i.e. other 3D gradient tensors) from the different output channels. Note that this doesn’t add another dimension but it is critical for gradient flow. Finally, we stack across batches to get the finished $dL/dX$.

This is very tricky! There’s a lot of moving parts.