CNNs

TagsCS 231N

Building the CNN

The key different in CNNs is that we assume the inputs are images, and implicitly, that pixels close to each other have similar meanings. Pixels further away will be less relevant, so a fully-connected network seems wasteful.

Neurons in the CNN will be connected to only a few neurons in its activation (after the convolution operation), and we slowly collapse the image down into a vector. Each layer transforms a 3D tensor into another 3D tensor. This is different than a neural network, which transforms vectors. Tensors make more sense for images, because they do have inherent spatial structure.

Where we were

We had a linear classifier first. This is equivalent to making one template per class. When we started using neural networks, we added more templates per class.

We can extract features from an image, like a color histogram or a histogram of oriented gradients. You can also do a bag of words, in which you have a code book of random patches, and then you sweep a window through the image and you generate a histogram of how similar the windows were to each patch of images. These all add low-dim information, which can be combined into one feature vector. This does perform better, but it raises the question: can we learn filters automatically?

Well, you say: yes. We’ve already done that with the neural network and even the linear classifiers. However, these classifiers fail to use the spatial nature of the image to their advantage.

Convolution layer

At a very broad overview, a convolution layer will slide a filter across an image (or representation) and it will take the dot product betwen the filter and the current window in the image. Lastly, it will add a bias vector to each of the pixels generated by the convolution. So mathematically, it looks something like this, where II’ is the reduced image (to the window size)

Aij=wI+bA_{ij} = w\cdot I' + b

Therefore, each neuron is connected to only a small patch of neurons (or pixels) in the previous layer. This allows for efficient backprop and reduced structural complexity.

We can do multiple filters, which allow for greater expressiveness.

Shapes

We typically deal with a 3×W×W3 \times W \times W image, and our convolutions are therefore 3×F×F3\times F \times F for a single filter. Of course, we can add kk filters, which is k×3×F×Fk \times 3 \times F \times F. For batches, you have an input of n×3×F×Fn\times 3 \times F \times F, and your weights are still k×3×F×Fk \times 3 \times F \times F..

Asymmetry of the convolution

Here’s something crucial to understanding the convolution. The height and the width of the “image” (it can be a true image or an intermediate convolutional representation) are subject to the narrowed receptive field. In other words, each neuron only considers a small square in the image.

However, the depth of the image is not subject to a reduced field. In other words, if we were to take a depth “core” at a single pixel and its corresponding output “core”, each point depends on all of the filters. In other words, if we take the core at one pixel, all the pixels in that core will be connected to all the filters that created it; it’s not a 1:1 mapping between filters and activations. We can also think of the convnet as a (rough) pixel-by-pixel fully connected network with shared weights, if this is helpful. We are looking ahead, but this is why 1X1 convolutions exist.

We refer to this “core” as the depth column or fibre. The size of the depth column depends on the number of filters you use, so it is a hyperparameter.

Note that every depth slice depends fully on all the filters in the original image. In this way, we make a fully-connected network

Strides

You can think of this receptive field as moving around as you move from neuron to neuron in the activation.

A common stride is 1, meaning that if you move one neuron over in the activation, the window moves one pixel. Without padding (more on this later), if your window as size kk and the image size n×nn\times n, then the activation will be size (nk+1)×(nk+1)(n - k + 1) \times (n-k + 1).

Other strides include 22 and above, although higher strides are less common. However, the higher the stride, the smaller the activation height & width. Here’s a nice diagram of how the connections change with different strides

Padding

To control the size of the activation, we pad the edges such that the image and the activation width and height are the same. This often saves a lot of headaches, because before the padding, you would need to make sure that the strides would yield a good number.

How does padding work? Well a window of size FF and stride size SS, you would need to pad

P=FS2P = \frac{F - S}{2}

This is just the number of elements from the “center” of the window to the side.

For F = 3 and S = 1, we have P = 1. For F = 5, S = 1, we have P = 2. For F = 7, S = 1, we have P = 3. And so on.

Receptive fields

The connectivity is known as the receptive field. For the first layer, it is just the filter size. For subsequent layers, the receptive field in the original image grows larger (as more thing are condensed). The general formula is that each layer of, our receptive field increases by F1F - 1. With kk layers, you have 1+k(F1)1 + k(F - 1). This is to be expected

The general formula

This is often a frustrating question, but here’s a way we can figure out the output dimensions given the stride SS, the input size WW, padding PP, and the filter size FF:

WF+2PS+1\frac{W - F + 2P}{S} + 1
this is very important! It’s annoying but it’s really helpful. As a memorization trick, you can do “fireless wood and two people in the sea”

Note that if we plug in the padding definition for same-size activations, we would get

WF+FSS+1=WSS+1=W\frac{W - F + F - S}{S} + 1 = \frac{W - S}{S} + 1 = W

Sharing parameters

So if we go back to this:

We’ve already lost some complexity by having each neuron depend on a small window of the input. We can make it even more space-efficient by sharing weights. In other words, all neurons in this particular layer share the same convolutional filter weights. The key insight here is that image objects can shift, so the location in the image doesn’t tell us too much about what to expect. Therefore, we can use the same weights.

This is why we refer to the weights as filters or kernels.

Sometimes, however, there is a specific structure and you can just remove the weight-sharing aspect of the convolution. For example, if you are given pictures of centered faces, you would benefit from having each neuron determining its own filter. this is known as a locally-connected layer.

With sharing, you have F2DF^2D weights per filter, and without parameter sharing, you have F2W2DF^2W^2D.

Review of dimensions

Weights are [out,in,F,F][out, in, F, F]

The FF is the size of the filter, which is convolved with the image. out is the number of weight sets we have, in is the stack size on the weights. You convolve by taking the dot product of [in,F,F][in, F, F] with the appropriate chunk in the input. Each output channel is therefore independent of the other output channels, but it is wholly dependent on each input channels.

Special convolutions

1 x 1 convolution

Sometimes we see a 1×11\times 1 convolution, i.e. a window that is literally a single pixel. This sounds dumb, but remember that the depth of an activation is fully connected. So you can imagine this as a mapping from depth dd to dd’ that is fully connected. As a practical example, a 1x1 convolution could map something like 32×32×3032 \times 32 \times 30 to 32×32×9032 \times 32 \times 90.

Dilated convolution

Sometimes we also see a dialated convolution. This is basically a convolution window but you put gaps in the window. You increase the effective receptive field without increasing the complexity.

Upsampling (transposed convolution)

The idea is basically the input is multiplied by the filter (scalar multiplication), and then the scaled filter is tiled across the image. If there are overlaps, you sum those values. Below is an example of a 1d transposed convolution.

The reason why we call this a transposed convolution is because you essentially transpose the weights. Below is a 1d example

the tl;dr is that through this, you can turn a small activation in to a larger activation through a learned upsampling process.

Pooling layer

Thanks to padding and small-sized windows, each convolution doesn’t shrink the width and the height much. So, if we want to end up with a reasonably sized output, we need to pool the image.

The pooling layer is essentially a downsampling layer. It might take a k×kk\times k tile and find the average, or the maximum, and then replace this whole tile with that one pixel.

Backpropagation

If we are using maximum, then the gradient is shuttled to only one out the k2k^2 pixels in the window. If it’s the sum, then the gradient is distributed evenly.

Unpooling

You can just do nearest neighbors, or bed of nails. NN will be good for the gradient, but these all seem rather primitive

You can do max unpooling, which means that you save the locations of the maximums and you just put the maximums back where they belong

Pooling controversy

People are trying to move away from pooling, and instead they are trying to move to layers of greater strides that reduce the width and height. It has been found to perform better.

Normalization layer

Sometimes, you get bad inputs to a layer, which may require very different magnitude weights or a very large bias. To fix this, you can normalize.

Normalization is added after fully-connected or convolutional layers, before non-linearity (because non-linearity can succumb to poor scaling.

Batchnorm and layernorm are a form of regularization, too. It prevents one point or one feature from playing center stage. therefore, you should be careful about using this with a high regularizer. Batchnorm also helps with stablizing the network’s weight normalization sensitivity.

Batchnorm on FC

The idea is that for each channel, we normalize across the batch.

For fully-connected networks, this means that we take the mean across the batch and the sigma across the batch dimension NN. If XX is n×dn\times d, then μ,σ\mu, \sigma would be 1×d1\times d.

x^ij=xijμjσj2+ϵ\hat x_{ij} = \frac{x_{ij} - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}}

Sometimes, normalization can be too strong. We can fix this by adding in learnable parameters γ,β\gamma, \beta for each element, where γ\gamma is the scaling and the β\beta is the offset. This allows a reconstruction of some of the original non-normalization

yij=γjx^ij+βjy_{ij} = \gamma_j \hat x_{ij} + \beta_j

See how if γ=σ,β=μ\gamma = \sigma, \beta = \mu, we would reconstruct the original input.

The parameters μ\mu and σ\sigma are derived from the training batch. However, during testing, the layer keeps track of a running average of μ,σ\mu, \sigma that is used instead. Just make sure to turn the layer from training to test mode, or else you will get weird results.

This γ,β\gamma, \beta deal exists with layernorm, but because it’s still being applied in the manner above, it’s impossible to “undo” the layernorm effect, like is possible with the batchnorm

Gradient through batchnorm, layer norm

For simplicity, let us consider a single feature across a batch. You can compute the batchnorm through the simple chain rule

dLdxi=jNdLdyjdyjdxi\frac{dL}{dx_i} = \sum_j^N \frac{dL}{dy_j}\frac{dy_j}{dx_i}

You can use the computational graph, but eventually you’ll get something like

which simplifies down to

(note that we don’t include the β,γ\beta, \gamma here for simplicity sake.

You can always add another dimension to do the batchnorm with more than one feature, because the features are independent anyways.

You can also get a more complicated version of batchnorm that builds the computation graph and then uses traditional backprop to compute the gradient. It ends up being the same thing. The computation graph looks like this:

Extending Batchnorm to convolution

It’s actually pretty simple. You can imagine picking a dimension to “save”, i.e. to prevent the normalization operation from touching. Batchnorm saves the channel, so each channel gets its own mean and STD.

As a dimensionality check, XX is shape N×C×H×WN\times C\times H \times W, and the norm will define μ,σ1×C×1×1\mu, \sigma \in 1 \times C \times 1 \times 1.

Other forms of normalization

A layer normalization will take the average of the data and leave the batch alone. In other words, for a FC network and an input of N×DN\times D, we’d take the mean and STD across the second axis, yielding μ,σN×1\mu, \sigma \in N \times 1.

For convolutions, it’s the same deal; we keep the NN, which means that μ,σN×1×1×1\mu, \sigma \in N \times 1 \times 1 \times 1

instance normalization saves both axes, which means that it’s only relevant for convolutions. Your mean and std are shape N×C×1×1N\times C \times 1 \times 1.

group normalization is when you select some channels to perform an average across but not everything.

Nice visual

Converting FC to Convolution

Because CNNs are just block of locally connected networks, you can implement convolution through a FC network (not efficiently, though!)

You can also turn a FC network into a convolution layer. This is actually a pretty straightforward, degenerate case. If you had some input volume 7×7×5127 \times 7 \times 512 and you wanted an output of 40964096 values, then you make 40964096 filters of size 77, which will output a 1×1×40961\times 1\times 4096 activation.

Why is this helpful?

Well, it’s a subtle but crucial advantage. It allows for robustness to the image size! If you put in a larger image to a model that uses convolutions as FC’s, then the model still runs. It just outputs a bunch of predictions at once.

These predictions can be understood as the predictions made by the original network (meant for smaller images) sliding over the larger image like a normal convolution would, and generating predictions for each of these crops. Therefore, the number of predictions can be seen as an ensemble, and you can just take their majority or average to get a good prediction!

The sliding network performs strides depending on how large the image is. If you wanted to make smaller strides, you can feed it the same image with a slight offset along both width and height.

Fine Tuning

If you have a different visual task, you can consider training from scratch. You can also try fine-tuning a pretrained network. The idea here is that the pretrained model will probably have some generalizable structures at the beginning of the network, like edge detection, etc. Therefore we just need to modify the later few layers to be task specific.

If you have very little data, then use a linear classifier on top of some final layer.

If you have a lot of data, finetune a few of the original model’s layers, and if you have even more data, you can finetune a larger number of layers.

However, training from scratch often still works; it just takes longer. You need to be careful when you’re fine-tuning. Typically you should use a learning rate that is 10 times lower than standard training LR.

Implementing Convolutions

Convolutions as matrix multiplications

We do this from one insight: the dot product is exactly the filter operation! So we just need to “linearize” the image somehow. We do this through an im2col algorithm, and it essentially extracts all the “views” in the image and linearizes them such that each ‘view’ is a column. Let’s call this XX.

We stretch out each filter into a row, to create a row-matrix WW.

Then, we simply compute WXWX, and this evaluates the dot product between every filter and receptive field location. The output will give us the value of the convolution at each point, across each filter

To understand this, let’s look at a quick example.

  1. Suppose you had a 227×227×3227 \times 227 \times 3 image. and your filter is size 11×11×3×9611 \times 11 \times 3 \times 96 (i.e. 96 filters). This means that your viewing window is size 363363
  1. At a stride of 44, you can calculate that you have 5555 locations lengthwise and heightwise, yielding 30253025 total location. This means that your image matrix is 363×3025363 \times 3025
  1. Now, your filter matrix is 96×36396 \times 363
  1. The convolution operation yields 96×302596 \times 3025, which is the number of filters times the number of views. Nice!

Backpropagation: intro

If you implement convolution as a matrix multiplication, then the backprop should be simple. But in the naive implementation where you are literally taking inner products over the different parts of the image, we actually have a very nice emergent pattern.

Backpropagation: getting dL/dWdL/dW

It’s best to try a one-dimensional convolution to get the feel (i.e. a vector filter and a vector xx). You will start to notice that dL/dwdL/dw is just XX convoluted with dL/dYdL/dY. This is assuming that there is no batches and only one output channel.

Backpropagation: getting dL/dXdL/dX

You will also notice that dL/dXdL/dX is just a padded version of WW convoluted with a modified version dL/dYdL/dY, with the same assumptions. The key modification is that you rotate dL/dYdL/dY by 180 degrees. There’s no intuitive theory for why this is the case; you will just see the pattern emerge.

As for that WW, you basically add as much padding as you need to get the output of the convolution equal to the dimensions of XX. Mathematically speaking, it is just the size of XX subtracted by the size of WW (the edge size, not the size of the matrix). At the end, if you padded your XX originally, you will get a matrix the size of the padding. You will just need to disregard the gradients beyond the edges of the original image; this is totally fine.

Backpropagation: dealing with high strides

If you are dealing with higher stride sizes, just know that the strides in forward space turn into dilations in backward space i.e. convolutions with holes in between each element of the filter. For example, if we have a stride of 2, our filter turns from

[1234][102000304]\begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \rightarrow \begin{bmatrix} 1 & 0 & 2 \\ 0 & 0& 0\\ 3 & 0 & 4 \end{bmatrix}

The dialation is needed for the dL/dYdL/dY in compuations of dL/dWdL/dW as well as dL/dXdL/dX.

Backpropagation: adding channels and batches

So far, we’ve been talking about the abstract 2d environment where both XX and YY and WW are 2d matrices. In real life, this is not the case.

For some context: dL/dYdL/dY have the form [batch,out,H,W][batch, out, H, W], XX has the form [batch,in,H,W][batch, in, H, W], and WW has the form [out,in,F,F][out, in, F, F]. This helps us formulate what we intend to do with batches and depths.

Let’s start with dL/dWdL/dW. We want to fill this 4D tensor with the gradients. Each batch contributes to this gradient through summation (there’s no slice in dL/dWdL/dW for batch). So, we can compute batches separately. this reduces the dimension on XX to a 3d tensor.

We are getting there, but there’s more work to be done. The first axis remaining on the dL/dYdL/dY and the first axis on the WW are both output channels and there is an independent set of weights for each output channel. Therefore, you can compute dL/dWdL/dW separately for each output channel. Thsi means that we need to fill a 3D tensor of dL/dWdL/dW each time, using a 2D dL/dYdL/dY convoluted with some 3D tensor XX.

Here, we note that the per-channel WW is also a 3d tensor. because we are not convoluting in the third dimension, this dimension can be separated. In other words, you can imagine splitting WW into individual 2D filters and splitting XX into individual windows that the filters sweep over. Then, you do 2D convolutions with the 2D dL/dYdL/dY that we previously calculated. The convolution uses the same dL/dYdL/dY but uses different slices of XX of the input channels. At the end, we stack them together to get dL/dWdL/dW, which at this point is a 3D matrix. We stack the output channels to get the 4D tensor, and we sum across batches to get our final result.

So we’ve just gotten the approach for dL/dWdL/dW. Now, we can work on getting dL/dXdL/dX. We can still decompose things by batch, because we know that XX is sliced by batch. Our eventual goal is to convolute a padded WW with a rotated version of dL/dYdL/dY.

Like before, we can separate each output channel and compute it separately. But note that dL/dXdL/dX doesn’t have slices for output channels, because there’s a divergence effect: each element of XX contributes to multiple output channels. The gradients flow back to one spot. Therefore, we will compute individual output channel gradients but we will perform a vector sum onto dL/dXdL/dX to mark each of its contributions.

At this point, we are iterating through batches and output channels, which means that we are left with a 2D matrix for dL/dYdL/dY. This you can modify with dilations and rotation.

Because we are again iterating through output channels, we have a 3D matrix for WW. Again, we come across the realization that each of the depth layers of WW influence the depth layers of XX independently, which means that the gradient WRT XX can also be segmented by depth layer. Therefore, we can imagine splitting WW again into these stacks of 2D filters and we can perform the convolutions of the modified dL/dYdL/dY . At this point, we have a set of 2D matrices that we stack to form the 3rd dimension of input channels. Now, we sum all contributions (i.e. other 3D gradient tensors) from the different output channels. Note that this doesn’t add another dimension but it is critical for gradient flow. Finally, we stack across batches to get the finished dL/dXdL/dX.

This is very tricky! There’s a lot of moving parts.