What is a CNN?

A CNN is a neural network which classifies images by using kernel convolutions in the hidden layer.

Convolutions

A convolution is a mathematical operation on two functions to produce a third function which is seen as a modification of one of the first functions, giving the integral of the pointwise multiplication of the two functions as a function of the amount that one of the original functions is translated. [2]

Convolution is defined as: $$ (f * g )(t) = \int_{0}^{t} f(\tau)\, g(t - \tau)\, d\tau\ \mathrm{for} \ \ f, g : [0, \infty) \to \mathbb{R} $$ where \( (f * g )(t) \) represents the convolution of \(f\) on \(g\). For discrete functions this is analogous to: $$ (f * g)[n]\ \stackrel{\mathrm{def}}{=}\ \sum_{m=-\infty}^\infty f[m]\, g[n - m] $$

CNN

So, putting those together, a CNN will try to find features in a new image by passing performing a convolution of the feature matrix over the image. This will result in a new map of where in the image the feature is found.

Each layer will use various nodes, which are really feature maps, from the previous layer to compute more complex features.

Typically, in the hidden layers, there are 3 types of layers - convolutional layers, pooling layers, and activation layers.

  • Convolutional Layers - this is where the convolutions happen.
  • Pooling layers - used to take large images and make them smaller while preserving the most important information.
  • Activation layers - a layer which performs an activation process to ensure the feature does not get lost at deeper layers. Often this is ReLU (Rectified Linear Units), which keeps positive values unchanged and negatives are set to 0. [3]

One final type of layer is a fully connected layer. A CNN will have at least one of these at the end and these are responsible for converting the output of the convolution, pooling and activation layers into votes for a particular class.

Inside a CNN

As mentioned above, a CNN consists of an input layer, a series of hidden layers, one or more fully connected layers and an output layer. Below we can see that a CNN will have more than one of each of the hidden layers, however they may not always be in a stack of convolution, ReLU, pooling. This is because there are often times when you wish to perform two convolutions without changing the size of the image.

The output of the CNN is a set of probabilities of the input being classified as each class. The highest value is taken and it is said that the CNN is, in this example, 92% sure it is a X.

Looking inside the hidden layers of a CNN[3]

Training

We have explained how a CNN will classify an image, however how does it know what features to look for and what the weights should be for each output?
We need to train the network, and this is done using labelled data and backpropagation.

Backpropagation

To train the CNN, we first start with an untrained CNN, where every pixel in all features and every weight in the fully connected layers are set to random values.
Next, each image is fed in one by one and the error between the CNN output and the label is calculated. The lower this error the better at classifying that particular image our CNN is. The features and the weights are adjusted up or down according to the amount of error and the weights and features which cause a lower error are kept.

As more and more images are fed in, the better the CNN becomes at classification and individual differences in images of the same class are quickly forgotten and features which are prevelant throughout are kept.

Performing the actual backpropagation for features requires calculating partial derivatives of the error with the last convolutional layer to calculate the features and then passing back another partial derivative to the next layer.

Using CNN's for multiple object classification

CNN's are very good at image classification but a lot of medical image analysis does not require classifying a single image but objects or regions of an image in order to show highlight lesions.

There has been a lot of research into how best to do this and the first major paper on it was called R-CNN.

CNNs have traditionally been used for image classification, however, in 2014 a paper4 on object detection using CNNs described a technique which enabled CNNs to be used to accurately isolate objects in images. The architecture was titled R-CNN (Regions with CNNs)5.

R-CNN

R-CNN uses a process called selective search to create regions of different sizes it wishes to check for objects.

After creating the propositions, R-CNN makes the region square by warping it and then runs it through a modified version of AlexNet to determine if it is a valid region and then adds a SVM that does the classification.

Finally, R-CNN will try to improve the bounding boxes using linear regression.

There are 2 main reasons why R-CNN can be slow:

  • Every region proposal has to be passed through the CNN to determine if it is valid
  • There are 3 models which need training (CNN for features, classifier and regression model)

2015 saw the author of R-CNN release a new paper on an architecture called Fast R-CNN which improved on R-CNN and solved both of the problems above.

How did he achieve this?

Solving the first issue involved making a single pass through the CNN of the image and then using obtaining features for a region by selecting a region on the CNNs feature map, reducing the number of passes to one.

Solving the second issue meant jointly training the CNN, classifier and regressor in a single model. Now only one network is used as you can see from the image above.

In 2016 an improvement in the regional proposals are made lead to Faster R-CNN. In 2017 Mask R-CNN improved on this and enabled object segmentation and classification.

Impact on Medical Image analysis

Quantitative assessment of lesions provide valuable information for analysis but it requires accurate lesion segmentation in multi-modal 3D images which is hard. Variability in location, size, shape, and frequency make it difficult to devise effective segmentation rules.6

A lot of 3D analysis is done by using a 2D CNN on each individual layer of the 3D one. The results from this are promising.

Full 3D CNN's are computationally expensive due to the 3D convolutions - therefore inference (gaining meaning) is very slow. However in contrast to the 2D/3D hybrids, fully 3D CNNs can use a technique called dense-inference which greatly reduces the times to meaningful inference.

A CNNs performance is influenced by the data it is trained on. A common approach is training on equally sampled images from each class, however, this can bias the classifier to rare types and can lead to over-segmentation. To counter this, two stage training has been introduced where another CNN is trained on a distribution which is similar to the real one but oversampling pixels which were incorrectly classified in the first one. This can lead to overfitting.