Classify STL-10 images using CNN


Convolutional Neural Networks (CNN) are variants of MLPs (Multi-layer perceptrons) which are inspired from biology. From Hubel and Wiesel's early work on cat's visual cortex, we know there exist a complex arrangement of cells within the visual cortex. These cells are sensitive to small sub-regions of the input space, called receptive field, and are titled in such a way as to cover the entire visual field. These filters are local in input space and are thus better suited to exploit the strong spatially local correlation present in natural images.


Figure 1: Hubel and Wiesel’s early work on cat’s visual cortex

Another urge for using CNN was due to computational costs. In sparse auto-encoder, one design choice that we had made was to "fully connect" all hidden units to all the inputs. On the relatively small images (e.g. 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g. 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive – you would have about $10^{4}$ input units, and assuming you want to learn just 100 features, you would have on the order of $10^{6}$ parameters to learn. The feed forward and back propagation computations would also be about $10^{2}$ times slower, compared to 28x28 images.


Figure 2: Fully Connected Networks


Figure 3: Locally Connected Networks (shared weights)

Thus, one simple solution to this problem would be to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a small subset of the input units. Specifically, each hidden unit will connect to only a small contiguous region of pixels in the input.
More precisely, having learned features over small (say 8x8) patches, sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and convolve them with the larger image, thus obtaining a different feature activation value at each location in the image.
Let's start by learning features over small 8x8 patches, sampled randomly from the larger image. Sample images from reduced STL - 10 dataset,


Figure 4: Reduced STL – 10 dataset. Contains 4 classes: airplane, car, cat, and dog.

Features learned over 8x8 patches, using sparse auto-encoder with linear decoder on output layer,


Figure 5: 400 features learned over 8×8 patches.

Now convolving each of these 400 filters/features with the whole image to produce convoluted image/feature map,


Figure (a): Image 3 convoluted with 2nd feature


Figure (b): Image 3 convoluted with 3rd feature


Figure (c): Image 39 convoluted with 22nd feature


Figure (d): Image 39 convoluted with 45th feature

Figure 6: Convoluted image/Feature maps

Dimensions of input image is 64x64 pixels and is being convolved with 400 8x8 features/filters, obtaining a (64-8+1) x (64-8+1) x 400, i.e. 57 x 57 x 400 dimension matrix. Each of these 400 57x57 matrix image is known as feature map.

Here are some convoluted images/feature maps,


Figure 7: Convoluted Features of Car (Image 2).


Figure 8: Convoluted Features of Dog (Image 4).

After obtaining features using convolution, we would next like to use them for classification. In theory, one could use all the extracted features with a classifier such as a softmax classifier, but this can be computationally challenging. Consider for instance the size of images, 64x64 pixels, and 400 features we have learned over 8x8 inputs. Each convolution results in an output of size (64 − 8 + 1) * (64 − 8 + 1) = 3249, and since we have 400 features, this results in a vector of $57^{2}$ * 400 = 1,299,600 features per example. Learning a classifier with inputs having 1+ million features can be unwieldy, and can also be prone to over-fitting.

To address this, one natural approach is to aggregate statistics of these features at various locations. For example, one could compute the mean (or max) value of a particular feature over a region of the image. These summary statistics are much lower in dimension (compared to using all of the extracted features) and can also improve results (less over-fitting). This operation of aggregation is called pooling, or sometimes mean pooling or max pooling (depending on the pooling operation applied).

The following image shows how pooling is done over 4 non-overlapping regions of the image.


Figure 9: Pooling

If one chooses the pooling regions to be contiguous areas in the image and only pools features generated from the same (replicated) hidden units. Then, these pooling units will then be translation invariant. This means that the same (pooled) feature will be active even when the image undergoes (small) translations. Translation-invariant features are often desirable; in many tasks (e.g., object detection, audio recognition), the label of the example (image) is the same even when the image is translated. For example, if you were to take an MNIST digit and translate it left or right, you would want your classifier to still accurately classify it as the same digit regardless of its final position.

So after we obtain our convolved features, as described earlier, we now decide the pooling area. In our case, we chose pooling dimension of 19x19 pixels. Therefore, our resulting pooled features will have dimension of 3x3x400. Here are some pooled features of convolved features of Image 2,


Figure 10: Pooled Feature of 2nd feature map (Image 2)


Figure 11: Pooled Feature of 3rd feature map (Image 2)

Now we will use pooled features to train a softmax classifier to map the pooled features to the class labels. After training for few minutes, the classifier is trained.

We then test classifier on test set, we convolve images from test set, mean pool those features and then after classification we obtain an accuracy of 80%.

Here are some classifier's prediction results. As you might notice there are some misclassified examples, which look similar to others in same category,


Figure (a): Category 1 - Airplanes


Figure (b): Category 2 - Cars


Figure (c): Category 3 - Cats


Figure (d): Category 4 - Dogs

Figure 12: Result from trained classifier. Accuracy of 80% with some misclassified images.