Friday, February 23, 2018

Convolutional Neural Network (CNN)

Disclaimer: most contents in this blog are my own speculations on how human vision works; I did not research if they are supported by literature.  I am enthusiastic about CNN; these speculations have greatly helped boost my own understanding of CNN, therefore, I hope they could be useful to you as well, even if they might be nonsense. Bear in mind that the purpose of this blog series is try to check the "why" aspect of deep learning.  Nowadays, you for sure can find many "how-to" tutorials on CNN elsewhere.


Introduction


Given a cat photo, we can recognize the cat absolutely effortlessly.  How does it work?

First, we do not think the brain relies on an object model approach.  That is it has a cat model consisting of an assembly of some geometric shapes representing eyes, ears, whiskers, tails, paws, etc.  In fact the object model approach is intractable, as you would need a model for a sitting cat, for a sleeping cat, for a cat with paws hiding, ..., just seem too complicated to be feasible [1].  Anyway, in this approach, the brain will also need to contain models of millions of other objectives, then apply all these models to the input image to determine which model is the closest match.  This modelling approach does not really explain how we see cats in the photos below, as they would not match any reasonable cat model well.

 Figure 1. Cat photos that do not match any cat models well.

Second, computer vision experts have spent decades to construct image features that are robust against translation, rotation and scaling [2], as our brains seem to have no problem in recognizing an object under such transformations.  Do our brains use similar tricks to represent objects with some mathematically-elegant transformations?  It does not seem so, at least, our vision recognition system is not very capable of correcting rotation transformations.  The two faces below are $180^\circ$ rotations of each other.  If our brains transform them into some rotation-invariant representation, we would expect to see the second face by staring at the first, but we do not.

Figure 2.  We cannot see the other face, if we look at one, as our brains do not transform the image into rotation-invariant representation (source).

Third, we do not seem to see details, but only see concepts instead.  Look at the scenery photo below for two seconds, now cover it up.  I assume you can tell it is about a beach, and you might probably saw some other objects such as beach, mountain, ocean, bridge, sky.  Now are you able to tell if there are many people on the beach? Is there a lagoon on the left? Are there wild flowers on the hill slope? are there people in the ocean? are there clouds in the sky?  All those pixels have been sent to our brains as input, yet we cannot answer these questions, as we "looked at" these pixels but we did not "see" them.  This means the output of our vision system probably is a few abstract concepts and lots of details were not really passed to the memory or reasoning regions of the brain.  This is great, as we immediately see beach, without the need of prioritizing roads, people, waves, lagoons, bridges, and then try to follow some complicated decision-making rules to come to the conclusion that this is a beach.  We simply can see the meaning of each image without reasoning.  It only happens in movies, when an agent can name everything in a room after a few seconds; photo memory does not seem to be what our normal people see things, otherwise, the brain will be drained by too many objects and have a hard time capture a key event.  This means in computer vision, details are often not as important as you might have expected.

Figure 3.  Beautiful Torrey Pines beach at Del Mar, San Diego (source).

Computer vision is the field where deep learning really shines, as it provides a feasible approach to see objects the way no previous machine learning technologies was capable of.  Personally, I am convinced this might actually be how our vision system works more or less, as it could help explain many vision phenomena, including the examples mentioned above and some examples to be discussed at the end of the blog.  Here, we will look at how a specialized neural network (NN) called convolutional neural network (CNN) works.


The Rationale for CNN



Vision recognition for ImageNet is basically a function $f(x)$, where $x$ is the input image and output is the object id(s) representing the concepts in the image.  We mentioned in the previous chapter that a fully-connected NN (FNN) can be used to model any function, such as the FNN in Figure 4 (left).  In this example, not all pixels on the input image are relevant, therefore the NN can be simplified into a localized NN (where weights for non-dog pixel inputs are zeroed out, Figure 4, right) with the dog at its center of the input.

Figure 4.  FNN for image object recognition is a localized NN, as only the pixels belonging to the object contribute to the recognition.

An image can contain multiple instances of dogs, to recognize all the instances, we only need to slide the localized NN over the input image, whenever there is a dog, the NN produces a response signal.  This operation of sliding a function over an image to obtain a new response image (Figure 5) is known as convolution in computer vision.  For the ImageNet task, we only care about whether the image contains dogs or not without needing to know how many and where they are.  Therefore, our NN has to be translation invariant, i.e., $f(x - \delta) = f(x)$.  This can be easily implemented by applying the $max$ operator onto the output convoluted image.


Figure 5.  Apply the same localized NN over the input image allows it to recognize multiple instances, where it produces a signal at each matched location.


In traditional computer vision, we construct a template (a dog model) and slide that template over the input image, then the output convoluted image shows bright spots where template matches the input.  However, it is unrealistic to have a dog template (model) that can be used to recognize all dogs, as not only dogs look very different, maybe only a partial of its body is exposed, or it is a highly distorted cartoon dog, etc.  A more robust way is to model a dog as a collection of its body parts that are loosely connected.  If we have a function $\mathbf{g}(x)$ that has multiple dimensions aiming to recognize dog eyes, nose, ears, spot pattern, tails, respectively.   Each dimension is a convolution function mentioned above that produces an output image, we call a feature map.  Each dimension still uses a convolution filter, which is basically a weight matrix describing what pattern this feature is looking for, and the intensity of the output feature map represents the likelihood (not strictly in the probability sense, but just monotonically reflecting the likelihood) of the presence of that particular body parts at a corresponding underlying location.  I.e., we assign weights to each dimension and produce a final output to indicate if the dog is present, i.e., $f(x) = \mathbf{w(\delta)}\cdot \mathbf{g}(x-\delta)$.  $\delta$ here accounts for the relative location shift among the signals from different feature maps.

What is important here is this weighting (i.e., convolution) operation does not need to be applied to the raw image pixels, it can be applied to previous feature maps, where pixels encoding the presence of smaller body parts.  To understand this, say we aim to construct a feature representing a dog face, which consists of preliminary parts such as eyes, nose, ears and are arranged in some lose geometric positions.  This means the feature maps of eyes, nose and ears have already been generated and they will become the input images for the detection of the dog face feature; each input feature map (nose or eye) is a separate input image channel (imagine the raw image has three color channels); they are convoluted to produce a new dog face feature map.  The convolution relies on a weight matrix, which can implement loose logic, such as the eyes are within a fuzzy region and the nose can be in another fuzzy location.  Such filters sitting on top of lower-level feature maps can therefore encode a very flexible face model, unlike the traditional template-based match (Figure 6).

Figure 6.  One convolution weight matrix (left) can encode very different faces.  For the weight matrix, red and orange indicates the possible locations of eyes, blue for nose and green for mouth.  This matrix can match the three very different faces on the right equally well.  This is not something the traditional template-based method is capable of.


  Similarly, the feature maps of dog face, legs, bodies will need to be further combined to produce an even higher-level feature map say represents a sitting dog.  In the other direction of the feature hierarchy, the feature map of dog eyes can be decomposed into more simpler imaging feature maps of circles, different color patches, lines, curves, corners, etc.  Therefore, at the end we are looking at a multi-layer hierarchical convolutional neural network, where each layer relies on a small input image section (either the raw input image or the output feature maps of a previous layer), it performs convolution and produce the output feature map encoding whether the feature it is looking for is present at specific locations.  At the top layer, we form a collection of feature maps, where each tell us whether certain high-level body parts are observed with high probability, the weighted collection of this information allows us to make a decision on whether the object of interest is present.  This way, a dog can be detected even if a significant portion of its body parts are missing, or the relative geometry of its parts are highly distorted.

Another practical requirements for object recognition is to be able to scale down an image.  Despite there are many differences in details for the faces in Figure 7, those details are irrelevant for our purpose of recognizing them all as face.  So if we scale down the image into low-resolution representations, where the irrelevant details are smoothed out, it will be much easier for us to match the smaller image with some convolution filter encoding a face.  The process of taking a higher-resolution image and scaling it down into a lower-resolution one is called pixel pooling, i.e., pool multiple pixels into one pixel.  Pooling offers a few advantages.  First, pooling get rid of details, so that a face filter (weights) only requires two eyes on top followed by a nose below and a mouth below, but it does not mandate the exact looks of the eyes, nose or mouth, there are lots of flexibility in such filter elements.  Second, pooling brings the underlying part objects closer in space, so that the filter to convolute these objects into higher-level object can be a lot smaller.  Third, it is computationally much more efficient to compute and store without compromising our goal of object recognition, as fewer pixels are involved.  The most-often used pooling mode in CNN is max pooling, where the maximum value of the several original pixels is retained as the pooling output.

Figure 7. Scaling down a complex image helps remove irrelevant details for a recognition task.

In the traditional template matching strategy in image analysis, the recognition of individual parts and the relative geometric location of these parts are simultaneously encoded within the template at the pixel level.  This seriously restricted the variation of the objects a template can represent.  In the CNN approach, each hierarchical feature allows room for variation within individual features (e.g., there can be multiple eye features of different styles) as well as the spatial variations among them, and the pooling further provides even more rooms for the "disfiguring", therefore, CNN no longer recognizes objects at the pixel level, but start to incorporate tolerance and somewhat represent objects at the conceptual level.  I think this might be the main reason behind the superb performance of CNN in real-life object recognition.

CNN for Face Recognition

With convolution and pooling layers as building elements, we are ready to perform what used to be the most difficult computer vision challenge, classify photos according to the object(s) it contains (the ImageNet challenge).  Let us look at how CNN can be applied to recognize a face.
Figure 8. Architecture of a CNN that recognize a $16 \times 16$ image containing a face.

As shown in Figure 8, our input (A) is a black-and-white face of size $16 \times 16$.  The first hidden layer (B) actually is a group of layers of the same size $16 \times 16$ each, and each is the result of $2 \times 2$ convolution across the input image (this means each layer is the heatmap resulted from convoluting a particular neuron across the original image).  The weight matrix $w$ for one of the feature (blue) reads (the matrix is shown at the lower-left corner of the feature maps):

$$w = \begin{bmatrix} 0 & 0 \\ 1 & 1 \end{bmatrix},$$

where 1 stands for black and 0 for white.  It basically detects an edge (black object on white background) facing top.  The resultant heatmap is depicted below the layers associated with group B in Figure 8.  The convolution sum is passed through a non-linear function, typically ReLU function.  ReLU has no effect in our specific example, as we only consider positive weights, but weights can be negative in real applications and there is a threshold parameter $b$ to suppress signals that are too weak (remember ReLU = $\max(0, signal)$.  Similarly we can build three other features to detect the bottom-facing (orange), right-facing (green) and left-facing (purple) edges, the resultant heatmaps are shown in Figure 8 as well.  There will be other feature maps in this group capturing other shapes, however, let us focus on these four feature maps in this particular example.

The image is big, so let us down sample it.  The next step C is a max-pooling step.  The example here is to shrink a $2 \times 2$ area into one pixel, where the value of the output pixel is the maximum of the four original pixels.  This will produce four new feature layers in the second hidden layer group C, each feature map is now only one forth of the size of those in the previous layer B, accounting for a 75% saving on computational resource.  There is no cross-talking among the feature maps in this pooling step, all what we do is to cut the resolution in half feature by feature, which also brings the response signal 50% closer to each other.

Then it uses another convolution layer D.  The difference here is the convolution is a 3D convolution, i.e., the weights not only covers the width and height of a sub-region in the previous layer C, but also the depth, i.e., it mixes signals from all the feature maps in the previous layer under that sub 2D region.  A convolution filter corresponding to the $2 \times 2 \times 4$ weight pattern as shown in Figure 8 produces an new output feature map in the D group representing an eye feature.  Where it looks for an activated pattern consisting of 2-pixel top, a 2-pixel bottom, a 2-pixel left and a 2-pixel right -facing patterns in the previous layer C.  This filter means the neuron is detecting a black square, it produces a strong activation signals on its output heatmap, where an eye (represented by a black square) is present.  Notice the color for the right eye is a bit lighter, as it is not a perfect match (but still a good match) to the eye filter (therefore the activation is less strong).

Figure 9. Weights (filter) associated with the eye feature.

Similarly, layer D also contains a feature map representing where noses are and another representing where mouths are.  We then do another max-pooling, which also pulls the detected eyes, nose, and the mouth closer to each other, actually now they all fall within a $3 \times 3$ neighborhood.  Layer F contains a new features representing a face, whose $3 \times 3 \times 3$ filter consists of the convolution of signals from eye, nose and mouth feature maps.  It generates an activated pixel in the output layer G, where we know a face is detected.  Here we see an important role of pooling is to bring lower-level features closer together, so that a higher-level feature can be constructed with a much smaller convolution windows.  If our CNN does not contain any pooling layers, a face filter would require a convolution matrix of size $16 \times 16$, which would use too many parameters and contains unnecessary details.

The face filter is a 3D weight matrix, which can be written as:

$$w_{face} = \begin{bmatrix} w_e \\ w_n \\ w_m \end{bmatrix}, \\
w_e = \begin{bmatrix} 1 & 0 & 1  \\ 0 & 0 & 0 \\ 0&0&0 \end{bmatrix}, \\
w_n = \begin{bmatrix} 0 & 0 & 0  \\ 0 & 1 & 0 \\ 0&0&0 \end{bmatrix}, \\
w_m = \begin{bmatrix} 0 & 0 & 0  \\ 0 & 0 & 0 \\ 0&1&0 \end{bmatrix}.
$$

So now we use convolution and max-pooling to build a CNN that is capable of producing an activated output, when we show it a face.  If the input image is twice larger and contains a tiling of four faces, our output layer will contains four activated pixels on the face feature map, each signaling a face.

Now imagine there are many other feature maps within each hidden layer, capturing all kinds of interesting features, from simple features such as edges, diagonal lines, to medium features such as rectangles, circles, to more sophisticated features such as wheels, front of a car, then to rather complex features representing face, car, bike, dogs, etc.  We can see how single features at the earlier hidden layers can be combined (depth-convoluted) to produce median complex features and then further combined to create more and more complex features.  The last output layer of CNN can contains thousands of high-level features and their activation status.  Those features then serve as the input for a FNN, which typically ends with a softmax layer to produce probabilities for each object class.  The top scoring classes represents the objects we see in the image (we use softmax, as this is a multi-class classification problem, see previous blog).

In practice, we certainly do not hand construct the feature maps ourselves.  The parameters of the CNN filters would be initialized with random values, we then simply feed the CNN with lots of input images and they true labels.  The loss function will be a cross-entropy term to quantify the prediction accuracy.  The result of training and parameter optimization will enable the CNN (including the FNN layers) to determine the optimal filters, i.e., features, for each layer.  As the CNN are trained by many faces: long, round, man, woman, etc., there will be many neurons representing different faces, therefore, we do not relying on one face feature to recognize all faces.

Although the features found by CNN are rarely as clean as our face-detection example, they generally resembles such concepts.  An extremely nice videos is available here, it should convince you some CNN neurons (representing a feature) indeed represent rather meaningful objects, e.g., an edge, a face, or text.


VGGNet for ImageNet


The huge success of CNN is reflected it its ability to classify images provided by ImageNet into 1000 classes, this is the well-known ImageNet competition. The current state-of-the-art CNN classification already surpassed human performance of 5.1% error rate [3], therefore, this task is considered a solved problem.  VGGNet (Figure 10) is the champion of 2014 ImageNet competition [4], we discuss it because it has the nice architecture very similar to our face-recognition example, i.e., very easy to understand.




Figure 10. VGG16 architecture. (source)

It takes the input image of size $224 \times 224 \times 3$, do two $3 \times 3$ convolutions followed by a max pooling. Then repeat this a few times.  It gradually doubles the number of features, from 64 to 128, then to 256 and finally to 512.  Let us walk through the statistics in Figure 10, the input image has a size of 224 pixels, the size is preserved after convolution, then halved by POOL2 into 112, continued the cycle of convolution and pooling until it gets $7 \times 7 \times 512$ at the last pool layer, that is we identify 512 features, each corresponds to a likelihood heatmap of size $7 \times 7$.  All these neurons become the input features for a 3-layer FNN to finally generate 1000 probability values for the 1000 object classes.   Why would one uses two $3 \times 3$ convolution layers sequentially instead of just one convolution layer?  Two of such $3 \times 3$ filters basically play the role of a single $5 \times 5$ filter, however, the former has only 20 parameters while the latter has 26 parameters, plus two layers can model more non-linearity than a single layer.  The saving is more significant if we replace a $7 \times 7$ filter with three $3 \times 3$ filters.  So CNN tends to use multiple small filters than a large filter.  Strictly speaking, we are actually doing 3D convolution, as the first filter is $3 \times 3 \times 3$, given a three-channel color input image.  The second filter is $3 \times 3 \times 64$ sitting after the first hidden layer.

In addition, we see there are a total of 138 million parameters in VGGNet!  Among them, 102 million are used for the first fully connecting layer.  So CNN layers are very efficient in terms of the number of parameters it consumes, as it uses small filters.  On the other hand, the fully connected layers are expensive for optimization.  Microsoft's ResNet [5] does a better job in extracting CNN features, and it only requires a single fully connected layer at the end for the classification, therefore, it is way more efficient.  ResNet was the champion of 2015 competition, with an error rate of 3.57%!

Machines such as VGGNet took a long time to train.  In our own projects, we often cannot offer such long training time and the CPU/GPU resource, instead we take advantage of a technique called transfer learning.  E.g., if we are tasked to classify a subset of ImageNet photos into either dog or cat, instead of the original 1000 classes.  VGGNet was trained to classify 1000 classes, therefore, is not optimized for our simpler problem.  Why?  Say bone is one of the 1000 classes, VGGNet cannot use bone features to help dog recognition, as it was also tasked to distinguish bone from dog.  Now with only two classes, each can take advantage of features representing other classes to help boost its confidence.  But to train the VGGNet from beginning would be too costly.  We know almost all the key features required to distinguish a dog from a cat are already captured in the CNN portion of the VGGNet, therefore, we can freeze that part. Instead, we only replace the last three fully connected layers by one fully connect layer producing a single output, representing the probability of the dog class.  The new CNN can be optimized much quicker.


Miscellaneous Topics


Here we discuss some non-technical topics of interest and see how CNN can help us better understand some recognition activities.

Objects of Different Scales


The CNN described above can deal with object translocation, but it does not deal with scaling.  If our CNN was only trained with faces of the same scale, then we feed it with faces of other scales during its application, it is likely to fail.  Because if the eyes are way bigger than what it saw, CNN is still looking at a pupil at its last layer, there will be no activation for the eye feature map in the earlier layer.  For the same argument, the face neuron will not be activated as well.  When the input face is too small, it already converges into a tiny $2 \times 2$ eye, nose, mouth-like features at a layer earlier than where the these features are expected, therefore it will be missed as well.  To overcome this, some people use multi-scale CNN [6] consisting of multiple independent CNNs, each trained with input training images at different scales.  Their outputs are combined to be used in the final classification.  You can picture this approach as taking the input image, shrink it and magnify it into multiple copies, and then sent them to different people specialized at looking at large face or small face, respectively, one of them will probably be able to recognize the object.

Instead of using multiple CNNs, we could probably train one CNN using image augmentation technique.  That is we turn one training image into multiple images of varying resolutions, then use all of them to train one single CNN.  With the CNN being trained by objects of different scales, it will automatically construct features of varying scales and be able to deal with scaling challenge.  This certainly implies we need a lot more feature maps in this CNN, a large eye feature and a small eye feature, ...  Human brain contains 86 billion neurons, comparable to 300 billion stars in our the Milky Way [7].  Therefore, our brain can afford lots of neurons to form a rather gigantic network to cope with scaling issue.  Considering the pair of human eyes of a three-year old kid already consumed hundreds of millions of images, human CNN does not lack of training data [1], so this could work for real.

Why a CNN trained above could recognize both a tiny face and a large face? A tiny face might be recognized by a small-face feature map in one of the early layers of our CNN trained by data augmentation, how will this activation survive the multiple convolution and pooling layers afterward?  This is at least mathematically possible.  First, max pooling is different from average pooling, an activated signal on a heatmap does not disappear after pooling.  Second, an activation can survive future convolution, if the convolution filter happens to be a trivial filter, for instance:

$$w = \begin{bmatrix} 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{bmatrix}$$

is a $5 \times 5$ filter that will preserve the signals on a feature map in its previous layer.  Thus CNN trained by small faces will construct a neuron capturing a tiny face at one of the earlier layer, it then  build an identity neuron in each of the following layers to simply propagate the activation face signal all the way to the output layer.  Human brain likely consists of CNNs of mixed layering structure, so the tiny-face neuron could be wired directly to the output layer.  ResNet introduce shortcut wiring structure, probably can also enable it to be efficient in relaying an early-detected features to the back.


Illusion - Empirical Activation


In the CNN model, the last feature layer contains thousands of features, where each one gives an activity representing how strong the feature is present in the input.  Our brain only see some concepts, which might be just an unordered or loosely-ordered collection of a list of such activated features.  E.g., the concept of a beach scene is a list of features such as beach, sea, sky.  In the left side of Figure 11, we easily see a smiling president.  This might be because we have a "smiling" feature neuron activated due to its underlying features such as smiling mouth, smiling cheeks are activated.  A "face" neuron is also activated.  Although our CNN does not rotate the image $180^\circ$, we have seen enough upside down faces, so that we have an upside-down face neuron for such cases.  With both neurons activated, we see the concept of a "smiling"+"face" equals "smiling face".  Nothing is alarming to our brain.  On the contrary, the right side photo looks weird, although we all know its is simply the left photo.  Here, our "upside up face" neuron is activated.  However, none of the typically facial expression neuron is strong activated, at least not those facial expression neurons normally associated with the president.  So we see an "eerie face".  We recognize the president, as there are sufficient amount of president-specific features, but we have reservations, as some features that expected to be activated go silent.

Figure 11. Smiles and face are recognized independently.

Figure 12 shows a study, where researchers found out the time it took to recognize a person is shortest with the normal photo, slower with one eye moved up, and the slowest with two eyes moved up [8].  This seems rather straightforward to explain with CNN.  With the normal photo, many feature neurons fire up and collectively let us see Elvis or JFK easily.  When one eye is moved, the half face with a moved eye does not activate some feature neurons, so our concept of Elvis/JFK does not contain all the necessary feature members, we are unsure whom we see.  Then our eyes explore the photo and focus on the normal half of the face, it immediately activate all the necessary feature neuron needed and the Elvis/JFK concept is seen.   With both eyes moved, our eyes wonder around,  Elvis/JFK face neuron is still not firing.  Then we focus on the top portion and ignore the bottom, then focus on the bottom half and ignore the top, the Elvis/JFK concept is probably the closest concept to match either output, therefore, we go along with the decision.  I can feel I am doing a decision making (reasoning) step here, unlike no decision making is needed for the normal photo.
Figure 12. Research by Cooper and Wojan (source)
In Figure 13, they sure look like two roads extending into two different directions, but in fact, these are identical photos.  What might have happened is we have seen many scenes in our life similar to the Flatiron building in the New York City (Figure 14), where roads branching out in a "V" shape.  If there is a "V" shape feature neuron in our CNN, that same neuron fires strongly when the two photos in Figure 13 are viewed at once.  Our brain has no way to suppress/ignore this "V" firing, this feature contribute strongly to the concepts that we are looking at two different roads.


Figure 12. Two identical photos look very different.

Figure 13. The Flatiron building in New York City.

CNN is also the hero behind many most well known AI systems, such as Alpha GO/Zero.  For bioinformatics applications, CNN can also be applied to recognize one-dimensional patterns, such as to binding motifs in DNA sequences.  For medical applications, it can be applied to identify 3D tumor patterns in MRI or CT scans.

Let us look at a bioinformatics application next, where CNN is used to identify motifs for DNA/RNA-binding proteins (next blog).


Reference


1. https://www.ted.com/talks/fei_fei_li_how_we_re_teaching_computers_to_understand_pictures#t-370742
2. https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
3. http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/
4. http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture9.pdf
5. https://arxiv.org/pdf/1512.03385v1.pdf
6. https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx069
7. https://www.nature.com/scitable/blog/brain-metrics/are_there_really_as_many
8. http://www.public.iastate.edu/~ecooper/Facepaper.html