Neural Networks and Deep Learning

Using neural nets to recognize handwritten digits

How the backpropagation algorithm works

Improving the way neural networks learn

A visual proof that neural nets can compute any function

Why are deep neural networks hard to train?

The human visual system is one of the wonders of the world. Consider the following sequence of handwritten digits:

Most people effortlessly recognize those digits as 504192. That ease is deceptive. In each hemisphere of our brain, humans have a primary visual cortex, also known as V1, containing 140 million neurons, with tens of billions of connections between them. And yet human vision involves not just V1, but an entire series of visual cortices - V2, V3, V4, and V5 - doing progressively more complex image processing. We carry in our heads a supercomputer, tuned by evolution over hundreds of millions of years, and superbly adapted to understand the visual world. Recognizing handwritten digits isn't easy. Rather, we humans are stupendously, astoundingly good at making sense of what our eyes show us. But nearly all that work is done unconsciously. And so we don't usually appreciate how tough a problem our visual systems solve.

The difficulty of visual pattern recognition becomes apparent if you attempt to write a computer program to recognize digits like those above. What seems easy when we do it ourselves suddenly becomes extremely difficult. Simple intuitions about how we recognize shapes - "a 9 has a loop at the top, and a vertical stroke in the bottom right" - turn out to be not so simple to express algorithmically. When you try to make such rules precise, you quickly get lost in a morass of exceptions and caveats and special cases. It seems hopeless.

Neural networks approach the problem in a different way. The idea is to take a large number of handwritten digits, known as training examples,

and then develop a system which can learn from those training examples. In other words, the neural network uses the examples to automatically infer rules for recognizing handwritten digits. Furthermore, by increasing the number of training examples, the network can learn more about handwriting, and so improve its accuracy. So while I've shown just 100 training digits above, perhaps we could build a better handwriting recognizer by using thousands or even millions or billions of training examples.

In this chapter we'll write a computer program implementing a neural network that learns to recognize handwritten digits. The program is just 74 lines long, and uses no special neural network libraries. But this short program can recognize digits with an accuracy over 96 percent, without human intervention. Furthermore, in later chapters we'll develop ideas which can improve accuracy to over 99 percent. In fact, the best commercial neural networks are now so good that they are used by banks to process cheques, and by post offices to recognize addresses.

We're focusing on handwriting recognition because it's an excellent prototype problem for learning about neural networks in general. As a prototype it hits a sweet spot: it's challenging - it's no small feat to recognize handwritten digits - but it's not so difficult as to require an extremely complicated solution, or tremendous computational power. Furthermore, it's a great way to develop more advanced techniques, such as deep learning. And so throughout the book we'll return repeatedly to the problem of handwriting recognition. Later in the book, we'll discuss how these ideas may be applied to other problems in computer vision, and also in speech, natural language processing, and other domains.

Of course, if the point of the chapter was only to write a computer
program to recognize handwritten digits, then the chapter would be
much shorter! But along the way we'll develop many key ideas about
neural networks, including two important types of artificial neuron
(the perceptron and the sigmoid neuron), and the standard learning
algorithm for neural networks, known as stochastic gradient descent.
Throughout, I focus on explaining *why* things are done the way
they are, and on building your neural networks intuition. That
requires a lengthier discussion than if I just presented the basic
mechanics of what's going on, but it's worth it for the deeper
understanding you'll attain. Amongst the payoffs, by the end of the
chapter we'll be in position to understand what deep learning is, and
why it matters.

What is a neural network? To get started, I'll explain a type of
artificial neuron called a *perceptron*.
Perceptrons were
developed
in the 1950s and 1960s by the scientist
Frank
Rosenblatt, inspired by earlier
work
by Warren
McCulloch and
Walter
Pitts. Today, it's more common to use other
models of artificial neurons - in this book, and in much modern work
on neural networks, the main neuron model used is one called the
*sigmoid neuron*. We'll get to sigmoid neurons shortly. But to
understand why sigmoid neurons are defined the way they are, it's
worth taking the time to first understand perceptrons.

So how do perceptrons work? A perceptron takes several binary inputs,

That's the basic mathematical model. A way you can think about the perceptron is that it's a device that makes decisions by weighing up evidence. Let me give an example. It's not a very realistic example, but it's easy to understand, and we'll soon get to more realistic examples. Suppose the weekend is coming up, and you've heard that there's going to be a cheese festival in your city. You like cheese, and are trying to decide whether or not to go to the festival. You might make your decision by weighing up three factors:

- Is the weather good?
- Does your boyfriend or girlfriend want to accompany you?
- Is the festival near public transit? (You don't own a car).

Now, suppose you absolutely adore cheese, so much so that you're happy
to go to the festival even if your boyfriend or girlfriend is
uninterested and the festival is hard to get to. But perhaps you
really loathe bad weather, and there's no way you'd go to the festival
if the weather is bad. You can use perceptrons to model this kind of
decision-making. One way to do this is to choose a weight

By varying the weights and the threshold, we can get different models
of decision-making. For example, suppose we instead chose a threshold
of *or* when both the
festival was near public transit *and* your boyfriend or
girlfriend was willing to join you. In other words, it'd be a
different model of decision-making. Dropping the threshold means
you're more willing to go to the festival.

Obviously, the perceptron isn't a complete model of human decision-making! But what the example illustrates is how a perceptron can weigh up different kinds of evidence in order to make decisions. And it should seem plausible that a complex network of perceptrons could make quite subtle decisions:

Incidentally, when I defined perceptrons I said that a perceptron has just a single output. In the network above the perceptrons look like they have multiple outputs. In fact, they're still single output. The multiple output arrows are merely a useful way of indicating that the output from a perceptron is being used as the input to several other perceptrons. It's less unwieldy than drawing a single output line which then splits.

Let's simplify the way we describe perceptrons. The condition *bias*,

I've described perceptrons as a method for weighing evidence to make
decisions. Another way perceptrons can be used is to compute the
elementary logical functions we usually think of as underlying
computation, functions such as `AND`

, `OR`

, and
`NAND`

. For example, suppose we have a perceptron with two
inputs, each with weight

`NAND`

gate!The `NAND`

example shows that we can use perceptrons to compute
simple logical functions.
In fact, we can use
networks of perceptrons to compute *any* logical function at all.
The reason is that the `NAND`

gate is universal for
computation, that is, we can build any computation up out of
`NAND`

gates. For example, we can use `NAND`

gates to
build a circuit which adds two bits,

`NAND`

gates by perceptrons with two inputs, each with weight
`NAND`

gate a little, just to make it easier to draw the arrows
on the diagram:
The adder example demonstrates how a network of perceptrons can be
used to simulate a circuit containing many `NAND`

gates. And
because `NAND`

gates are universal for computation, it follows
that perceptrons are also universal for computation.

The computational universality of perceptrons is simultaneously
reassuring and disappointing. It's reassuring because it tells us
that networks of perceptrons can be as powerful as any other computing
device. But it's also disappointing, because it makes it seem as
though perceptrons are merely a new type of `NAND`

gate.
That's hardly big news!

However, the situation is better than this view suggests. It turns
out that we can devise *learning
algorithms* which can
automatically tune the weights and biases of a network of artificial
neurons. This tuning happens in response to external stimuli, without
direct intervention by a programmer. These learning algorithms enable
us to use artificial neurons in a way which is radically different to
conventional logic gates. Instead of explicitly laying out a circuit
of `NAND`

and other gates, our neural networks can simply learn
to solve problems, sometimes problems where it would be extremely
difficult to directly design a conventional circuit.

Learning algorithms sound terrific. But how can we devise such algorithms for a neural network? Suppose we have a network of perceptrons that we'd like to use to learn to solve some problem. For example, the inputs to the network might be the raw pixel data from a scanned, handwritten image of a digit. And we'd like the network to learn weights and biases so that the output from the network correctly classifies the digit. To see how learning might work, suppose we make a small change in some weight (or bias) in the network. What we'd like is for this small change in weight to cause only a small corresponding change in the output from the network. As we'll see in a moment, this property will make learning possible. Schematically, here's what we want (obviously this network is too simple to do handwriting recognition!):

If it were true that a small change in a weight (or bias) causes only a small change in output, then we could use this fact to modify the weights and biases to get our network to behave more in the manner we want. For example, suppose the network was mistakenly classifying an image as an "8" when it should be a "9". We could figure out how to make a small change in the weights and biases so the network gets a little closer to classifying the image as a "9". And then we'd repeat this, changing the weights and biases over and over to produce better and better output. The network would be learning.

The problem is that this isn't what happens when our network contains
perceptrons. In fact, a small change in the weights or bias of any
single perceptron in the network can sometimes cause the output of
that perceptron to completely flip, say from

We can overcome this problem by introducing a new type of artificial
neuron called a *sigmoid* neuron.
Sigmoid neurons are similar to perceptrons, but modified so that small
changes in their weights and bias cause only a small change in their
output. That's the crucial fact which will allow a network of sigmoid
neurons to learn.

Okay, let me describe the sigmoid neuron. We'll depict sigmoid neurons in the same way we depicted perceptrons:

At first sight, sigmoid neurons appear very different to perceptrons. The algebraic form of the sigmoid function may seem opaque and forbidding if you're not already familiar with it. In fact, there are many similarities between perceptrons and sigmoid neurons, and the algebraic form of the sigmoid function turns out to be more of a technical detail than a true barrier to understanding.

To understand the similarity to the perceptron model, suppose

What about the algebraic form of

This shape is a smoothed out version of a step function:

If *be* a perceptron, since the output would be

If it's the shape of

How should we interpret the output from a sigmoid neuron? Obviously,
one big difference between perceptrons and sigmoid neurons is that
sigmoid neurons don't just output

**Sigmoid neurons simulating perceptrons, part I**

Suppose we take all the weights and biases in a network of perceptrons, and multiply them by a positive constant,c>0 . Show that the behaviour of the network doesn't change.**Sigmoid neurons simulating perceptrons, part II**

Suppose we have the same setup as the last problem - a network of perceptrons. Suppose also that the overall input to the network of perceptrons has been chosen. We won't need the actual input value, we just need the input to have been fixed. Suppose the weights and biases are such thatw⋅x+b≠0 for the inputx to any particular perceptron in the network. Now replace all the perceptrons in the network by sigmoid neurons, and multiply the weights and biases by a positive constantc>0 . Show that in the limit asc→∞ the behaviour of this network of sigmoid neurons is exactly the same as the network of perceptrons. How can this fail whenw⋅x+b=0 for one of the perceptrons?

In the next section I'll introduce a neural network that can do a pretty good job classifying handwritten digits. In preparation for that, it helps to explain some terminology that lets us name different parts of a network. Suppose we have the network:

The design of the input and output layers in a network is often
straightforward. For example, suppose we're trying to determine
whether a handwritten image depicts a "9" or not. A natural way to
design the network is to encode the intensities of the image pixels
into the input neurons. If the image is a

While the design of the input and output layers of a neural network is often straightforward, there can be quite an art to the design of the hidden layers. In particular, it's not possible to sum up the design process for the hidden layers with a few simple rules of thumb. Instead, neural networks researchers have developed many design heuristics for the hidden layers, which help people get the behaviour they want out of their nets. For example, such heuristics can be used to help determine how to trade off the number of hidden layers against the time required to train the network. We'll meet several such design heuristics later in this book.

Up to now, we've been discussing neural networks where the output from
one layer is used as input to the next layer. Such networks are
called *feedforward*
neural networks. This means there are no loops in the network -
information is always fed forward, never fed back. If we did have
loops, we'd end up with situations where the input to the

However, there are other models of artificial neural networks in which feedback loops are possible. These models are called recurrent neural networks. The idea in these models is to have neurons which fire for some limited duration of time, before becoming quiescent. That firing can stimulate other neurons, which may fire a little while later, also for a limited duration. That causes still more neurons to fire, and so over time we get a cascade of neurons firing. Loops don't cause problems in such a model, since a neuron's output only affects its input at some later time, not instantaneously.

Recurrent neural nets have been less influential than feedforward networks, in part because the learning algorithms for recurrent nets are (at least to date) less powerful. But recurrent networks are still extremely interesting. They're much closer in spirit to how our brains work than feedforward networks. And it's possible that recurrent networks can solve important problems which can only be solved with great difficulty by feedforward networks. However, to limit our scope, in this book we're going to concentrate on the more widely-used feedforward networks.

Having defined neural networks, let's return to handwriting recognition. We can split the problem of recognizing handwritten digits into two sub-problems. First, we'd like a way of breaking an image containing many digits into a sequence of separate images, each containing a single digit. For example, we'd like to break the image

into six separate images,

We humans solve this *segmentation
problem* with ease, but it's challenging
for a computer program to correctly break up the image. Once the
image has been segmented, the program then needs to classify each
individual digit. So, for instance, we'd like our program to
recognize that the first digit above,

is a 5.

We'll focus on writing a program to solve the second problem, that is, classifying individual digits. We do this because it turns out that the segmentation problem is not so difficult to solve, once you have a good way of classifying individual digits. There are many approaches to solving the segmentation problem. One approach is to trial many different ways of segmenting the image, using the individual digit classifier to score each trial segmentation. A trial segmentation gets a high score if the individual digit classifier is confident of its classification in all segments, and a low score if the classifier is having a lot of trouble in one or more segments. The idea is that if the classifier is having trouble somewhere, then it's probably having trouble because the segmentation has been chosen incorrectly. This idea and other variations can be used to solve the segmentation problem quite well. So instead of worrying about segmentation we'll concentrate on developing a neural network which can solve the more interesting and difficult problem, namely, recognizing individual handwritten digits.

To recognize individual digits we will use a three-layer neural network:

The input layer of the network contains neurons encoding the values of
the input pixels. As discussed in the next section, our training data
for the network will consist of many

The second layer of the network is a hidden layer. We denote the
number of neurons in this hidden layer by

The output layer of the network contains 10 neurons. If the first
neuron fires, i.e., has an output

You might wonder why we use *why* using

To understand why we do this, it helps to think about what the neural
network is doing from first principles. Consider first the case where
we use

It can do this by heavily weighting input pixels which overlap with the image, and only lightly weighting the other inputs. In a similar way, let's suppose for the sake of argument that the second, third, and fourth neurons in the hidden layer detect whether or not the following images are present:

As you may have guessed, these four images together make up the

So if all four of these hidden neurons are firing then we can conclude
that the digit is a *only* sort
of evidence we can use to conclude that the image was a