Backpropagation calculus | Deep Learning Chapter 4

Help fund future projects: https://www.patreon.com/3blue1brown An equally valuable form of support is to share the videos. Special thanks to these supporters: http://3b1b.co/nn3-thanks Written/interactive form of this series: https://www.3blue1brown.com/topics/neural-networks This one is a bit more symbol-heavy, and that's actually the point. The goal here is to represent in somewhat more formal terms the intuition for how backpropagation works in part 3 of the series, hopefully providing some connection between that video and other texts/code that you come across later. For more on backpropagation: http://neuralnetworksanddeeplearning.com/chap2.html https://github.com/mnielsen/neural-networks-and-deep-learning http://colah.github.io/posts/2015-08-Backprop/ Music by Vincent Rubinetti: https://vincerubinetti.bandcamp.com/album/the-music-of-3blue1brown Thanks to these viewers for their contributions to translations Звуковая дорожка на русском языке: Влад Бурмистров. Hebrew: Omer Tuchfeld ------------------ Video timeline 0:00 - Introduction 0:38 - The Chain Rule in networks 3:56 - Computing relevant derivatives 4:45 - What do the derivatives mean? 5:39 - Sensitivity to weights/biases 6:42 - Layers with additional neurons 9:13 - Recap ------------------ 3blue1brown is a channel about animating math, in all senses of the word animate. And you know the drill with YouTube, if you want to stay posted on new videos, subscribe, and click the bell to receive notifications (if you're into that): http://3b1b.co/subscribe If you are new to this channel and want to see more, a good place to start is this playlist: http://3b1b.co/recommended Various social media stuffs: Website: https://www.3blue1brown.com Twitter: https://twitter.com/3Blue1Brown Patreon: https://patreon.com/3blue1brown Facebook: https://www.facebook.com/3blue1brown Reddit: https://www.reddit.com/r/3Blue1Brown

Hosts: Grant Sanderson

📺Watch on YouTube

📅November 03, 2017

⏱️00:10:17

🌐English

🤍0 likes

Disclaimer: The transcript on this page is for the YouTube video titled "Backpropagation calculus | Deep Learning Chapter 4" from "3Blue1Brown". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=4

00:00:04Grant Sanderson

The hard assumption here is that you've watched part 3, giving an intuitive walkthrough of the backpropagation algorithm. Here, we get a little more formal and dive into the relevant calculus. It's normal for this to be at least a little confusing, so the mantra to regularly pause and ponder certainly applies as much here as anywhere else.

🤍0 likes💬 0 comments

00:00:21Grant Sanderson

Our main goal is to show how people in machine learning commonly think about the chain rule from calculus in the context of networks, which has a different feel from how most introductory calculus courses approach the subject. For those of you uncomfortable with the relevant calculus, I do have a whole series on the topic.

🤍0 likes💬 0 comments

00:00:39Grant Sanderson

Let's start off with an extremely simple network, one where each layer has a single neuron in it. This network is determined by three weights and three biases, and our goal is to understand how sensitive the cost function is to these variables. That way, we know which adjustments to those terms will cause the most efficient decrease to the cost function. And we're just going to focus on the connection between the last two neurons.

🤍0 likes💬 0 comments

00:01:05Grant Sanderson

Let's label the activation of that last neuron with a superscript $L$, indicating which layer it's in, so the activation of the previous neuron is $a^{(L-1)}$. These are not exponents; they're just a way of indexing what we're talking about, since I want to save subscripts for different indices later on.

🤍0 likes💬 0 comments

00:01:23Grant Sanderson

Let's say that the value we want this last activation to be for a given training example is $y$—for example, $y$ might be $0$ or $1$. So the cost of this network for a single training example is $(a^{(L)} - y)^2$. We'll denote the cost of that one training example as $C_0$.

🤍0 likes💬 0 comments

00:01:45Grant Sanderson

As a reminder, this last activation is determined by a weight, which I'm going to call $w^{(L)}$, times the previous neuron's activation plus some bias, which I'll call $b^{(L)}$. And then you pump that through some special nonlinear function like the sigmoid or ReLU.

🤍0 likes💬 0 comments

00:02:01Grant Sanderson

It's actually going to make things easier for us if we give a special name to this weighted sum, like $z$, with the same superscript as the relevant activations. This is a lot of terms, and a way you might conceptualize it is that the weight, previous activation, and the bias all together are used to compute $z$, which in turn lets us compute $a$, which finally, along with a constant $y$, lets us compute the cost.

🤍0 likes💬 0 comments

00:02:27Grant Sanderson

And of course, $a^{(L-1)}$ is influenced by its own weight and bias and such, but we're not going to focus on that right now. All of these are just numbers, right? And it can be nice to think of each one as having its own little number line.

🤍0 likes💬 0 comments

00:02:41Grant Sanderson

Our first goal is to understand how sensitive the cost function is to small changes in our weight $w^{(L)}$. Or phrased differently, what is the derivative of $C$ with respect to $w^{(L)}$? When you see this $\partial w$ term, think of it as meaning some tiny nudge to $w$, like a change by $0.01$, and think of this $\partial C$ term as meaning whatever the resulting nudge to the cost is. What we want is their ratio.

🤍0 likes💬 0 comments

00:03:11Grant Sanderson

Conceptually, this tiny nudge to $w^{(L)}$ causes some nudge to $z^{(L)}$, which in turn causes some nudge to $a^{(L)}$, which directly influences the cost. So we break things up by first looking at the ratio of a tiny change to $z^{(L)}$ to this tiny change in $w^{(L)}$, that is, the derivative of $z^{(L)}$ with respect to $w^{(L)}$.

🤍0 likes💬 0 comments

00:03:33Grant Sanderson

Likewise, you then consider the ratio of the change to $a^{(L)}$ to the tiny change in $z^{(L)}$ that caused it, as well as the ratio between the final nudge to $C$ and this intermediate nudge to $a^{(L)}$. This right here is the chain rule, where multiplying together these three ratios gives us the sensitivity of $C$ to small changes in $w^{(L)}$.

🤍0 likes💬 0 comments

00:03:56Grant Sanderson

So on screen right now, there's a lot of symbols, and take a moment to make sure it's clear what they all are, because now we're going to compute the relevant derivatives.

🤍0 likes💬 0 comments

00:04:07Grant Sanderson

The derivative of $C$ with respect to $a^{(L)}$ works out to be $2(a^{(L)} - y)$. Notice this means its size is proportional to the difference between the network's output and the thing we want it to be. So if that output was very different, even slight changes stand to have a big impact on the final cost function.

🤍0 likes💬 0 comments

00:04:27Grant Sanderson

The derivative of $a^{(L)}$ with respect to $z^{(L)}$ is just the derivative of our sigmoid function, or whatever nonlinearity you choose to use. And the derivative of $z^{(L)}$ with respect to $w^{(L)}$ in this case comes out to be $a^{(L-1)}$.

🤍0 likes💬 0 comments

00:04:45Grant Sanderson

Now, I don't know about you, but I think it's easy to get stuck head down in the formulas without taking a moment to sit back and remind yourself of what they all mean. In the case of this last derivative, the amount that the small nudge to the weight influenced the last layer depends on how strong the previous neuron is. Remember, this is where the "neurons that fire together, wire together" idea comes in.

🤍0 likes💬 0 comments

00:05:09Grant Sanderson

And all of this is the derivative with respect to $w^{(L)}$ only of the cost for a specific, single training example. Since the full cost function involves averaging together all those costs across many different training examples, its derivative requires averaging this expression over all training examples. And of course, that is just one component of the gradient vector, which itself is built up from the partial derivatives of the cost function with respect to all those weights and biases.

🤍0 likes💬 0 comments

00:05:40Grant Sanderson

But even though that's just one of the many partial derivatives we need, it's more than 50% of the work. The sensitivity to the bias, for example, is almost identical. We just need to change out this $\frac{\partial z}{\partial w}$ term for a $\frac{\partial z}{\partial b}$. And if you look at the relevant formula, that derivative comes out to be $1$.

🤍0 likes💬 0 comments

00:06:06Grant Sanderson

Also—and this is where the idea of propagating backwards comes in—you can see how sensitive this cost function is to the activation of the previous layer. Namely, this initial derivative in the chain rule expression, the sensitivity of $z$ to the previous activation, comes out to be the weight $w^{(L)}$.

🤍0 likes💬 0 comments

00:06:26Grant Sanderson

And again, even though we're not going to be able to directly influence that previous layer activation, it's helpful to keep track of, because now we can just keep iterating this same chain rule idea backwards to see how sensitive the cost function is to previous weights and previous biases.

🤍0 likes💬 0 comments

00:06:43Grant Sanderson

And you might think this is an overly simple example, since all layers have one neuron, and things are going to get exponentially more complicated for a real network. But honestly, not that much changes when we give the layers multiple neurons; really, it's just a few more indices to keep track of.

🤍0 likes💬 0 comments

00:06:59Grant Sanderson

Rather than the activation of a given layer simply being $a^{(L)}$, it's also going to have a subscript indicating which neuron of that layer it is. Let's use the letter $k$ to index the layer $L-1$, and $j$ to index the layer $L$.

🤍0 likes💬 0 comments

00:07:15Grant Sanderson

For the cost, again we look at what the desired output is, but this time we add up the squares of the differences between these last layer activations and the desired output. That is, you take a sum over $(a_j^{(L)} - y_j)^2$.

🤍0 likes💬 0 comments

00:07:33Grant Sanderson

Since there's a lot more weights, each one has to have a couple more indices to keep track of where it is, so let's call the weight of the edge connecting this $k$-th neuron to the $j$-th neuron, $w_{jk}^{(L)}$. Those indices might feel a little backwards at first, but it lines up with how you'd index the weight matrix I talked about in the part 1 video.

🤍0 likes💬 0 comments

00:07:53Grant Sanderson

Just as before, it's still nice to give a name to the relevant weighted sum, like $z$, so that the activation of the last layer is just your special function, like the sigmoid, applied to $z$. You can see what I mean, where all of these are essentially the same equations we had before in the one-neuron-per-layer case—it's just that it looks a little more complicated.

🤍0 likes💬 0 comments

00:08:15Grant Sanderson

And indeed, the chain-ruled derivative expression describing how sensitive the cost is to a specific weight looks essentially the same. I'll leave it to you to pause and think about each of those terms if you want.

🤍0 likes💬 0 comments

00:08:28Grant Sanderson

What does change here, though, is the derivative of the cost with respect to one of the activations in the layer $L-1$. In this case, the difference is that the neuron influences the cost function through multiple different paths. That is, on the one hand, it influences $a_0^{(L)}$, which plays a role in the cost function, but it also has an influence on $a_1^{(L)}$, which also plays a role in the cost function, and you have to add those up.

🤍0 likes💬 0 comments

00:08:59Grant Sanderson

And that, well, that's pretty much it. Once you know how sensitive the cost function is to the activations in this second-to-last layer, you can just repeat the process for all the weights and biases feeding into that layer.

🤍0 likes💬 0 comments

00:09:13Grant Sanderson

So pat yourself on the back! If all of this makes sense, you have now looked deep into the heart of backpropagation, the workhorse behind how neural networks learn. These chain rule expressions give you the derivatives that determine each component in the gradient that helps minimize the cost of the network by repeatedly stepping downhill.

🤍0 likes💬 0 comments

00:09:34Grant Sanderson

If you sit back and think about all that, this is a lot of layers of complexity to wrap your mind around, so don't worry if it takes time for your mind to digest it all.

🤍0 likes💬 0 comments

Video Player