The spelled-out intro to neural networks and backpropagation: building micrograd
Disclaimer: The transcript on this page is for the YouTube video titled "The spelled-out intro to neural networks and backpropagation: building micrograd" from "Andrej Karpathy". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.
Watch the original video here: https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=1
Hello, my name is Andrej and I've been training deep neural networks for a bit more than a decade. In this lecture, I'd like to show you what neural network training looks like under the hood. So in particular, we are going to start with a blank Jupyter notebook and by the end of this lecture, we will define and train a neural net, and you'll get to see everything that goes on under the hood and exactly sort of how that works on an intuitive level.
Now specifically, what I would like to do is I would like to take you through building of micrograd. Now micrograd is this library that I released on GitHub about two years ago, but at the time I only uploaded the source code and you'd have to go in by yourself and really figure out how it works. So in this lecture, I will take you through it step by step and kind of comment on all the pieces of it. So what is micrograd and why is it interesting?
Good. Micrograd is basically an autograd engine. Autograd is short for automatic gradient. And really what it does is it implements backpropagation. Now, backpropagation is this algorithm that allows you to efficiently evaluate the gradient of some kind of a loss function with respect to the weights of a neural network.
And what that allows us to do then is we can iteratively tune the weights of that neural network to minimize the loss function and therefore improve the accuracy of the network. So backpropagation would be at the mathematical core of any modern deep neural network library like say PyTorch or JAX.
So the functionality of micrograd is, I think, best illustrated by an example. So if we just scroll down here, you'll see that micrograd basically allows you to build out mathematical expressions. And here what we are doing is we have an expression that we're building out where you have two inputs, $a$ and $b$. And you'll see that $a$ and $b$ are negative four and two, but we are wrapping those values into this `Value` object that we are going to build out as part of micrograd.
So this `Value` object will wrap the numbers themselves, and then we are going to build out a mathematical expression here where $a$ and $b$ are transformed into $c$, $d$, and eventually $e$, $f$, and $g$. And I'm showing some of the functionality of micrograd and the operations that it supports. So you can add two value objects, you can multiply them, you can raise them to a constant power, you can offset by one, negate, squash at zero, square, divide by constant, divide by it, etc.
And so we're building out an expression graph with these two inputs $a$ and $b$, and we're creating an output value of $g$. And micrograd will in the background build out this entire mathematical expression. So it will for example know that $c$ is also a `Value`; $c$ was a result of an addition operation and the child nodes of $c$ are $a$ and $b$ because it will maintain pointers to $a$ and $b$ value objects. So we'll basically know exactly how all of this is laid out.
And then not only can we do what we call the forward pass where we actually look at the value of $g$—of course that's pretty straightforward, we will access that using the `.data` attribute and so the output of the forward pass, the value of $g$, is 24.7 it turns out—but the big deal is that we can also take this $g$ value object and we can call `.backward()`. And this will basically initialize backpropagation at the node $g$.
And what backpropagation is going to do is it's going to start at $g$ and it's going to go backwards through that expression graph and it's going to recursively apply the chain rule from calculus. And what that allows us to do then is we're going to evaluate basically the derivative of $g$ with respect to all the internal nodes like $e$, $d$, and $c$, but also with respect to the inputs $a$ and $b$.
And then we can actually query this derivative of $g$ with respect to $a$, for example, that's `a.grad`. In this case it happens to be 138. And the derivative of $g$ with respect to $b$, which also happens to be here 645. And this derivative we'll see soon is very important information because it's telling us how $a$ and $b$ are affecting $g$ through this mathematical expression.
So in particular, `a.grad` is 138. So if we slightly nudge $a$ and make it slightly larger, 138 is telling us that $g$ will grow and the slope of that growth is going to be 138. And the slope of growth of $b$ is going to be 645. So that's going to tell us about how $g$ will respond if $a$ and $b$ get tweaked a tiny amount in a positive direction.
Okay. Now you might be confused about what this expression is that we built out here. And this expression by the way is completely meaningless, I just made it up. I'm just flexing about the kinds of operations that are supported by micrograd. What we actually really care about are neural networks, but it turns out that neural networks are just mathematical expressions just like this one, but actually slightly a bit less crazy even.
Neural networks are just a mathematical expression. They take the input data as an input and they take the weights of a neural network as an input and it's a mathematical expression. And the output are your predictions of your neural net or the loss function. We'll see this in a bit, but basically neural networks just happen to be a certain class of mathematical expressions.
But backpropagation is actually significantly more general. It doesn't actually care about neural networks at all. It only tells us about arbitrary mathematical expressions and then we happen to use that machinery for training of neural networks.
Now one more note I would like to make at this stage is that as you see here, micrograd is a scalar-valued autograd engine. So it's working on the level of individual scalars like negative four and two. And we're taking neural nets and we're breaking them down all the way to these atoms of individual scalars and all the little pluses and times and it's just excessive. And so obviously you would never be doing any of this in production. It's really just put down for pedagogical reasons because it allows us to not have to deal with these n-dimensional tensors that you would use in modern deep neural network libraries.
So this is really done so that you understand and refactor out backpropagation and chain rule and understanding of neural net training. And then if you actually want to train bigger networks, you have to be using these tensors. But none of the math changes; this is done purely for efficiency.
We are basically taking scalar values, all the scalar values, we're packaging them up into tensors which are just arrays of these scalars. And then because we have these large arrays, we're making operations on those large arrays that allows us to take advantage of the parallelism in a computer. And all those operations can be done in parallel and then the whole thing runs faster. But really none of the math changes and that's done purely for efficiency.
So I don't think that it's pedagogically useful to be dealing with tensors from scratch, and I think that's why I fundamentally wrote micrograd, because you can understand how things work at the fundamental level and then you can speed it up later.
Okay, so here's the fun part. My claim is that micrograd is what you need to train your networks and everything else is just efficiency. So you'd think that micrograd would be a very complex piece of code and that turns out to not be the case. So if we just go to micrograd and you'll see that there's only two files here in micrograd. This is the actual engine, it doesn't know anything about neural nets, and this is the entire neural nets library on top of micrograd: `engine.py` and `nn.py`.
So the actual backpropagation autograd engine that gives you the power of neural networks is literally 100 lines of code of very simple Python, which we'll understand by the end of this lecture. And then `nn.py`, this neural network library built on top of the autograd engine, is like a joke. It's like we have to define what is a neuron and then we have to define what is the layer of neurons and then we define what is a multi-layer perceptron which is just a sequence of layers of neurons. And so it's just a total joke.
So basically, there's a lot of power that comes from only 150 lines of code. And that's all you need to understand to understand neural network training and everything else is just efficiency. And of course there's a lot to efficiency, but fundamentally that's all that's happening.
Okay, so now let's dive right in and implement micrograd step by step. The first thing I'd like to do is I'd like to make sure that you have a very good understanding intuitively of what a derivative is and exactly what information it gives you. So let's start with some basic imports that I copy-paste in every Jupyter notebook always, and let's define a function, a scalar-valued function $f(x)$ as follows.
So I just make this up randomly. I just want a scalar-valued function that takes a single scalar $x$ and returns a single scalar $y$. And we can call this function of course, so we can pass in say 3.0 and get 20 back.
Now we can also plot this function to get a sense of its shape. You can tell from the mathematical expression that this is probably a parabola, it's a quadratic. And so if we just create a set of scalar values that we can feed in using for example `arange` from negative five to five in steps of 0.25... So this `xs` is just from negative 5 to 5 not including 5 in steps of 0.25. And we can actually call this function on this NumPy array as well so we get a set of `ys` if we call `f` on `xs`. And these `ys` are basically also applying a function on every one of these elements independently.
And we can plot this using Matplotlib, so `plt.plot(xs, ys)` and we get a nice parabola. So previously here we fed in 3.0 somewhere here and we received 20 back which is here the y coordinate.
So now I'd like to think through: what is the derivative of this function at any single input point $x$? Right, so what is the derivative at different points $x$ of this function? Now if you remember back to your calculus class you've probably derived derivatives. So we take this mathematical expression $3x^2 - 4x + 5$ and you would write out on a piece of paper and you would apply the product rule and all the other rules and derive the mathematical expression of the derivative of the original function and then you could plug in different $x$'s and see what the derivative is.
We're not going to actually do that because no one in neural networks actually writes out the expression for the neural net. It would be a massive expression, it would be thousands, tens of thousands of terms. No one actually derives the derivative of course. And so we're not going to take this kind of symbolic approach. Instead, what I'd like to do is I'd like to look at the definition of derivative and just make sure that we really understand what derivative is measuring, what it's telling you about the function.
And so if we just look up derivative... Okay so this is not a very good definition of derivative, this is a definition of what it means to be differentiable. But if you remember from your calculus, it is the limit as $h$ goes to zero of $(f(x+h) - f(x)) / h$.
So basically what it's saying is if you slightly bump up... you're at some point $x$ that you're interested in, or $a$. And if you slightly bump up, you know you slightly increase it by small number $h$, how does the function respond? With what sensitivity does it respond? What is the slope at that point? Does the function go up or does it go down and by how much? And that's the slope of that function, the slope of that response at that point.
And so we can basically evaluate the derivative here numerically by taking a very small $h$. Of course the definition would ask us to take $h$ to zero; we're just going to pick a very small $h$, 0.001. And let's say we're interested in point 3.0. So we can look at $f(x)$ of course as 20. And now $f(x+h)$. So if we slightly nudge $x$ in a positive direction, how is the function going to respond?
And just looking at this, do you expect $f(x+h)$ to be slightly greater than 20 or do you expect to be slightly lower than 20? And since this 3 is here and this is 20, if we slightly go positively the function will respond positively. So you'd expect this to be slightly greater than 20.
And now by how much? It's telling you the sort of the strength of that slope, right, the size of the slope. So $f(x+h) - f(x)$, this is how much the function responded in the positive direction. And we have to normalize by the run, so we have the rise over run to get the slope. So this of course is just a numerical approximation of the slope because we have to make $h$ very very small to converge to the exact amount.
Now if I'm doing too many zeros, at some point I'm going to get an incorrect answer because we're using floating point arithmetic and the representations of all these numbers in computer memory is finite and at some point we get into trouble. So we can converge towards the right answer with this approach. But basically, at 3 the slope is 14.
And you can see that by taking $3x^2 - 4x + 5$ and differentiating it in our head. So $3x^2$ would be $6x - 4$. And then we plug in $x=3$ so that's $18 - 4$ is 14. So this is correct at 3.
Now how about the slope at say negative 3? What would you expect for the slope? Now telling the exact value is really hard but what is the sign of that slope? So at negative three, if we slightly go in the positive direction at $x$, the function would actually go down and so that tells you that the slope would be negative. So we'll get a slight number below 20. And so if we take the slope we expect something negative, negative 22.
Okay. And at some point here of course the slope would be zero. Now for this specific function I looked it up previously and it's at point $2/3$. So at roughly $2/3$, that's somewhere here, this derivative would be zero. So basically at that precise point, if we nudge in a positive direction the function doesn't respond, this stays the same almost, and so that's why the slope is zero.
Okay, now let's look at a bit more complex case. So we're going to start complexifying a bit. So now we have a function here with output variable $d$ that is a function of three scalar inputs $a$, $b$, and $c$. So $a$, $b$, and $c$ are some specific values, three inputs into our expression graph, and a single output $d$. And so if we just print $d$ we get four.
And now what I have to do is I'd like to again look at the derivatives of $d$ with respect to $a$, $b$, and $c$ and think through again just the intuition of what this derivative is telling us. So in order to evaluate this derivative, we're going to get a bit hacky here. We're going to again have a very small value of $h$ and then we're going to fix the inputs at some values that we're interested in. So these are the... this is the point $a, b, c$ at which we're going to be evaluating the derivative of $d$ with respect to all $a$, $b$, and $c$ at that point.
So there are the inputs and now we have $d1$ is that expression. And then we're going to for example look at the derivative of $d$ with respect to $a$. So we'll take $a$ and we'll bump it by $h$. And then we'll get $d2$ to be the exact same function. And now we're going to print $d1$, $d2$, and print slope. So the derivative or slope here will be of course $(d2 - d1) / h$. So $d2 - d1$ is how much the function increased when we bumped the specific input that we're interested in by a tiny amount, and this is then normalized by $h$ to get the slope.
So if I just run this we're going to print $d1$ which we know is four. Now $d2$ will be... $a$ will be bumped by $h$. So let's just think through a little bit what $d2$ will be printed out here. In particular, $d1$ will be four. Will $d2$ be a number slightly greater than four or slightly lower than four? And that's going to tell us the sign of the derivative.
So we're bumping $a$ by $h$. $b$ is minus three, $c$ is ten. So you can just intuitively think through this derivative and what it's doing. $a$ will be slightly more positive, but $b$ is a negative number. So if $a$ is slightly more positive, because $b$ is negative three, we're actually going to be adding less to $d$. So you'd actually expect that the value of the function will go down. So let's just see this.
Yeah, and so we went from 4 to 3.9996. And that tells you that the slope will be negative. And then... will be a negative number because we went down. And then the exact number of slope will be... exact amount of slope is negative 3.
And you can also convince yourself that negative 3 is the right answer mathematically and analytically because if you have $a \times b + c$ and you are... you have calculus, then differentiating $a \times b + c$ with respect to $a$ gives you just $b$. And indeed the value of $b$ is negative 3, which is the derivative that we have, so you can tell that that's correct.
So now if we do this with $b$. So if we bump $b$ by a little bit in a positive direction we'd get different slopes. So what is the influence of $b$ on the output $d$? So if we bump $b$ by a tiny amount in a positive direction, then because $a$ is positive, we'll be adding more to $d$. Right. So, and now what is the sensitivity? What is the slope of that addition? And it might not surprise you that this should be 2. And why is it 2? Because $dd/db$, differentiating with respect to $b$, would give us $a$. And the value of $a$ is two so that's also working well.
And then if $c$ gets bumped a tiny amount in $h$, by $h$, then of course $a \times b$ is unaffected. And now $c$ becomes slightly bit higher. What does that do to the function? It makes it slightly bit higher because we're simply adding $c$. And it makes it slightly bit higher by the exact same amount that we added to $c$. And so that tells you that the slope is one. That will be the rate at which $d$ will increase as we scale $c$.
Okay, so we now have some intuitive sense of what this derivative is telling you about the function and we'd like to move to neural networks. Now as I mentioned, neural networks will be pretty massive mathematical expressions so we need some data structures that maintain these expressions and that's what we're going to start to build out now. So we're going to build out this `Value` object that I showed you in the readme page of micrograd.
So let me copy paste a skeleton of the first very simple Value object. So class `Value` takes a single scalar value that it wraps and keeps track of. And that's it. So we can for example do `Value(2.0)` and then we can get... we can look at its content and Python will internally use the `__repr__` function to return this string. So this is a `Value` object with data=2 that we're creating here.
Now what we'd like to do is we'd like to be able to have not just two values but we'd like to do `a + b`, right? We'd like to add them. So currently you would get an error because Python doesn't know how to add two Value objects, so we have to tell it. So here's addition. So you have to basically use these special double underscore methods in Python to define these operators for these objects.
So if we use this plus operator, Python will internally call `a.__add__(b)`. That's what will happen internally. And so `b` will be other and self will be `a`. And so we see that what we're going to return is a new Value object and it's just going to be wrapping the plus of their data. But remember now because data is the actual like numbered Python number, so this operator here is just the typical floating point plus addition. Now it's not an addition of Value objects. And will return a new `Value`.
So now `a + b` should work and it should print `Value(-1)` because that's two plus minus three. There we go.
Okay let's now implement multiply just so we can recreate this expression here. So multiply, I think it won't surprise you, will be fairly similar. So instead of add we're going to be using `mul`. And then here of course we want to do times. And so now we can create a `c` Value object which will be 10.0 and now we should be able to do `a * b`... well let's just do `a * b` first. That's `Value(-6)` now.
And by the way I skipped over this a little bit. Suppose that I didn't have the `__repr__` function here, then it's just that you'll get some kind of an ugly expression. So what `__repr__` is doing is it's providing us a way to print out like a nicer looking expression in Python so we don't just have something cryptic, we actually are, you know, it's `Value(-6)`.
So this gives us `a * b` and then this we should now be able to add `c` to it because we've defined and told Python how to do `mul` and `add`. And so this will basically be equivalent to `a.__mul__(b)` and then this new Value object will be `.__add__(c)`. And so let's see if that worked. Yep so that worked well, that gave us four which is what we expect from before. And I believe we can just call them manually as well. There we go.
Okay so now what we are missing is the connective tissue of this expression. As I mentioned we want to keep these expression graphs so we need to know and keep pointers about what values produce what other values. So here for example we are going to introduce a new variable which we'll call `_children` and by default it will be an empty tuple. And then we're actually going to keep a slightly different variable in the class which we'll call `_prev` which will be the set of children. This is how I did it in the original micrograd looking at my code here, I can't remember exactly the reason, I believe it was efficiency but this `_children` will be a tuple for convenience but then when we actually maintain it in the class it will be just this set. Yeah I believe for efficiency.
So now when we are creating a value like this with a constructor, children will be empty and prev will be the empty set. But when we're creating a value through addition or multiplication, we're going to feed in the children of this value which in this case is self and other. So those are the children here.
So now we can do `d._prev` and we'll see that the children of `d` we now know are this `Value(-6)` and `Value(10)`, and this of course is the value resulting from `a * b` and the `c` value which is 10.
Now the last piece of information we don't know... so we know the children of every single value but we don't know what operation created this value. So we need one more element here let's call it `_op` and by default this is the empty set for leaves. And then we'll just maintain it here. And now the operation will be just a simple string and in the case of addition it's plus, in the case of multiplication is times.
So now we not just have `d._prev`, we also have a `d._op`. And we know that `d` was produced by an addition of those two values and so now we have the full mathematical expression and we're building out this data structure and we know exactly how each value came to be by what expression and from what other values.
Now because these expressions are about to get quite a bit larger, we'd like a way to nicely visualize these expressions that we're building out. So for that I'm going to copy paste a bunch of slightly scary code that's going to visualize these expression graphs for us. So here's the code and I'll explain it in a bit but first let me just show you what this code does.
Basically what it does is it creates a new function `draw_dot` that we can call on some root node and then it's going to visualize it. So if we call `draw_dot` on `d`, which is this final value here that is `a * b + c`, it creates something like this. So this is `d` and you see that this is `a * b` creating an integrated value plus `c` gives us this output node `d`. So that's `draw_dot` of `d`.
And I'm not going to go through this in complete detail. You can take a look at Graphviz and its API. Graphviz is an open source graph visualization software. And what we're doing here is we're building out this graph in Graphviz API and you can basically see that trace is this helper function that enumerates all of the nodes and edges in the graph. So that just builds a set of all the nodes and edges and then we iterate for all the nodes and we create special node objects for them using `.node`. And then we also create edges using `.edge`.
And the only thing that's like slightly tricky here is you'll notice that I basically add these fake nodes which are these operation nodes. So for example, this node here is just like a plus node and I create these special op nodes here and I connect them accordingly. So these nodes of course are not actual nodes in the original graph, they're not actually a value object. The only value objects here are the things in squares; those are actual value objects or representations thereof. And these op nodes are just created in this `draw_dot` routine so that it looks nice.
Let's also add labels to these graphs just so we know what variables are where. So let's create a special label... or let's just do label equals empty by default and save it in each node. And then here we're going to do label as `a`, label is `b`, label `c`. And then let's create a special `e = a * b` and `e.label` will be `e`. And `d` will be `e + c` and `d.label` will be `d`.
Okay so nothing really changes, I just added this new `e` function, a new `e` variable. And then here when we are printing this, I'm going to print the label here. And so now we have the label on the left here so it says $a$, $b$ creating $e$, and then $e + c$ creates $d$, just like we have it here.
And finally let's make this expression just one layer deeper. So `d` will not be the final output node. Instead after `d` we are going to create a new Value object called `f`—we're going to start running out of variables soon—`f` will be negative 2.0 and its label will of course just be `f`. And then `L`, capital `L`, will be the output of our graph and `L` will be `d * f`. Okay. So `L` will be negative eight is the output. So now we don't just draw `d`, we draw `L`. And somehow the label of `L` was undefined, oops, `L.label` has to be explicitly sort of given to it. There we go. So `L` is the output.
So let's quickly recap what we've done so far. We are able to build out mathematical expressions using only plus and times so far. They are scalar valued along the way and we can do this forward pass and build out a mathematical expression. So we have multiple inputs here $a$, $b$, $c$, and $f$ going into a mathematical expression that produces a single output $L$. And this here is visualizing the forward pass. So the output of the forward pass is negative eight, that's the value.
Now what we'd like to do next is we'd like to run backpropagation. And in backpropagation we are going to start here at the end and we're going to reverse and calculate the gradient along all these intermediate values. And really what we're computing for every single value here, we're going to compute the derivative of that node with respect to $L$.
So the derivative of $L$ with respect to $L$ is just one. And then we're going to derive what is the derivative of $L$ with respect to $f$, with respect to $d$, with respect to $c$, with respect to $e$, with respect to $b$, and with respect to $a$.
And in the neural network setting you'd be very interested in the derivative of basically this loss function $L$ with respect to the weights of a neural network. And here of course we have just these variables $a$, $b$, $c$, and $f$, but some of these will eventually represent the weights of a neural net and so we'll need to know how those weights are impacting the loss function. So we'll be interested basically in the derivative of the output with respect to some of its leaf nodes, and those leaf nodes will be the weights of the neural net. And the other leaf nodes of course will be the data itself, but usually we will not want or use the derivative of the loss function with respect to data because the data is fixed, but the weights will be iterated on using the gradient information.
So next we are going to create a variable inside the Value class that maintains the derivative of $L$ with respect to that value and we will call this variable `grad`. So there's a `data` and there's a `self.grad`. And initially it will be zero. And remember that zero basically means no effect. So at initialization we're assuming that every value does not impact, does not affect the output. Right, because if the gradient is zero that means that changing this variable is not changing the loss function. So by default we assume that the gradient is zero.
And then now that we have `grad` and it's 0.0, we are going to be able to visualize it here after data. So here grad is 0.4f and this will be in that graph. And now we are going to be showing both the data and the grad initialized at zero.
And we are just about getting ready to calculate the backpropagation. And of course this `grad` again, as I mentioned, is representing the derivative of the output in this case $L$ with respect to this value. So this is the derivative of $L$ with respect to $f$, with respect to $d$, and so on. So let's now fill in those gradients and actually do backpropagation manually.
So let's start filling in these gradients and start all the way at the end as I mentioned here. First we are interested to fill in this gradient here. So what is the derivative of $L$ with respect to $L$? In other words if I change $L$ by a tiny amount of $h$, how much does $L$ change? It changes by $h$, so it's proportional and therefore derivative will be one.
We can of course measure these or estimate these numerical gradients numerically just like we've seen before. So if I take this expression and I create a `def lol()` function here and put this here... now the reason I'm creating a gating function `lol` here is because I don't want to pollute or mess up the global scope here. This is just kind of like a little staging area and as you know in Python all of these will be local variables to this function so I'm not changing any of the global scope here.
So here `L1` will be `L`. And then copy pasting this expression, we're going to add a small amount $h$ in for example $a$. Right, and this would be measuring the derivative of $L$ with respect to $a$. So here this will be `L2`. And then we want to print this derivative. So `print((L2 - L1)/h)` which is how much $L$ changed and then normalize it by $h$, so this is the rise over run. And we have to be careful because $L$ is a `Value` node so we actually want its data so that these are floats dividing by $h$. And this should print the derivative of $L$ with respect to $a$ because $a$ is the one that we bumped a little bit by $h$. So what is the derivative of $L$ with respect to $a$? It's six.
Okay and obviously if we change $L$ by $h$, then that would be... here effectively... this looks really awkward but changing $L$ by $h$, you see the derivative here is 1. That's kind of like the base case of what we are doing here. So basically we can now come up here and we can manually set `L.grad` to one. This is our manual backpropagation. `L.grad` is one and let's redraw, and we'll see that we filled in grad as 1 for $L$.
We're now going to continue the backpropagation. So let's here look at the derivatives of $L$ with respect to $d$ and $f$. Let's do $d$ first. So what we are interested in—if I create a markdown node here—is we'd like to know basically we have that $L = d \times f$ and we'd like to know what is $dL/dd$. What is that? And if you know your calculus, $L$ is $d \times f$, so what is $dL/dd$? It would be $f$.
And if you don't believe me we can also just derive it because the proof would be fairly straightforward. We go to the definition of the derivative which is $(f(x+h) - f(x)) / h$ as a limit of $h$ goes to zero of this kind of expression. So when we have $L = d \times f$, then increasing $d$ by $h$ would give us the output of $(d+h) \times f$. That's basically $f(x+h)$ right. Minus $d \times f$. And then divide $h$.
And symbolically expanding out here we would have basically $d \times f + h \times f - d \times f$ divide $h$. And then you see how the $df - df$ cancels so you're left with $h \times f / h$, which is $f$. So in the limit as $h$ goes to zero of derivative definition we just get $f$ in the case of $d \times f$.
So symmetrically $dL/df$ will just be $d$. So what we have is that `f.grad`, we see now is just the value of `d`, which is 4. And we see that `d.grad` is just the value of `f`, and so the value of `f` is negative two. So we'll set those manually. Let me erase this markdown node and then let's redraw what we have.
Okay. And let's just make sure that these were correct. So we seem to think that $dL/dd$ is negative two so let's double check. Um, let me erase this `+ h` from before and now we want the derivative with respect to $f$. So let's just come here when I create $f$ and let's do a `+ h` here and this should print the derivative of $L$ with respect to $f$, so we expect to see four. Yeah and this is four up to floating point funkiness.
And then $dL/dd$ should be $f$ which is negative two. `grad` is negative two. So if we again come here and we change $d$, `d.data += h` right here, so we expect... so we've added a little $h$ and then we see how $L$ changed and we expect to print negative two. There we go. So we've numerically verified what we're doing here is what kind of like an inline gradient check. Gradient check is when we are deriving this like backpropagation and getting the derivative with respect to all the intermediate results and then numerical gradient is just you know estimating it using small step size.
Now we're getting to the crux of backpropagation. So this will be the most important node to understand because if you understand the gradient for this node you understand all of backpropagation and all of training of neural nets basically. So we need to derive $dL/dc$. In other words the derivative of $L$ with respect to $c$ because we've computed all these other gradients already. Now we're coming here and we're continuing the backpropagation manually.
So we want $dL/dc$ and then we'll also derive $dL/de$. Now here's the problem. How do we derive $dL/dc$? We actually know the derivative of $L$ with respect to $d$ so we know how $L$ is sensitive to $d$. But how is $L$ sensitive to $c$? So if we wiggle $c$ how does that impact $L$ through $d$?
So we know $dL/dc$... and we also here know how $c$ impacts $d$. And so just very intuitively if you know the impact that $c$ is having on $d$ and the impact that $d$ is having on $L$, then you should be able to somehow put that information together to figure out how $c$ impacts $L$. And indeed this is what we can actually do.
So in particular we know just concentrating on $d$ first let's look at what is the derivative basically of $d$ with respect to $c$. So in other words what is $dd/dc$? So here we know that $d$ is $c + e$. That's what we know and now we're interested in $dd/dc$. If you just know your calculus again and you remember that differentiating $c + e$ with respect to $c$ you know that that gives you 1.0.
And we can also go back to the basics and derive this because again we can go to our $(f(x+h) - f(x)) / h$. That's the definition of a derivative as $h$ goes to zero. And so here focusing on $c$ and its effect on $d$, we can basically do the $f(x+h)$ will be: $c$ is incremented by $h$ plus $e$. That's the first evaluation of our function minus $c + e$. And then divide $h$.
And so what is this? Just expanding this out this will be $c + h + e - c - e$ divide $h$. And then you see here how $c - c$ cancels, $e - e$ cancels, we're left with $h / h$ which is 1.0. And so by symmetry also $dd/de$ will be 1.0 as well.
So basically the derivative of a sum expression is very simple. And this is the local derivative. So I call this the local derivative because we have the final output value all the way at the end of this graph and we're now like a small node here. And this is a little plus node. And the little plus node doesn't know anything about the rest of the graph that it's embedded in. All it knows is that it did a plus; it took a $c$ and an $e$, added them and created $d$.
And this plus node also knows the local influence of $c$ on $d$ or rather the derivative of $d$ with respect to $c$, and it also knows the derivative of $d$ with respect to $e$. But that's not what we want. That's just a local derivative. What we actually want is $dL/dc$. And $L$ could... $L$ is here just one step away but in a general case this little plus node could be embedded in like a massive graph. So again we know how $L$ impacts $d$ and now we know how $c$ and $e$ impact $d$. How do we put that information together to write $dL/dc$? And the answer of course is the Chain Rule in calculus.
And so I pulled up a chain rule here from Wikipedia. And I'm going to go through this very briefly. So Chain Rule... Wikipedia sometimes can be very confusing and calculus can be very confusing. Like this is the way I learned Chain Rule and it was very confusing like what is happening, it's just complicated. So I like this expression much better. If a variable $z$ depends on a variable $y$, which itself depends on the variable $x$, then $z$ depends on $x$ as well obviously through the intermediate variable $y$. In this case the chain rule is expressed as: if you want $dz/dx$ then you take the $dz/dy$ and you multiply it by $dy/dx$.
So the chain rule fundamentally is telling you how we chain these derivatives together correctly. So to differentiate through a function composition we have to apply a multiplication of those derivatives. So that's really what chain rule is telling us. And there's a nice little intuitive explanation here which I also think is kind of cute. The chain rule says that knowing the instantaneous rate of change of $z$ with respect to $y$ and $y$ relative to $x$ allows one to calculate the instantaneous rate of change of $z$ relative to $x$ as a product of those two rates of change. Simply the product of those two.
So here's a good one. If a car travels twice as fast as bicycle and the bicycle is four times as fast as walking man, then the car travels two times four, eight times as fast as the man. And so this makes it very clear that the correct thing to do sort of is to multiply. So car is twice as fast as bicycle and bicycle is four times as fast as man, so the car will be eight times as fast as the man. And so we can take these intermediate rates of change if you will and multiply them together and that justifies the chain rule intuitively.
So have a look at chain rule but here really what it means for us is there's a very simple recipe for deriving what we want, which is $dL/dc$. And what we have so far is we know what we want and we know what is the impact of $d$ on $L$. So we know $dL/dd$, the derivative of $L$ with respect to $d$, we know that that's negative two.
And now because of this local reasoning that we've done here we know $dd/dc$. So how does $c$ impact $d$? And in particular this is a plus node so the local derivative is simply 1.0, it's very simple. And so the chain rule tells us that $dL/dc$, going through this intermediate variable, will just be simply $(dL/dd) \times (dd/dc)$. That's chain rule.
So this is identical to what's happening here except $z$ is our $L$, $y$ is our $d$ and $x$ is our $c$. So we literally just have to multiply these. And because these local derivatives like $dd/dc$ are just one, we basically just copy over $dL/dd$ because this is just times one.
So what does it do? So because $dL/dd$ is negative two, what is $dL/dc$? Well it's the local gradient 1.0 times $dL/dd$ which is negative two. So literally what a plus node does, you can look at it that way, is it literally just routes the gradient because the plus node's local derivatives are just one. And so in the chain rule, one times $dL/dd$ is just $dL/dd$ and so that derivative just gets routed to both $c$ and to $e$ in this case.
So basically we have that `c.grad`—or let's start with `c` since that's the one we looked at—is negative two times one, negative two. And in the same way by symmetry `e.grad` will be negative two, that's the claim. So we can set those, we can redraw, and you see how we just assign negative two, negative two.
So this backpropagating signal which is carrying the information of like what is the derivative of $L$ with respect to all the intermediate nodes, we can imagine it almost like flowing backwards through the graph and a plus node will simply distribute the derivative to all the leaf nodes—sorry to all the children nodes of it. So this is the claim and now let's verify it. So let me remove the `+ h` from before.
And now instead what we're going to do is we're going to increment `c`, so `c.data` will be incremented by $h$. And when I run this we expect to see negative 2. Negative 2. And then of course for `e`. So `e.data += h` and we expect to see negative 2. Simple.
So those are the derivatives of these internal nodes and now we're going to recurse our way backwards again. And we're again going to apply the chain rule. So here we go our second application of chain rule and we will apply it all the way through the graph. We just happen to only have one more node remaining. We have that $dL/de$ as we have just calculated is negative two. So we know that. So we know the derivative of $L$ with respect to $e$. And now we want $dL/da$. Right.
And the chain rule is telling us that that's just $dL/de$ (negative 2) times the local gradient. So what is the local gradient? Basically $de/da$. We have to look at that. So I'm a little times node inside a massive graph and I only know that I did $a \times b$ and I produced an $e$. So now what is $de/da$ and $de/db$? That's the only thing that I sort of know about, that's my local gradient.
So because we have that $e$ is $a \times b$, we're asking what is $de/da$. And of course we just did that here, we had $a$ times... so I'm not going to rederive it, but if you want to differentiate this with respect to $a$ you'll just get $b$, right, the value of $b$, which in this case is negative 3.0.
So basically we have that $dL/da$... well let me just do it right here. We have that `a.grad` and we are applying chain rule here is $dL/de$ which we see here is negative two times what is $de/da$? It's the value of $b$ which is negative 3. That's it.
And then we have `b.grad` is again $dL/de$ which is negative 2 just the same way, times what is $de/db$? Is the value of $a$ which is 2.0. As the value of $a$. So these are our claimed derivatives. Let's redraw. And we see here that `a.grad` turns out to be 6 because that is negative 2 times negative 3. And `b.grad` is negative 4 times... sorry is negative 2 times 2 which is negative 4.
So those are our claims. Let's delete this and let's verify them. We have `a` here `a.data += h`. So the claim is that `a.grad` is six. Let's verify. Six. And we have `b.data += h`. So nudging `b` by $h$ and looking at what happens, we claim it's negative four. And indeed it's negative four plus minus again float oddness. And that's it.
That was the manual backpropagation all the way from here to all the leaf nodes and we've done it piece by piece. And really all we've done is as you saw we iterated through all the nodes one by one and locally applied the chain rule. We always know what is the derivative of $L$ with respect to this little output and then we look at how this output was produced. This output was produced through some operation and we have the pointers to the children nodes of this operation. And so in this little operation we know what the local derivatives are and we just multiply them onto the derivative always. So we just go through and recursively multiply on the local derivatives and that's what backpropagation is: is just a recursive application of chain rule backwards through the computation graph.
Let's see this power in action just very briefly. What we're going to do is we're going to nudge our inputs to try to make $L$ go up. So in particular what we're doing is we want `a.data`... we're going to change it. And if we want $L$ to go up that means we just have to go in the direction of the gradient. So $a$ should increase in the direction of gradient by like some small step amount—this is the step size.
And we don't just want this for $a$ but also for $b$, also for $c$, also for $f$. Those are leaf nodes which we usually have control over. And if we nudge in direction of the gradient we expect a positive influence on $L$. So we expect $L$ to go up positively. So it should become less negative, it should go up to say negative you know six or something like that. It's hard to tell exactly and we'd have to rewrite the forward pass. So let me just do that here.
This would be the forward pass, $f$ would be unchanged. This is effectively the forward pass and now if we print `L.data` we expect because we nudged all the values, all the inputs in the rational gradient, we expected a less negative $L$. We expect it to go up. So maybe it's negative six or so, let's see what happens. Okay negative seven. And this is basically one step of an optimization that we'll end up running. And really does gradient just give us some power because we know how to influence the final outcome and this will be extremely useful for training neural networks as well as you'll see.
So now I would like to do one more example of manual backpropagation using a bit more complex and useful example. We are going to backpropagate through a neuron. So we want to eventually build up neural networks and in the simplest case these are multi-layer perceptrons as they're called. So this is a two-layer neural net and it's got these hidden layers made up of neurons and these neurons are fully connected to each other.
Now biologically neurons are very complicated devices but we have very simple mathematical models of them. And so this is a very simple mathematical model of a neuron. You have some inputs $x$'s and then you have these synapses that have weights on them. So the $w$'s are weights. And then the synapse interacts with the input to this neuron multiplicatively. So what flows to the cell body of this neuron is $w \times x$. But there's multiple inputs so there's many $w \times x$'s flowing into the cell body.
The cell body then has also like some bias. So this is kind of like the inert, innate sort of trigger happiness of this neuron. So this bias can make it a bit more trigger happy or a bit less trigger happy regardless of the input. But basically we're taking all the $w \times x$ of all the inputs, adding the bias, and then we take it through an activation function. And this activation function is usually some kind of a squashing function like a sigmoid or tanh or something like that. So as an example we're going to use the tanh in this example.
NumPy has a `np.tanh`. So we can call it on a range and we can plot it. This is the tanh function and you see that the inputs as they come in get squashed on the y coordinate here. So right at zero we're going to get exactly zero and then as you go more positive in the input, then you'll see that the function will only go up to one and then plateau out. And so if you pass in very positive inputs we're gonna cap it smoothly at one and on the negative side we're gonna cap it smoothly to negative one. So that's tanh. And that's the squashing function or an activation function.
And what comes out of this neuron is just the activation function applied to the dot product of the weights and the inputs. So let's write one out. I'm going to copy paste because I don't want to type too much, but okay. So here we have the inputs $x_1, x_2$. So this is a two-dimensional neuron, so two inputs are going to come in. These are thought out as the weights of this neuron, weights $w_1, w_2$. And these weights again are the synaptic strengths for each input. And this is the bias of the neuron, $b$.
And now we want to do is according to this model we need to multiply $x_1 \times w_1$ and $x_2 \times w_2$ and then we need to add bias on top of it. And it gets a little messy here but all we are trying to do is $x_1 w_1 + x_2 w_2 + b$. And these are multiply here except I'm doing it in small steps so that we actually have pointers to all these intermediate nodes. So we have `x1w1` variable, `x2w2` variable, and I'm also labeling them.
So `n` is now the cell body raw activation without the activation function for now. And this should be enough to basically plot it. So `draw_dot` of `n` gives us $x_1 \times w_1$, $x_2 \times w_2$ being added, then the bias gets added on top of this and this `n` is this sum.
So we're now going to take it through an activation function. And let's say we use the tanh so that we produce the output. So what we'd like to do here is we'd like to do the output, and I'll call it `o`, is `n.tanh()`. Okay but we haven't yet written the tanh. Now the reason that we need to implement another tanh function here is that tanh is a hyperbolic function and we've only so far implemented a plus and the times and you can't make a tanh out of just pluses and times. You also need exponentiation.
So tanh is this kind of a formula here. You can use either one of these and you see that there's exponentiation involved which we have not implemented yet for our low value node here so we're not going to be able to produce tanh yet and we have to go back up and implement something like it.
Now one option here is we could actually implement exponentiation, right, and we could return the exp of a value instead of a tanh of a value. Because if we had exp then we have everything else that we need. So because we know how to add and we know how to multiply, so we'd be able to create tanh if we knew how to exp. But for the purposes of this example I specifically wanted to show you that we don't necessarily need to have the most atomic pieces in this Value object.
We can actually like create functions at arbitrary points of abstraction. They can be complicated functions but they can be also very very simple functions like a plus and it's totally up to us. The only thing that matters is that we know how to differentiate through any one function. So we take some inputs and we make an output; the only thing that matters, it can be arbitrarily complex function as long as you know how to create the local derivative. If you know the local derivative of how the inputs impact the output then that's all you need.
So we're going to cluster up all of this expression and we're not going to break it down to its atomic pieces. We're just going to directly implement tanh. So let's do that. `def tanh`: and then out will be a `Value` of... and we need this expression here. So let me actually copy paste. Let's grab `n` which is `self.data`. And then this I believe is the tanh... `math.exp(2*n)`... Maybe I can call this $x$ just so that it matches exactly.
Okay and now this will be `t` and children of this node there's just one child and I'm wrapping it in a tuple, so this is a tuple of one object just `self`. And here the name of this operation will be 'tanh' and we're going to return that.
Okay. So now `Value` should be implementing tanh and now we can scroll all the way down here and we can actually do `n.tanh()` and that's going to return the tanh output of `n`. And now we should be able to draw it out of `o`, not of `n`. So let's see how that worked. There we go. `n` went through tanh to produce this output. So now tanh is sort of our little micrograd supported node here as an operation. And as long as we know the derivative of tanh, then we'll be able to backpropagate through it.
Now let's see this tanh in action. Currently it's not squashing too much because the input to it is pretty low. So if the bias was increased to say eight, then we'll see that what's flowing into the tanh now is two and tanh is squashing it to 0.96. So we're already hitting the tail of this tanh and it will sort of smoothly go up to 1 and then plateau out over there.
Okay so now I'm going to do something slightly strange. I'm going to change this bias from 8 to this number 6.88 etc. And I'm going to do this for specific reasons because we're about to start backpropagation and I want to make sure that our numbers come out nice. They're not like very crazy numbers, they're nice numbers that we can sort of understand in our head. Let me also add a label... `o` is short for output here. So that's zero.
Okay so 0.88 flows into tanh, comes out 0.7 so on. So now we're going to do backpropagation and we're going to fill in all the gradients. So what is the derivative `o` with respect to all the inputs here? And of course in the typical neural network setting what we really care about the most is the derivative of these neurons on the weights, specifically the `w2` and `w1`, because those are the weights that we're going to be changing part of the optimization.
And the other thing that we have to remember is here we have only a single neuron but in the neural net usually have many neurons and they're connected. So this is only like a one small neuron, a piece of a much bigger puzzle. And eventually there's a loss function that sort of measures the accuracy of the neural net and we're backpropagating...
So let's start backpropagation here in the end. What is the derivative of `o` with respect to `o`? The base case, sort of, we know always is that the gradient is just 1.0. So let me fill it in. And then let me split out the drawing function here, and then here call `draw_dot`... clear this output here. Okay. So now when we draw `o`, we'll see that `o.grad` is one.
So now we're going to backpropagate through the tanh. So to backpropagate through tanh, we need to know the local derivative of tanh. So if we have that $o = \tanh(n)$, then what is $do/dn$? Now what you could do is you could come here and you could take this expression and you could do your calculus derivative taking, and that would work. But we can also just scroll down Wikipedia here into a section that hopefully tells us that derivative. $d/dx$ of $\tanh(x)$ is... any of these? I like this one: $1 - \tanh^2(x)$.
So basically what this is saying is that $do/dn$ is $1 - \tanh(n)^2$. And we already have $\tanh(n)$; that's just $o$. So it's $1 - o^2$. So $o$ is the output here; so the output is this number, data. And then what this is saying is that $do/dn$ is $1 - \text{data}^2$. So one minus that data squared is 0.5 conveniently. So the local derivative of this tanh operation here is 0.5. And so that would be $do/dn$. So we can fill in that `n.grad` is 0.5.
So now we're going to continue the backpropagation. This is 0.5 and this is a plus node. So what is backprop going to do here? And if you remember our previous example, a plus is just a distributor of gradient. So this gradient will simply flow to both of these equally, and that's because the local derivative of this operation is one for every one of its nodes. So 1 times 0.5 is 0.5. So therefore, we know that this node here, which we called `x1w1`, its grad is just 0.5. And we know that `x2w2.grad` is also 0.5. So let's set those and let's draw.
Continuing, we have another plus. 0.5 again, we'll just distribute it. So 0.5 will flow to both of these. So we can set `x2w2` as well; that grad is 0.5. And let's redraw. Pluses are my favorite operations to backpropagate through because it's very simple.
So now it's flowing into these expressions as 0.5. And so really, again, keep in mind what the derivative is telling us at every point in time along here. This is saying that if we want the output of this neuron to increase, then the influence on these expressions is positive on the output; both of them are positive contribution to the output.
So now backpropagating to `x2` and `w2`. First, this is a times node, so we know that the local derivative is the other term. So if we want to calculate `x2.grad`, then... can you think through what it's going to be? So `x2.grad` will be `w2.data` times `x2w2.grad`, right? And `w2.grad` will be `x2.data` times `x2w2.grad`. Right? So that's the local piece of chain rule.
Let's set them and let's redraw. So here we see that the gradient on our weight 2 is 0 because `x2.data` was 0. Right? But `x2` will have the gradient 0.5 because data here was 1. And so what's interesting here is because the input `x2` was 0, then because of the way the times works, of course this gradient will be zero. And think about intuitively why that is. Derivative always tells us the influence of this on the final output. If I wiggle `w2`, how is the output changing? It's not changing because we're multiplying by zero. So because it's not changing, there's no derivative and zero is the correct answer, because we're squashing it at zero.
And let's do it here. 0.5 should come here and flow through this times. And so we'll have that `x1.grad` is... can you think through a little bit what this should be? The local derivative of times with respect to `x1` is going to be `w1`. So `w1.data` times `x1w1.grad`. And `w1.grad` will be `x1.data` times `x1w1.grad`.
Let's see what those came out to be. So this is 0.5, so this would be negative 1.5, and this would be 1. And we've backpropagated through this expression. These are the actual final derivatives. So if we want this neuron's output to increase, we know that what's necessary is that `w2`... we have no gradient, `w2` doesn't actually matter to this neuron right now. But this neuron, this weight `w1`, should go up. So if this weight goes up, then this neuron's output would have gone up, and proportionally because the gradient is one.
Okay, so doing the backpropagation manually is obviously ridiculous. So we are now going to put an end to this suffering and we're going to see how we can implement the backward pass a bit more automatically. We're not going to be doing all of it manually out here. It's now pretty obvious to us by example how these pluses and times are backpropagating gradients. So let's go up to the `Value` object and we're going to start codifying what we've seen in the examples below.
So we're going to do this by storing a special `self._backward`. And `_backward` will be a function which is going to do that little piece of chain rule at each little node that took inputs and produced output. We're going to store how we are going to chain the output's gradient into the input's gradients. So by default, this will be a function that doesn't do anything. And that would be sort of the case, for example, for a leaf node. For a leaf node, there's nothing to do.
But now, when we're creating these `out` values... these `out` values are an addition of self and other. And so we will want to set `out._backward` to be the function that propagates the gradient. So let's define what should happen, and we're going to store it in a closure. Let's define what should happen when we call `out._backward` for an addition. Our job is to take `out.grad` and propagate it into `self.grad` and `other.grad`. So basically we want to set `self.grad` to something and we want to set `other.grad` to something.
And the way we saw below how chain rule works: we want to take the local derivative times the sort of global derivative, I should call it, which is the derivative of the final output of the expression with respect to `out`. So the local derivative of `self` in an addition is 1.0. So it's just `1.0 * out.grad`. That's the chain rule. And `other.grad` will be `1.0 * out.grad`. And basically what you're seeing here is that `out.grad` will simply be copied onto `self.grad` and `other.grad` as we saw happens for an addition operation. So we're going to later call this function to propagate the gradient.
Having done an addition, let's now do multiplication. We're going to also define `_backward`, and we're going to set `out._backward` to be `_backward`. And we want to chain `out.grad` into `self.grad` and `other.grad`. And this will be a little piece of chain rule for multiplication. So what should this be? Can you think through? So what is the local derivative here? The local derivative was `other.data`. And then times `out.grad`—that's chain rule. And here we have `self.data * out.grad`. That's what we've been doing.
And finally here for tanh. `def _backward()`. And then we want to set `out._backward` to be just `_backward`. And here we need to backpropagate. We have `out.grad` and we want to chain it into `self.grad`. And `self.grad` will be the local derivative of this operation that we've done here, which is tanh. And so we saw that the local gradient is $1 - \tanh(x)^2$, which here is `t`. That's the local derivative because `t` is the output of this tanh, so $1 - t^2$ is the local derivative. And then gradient has to be multiplied because of the chain rule. So `out.grad` is chained through the local gradient into `self.grad`.
And that should be basically it. So we're going to redefine our Value node. We're going to swing all the way down here and we're going to redefine our expression. Make sure that all the grads are zero. Okay. But now we don't have to do this manually anymore. We are going to basically be calling `_backward` in the right order.
So first we want to call `o._backward`. So `o` was the outcome of tanh, right? So calling `o._backward` will be this function. Now we have to be careful because there's a times `out.grad`, and `out.grad` remember is initialized to zero. So here we see grad zero. So as a base case, we need to set `o.grad` to 1.0 to initialize this with 1.
And then once this is 1, we can call `o._backward`. And what that should do is it should propagate this grad through tanh. So the local derivative times the global derivative which is initialized at one. So this should... Uh-oh.
So I thought about redoing it, but I figured I should just leave the error in here because it's pretty funny. Why is NoneType object not callable? It's because I screwed up. We're trying to save these functions. So this is correct. This here... we don't want to call the function because that returns None. These functions return None. We just want to store the function. So let me redefine the Value object. And then we're going to come back in, redefine the expression, draw dot... everything is great. `o.grad` is one.
And now this should work, of course. Okay, so `o._backward`... this grad should now be 0.5 if we redraw. And if everything went correctly... 0.5! Yay. Okay, so now we need to call `n._backward`. Sorry, `n._backward`. So that seems to have worked. So `n._backward` routed the gradient to both of these, so this is looking great.
Now we could of course call `b._backward`. What's gonna happen? Well, `b` doesn't have a `_backward`. `b._backward` is, by initialization, the empty function. So nothing would happen, but we can call it on it. But when we call this one's `_backward`, then we expect this 0.5 to get further routed. Right? So there we go: 0.5, 0.5.
And then finally we want to call it here on `x2w2` and on `x1w1`. Do both of those. And there we go. So we get 0, 0.5, negative 1.5, and 1. Exactly as we did before, but now we've done it through calling that `_backward` sort of manually.
So we have one last piece to get rid of, which is us calling `_backward` manually. So let's think through what we are actually doing. We've laid out a mathematical expression and now we're trying to go backwards through that expression. So going backwards through the expression just means that we never want to call `_backward` for any node before we've done sort of everything after it. So we have to do everything after it before we're ever going to call `_backward` on any one node. We have to get all of its full dependencies—everything that it depends on has to propagate to it—before we can continue backpropagation. So this ordering of graphs can be achieved using something called Topological Sort.
So topological sort is basically a laying out of a graph such that all the edges go only from left to right, basically. So here we have a graph; it's a directed acyclic graph, a DAG. And this is two different topological orders of it, I believe, where basically you'll see that it's laying out of the nodes such that all the edges go only one way, from left to right.
And implementing topological sort—you can look in Wikipedia and so on, I'm not going to go through it in detail—but basically this is what builds a topological graph. We maintain a set of visited nodes and then we are going through starting at some root node, which for us is `o`; that's where we want to start the topological sort. And starting at `o`, we go through all of its children and we need to lay them out from left to right. And basically, this starts at `o`. If it's not visited, then it marks it as visited and then it iterates through all of its children and calls `build_topological` on them. And then after it's gone through all the children, it adds itself.
So basically, this node that we're going to call it on, like say `o`, is only going to add itself to the topo list after all of the children have been processed. And that's how this function is guaranteeing that you're only going to be in the list once all your children are in the list. And that's the invariant that is being maintained. So if we `build_topo` on `o` and then inspect this list, we're going to see that it ordered our value objects. And the last one is the value of 0.707 which is the output. So this is `o`, and then this is `n`, and then all the other nodes get laid out before it.
So that builds the topological graph. And really what we're doing now is we're just calling `_backward` on all of the nodes in a topological order. So if we just reset the gradients—they're all zero. What did we do? We started by setting `o.grad` to be 1. That's the base case. Then we built the topological order. And then we went: `for node in reversed(topo)`. Now, in the reverse order because this list goes from... you know, we need to go through it in reversed order. So starting at `o`: `node._backward()`. And this should be it. There we go. Those are the correct derivatives.
Finally, we are going to hide this functionality. So I'm going to copy this and we're going to hide it inside the Value class because we don't want to have all that code lying around. So instead of an `_backward`, we're now going to define an actual `backward`—so that's backward without the underscore—and that's going to do all the stuff that we just derived.
So let me just clean this up a little bit. So we're first going to build a topological graph starting at self. So `build_topo(self)` will populate the topological order into the `topo` list, which is a local variable. Then we set `self.grad` to be one. And then for each node in the reversed list—so starting at us and going to all the children—`_backward()`. And that should be it.
So save. Come down here. Redefine. Okay, all the grads are zero. And now what we can do is `o.backward()`—without the underscore—and there we go. And that's backpropagation for one neuron.
Now we shouldn't be too happy with ourselves actually because we have a bad bug. And we have not surfaced the bug because of some specific conditions that we have to think about right now. So here's the simplest case that shows the bug. Say I create a single node `a`. And then I create a `b` that is `a + a`. And then I call backward. So what's going to happen is `a` is 3, and then `b` is `a + a`, so there's two arrows on top of each other here. Then we can see that `b` is, of course—the forward pass works—`b` is just `a + a` which is six. But the gradient here is not actually correct that we calculated automatically.
And that's because, of course, just doing calculus in your head, the derivative of $b$ with respect to $a$ should be two—one plus one. It's not one. Intuitively what's happening here, right, so `b` is the result of `a + a` and then we call backward on it. So let's go up and see what that does. `b` is a result of addition. So `out` is `b`. And then when we called backward, what happened is `self.grad` was set to one and then `other.grad` was set to one. But because we're doing `a + a`, self and other are actually the exact same object. So we are overwriting the gradient; we are setting it to one and then we are setting it again to one, and that's why it stays at one. So that's a problem.
There's another way to see this in a little bit more complicated expression. So here we have `a` and `b`. And then `d` will be the multiplication of the two, and `e` will be the addition of the two. And then we multiply `e` times `d` to get `f`, and then we called `f.backward()`. And these gradients, if you check, will be incorrect.
So fundamentally what's happening here, again, is basically we're going to see an issue anytime we use a variable more than once. Until now, in these expressions above, every variable is used exactly once so we didn't see the issue. But here if a variable is used more than once, what's going to happen during backward pass? We're backpropagating from `f` to `e` to `d`. So far so good. But now `e` calls its backward and it deposits its gradients to `a` and `b`. But then we come back to `d` and call backward, and it overwrites those gradients at `a` and `b`. So that's obviously a problem.
And the solution here—if you look at the multivariate case of the chain rule and its generalization there—the solution there is basically that we have to accumulate these gradients. These gradients add. And so instead of setting those gradients, we can simply do `+=`. We need to accumulate those gradients. `+=`, `+=`, `+=`, `+=`.
And this will be okay, remember, because we are initializing them at zero. So they start at zero and then any contribution that flows backwards will simply add. So now if we redefine this one... because of the `+=`, this now works. Because `a.grad` started at zero and we called `b.backward`, we deposit one and then we deposit one again, and now this is two, which is correct. And here this will also work and we'll get correct gradients. Because when we call `e.backward`, we will deposit the gradients from this branch, and then we get to back into `d.backward`, it will deposit its own gradients. And then those gradients simply add on top of each other. And so we just accumulate those gradients and that fixes the issue.
Okay, now before we move on, let me actually do a bit of cleanup here and delete some of this intermediate work. So we're not gonna need any of this now that we've derived all of it. We are going to keep this because I want to come back to it. Delete the tanh. Delete our morning example. Delete this step. Delete this. Keep the code that draws. And then delete this example. And leave behind only the definition of Value.
And now let's come back to this non-linearity here that we implemented: the tanh. Now I told you that we could have broken down tanh into its explicit atoms in terms of other expressions if we had the exp function. So if you remember, tanh is defined like this, and we chose to develop tanh as a single function because we know its derivative and we can backpropagate through it. But we can also break down tanh into... and express it as a function of $x$. And I would like to do that now because I want to prove to you that you get all the same results and all those gradients, but also because it forces us to implement a few more expressions. It forces us to do exponentiation, addition, subtraction, division, and things like that. And I think it's a good exercise to go through a few more of these.
Okay, so let's scroll up to the definition of Value. And here one thing that we currently can't do is we can do like a `Value` of say 2.0, but we can't do, you know, here for example we want to add constant one. We can't do something like this. And we can't do it because it says "object has no attribute data." That's because `a + 1` comes right here to `__add__` and then `other` is the integer one. And then here Python is trying to access `one.data` and that's not a thing, and that's because basically one is not a `Value` object and we only have addition for `Value` objects.
So as a matter of convenience, so that we can create expressions like this and make them make sense, we can simply do something like this. Basically, we let `other` alone if `other` is an instance of `Value`. But if it's not an instance of `Value`, we're going to assume that it's a number like an integer or float and we're going to simply wrap it in `Value`. And then `other` will just become `Value(other)` and then `other` will have a data attribute. And this should work. So if I just say this predefined value, then this should work. There we go.
Okay, now let's do the exact same thing for multiply because we can't do something like this, again for the exact same reason. So we just have to go to `__mul__` and if other is not a Value, then let's wrap it in Value. Let's redefine Value and now this works.
Now here's a kind of unfortunate and not obvious part. `a * 2` works, we saw that. But `2 * a`... is that gonna work? You'd expect it to, right? But actually it will not. And the reason it won't is because Python doesn't know... like when you do `a * 2`, basically Python will go and it will basically do something like `a.__mul__(2)`. That's basically what it will call. But to it, `2 * a` is the same as `2.__mul__(a)`. And 2 can't multiply `Value`, and so it's really confused about that.
So instead what happens is in Python, the way this works is you are free to define something called the `__rmul__`. And `__rmul__` is kind of like a fallback. So if Python can't do `2 * a`, it will check if by any chance `a` knows how to multiply 2 and that will be called into `__rmul__`. So because Python can't do `2 * a`, it will check: is there an `__rmul__` in `Value`? And because there is, it will now call that. And what we'll do here is we will swap the order of the operands. So basically `2 * a` will redirect to `__rmul__`, and `__rmul__` will basically call `a * 2`. And that's how that will work. So redefining now with `__rmul__`, `2 * a` becomes four.
Okay, now looking at the other elements that we still need: we need to know how to exponentiate and how to divide. So let's first do the exponentiation part. We're going to introduce a single function `exp` here. And `exp` is going to mirror tanh in the sense that it's a simple single function that transforms a single scalar value and outputs a single scalar value. So we pop out the Python number, we use `math.exp` to exponentiate it, create a new `Value` object... everything that we've seen before. The tricky part of course is how do you propagate through $e^x$?
So here you can potentially pause the video and think about what should go here. Okay, so basically we need to know what is the local derivative of $e^x$. So $d/dx$ of $e^x$ is famously just $e^x$. And we've already just calculated $e^x$ and it's inside `out.data`. So we can do `out.data * out.grad`. That's the chain rule. So we're just chaining on to the current running grad. And this is what the expression looks like. It looks a little confusing but this is what it is, and that's the exponentiation. So redefining, we should now be able to call `a.exp()`. And hopefully the backward pass works as well.
Okay, and the last thing we'd like to do of course is we'd like to be able to divide. Now I actually will implement something slightly more powerful than division because division is just a special case of something a bit more powerful. So in particular, just by rearranging, if we have some kind of `a, b = Value(4.0)` here, we'd like to basically be able to do `a / b` and we'd like this to be able to give us 0.5.
Now division actually can be reshuffled as follows: if we have $a / b$, that's actually the same as $a * (1/b)$. And that's the same as $a * b^{-1}$. And so what I'd like to do instead is I basically like to implement the operation of $x^k$ for some constant $k$. So it's an integer or a float. And we would like to be able to differentiate this, and then as a special case, negative one will be division. And so I'm doing that just because it's more general and, um, yeah you might as well do it that way.
So basically what I'm saying is we can redefine division, which we will put here somewhere. Yeah, we can put it here somewhere. What I'm saying is that we can redefine division. So `self / other` can actually be rewritten as `self * other^-1`. And now a `Value` raised to the power of negative one... we have now defined that. So here's... so we need to implement the `__pow__` function. Where am I going to put the power function? Maybe here somewhere. This is the skeleton for it.
So this function will be called when we try to raise a value to some power, and `other` will be that power. Now I'd like to make sure that `other` is only an int or a float. Usually `other` is some kind of a different `Value` object, but here `other` will be forced to be an int or a float otherwise the math won't work for what we're trying to achieve. In the specific case that would be a different derivative expression if we wanted `other` to be a `Value`.
So here we create the output value which is just, you know, this data raised to the power of `other`. And `other` here could be, for example, negative one; that's what we are hoping to achieve. And then this is the backwards stub. And this is the fun part, which is: what is the chain rule expression here for backpropagating through the power function where the power is to the power of some kind of a constant? So this is the exercise, and maybe pause the video here and see if you can figure it out yourself as to what we should put here.
Okay, so you can actually go here and look at derivative rules as an example and we see lots of derivatives that you can hopefully know from calculus. In particular, what we're looking for is the power rule. Because that's telling us that if we're trying to take $d/dx$ of $x^n$—which is what we're doing here—then that is just $n * x^{n-1}$. Right. Okay.
So that's telling us about the local derivative of this power operation. So all we want here... basically $n$ is now `other`. And `self.data` is $x$. And so this now becomes `other` (which is $n$) times `self.data` (which is now a Python int or a float, it's not a `Value` object, we're accessing the data attribute) raised to the power of `other - 1`, or $n - 1$. I can put brackets around this but this doesn't matter because power takes precedence over multiply in Python so that would have been okay. And that's the local derivative only, but now we have to chain it. And we chain just simply by multiplying by `out.grad`. That's chain rule.
And this should technically work. And we're going to find out soon. But now if we do this, this should now work. And we get 0.5. So the forward pass works, but does the backward pass work? And I realize that we actually also have to know how to subtract. So right now `a - b` will not work. To make it work, we need one more piece of code here. And basically, this is the subtraction. And the way we're going to implement subtraction is we're going to implement it by addition of a negation. And then to implement negation, we're gonna multiply by negative one. So just again using the stuff we've already built and just expressing it in terms of what we have. And `a - b` is now working.
Okay, so now let's scroll again to this expression here for this neuron, and let's just compute the backward pass here once we've defined `o`. And let's draw it. So here's the gradients for all these leaf nodes for this two-dimensional neuron that has a tanh that we've seen before. So now what I'd like to do is I'd like to break up this tanh into this expression here. So let me copy paste this here. And now instead of... we'll preserve the label and we will change how we define `o`.
So in particular we're going to implement this formula here. So we need $\frac{e^{2x} - 1}{e^{2x} + 1}$. So $e^{2x}$: we need to take `2 * n` and we need to exponentiate it. That's $e^{2x}$. And then because we're using it twice, let's create an intermediate variable `e`. And then define `o` as `(e - 1) / (e + 1)`. And that should be it. And then we should be able to `draw_dot` of `o`.
So now before I run this, what do we expect to see? Number one, we're expecting to see a much longer graph here because we've broken up tanh into a bunch of other operations. But those operations are mathematically equivalent. And so what we're expecting to see is: number one, the same result here, so the forward pass works; and number two, because of that mathematical equivalence, we expect to see the same backward pass and the same gradients on these leaf nodes. So these gradients should be identical.
So let's run this. So number one, let's verify that instead of a single tanh node we have now exp and we have plus, we have times negative one... this is the division. And we end up with the same forward pass here. And then the gradients, we have to be careful because they're in slightly different order potentially. The gradients for `w2`, `x2` should be 0 and 0.5. `w2` and `x2` are 0 and 0.5. And `w1`, `x1` are 1 and negative 1.5. 1 and negative 1.5.
So that means that both our forward passes and backward passes were correct because this turned out to be equivalent to tanh before. And so the reason I wanted to go through this exercise is: number one, we got to practice a few more operations and writing more backwards passes; and number two, I wanted to illustrate the point that the level at which you implement your operations is totally up to you. You can implement backward passes for tiny expressions like a single individual plus or a single times. Or you can implement them for, say, tanh, which is a kind of a potentially... you can see it as a composite operation because it's made up of all these more atomic operations. But really all of this is kind of like a fake concept. All that matters is we have some kind of inputs and some kind of an output, and this output is a function of the inputs in some way. And as long as you can do forward pass and the backward pass of that little operation, it doesn't matter what that operation is and how composite it is. If you can write the local gradients, you can chain the gradient and you can continue backpropagation. So the design of what those functions are is completely up to you.
So now I would like to show you how you can do the exact same thing by using a modern deep neural network library like, for example, PyTorch, which I've roughly modeled micrograd by. And so PyTorch is something you would use in production, and I'll show you how you can do the exact same thing but in PyTorch API. So I'm just going to copy paste it in and walk you through it a little bit. This is what it looks like.
So we're going to import PyTorch and then we need to define these value objects like we have here. Now micrograd is a scalar valued engine, so we only have scalar values like 2.0. But in PyTorch, everything is based around tensors. And like I mentioned, tensors are just n-dimensional arrays of scalars. So that's why things get a little bit more complicated here; I just need a scalar value tensor—a tensor with just a single element.
But by default, when you work with PyTorch, you would use more complicated tensors like this. So if I import PyTorch, then I can create tensors like this, and this tensor for example is a two by three array of scalars in a single compact representation. So we can check its shape, we see that it's a two by three array and so on. So this is usually what you would work with in the actual libraries. So here I'm creating a tensor that has only a single element: 2.0. And then I'm casting it to be double because Python is by default using double precision for its floating point numbers, so I'd like everything to be identical. By default, the data type of these tensors will be float32—so it's only using a single precision float—so I'm casting to double so that we have float64 just like in Python. So I'm casting to double and then we get something similar to `Value(2)`.
The next thing I have to do is, because these are leaf nodes, by default PyTorch assumes that they do not require gradients. So I need to explicitly say that all of these nodes require gradients. Okay, so this is going to construct scalar valued one-element tensors, make sure that PyTorch knows that they require gradients. Now by default these are set to False, by the way, because of efficiency reasons, because usually you would not want gradients for leaf nodes like the inputs to the network and this is just trying to be efficient in the most common cases.
So once we've defined all of our values in Python, we can perform arithmetic just like we can here in micrograd land. So this will just work. And then there's a `torch.tanh` also. And when we get back is a tensor again. And we can, just like in micrograd, it's got a `.data` attribute and it's got `.grad` attributes. So these tensor objects, just like in micrograd, have a `.data` and a `.grad`. And the only difference here is that we need to call `.item()`. Because otherwise, um, PyTorch `.item()` basically takes a single tensor of one element and it just returns that element stripping out the tensor.
So let me just run this and hopefully we are going to get... this is going to print the forward pass which is 0.707. And this will be the gradients which hopefully are 0.5, 0, negative 1.5, and 1. So if we just run this... there we go. 0.7, so the forward pass agrees. And then 0.5, 0, negative 1.5, and 1. So PyTorch agrees with us.
And just to show you here, basically `o` here is a tensor with a single element and it's a double. And we can call `.item()` on it to just get the single number out. So that's what `item` does. And `o` is a tensor object like I mentioned, and it's got a backward function just like we've implemented. And then all of these also have a `.grad`. So like `x2` for example in the grad, and it's a tensor and we can pop out the individual number with `.item()`.
So basically, PyTorch can do what we did in micrograd as a special case when your tensors are all single element tensors. But the big deal with PyTorch is that everything is significantly more efficient because we are working with these tensor objects and we can do lots of operations in parallel on all of these tensors. But otherwise what we've built very much agrees with the API of PyTorch.
Okay, so now that we have some machinery to build out pretty complicated mathematical expressions, we can also start building out neural nets. And as I mentioned, neural nets are just a specific class of mathematical expressions. So we're going to start building out a neural net piece by piece and eventually we'll build out a two-layer Multi-Layer Perceptron as it's called, and I'll show you exactly what that means.
Let's start with a single individual neuron. We've implemented one here, but here I'm going to implement one that also subscribes to the PyTorch API in how it designs its neural network modules. So just like we saw that we can match the API of PyTorch on the autograd side, we're going to try to do that on the neural network modules.
So here's class `Neuron`. And just for the sake of efficiency, I'm going to copy paste some sections that are relatively straightforward. So the constructor will take number of inputs to this neuron—which is how many inputs come to a neuron; so this one for example has three inputs. And then it's going to create a weight there, is some random number between negative one and one for every one of those inputs. And a bias that controls the overall trigger happiness of this neuron.
And then we're going to implement a `def __call__(self, x)`. And really what we want to do here is $w \cdot x + b$. Where $w \cdot x$ here is a dot product specifically. Now if you haven't seen `__call__`... let me just return 0.0 here for now. The way this works now is we can have an `x` which is say like `[2.0, 3.0]`. Then we can initialize a neuron that is two-dimensional because these are two numbers. And then we can feed those two numbers into that neuron to get an output. And so when you use this notation `n(x)`, Python will use `__call__`. So currently `__call__` just returns 0.0.
Now we'd like to actually do the forward pass of this neuron instead. So what we're going to do here first is we need to basically multiply all of the elements of $w$ with all of the elements of $x$; pairwise we need to multiply them. So the first thing we're going to do is we're going to zip up `self.w` and `x`. And in Python, zip takes two iterators and it creates a new iterator that iterates over the tuples of the corresponding entries. So for example, just to show you, we can print this list and still return 0.0 here. Sorry. So we see that these w's are paired up with the x's. w with x.
And now what we want to do is... `for wi, xi in`... we want to multiply `wi * xi`. And then we want to sum all of that together to come up with an activation. And add also `self.b` on top. So that's the raw activation. And then of course we need to pass that through a non-linearity. So what we're going to be returning is `act.tanh()`. And here's out.
So now we see that we are getting some outputs. And we get a different output from a neuron each time because we are initializing different weights and biases. And then to be a bit more efficient here... actually sum by the way takes a second optional parameter which is the start. And by default the start is zero. So these elements of this sum will be added on top of zero to begin with, but actually we can just start with `self.b`. And then we just have an expression like this. And then the generator expression here must be parenthesized in Python. There we go. Yep, so now we can forward a single neuron.
Next up we're going to define a layer of neurons. So here we have a schematic for an MLP. So we see that these MLPs, each layer—this is one layer—has actually a number of neurons and they're not connected to each other but all of them are fully connected to the input. So what is a layer of neurons? It's just a set of neurons evaluated independently. So in the interest of time, I'm going to do something fairly straightforward here. It's literally a `Layer` is just a list of `Neuron`s. And then how many neurons do we have? We take that as an input argument here: how many neurons do you want in your layer, number of outputs in this layer. And so we just initialize completely independent neurons with this given dimensionality. And when we call on it, we just independently evaluate them.
So now instead of a neuron, we can make a layer of neurons. They are two-dimensional neurons and let's have three of them. And now we see that we have three independent evaluations of three different neurons. Right?
Okay, finally let's complete this picture and define an entire Multi-Layer Perceptron or MLP. And as we can see here in an MLP, these layers just feed into each other sequentially. So let's come here and I'm just going to copy the code here in interest of time. So an MLP is very similar. We're taking the number of inputs as before, but now instead of taking a single `nout` (which is number of neurons in a single layer), we're going to take a list of `nout`s. And this list defines the sizes of all the layers that we want in our MLP. So here we just put them all together and then iterate over consecutive pairs of these sizes and create layer objects for them. And then in the `__call__` function, we are just calling them sequentially. So that's an MLP really.
And let's actually re-implement this picture. So we want three input neurons and then two layers of four and an output unit. So we want a three-dimensional input, say this is an example input. We want three inputs into two layers of four and one output. And this of course is an MLP. And there we go, that's a forward pass of an MLP.
To make this a little bit nicer—you see how we have just a single element but it's wrapped in a list because `Layer` always returns lists—so for convenience: `return outs[0] if len(outs) == 1 else return outs`. And this will allow us to just get a single value out at the last layer that only has a single neuron. And finally, we should be able to `draw_dot` of `n(x)`. And as you might imagine, these expressions are now getting relatively involved. So this is an entire MLP that we're defining now, all the way until a single output.
Okay. And so obviously you would never differentiate on pen and paper these expressions. But with micrograd, we will be able to backpropagate all the way through this and backpropagate into these weights of all these neurons. So let's see how that works.
Okay, so let's create ourselves a very simple example dataset here. So this dataset has four examples. And so we have four possible inputs into the neural net. And we have four desired targets. So we'd like the neural net to output 1.0 when it's fed this example, negative one when it's fed these examples, and one when it's fed this example. So it's a very simple binary classifier neural net, basically, that we would like here.
Now let's think what the neural net currently thinks about these four examples. We can just get their predictions. Basically we can just call `n(x)` for `x` in `xs`. And then we can print. So these are the outputs of the neural net on those four examples. So the first one is 0.91, but we'd like it to be one. So we should push this one higher. This one we want to be higher. This one says 0.88 and we want this to be negative one. This is 0.8, we want it to be negative one. And this one is 0.8, we want it to be one.
So how do we make the neural net and how do we tune the weights to better predict the desired targets? And the trick used in deep learning to achieve this is to calculate a single number that somehow measures the total performance of your neural net. And we call this single number the loss. So the loss first is a single number that we're going to define that basically measures how well the neural net is performing. Right now we have the intuitive sense that it's not performing very well because we're not very much close to this. So the loss will be high and we'll want to minimize the loss.
So in particular in this case, what we're going to do is we're going to implement the Mean Squared Error loss. So this is doing is: we're going to basically iterate for `ygt, yout in zip(ys, ypred)`. So we're going to pair up the ground truths with the predictions. And this zip iterates over tuples of them. And for each `ygt` and `yout`, we're going to subtract them and square them.
So let's first see what these losses are. These are individual loss components. And so basically for each one of the four, we are taking the prediction and the ground truth, we are subtracting them, and squaring them. So because this one is so close to its target—0.91 is almost one—subtracting them gives a very small number. So here we would get like a negative point one, and then squaring it just makes sure that regardless of whether we are more negative or more positive, we always get a positive number. Instead of squaring, we could also take, for example, the absolute value; we need to discard the sign.
And so you see that the expression is ranged so that you only get zero exactly when `yout` is equal to `ygt`. When those two are equal, so your prediction is exactly the target, you are going to get zero. And if your prediction is not the target, you are going to get some other number. So here for example we are way off and so that's why the loss is quite high. And the more off we are, the greater the loss will be.
So we don't want high loss, we want low loss. And so the final loss here will be just the sum of all of these numbers. So you see that this should be zero roughly, plus zero roughly... but plus seven. So loss should be about seven here. And now we want to minimize the loss; we want the loss to be low. Because if loss is low, then every one of the predictions is equal to its target. So the loss, the lowest it can be is zero, and the greater it is, the worse off the neural net is predicting.
So now of course if we do `loss.backward()`... something magical happened when I hit enter. And the magical thing of course that happened is that we can look at `n.layers[0].neurons[0]`. Because remember that MLP has the layers which is a list, and each layer has neurons which is a list, and that gives us an individual neuron. And then it's got some weights. And so we can for example look at the weights at zero. Um, oops, it's not called weights, it's called `w`. And that's a value. But now this value also has a grad because of the backward pass.
And so we see that because this gradient here on this particular weight of this particular neuron of this particular layer is negative, we see that its influence on the loss is also negative. So slightly increasing this particular weight of this neuron of this layer would make the loss go down. And we actually have this information for every single one of our neurons and all their parameters.
Actually it's worth looking at also the `draw_dot(loss)` by the way. So previously we looked at the `draw_dot` of a single neural neuron forward pass, and that was already a large expression. But what is this expression? We actually forwarded every one of those four examples and then we have the loss on top of them with the mean squared error. And so this is a really massive graph. Because this graph that we've built up now—oh my gosh, this graph that we've built up now—is kind of excessive. It's excessive because it has four forward passes of a neural net for every one of the examples and then it has the loss on top. And it ends with the value of the loss which was 7.12. And this loss will now backpropagate through all the four forward passes all the way through just every single intermediate value of the neural net. All the way back to, of course, the parameters of the weights, which are the input.
So these weight parameters here are inputs to this neural net. And these numbers here, these scalars, are inputs to the neural net. So if we went around here, we'll probably find some of these examples—this 1.0 potentially, maybe this 1.0 or you know some of the others—and you'll see that they all have gradients as well. The thing is, these gradients on the input data are not that useful to us. And that's because the input data seems to be not changeable; it's a given to the problem and so it's a fixed input. We're not going to be changing it or messing with it, even though we do have gradients for it. But some of these gradients here will be for the neural network parameters, the ws and the bs. And those we of course we want to change.
Okay, so now we're going to want some convenience code to gather up all of the parameters of the neural net so that we can operate on all of them simultaneously. And every one of them, we will nudge a tiny amount based on the gradient information. So let's collect the parameters of the neural net all in one array.
So let's create a `parameters` of self that just returns `self.w`, which is a list, concatenated with a list of `self.b`. So this will just return a list. List plus list just, you know, gives you a list. So that's parameters of neuron. And I'm calling it this way because also PyTorch has a `parameters()` on every single `nn.Module`. And it does exactly what we're doing here; it just returns the parameter tensors for us, as the parameter scalars.
Now `Layer` is also a module so it will have `parameters(self)`. And basically what we want to do here is something like this... like `params` is here, and then for neuron in `self.neurons`, we want to get `neuron.parameters()` and we want to `params.extend()`. Right, so these are the parameters of this neuron and then we want to put them on top of params. So `params.extend(p)`. And then we want to return params.
So this is way too much code, so actually there's a way to simplify this which is: `return [p for neuron in self.neurons for p in neuron.parameters()]`. So it's a single list comprehension in Python. You can sort of nest them like this and you can then create the desired array. So these are identical. We can take this out. And then let's do the same here. `def parameters(self): return [p for layer in self.layers for p in layer.parameters()]`. And that should be good.
Now let me pop out this so we don't re-initialize our network because we need to re-initialize our... Okay, so unfortunately we will have to probably re-initialize the network because we just add functionality. Because this class... of course I want to get all the `net.parameters()`, but that's not going to work because this is the old class. Okay. So unfortunately we do have to reinitialize the network which will change some of the numbers. But let me do that so that we pick up the new API. We can now do `n.parameters()`. And these are all the weights and biases inside the entire neural net. So in total, this MLP has 41 parameters. And now we'll be able to change them.
If we recalculate the loss here, we see that unfortunately we have slightly different predictions and slightly different loss, but that's okay. Okay, so we see that this neuron's gradient is slightly negative. We can also look at its data right now which is 0.85. So this is the current value of this neuron and this is its gradient on the loss.
So what we want to do now is we want to iterate for every `p` in `n.parameters()`—so for all the 41 parameters in this neural net—we actually want to change `p.data` slightly according to the gradient information. Okay so... dot dot to do here. But this will be basically a tiny update in this gradient descent scheme. In gradient descent, we are thinking of the gradient as a vector pointing in the direction of increased loss. And so in gradient descent, we are modifying `p.data` by a small step size in the direction of the gradient. So the step size as an example could be like a very small number like `0.01`. So `0.01 * p.grad`.
Right. But we have to think through some of the signs here. So in particular, working with this specific example here, we see that if we just left it like this, then this neuron's value would be currently increased by a tiny amount of the gradient. The gradient is negative, so this value of this neuron would go slightly down. It would become like 0.84 or something like that. But if this neuron's value goes lower, that would actually increase the loss. That's because the derivative of this neuron is negative, so increasing this makes the loss go down. So increasing it is what we want to do instead of decreasing it.
So basically what we're missing here is we're actually missing a negative sign. And again, this other interpretation—and that's because we want to minimize the loss, we don't want to maximize the loss, we want to decrease it. And the other interpretation as I mentioned is you can think of the gradient vector—so basically just the vector of all the gradients—as pointing in the direction of increasing the loss. But then we want to decrease it, so we actually want to go in the opposite direction. And so you can convince yourself that this sort of plug does the right thing here with the negative because we want to minimize the loss.
So if we nudge all the parameters by a tiny amount, then we'll see that this data will have changed a little bit. So now this neuron is a tiny amount greater value: so 0.854 went to 0.857. And that's a good thing because slightly increasing this neuron data makes the loss go down according to the gradient. And so the correct thing has happened sign-wise.