TextPurr Logo

TextPurr

Loading...
Loading...

The spelled-out intro to language modeling: building makemore

Andrej Karpathy
We implement a bigram character-level language model, which we will further complexify in followup videos into a modern Transformer language model, like GPT. In this video, the focus is on (1) introducing torch.Tensor and its subtleties and use in efficiently evaluating neural networks and (2) the overall framework of language modeling that includes model training, sampling, and the evaluation of a loss (e.g. the negative log likelihood for classification). Links: - makemore on github: https://github.com/karpathy/makemore - jupyter notebook I built in this video: https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part1_bigrams.ipynb - my website: https://karpathy.ai - my twitter: https://twitter.com/karpathy - (new) Neural Networks: Zero to Hero series Discord channel: https://discord.gg/3zy8kqD9Cp , for people who'd like to chat more and go beyond youtube comments Useful links for practice: - Python + Numpy tutorial from CS231n https://cs231n.github.io/python-numpy-tutorial/ . We use torch.tensor instead of numpy.array in this video. Their design (e.g. broadcasting, data types, etc.) is so similar that practicing one is basically practicing the other, just be careful with some of the APIs - how various functions are named, what arguments they take, etc. - these details can vary. - PyTorch tutorial on Tensor https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html - Another PyTorch intro to Tensor https://pytorch.org/tutorials/beginner/nlp/pytorch_tutorial.html Exercises: E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model? E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see? E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve? E04: we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W? E05: look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy instead? E06: meta-exercise! Think of a fun/interesting exercise and complete it. Chapters: 00:00:00 intro 00:03:03 reading and exploring the dataset 00:06:24 exploring the bigrams in the dataset 00:09:24 counting bigrams in a python dictionary 00:12:45 counting bigrams in a 2D torch tensor ("training the model") 00:18:19 visualizing the bigram tensor 00:20:54 deleting spurious (S) and (E) tokens in favor of a single . token 00:24:02 sampling from the model 00:36:17 efficiency! vectorized normalization of the rows, tensor broadcasting 00:50:14 loss function (the negative log likelihood of the data under our model) 01:00:50 model smoothing with fake counts 01:02:57 PART 2: the neural network approach: intro 01:05:26 creating the bigram dataset for the neural net 01:10:01 feeding integers into neural nets? one-hot encodings 01:13:53 the "neural net": one linear layer of neurons implemented with matrix multiplication 01:18:46 transforming neural net outputs into probabilities: the softmax 01:26:17 summary, preview to next steps, reference to micrograd 01:35:49 vectorized loss 01:38:36 backward and update, in PyTorch 01:42:55 putting everything together 01:47:49 note 1: one-hot encoding really just selects a row of the next Linear layer's weight matrix 01:50:18 note 2: model smoothing as regularization loss 01:54:31 sampling from the neural net 01:56:16 conclusion
Hosts: Andrej Karpathy
📅September 07, 2022
⏱️01:57:45
🌐English

Disclaimer: The transcript on this page is for the YouTube video titled "The spelled-out intro to language modeling: building makemore" from "Andrej Karpathy". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=PaCmpygFfXo&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=2

00:00:00Andrej Karpathy

Hi everyone, hope you're well. Next up, what I'd like to do is I'd like to build out `makemore`. Like `micrograd` before it, `makemore` is a repository that I have on my GitHub webpage. You can look at it. But just like with `micrograd`, I'm going to build it out step by step and I'm going to spell everything out, so we're going to build it out slowly and together.

💬 0 comments
Add to My Notes
00:00:20Andrej Karpathy

Now what is `makemore`? `makemore`, as the name suggests, makes more of things that you give it. So here's an example: `names.txt` is an example dataset to `makemore`. And when you look at `names.txt`, you'll find that it's a very large dataset of names. So here's lots of different types of names; in fact, I believe there are 32,000 names that I've sort of found randomly on the government website. And if you train `makemore` on this dataset, it will learn to make more of things like this.

💬 0 comments
Add to My Notes
00:00:55Andrej Karpathy

And in particular in this case, that will mean more things that sound name-like but are actually unique names. And maybe if you have a baby and you're trying to assign a name, maybe you're looking for a cool new sounding unique name, `makemore` might help you. So here are some example generations from the neural network once we train it on our dataset. So here's some example unique names that it will generate: Dontel, Irot, Zhendi, and so on. And so all these sound name-like, but they're not, of course, names.

💬 0 comments
Add to My Notes
00:01:30Andrej Karpathy

So under the hood, `makemore` is a character-level language model. So what that means is that it is treating every single line here as an example, and within each example, it's treating them all as sequences of individual characters. So `reese` is this example, and that's the sequence of characters. And that's the level on which we are building out `makemore`. And what it means to be a character-level language model then is that it's just sort of modeling those sequences of characters, and it knows how to predict the next character in the sequence.

💬 0 comments
Add to My Notes
00:02:03Andrej Karpathy

Now, we're actually going to implement a large number of character-level language models in terms of the neural networks that are involved in predicting the next character in a sequence. So very simple bigram and bag-of-words models, Multilayer Perceptrons, Recurrent Neural Networks, all the way to modern Transformers. In fact, the Transformer that we will build will be basically the equivalent Transformer to GPT-2, if you have heard of GPT. So that's kind of a big deal; it's a modern network. And by the end of the series, you will actually understand how that works on the level of characters.

💬 0 comments
Add to My Notes
00:02:36Andrej Karpathy

Now to give you a sense of the extensions here, after characters, we will probably spend some time on the word level so that we can generate documents of words, not just little, you know, segments of characters, but we can generate entire large, much larger documents. And then we're probably going to go into images and image-text networks such as DALL-E, Stable Diffusion, and so on. But for now, we have to start here: character-level language modeling. Let's go.

💬 0 comments
Add to My Notes
00:03:03Andrej Karpathy

So like before, we are starting with a completely blank Jupyter notebook page. The first thing is I would like to basically load up the dataset `names.txt`. So we're going to open up `names.txt` for reading, and we're going to read in everything into a massive string. And then because it's a massive string, we'd only like the individual words and put them in the list. So let's call `splitlines` on that string to get all of our words as a Python list of strings. So basically we can look at, for example, the first 10 words, and we have that it's a list of Emma, Olivia, Ava, and so on. And if we look at the top of the page here, that is indeed what we see.

💬 0 comments
Add to My Notes
00:03:48Andrej Karpathy

So that's good. This list actually makes me feel that this is probably sorted by frequency. But okay, so these are the words. Now we'd like to actually learn a little bit more about this dataset. Let's look at the total number of words; we expect this to be roughly 32,000. And then what is the, for example, shortest word? So `min(len(w) for w in words)`. So the shortest word will be length two. And `max(len(w) for w in words)`, so the longest word will be 15 characters.

💬 0 comments
Add to My Notes
00:04:24Andrej Karpathy

So let's now think through our very first language model. As I mentioned, a character-level language model is predicting the next character in a sequence given already some concrete sequence of characters before it. Now we have to realize here is that every single word here, like "Isabella", is actually quite a few examples packed into that single word. Because what is an existence of a word like "Isabella" in the dataset telling us? Really, it's saying that the character `i` is a very likely character to come first in the sequence of a name. The character `s` is likely to come after `i`. The character `a` is likely to come after `is`. The character `b` is very likely to come after `isa`, and so on, all the way to `a` following `Isabell`.

💬 0 comments
Add to My Notes
00:05:14Andrej Karpathy

And then there's one more example actually packed in here, and that is that after there's "Isabella", the word is very likely to end. So that's one more sort of explicit piece of information that we have here that we have to be careful with. And so there's a lot packed into a single individual word in terms of the statistical structure of what's likely to follow in these character sequences. And then of course we don't have just an individual word; we actually have 32,000 of these, and so there's a lot of structure here to model.

💬 0 comments
Add to My Notes
00:05:44Andrej Karpathy

Now in the beginning, what I'd like to start with is I'd like to start with building a bigram language model. Now in the bigram language model, we're always working with just two characters at a time. So we're only looking at one character that we are given and we're trying to predict the next character in the sequence. So what characters are likely to follow `r`? What characters are likely to follow `a`? And so on. And we're just modeling that kind of a little local structure. And we're forgetting the fact that we may have a lot more information; we're always just looking at the previous character to predict the next one. So it's a very simple and weak language model, but I think it's a great place to start.

💬 0 comments
Add to My Notes
00:06:24Andrej Karpathy

So now let's begin by looking at these bigrams in our dataset and what they look like. And these bigrams again are just two characters in a row. So `for w in words`, each `w` here is an individual word, a string. We want to iterate... we're going to iterate this word with consecutive characters, so two characters at a time, sliding it through the word. Now an interesting, nice way, cute way to do this in Python by the way is doing something like this: `for ch1, ch2 in zip(w, w[1:])`. Print `ch1, ch2`.

💬 0 comments
Add to My Notes
00:07:04Andrej Karpathy

And let's not do all the words, let's just do the first three words and I'm going to show you in a second how this works. But for now, basically as an example, let's just do the very first word alone: `emma`. You see how we have `emma` and this will just print `e m`, `m m`, `m a`. And the reason this works is because `w` is the string `emma`, `w[1:]` is the string `mma`. And `zip` takes two iterators and it pairs them up and then creates an iterator over the tuples of their consecutive entries. And if any one of these lists is shorter than the other, then it will just halt and return. So basically that's why we return `e m`, `m m`, `m a`. But then because this iterator, the second one here, runs out of elements, `zip` just ends and that's why we only get these tuples. So pretty cute.

💬 0 comments
Add to My Notes
00:08:01Andrej Karpathy

So these are the consecutive elements in the first word. Now we have to be careful because we actually have more information here than just these three examples. As I mentioned, we know that `e` is very likely to come first and we know that `a` in this case is coming last. So one way to do this is basically we're going to create a special array here, `all_chars`, and we're going to hallucinate a special start token here. I'm going to call it like special start `<S>`. So this is a list of one element plus `w`, and then plus a special end character `<E>`. And the reason I'm wrapping the list of `w` here is because `w` is a string `emma`; list of `w` will just have the individual characters in the list.

💬 0 comments
Add to My Notes
00:08:51Andrej Karpathy

And then doing this again now, but not iterating over `w`'s but over the characters, will give us something like this. So `<S> e` is likely—so this is a bigram of the start character and `e`—and this is a bigram of the `a` and the special end character `<E>`. And now we can look at, for example, what this looks like for Olivia or Ava. And indeed we can actually potentially do this for the entire dataset but we won't print that; that's going to be too much. But these are the individual character bigrams and we can print them.

💬 0 comments
Add to My Notes
00:09:24Andrej Karpathy

Now in order to learn the statistics about which characters are likely to follow other characters, the simplest way in the bigram language models is to simply do it by counting. So we're basically just going to count how often any one of these combinations occurs in the training set, in these words. So we're going to need some kind of a dictionary that's going to maintain some counts for every one of these bigrams. So let's use a dictionary `b`. And this will map these bigrams—so bigram is a tuple of character one, character two. And then `b[bigram]` will be `b.get(bigram, 0) + 1`. So this will basically add up all the bigrams and count how often they occur.

💬 0 comments
Add to My Notes
00:10:18Andrej Karpathy

Let's keep the printing and let's just inspect what `b` is in this case. And we see that many bigrams occur just a single time. This one allegedly occurred three times. So `a` was an ending character three times, and that's true for all of these words—all of Emma, Olivia, and Ava end with `a`. So that's why this occurred three times.

💬 0 comments
Add to My Notes
00:10:58Andrej Karpathy

Let's just run it. And now `b` will have the statistics of the entire dataset. So these are the counts across all the words of the individual bigrams. And we could for example look at some of the most common ones and least common ones. This kind of grows in Python, but the simplest way I like is we just use `b.items()`. `b.items()` returns the tuples of key-value; in this case, the keys are the character bigrams and the values are the counts. And so then what we want to do is we want to do `sorted` of this. But by default, sort is on the first item of a tuple, but we want to sort by the values which are the second element of a tuple, that is the key value. So we want to use the `key = lambda kv: kv[1]`. So we want to sort by the count of these elements.

💬 0 comments
Add to My Notes
00:12:10Andrej Karpathy

And actually we wanted to go backwards. So here we have: the bigram `q` and `r` occurs only a single time, `dz` occurred only a single time. And when we sort this the other way around, we're going to see the most likely bigrams. So we see that `n` was very often an ending character, many, many times. And apparently `n` almost always follows an `a`. And that's a very likely combination as well. So this is kind of the individual counts that we achieve over the entire dataset.

💬 0 comments
Add to My Notes
00:12:45Andrej Karpathy

Now it's actually going to be significantly more convenient for us to keep this information in a two-dimensional array instead of a Python dictionary. So we're going to store this information in a 2D array. And the rows are going to be the first character of the bigram and the columns are going to be the second character. And each entry in this two-dimensional array will tell us how often that first character follows the second character in the dataset.

💬 0 comments
Add to My Notes
00:13:12Andrej Karpathy

So in particular, the array representation that we're going to use, or the library, is that of PyTorch. And PyTorch is a deep learning neural network framework, but part of it is also this `torch.tensor` which allows us to create multi-dimensional arrays and manipulate them very efficiently. So let's import PyTorch, which you can do by `import torch`. And then we can create arrays. So let's create an array of zeros and we give it a size of this array. Let's create a three by five array as an example. This is a three by five array of zeros. And by default, you'll notice `a.dtype`, which is short for data type, is `float32`. So these are single precision floating point numbers. Because we are going to represent counts, let's actually use `dtype` as `torch.int32`. So these are 32-bit integers. So now you see that we have integer data inside this tensor.

💬 0 comments
Add to My Notes
00:14:14Andrej Karpathy

Now tensors allow us to really manipulate all the individual entries and do it very efficiently. So for example, if we want to change this bit, we have to index into the tensor. And in particular here, this is the first row—because it's zero-indexed—so this is row index 1 and column index 0, 1, 2, 3. So `a[1, 3]`, we can set that to 1. And then `a` will have a 1 over there. We can of course also do things like this, so now `a` will be 2 over there, or 3. And also we can for example say `a[0, 0]` is 5, and then `a` will have a 5 over here.

💬 0 comments
Add to My Notes
00:15:00Andrej Karpathy

So that's how we can index into the arrays. Now of course the array that we are interested in is much, much bigger. So for our purposes, we have 26 letters of the alphabet and then we have two special characters `<S>` and `<E>`. So we want 26 plus 2, or 28 by 28 array. And let's call it the capital `N` because it's going to represent sort of the counts. Let me erase this stuff. So that's the array that starts at zeros, 28 by 28.

💬 0 comments
Add to My Notes
00:15:30Andrej Karpathy

And now let's copy paste this here. But instead of having a dictionary `b`, which we're going to erase, we now have an `N`. Now the problem here is that we have these characters which are strings, but we have to now basically index into an array and we have to index using integers. So we need some kind of a lookup table from characters to integers. So let's construct such a character array. And the way we're going to do this is we're going to take all the words which is a list of strings, we're going to concatenate all of it into a massive string—so this is just simply the entire dataset as a single string. We're going to pass this to the `set` constructor which takes this massive string and throws out duplicates because sets do not allow duplicates. So `set` of this will just be the set of all the lowercase characters, and there should be a total of 26 of them.

💬 0 comments
Add to My Notes
00:16:28Andrej Karpathy

And now we actually don't want a set, we want a list. But we don't want a list sorted in some weird arbitrary way, we want it to be sorted from `a` to `z`. So sorted list. So those are our characters. Now what we want is this lookup table as I mentioned. So let's create a special `stoi`, I will call it. `s` is string or character, and this will be an `stoi` mapping: `{s:i for i,s in enumerate(chars)}`. So `enumerate` basically gives us this iterator over the integer index and the actual element of the list, and then we are mapping the character to the integer. So `stoi` is a mapping from `a` to 0, `b` to 1, etc., all the way from `z` to 25.

💬 0 comments
Add to My Notes
00:17:24Andrej Karpathy

And that's going to be useful here, but we actually also have to specifically set that `<S>` will be 26 and `stoi['<E>']` will be 27, right, because `z` was 25. So those are the lookups. And now we can come here and we can map both character 1 and character 2 to their integers. So this will be `ix1 = stoi[ch1]` and `ix2 = stoi[ch2]`. And now we should be able to do this line but using our array. So `N[ix1, ix2]`—this is the two-dimensional array indexing I've shown you before—and honestly just `+= 1` because everything starts at zero. So this should work and give us a large 28 by 28 array of all these counts.

💬 0 comments
Add to My Notes
00:18:15Andrej Karpathy

So if we print `N`, this is the array. But of course, it looks ugly. So let's erase this ugly mess and let's try to visualize it a bit more nicely. So for that, we're going to use a library called `matplotlib`. So `matplotlib` allows us to create figures. So we can do things like `plt.imshow(N)`. So this is the 28x28 array. And this is structure, but even this I would say is still pretty ugly. So we're going to try to create a much nicer visualization of it, and I wrote a bunch of code for that.

💬 0 comments
Add to My Notes
00:18:49Andrej Karpathy

The first thing we're going to need is we're going to need to invert this array here, this dictionary. So `stoi` is mapping from `s` to `i`, and in `itos` we're going to reverse this dictionary. So iterator of all the items and just reverse that array. So `itos` maps inversely from 0 to `a`, 1 to `b`, etc. So we'll need that.

💬 0 comments
Add to My Notes
00:19:14Andrej Karpathy

And then here's the code that I came up with to try to make this a little bit nicer. We create a figure, we plot `N`, and then we visualize a bunch of things. Let me just run it so you get a sense of what this is. Okay. So you see here that we have the array spaced out and every one of these is basically like: `b` follows `g` zero times; `b` follows `h` 41 times. So `a` follows `j` 175 times. And so what you can see that I'm doing here is first I show that entire array, and then I iterate over all the individual little cells here and I create a character string here which is the inverse mapping `itos` of the integer `i` and the integer `j`. So those are the bigrams in a character representation. And then I plot just the bigram text and then I plot the number of times that this bigram occurs.

💬 0 comments
Add to My Notes
00:20:16Andrej Karpathy

Now the reason that there's a `.item()` here is because when you index into these arrays, these are torch tensors. You see that we still get a tensor back. So the type of this thing, you'd think it would be just an integer 149, but it's actually a `torch.tensor`. And so if you do `.item()`, then it will pop out that individual integer. So it will just be 149. So that's what's happening there. And these are just some options to make it look nice.

💬 0 comments
Add to My Notes
00:20:45Andrej Karpathy

So what is the structure of this array? We have all these counts and we see that some of them occur often and some of them do not occur often. Now if you scrutinize this carefully, you will notice that we're not actually being very clever. That's because when you come over here, you'll notice that for example, we have an entire row of completely zeros. And that's because the end character `<E>` is never possibly going to be the first character of a bigram because we're always placing these end tokens all at the end of the bigram. Similarly, we have entire columns zeros here because the `<S>` character will never possibly be the second element of a bigram because we always start with `<S>` and we end with `<E>` and we only have the words in between. So we have an entire column of zeros, an entire row of zeros. And in this little two by two matrix here as well, the only one that can possibly happen is if `<S>` directly follows `<E>`. That can be non-zero if we have a word that has no letters. So in that case there's no letters in the word, it's an empty word and we just have `<S>` follows `<E>`. But the other ones are just not possible.

💬 0 comments
Add to My Notes
00:21:50Andrej Karpathy

And so we're basically wasting space. And not only that, but the `<S>` and the `<E>` are getting very crowded here. I was using these brackets because there's convention in Natural Language Processing to use these kinds of brackets to denote special tokens. But we're going to use something else. So let's fix all this and make it prettier. We're not actually going to have two special tokens; we're only going to have one special token. So we're going to have an `N` by `N` array of 27 by 27 instead. Instead of having two, we will just have one and I will call it a dot (`.`).

💬 0 comments
Add to My Notes
00:22:27Andrej Karpathy

Okay. Let me swing this over here. Now one more thing that I would like to do is I would actually like to make this special character have position zero, and I would like to offset all the other letters. I find that a little bit more pleasing. So we need a plus one here so that the first character, which is `a`, will start at one. So `stoi` will now be: `a` starts at one and dot is 0. And `itos` of course we're not changing this because `itos` just creates a reverse mapping and this will work fine. So 1 is `a`, 2 is `b`, 0 is dot. So we've reversed that here. We have a dot and a dot. This should work fine. Make sure I start at zeros, count, and then here we don't go up to 28, we go up to 27. And this should just work.

💬 0 comments
Add to My Notes
00:23:30Andrej Karpathy

Okay. So we see that dot never happened, it's at zero because we don't have empty words. Then this row here now is just very simply the counts for all the first letters. So `j` starts a word, `h` starts a word, `i` starts a word, etc. And then these are all the ending characters. And in between we have the structure of what characters follow each other. So this is the counts array of our entire dataset.

💬 0 comments
Add to My Notes
00:24:00Andrej Karpathy

So this array actually has all the information necessary for us to actually sample from this bigram character-level language model. And roughly speaking, what we're going to do is we're just going to start following these probabilities and these counts and we're going to start sampling from the model. So in the beginning of course, we start with the dot, the start token `.`. So to sample the first character of a name, we're looking at this row here. So we see that we have the counts, and those counts are telling us how often any one of these characters is to start a word. So if we take this `N` and we grab the first row, we can do that by using just indexing as zero, and then using this notation `:` for the rest of that row. So `N[0, :]` is indexing into the zeroth row and then it's grabbing all the columns. And so this will give us a one-dimensional array of the first row. The shape of this is 27; it's just the row of 27. And the other way that you can do this also is you don't need to actually give this; you just grab the zeroth row like this `N[0]`. This is equivalent.

💬 0 comments
Add to My Notes
00:25:28Andrej Karpathy

Now these are the counts. And now what we'd like to do is we'd like to basically sample from this. Since these are the raw counts, we actually have to convert this to probabilities. So we create a probability vector. So we'll take `N[0]` and we'll actually convert this to float first. Okay, so these integers are converted to floating point numbers. And the reason we're creating floats is because we're about to normalize these counts. So to create a probability distribution here, we want to divide. We basically want to do `p = p / p.sum()`. And now we get a vector of smaller numbers and these are now probabilities. So of course because we divided by the sum, the sum of `p` now is 1. So this is a nice proper probability distribution; it sums to 1 and this is giving us the probability for any single character to be the first character of a word.

💬 0 comments
Add to My Notes
00:26:27Andrej Karpathy

So now we can try to sample from this distribution. To sample from these distributions, we're going to use `torch.multinomial` which I've pulled up here. So `torch.multinomial` returns samples from the multinomial probability distribution, which is a complicated way of saying you give me probabilities and I will give you integers which are sampled according to the probability distribution. So this is the signature of the method. And to make everything deterministic, we're going to use a generator object in PyTorch. So this makes everything deterministic so when you run this on your computer, you're going to get the exact same results that I'm getting here on my computer.

💬 0 comments
Add to My Notes
00:27:12Andrej Karpathy

So let me show you how this works. Here's the deterministic way of creating a torch generator object, seeding it with some number that we can agree on. So that seeds a generator, gives us an object `g`. And then we can pass that `g` to a function that creates random numbers. `torch.rand` creates random numbers, three of them, and it's using this generator object as a source of randomness. So without normalizing it, I can just print. This is sort of like numbers between 0 and 1 that are random according to this thing. And whenever I run it again, I'm always going to get the same result because I keep using the same generator object which I'm seeding here. And then if I divide to normalize, I'm going to get a nice probability distribution of just three elements.

💬 0 comments
Add to My Notes
00:28:07Andrej Karpathy

And then we can use `torch.multinomial` to draw samples from it. So this is what that looks like. `torch.multinomial` will take the torch tensor of probability distributions. Then we can ask for a number of samples, let's say 20. `replacement=True` means that when we draw an element, we can draw it and then we can put it back into the list of eligible indices to draw again. And we have to specify `replacement` as `True` because by default, for some reason, it's `False`. And I think, you know, it's just something to be careful with. And the generator is passed in here so we're going to always get deterministic results, the same results.

💬 0 comments
Add to My Notes
00:28:51Andrej Karpathy

So if I run these two, we're going to get a bunch of samples from this distribution. Now you'll notice here that the probability for the first element in this tensor is 60%. So in these 20 samples, we'd expect 60% of them to be zero. We'd expect 30% of them to be one. And because the element index two has only 10% probability, very few of these samples should be two. And indeed we only have a small number of twos. And we can sample as many as we'd like. And the more we sample, the more these numbers should roughly have the distribution here. So we should have lots of zeros, half as many ones, and we should have three times as few twos. So you see that we have very few twos, we have some ones, and most of them are zero. So that's what `torch.multinomial` is doing for us here.

💬 0 comments
Add to My Notes
00:30:01Andrej Karpathy

We are interested in this row. We've created this `p` here and now we can sample from it. So if we use the same seed and then we sample from this distribution, let's just get one sample. Then we see that the sample is say 13. So this will be the index. And let's—you see how it's a tensor that wraps 13—we again have to use `.item()` to pop out that integer. And now `index` would be just the number 13. And of course we can do `itos[ix]` to figure out exactly which character we're sampling here. We're sampling `m`. So we're saying that the first character is `m` in our generation. And just looking at the row here, `m` was drawn and we can see that `m` actually starts a large number of words. `m` started 2,500 words out of 32,000 words, so a bit less than 10% of the words start with `m`. So this was actually a fairly likely character to draw.

💬 0 comments
Add to My Notes
00:31:15Andrej Karpathy

So that would be the first character of our word. And now we can continue to sample more characters because now we know that `m` started, `m` is already sampled. So now to draw the next character, we will come back here and we will look for the row that starts with `m`. So you see `m`, and we have a row here. So we see that `m` dot is 516, `ma` is this many, and `b` is this many, etc. So these are the counts for the next row and that's the next character that we are going to now generate.

💬 0 comments
Add to My Notes
00:31:48Andrej Karpathy

So I think we are ready to actually just write out the loop because I think you're starting to get a sense of how this is going to go. We always begin at index 0 because that's the start token. And then `while True`: we're going to grab the row corresponding to the index that we're currently on. So that's `p`. So that's `N[ix]` converted to float is our `p`. Then we normalize this `p` to sum to one. Then we need this generator object. Now we're going to initialize up here. And we're going to draw a single sample from this distribution. And then this is going to tell us what index is going to be next. If the index sampled is 0, then that's now the end token, so we will break. Otherwise, we are going to print `itos[ix]`. And that's pretty much it.

💬 0 comments
Add to My Notes
00:33:10Andrej Karpathy

This should work. Okay, `more`. So that's the name that we've sampled. We started with `m`, the next step was `o`, then `r`, then dot. And this dot, we hit here as well. So let's now do this a few times. So let's actually create an `out` list here, and instead of printing we're going to append. So `out.append(this_character)`. And then here let's just print it at the end, so let's just join up all the `out`s. And we're just going to print `more`. Okay, now we're always getting the same result because of the generator. So if we want to do this a few times, we can go `for i in range(10)`, we can sample 10 names. And we can just do that 10 times.

💬 0 comments
Add to My Notes
00:34:05Andrej Karpathy

And these are the names that we're getting out. Let's do 20. I'll be honest with you, this doesn't look right. So I stared a few minutes to convince myself that it actually is right. The reason these samples are so terrible is that the bigram language model is actually, look, just like really terrible. We can generate a few more here. And you can see that they're kind of name-like a little bit, like "Yanu", "O'Reilly", etc., but they're just like totally messed up. And I mean the reason that this is so bad—like we're generating `h` as a name—but you have to think through it from the model's eyes. It doesn't know that this `h` is the very first `h`. All it knows is that `h` was previously, and now how likely is `h` the last character? Well, it's somewhat likely. And so it just makes it the last character. It doesn't know that there were other things before it or there were not other things before it. And so that's why it's generating all these nonsense names.

💬 0 comments
Add to My Notes
00:35:08Andrej Karpathy

Another way to do this, to convince yourself that this is actually doing something reasonable even though it's so terrible, is: these little `p`s here are 27, right? So how about if we did something like this: instead of `p` having any structure whatsoever, how about if `p` was just `torch.ones(27) / 27`? By default this is a `float32` so this is fine. So what I'm doing here is this is the uniform distribution which will make everything equally likely. And we can sample from that. So let's see if that does any better.

💬 0 comments
Add to My Notes
00:35:54Andrej Karpathy

Okay, so this is what you have from a model that is completely untrained where everything is equally likely. So it's obviously garbage. And then if we have a trained model which is trained on just bigrams, this is what we get. So you can see that it is more name-like, it is actually working, it's just bigram is so terrible and we have to do better.

💬 0 comments
Add to My Notes
00:36:15Andrej Karpathy

Now next I would like to fix an inefficiency that we have going on here. Because what we're doing here is we're always fetching a row of `N` from the counts matrix up ahead, and then we're always doing the same things: we're converting to float and we're dividing and we're doing this every single iteration of this loop. And we just keep renormalizing these rows over and over again and it's extremely inefficient and wasteful. So what I'd like to do is I'd like to actually prepare a matrix capital `P` that will just have the probabilities in it. So in other words, it's going to be the same as the capital `N` matrix here of counts, but every single row will have the row of probabilities that is normalized to 1, indicating the probability distribution for the next character given the character before it, as defined by which row we're in.

💬 0 comments
Add to My Notes
00:37:01Andrej Karpathy

So basically what we'd like to do is we'd like to just do it up front here, and then we would like to just use that row here. So here we would like to just do `p = P[ix]` instead. Okay. The other reason I want to do this is not just for efficiency, but also I would like us to practice these n-dimensional tensors and I'd like us to practice their manipulation and especially something that's called broadcasting that we'll go into in a second. We're actually going to have to become very good at these tensor manipulations because if we're going to build out all the way to Transformers, we're going to be doing some pretty complicated array operations for efficiency and we need to really understand that and be very good at it.

💬 0 comments
Add to My Notes
00:37:42Andrej Karpathy

So intuitively what we want to do is we first want to grab the floating point copy of `N`. And I'm mimicking the line here basically. And then we want to divide all the rows so that they sum to 1. So we'd like to do something like this: `P / P.sum()`. But now we have to be careful because `P.sum()` actually produces a sum—sums up all of the counts of this entire matrix `N` and gives us a single number of just the summation of everything. So that's not the way we want to divide. We want to simultaneously and in parallel divide all the rows by their respective sums.

💬 0 comments
Add to My Notes
00:38:30Andrej Karpathy

So what we have to do now is we have to go into documentation for `torch.sum` and we can scroll down here to a definition that is relevant to us, which is where we don't only provide an input array that we want to sum, but we also provide the dimension along which we want to sum. And in particular, we want to sum up over rows. Right. Now one more argument that I want you to pay attention to here is the `keepdim` is `False`. If `keepdim` is `True`, then the output tensor is of the same size as input except of course the dimension along which is summed, which will become just one. But if you pass in `keepdim` as `False`, then this dimension is squeezed out. And so `torch.sum` not only does the sum and collapses dimension to be of size one, but in addition it does what's called a squeeze where it squeezes out that dimension.

💬 0 comments
Add to My Notes
00:39:25Andrej Karpathy

So basically what we want here is we instead want to do `P.sum(axis)`. And in particular notice that `P.shape` is 27 by 27. So when we sum up across axis zero, then we would be taking the zeroth dimension and we would be summing across it. So when `keepdim` is `True`, then this thing will not only give us the counts across... along the columns, but notice that basically the shape of this is 1 by 27; we just get a row vector. And the reason we get a row vector here again is because we passed in zero dimension, so this zero dimension becomes one and we've done a sum. And we get a row. And so basically we've done the sum this way, vertically, and arrived at just a single 1 by 27 vector of counts.

💬 0 comments
Add to My Notes
00:40:15Andrej Karpathy

What happens when you take out `keepdim` is that we just get 27. So it squeezes out that dimension and we just get a one-dimensional vector of size 27. Now we don't actually want 1 by 27 row vector because that gives us the counts or the sums across the columns. We actually want to sum the other way along dimension one, and you'll see that the shape of this is 27 by one. So it's a column vector; it's a 27 by one vector of counts. Okay. And that's because what's happened here is that we're going horizontally and this 27 by 27 matrix becomes a 27 by 1 array.

💬 0 comments
Add to My Notes
00:41:03Andrej Karpathy

Now you'll notice by the way that the actual numbers of these counts are identical. And that's because this special array of counts here comes from bigram statistics and actually it just so happens by chance—or because of the way this array is constructed—that the sums along the columns or along the rows, horizontally or vertically, is identical. But actually what we want to do in this case is we want to sum across the rows horizontally. So what we want here is `P.sum(1, keepdim=True)`. 27 by 1 column vector.

💬 0 comments
Add to My Notes
00:41:39Andrej Karpathy

And now what we want to do is we want to divide by that. Now we have to be careful here again. Is it possible to take... what's `P.shape`? You see here 27 by 27. Is it possible to take a 27 by 27 array and divide it by what is a 27 by 1 array? Is that an operation that you can do? And whether or not you can perform this operation is determined by what's called broadcasting rules. So if you just search "broadcasting semantics in torch", you'll notice that there's a special definition for whether or not these two arrays can be combined in a binary operation like division.

💬 0 comments
Add to My Notes
00:42:23Andrej Karpathy

So the first condition is: each tensor has at least one dimension, which is the case for us. And then when iterating over the dimension sizes starting at the trailing dimension, the dimension sizes must either be equal, one of them is one, or one of them does not exist. Okay. So let's do that. We need to align the two arrays and their shapes, which is very easy because both of these shapes have two elements so they're aligned. Then we iterate from the right and going to the left. Each dimension must be either equal, one of them is a one, or one of them does not exist. So in this case they're not equal, but one of them is a one, so this is fine. And then this dimension they're both equal. So this is fine. So all the dimensions are fine and therefore this operation is broadcastable. So that means that this operation is allowed.

💬 0 comments
Add to My Notes
00:43:14Andrej Karpathy

And what is it that these arrays do when you divide 27 by 27 by 27 by 1? What it does is that it takes this dimension one and it stretches it out. It copies it to match 27 here in this case. So in our case, it takes this column vector which is 27 by 1 and it copies it 27 times to make these both be 27 by 27 internally, you can think of it that way. And so it copies those counts and then it does an element-wise division. Which is what we want because these counts, we want to divide by them on every single one of these columns in this matrix. So this actually we expect will normalize every single row.

💬 0 comments
Add to My Notes
00:44:00Andrej Karpathy

And we can check that this is true by taking the first row for example and taking its sum. We expect this to be 1. Because it's not normalized... And then we expect this now, because if we actually correctly normalize all the rows, we expect to get the exact same result here. So let's run this. It's the exact same result. This is correct.

💬 0 comments
Add to My Notes
00:44:23Andrej Karpathy

So now I would like to scare you a little bit. You actually have to—like I basically encourage you very strongly to read through broadcasting semantics and I encourage you to treat this with respect. And it's not something to play fast and loose with. It's something to really respect, really understand, and look up maybe some tutorials for broadcasting and practice it and be careful with it because you can very quickly run into bugs. Let me show you what I mean.

💬 0 comments
Add to My Notes
00:44:47Andrej Karpathy

You see how here we have `P.sum(1, keepdim=True)`. The shape of this is 27 by 1. Let me take out this line just so we have the `N` and then we can see the counts. We can see that this is all the counts across all the rows and it's a 27 by 1 column vector. Right. Now suppose that I tried to do the following but I erase `keepdim=True` here. What does that do? If `keepdim` is not `True`, it's `False`. Then remember according to documentation it gets rid of this dimension one; it squeezes it out. So basically we just get all the same counts, the same result, except the shape of it is not 27 by 1, it is just 27. The one disappears. But all the counts are the same.

💬 0 comments
Add to My Notes
00:45:34Andrej Karpathy

So you'd think that this divide would work. First of all, can we even write this and will it... is it even expected to run? Is it broadcastable? Let's determine if this result is broadcastable. `P.sum(1)` is shape 27. This is 27 by 27. So 27 by 27 broadcasting into 27. So now rules of broadcasting number one: align all the dimensions on the right. Done. Now iteration over all the dimensions starting from the right going to the left. All the dimensions must either be equal, one of them must be one, or one that does not exist. So here they are all equal. Here the dimension does not exist. So internally what broadcasting will do is it will create a one here. And then we see that one of them is a one and this will get copied and this will run. This will broadcast.

💬 0 comments
Add to My Notes
00:46:32Andrej Karpathy

Okay so you'd expect this to work because we are... this broadcasts and we can divide this. Now if I run this, you'd expect it to work, but it doesn't. You actually get garbage; you get a wrong result because this is actually a bug. This `keepdim=True` makes it work. This is a bug. In both cases, we are doing the correct counts, we are summing up across the rows. But `keepdim` is saving us and making it work. So in this case, I'd like to encourage you to potentially like pause this video at this point and try to think about why this is buggy and why the `keepdim` was necessary here.

💬 0 comments
Add to My Notes
00:47:22Andrej Karpathy

Okay. So the reason to do for this... I'm trying to hint it here when I was sort of giving you a bit of a hint on how this works. This 27 vector, internally inside the broadcasting, this becomes a 1 by 27. And 1 by 27 is a row vector, right? And now we are dividing 27 by 27 by 1 by 27. And torch will replicate this dimension. So basically, it will take this row vector and it will copy it vertically now 27 times so the 27 by 27 lies exactly and element-wise divides. And so basically what's happening here is we're actually normalizing the columns instead of normalizing the rows.

💬 0 comments
Add to My Notes
00:48:10Andrej Karpathy

So you can check that what's happening here is that `P[0]`, which is the first row of `P`, `.sum()` is not one, it's seven. It is the first column as an example that sums to one. So to summarize, where does the issue come from? The issue comes from the silent adding of a dimension here because in broadcasting rules you align on the right and go from right to left and if a dimension doesn't exist you create it. So that's where the problem happens. We still did the counts correctly, we did the counts across the rows and we got the counts on the right here as a column vector. But because the `keepdim` was `True`... this dimension was discarded and now we just have a vector of 27. And because of broadcasting the way it works, this vector of 27 suddenly becomes a row vector. And then this row vector gets replicated vertically and at every single point we are dividing by the count in the opposite direction.

💬 0 comments
Add to My Notes
00:49:08Andrej Karpathy

So this thing just doesn't work. This needs to be `keepdim=True` in this case. So then we have that `P[0]` is normalized. And conversely the first column you'd expect to potentially not be normalized. And this is what makes it work. So pretty subtle and hopefully this helps to scare you that you should have a respect for broadcasting. Be careful, check your work, and understand how it works under the hood and make sure that it's broadcasting in the direction that you like. Otherwise you're going to introduce very subtle bugs, very hard to find bugs, and just be careful.

💬 0 comments
Add to My Notes
00:49:46Andrej Karpathy

One more note on efficiency: we don't want to be doing this here because this creates a completely new tensor that we store into `P`. We prefer to use in-place operations if possible. So this `P /= ...` would be an in-place operation. It has the potential to be faster, it doesn't create new memory under the hood. And then let's erase this, we don't need it. And let's also just do fewer just so I'm not wasting space.

💬 0 comments
Add to My Notes
00:50:14Andrej Karpathy

Okay, so we're actually in a pretty good spot now. We trained a bigram language model and we trained it really just by counting how frequently any pairing occurs and then normalizing so that we get a nice probability distribution. So really these elements of this array `P` are really the parameters of our bigram language model giving us and summarizing the statistics of these bigrams. So we train the model and then we know how to sample from a model; we just iteratively sample the next character and feed it in each time and get a next character.

💬 0 comments
Add to My Notes
00:50:46Andrej Karpathy

Now what I'd like to do is I'd like to somehow evaluate the quality of this model. We'd like to somehow summarize the quality of this model into a single number: how good is it at predicting the training set? And as an example, so in the training set we can evaluate now the training loss. And this training loss is telling us about sort of the quality of this model in a single number just like we saw in `micrograd`. So let's try to think through the quality of the model and how we would evaluate it.

💬 0 comments
Add to My Notes
00:51:16Andrej Karpathy

Basically what we're going to do is we're going to copy paste this code that we previously used for counting. And let me just print these bigrams first. We're going to use f-strings and I'm going to print character one followed by character two. These are the bigrams. And then I don't want to do it for all the words, just do the first three words. So here we have Emma, Olivia, and Ava bigrams. Now what we'd like to do is we'd like to basically look at the probability that the model assigns to every one of these bigrams. So in other words, we can look at the probability, which is summarized in the matrix `P`, of `P[ix1, ix2]`. And then we can print it here as probability. And because these probabilities are way too large, let me present `:.4f` to like truncate it a bit.

💬 0 comments
Add to My Notes
00:52:09Andrej Karpathy

So what do we have here? Right. We're looking at the probabilities that the model assigns to every one of these bigrams in the dataset. And so we can see some of them are 4%, 3%, etc. Just to have a measuring stick in our mind by the way, we have 27 possible characters or tokens. And if everything was equally likely, then you'd expect all these probabilities to be roughly 4%. So anything above 4% means that we've learned something useful from these bigram statistics. And you see that roughly some of these are 4%, but some of them are as high as 40%, 35%, and so on. So you see that the model actually assigned a pretty high probability to whatever's in the training set and so that's a good thing. Basically if you have a very good model, you'd expect that these probabilities should be near one because that means that your model is correctly predicting what's going to come next, especially on the training set where you trained your model.

💬 0 comments
Add to My Notes
00:53:03Andrej Karpathy

So now we'd like to think about how can we summarize these probabilities into a single number that measures the quality of this model. Now when you look at the literature into Maximum Likelihood Estimation and statistical modeling and so on, you'll see that what's typically used here is something called the likelihood. And the likelihood is the product of all of these probabilities. And so the product of all these probabilities is the likelihood and it's really telling us about the probability of the entire dataset assigned by the model that we've trained. And that is a measure of quality. So the product of these should be as high as possible when you are training the model. And when you have a good model, your product of these probabilities should be very high.

💬 0 comments
Add to My Notes
00:53:50Andrej Karpathy

Now because the product of these probabilities is an unwieldy thing to work with—you can see that all of them are between zero and one, so your product of these probabilities will be a very tiny number—so for convenience, what people work with usually is not the likelihood but they work with what's called the log likelihood. So the product of these is the likelihood; to get the log likelihood we just have to take the log of the probability. And so the log of the probability here, I have the log of $x$ from zero to one. The log is a, you see here, monotonic transformation of the probability where if you pass in one, you get zero. So probability one gets your log probability of zero. And then as you go lower and lower probability, the log will grow more and more negative until all the way to negative infinity at zero.

💬 0 comments
Add to My Notes
00:54:41Andrej Karpathy

So here we have a `logprob` which is really just a `torch.log` of probability. Let's print it out to get a sense of what that looks like. `logprob`, also `.4f`. Okay. So as you can see when we plug in numbers that are very close, some of our higher numbers, we get closer and closer to zero. And then if we plug in very bad probabilities, we get more and more negative number. That's bad.

💬 0 comments
Add to My Notes
00:55:10Andrej Karpathy

And the reason we work with this is for a large extent convenience, right? Because we have mathematically that if you have some product $a \times b \times c$ of all these probabilities, right, the likelihood is the product of all these probabilities. Then the log of these is just $\log(a) + \log(b) + \log(c)$, if you remember your logs from your high school or undergrad and so on. So we have that basically the likelihood is the product of probabilities; the log likelihood is just the sum of the logs of the individual probabilities.

💬 0 comments
Add to My Notes
00:55:48Andrej Karpathy

So `log_likelihood` starts at zero. And then `log_likelihood` here we can just accumulate simply. And in the end we can print this. Print the log likelihood. f-strings, maybe you're familiar with this. So log likelihood is -38. Okay. Now we actually want... so how high can log likelihood get? It can go to zero. So when all the probabilities are one, log likelihood will be zero. And then when all the probabilities are lower, this will grow more and more negative.

💬 0 comments
Add to My Notes
00:56:39Andrej Karpathy

Now we don't actually like this because what we'd like is a loss function, and a loss function has the semantics that low is good because we're trying to minimize the loss. So we actually need to invert this and that's what gives us something called the negative log likelihood. Negative log likelihood is just negative of the log likelihood. These are f-strings by the way if you'd like to look this up. `nll = ...`. So negative log likelihood now is just negative of it. And so the negative log likelihood is a very nice loss function because the lowest it can get is zero, and the higher it is, the worse off the predictions are that you're making.

💬 0 comments
Add to My Notes
00:57:26Andrej Karpathy

And then one more modification to this that sometimes people do is that for convenience, they actually like to normalize by... they like to make it an average instead of a sum. And so here let's just keep some counts as well. So `n += 1`, starts at zero. And then here we can have sort of like a normalized log likelihood. If we just normalize it by the count, then we will sort of get the average log likelihood. So this would be usually our loss function here is what we would use. So our loss function for the training set assigned by the model is 2.4. That's the quality of this model. And the lower it is the better off we are, and the higher it is the worse off we are.

💬 0 comments
Add to My Notes
00:58:14Andrej Karpathy

And the job of our, you know, training is to find the parameters that minimize the negative log likelihood loss. And that would be like a high quality model. Okay, so to summarize I actually wrote it out here. So our goal is to maximize likelihood which is the product of all the probabilities assigned by the model. And we want to maximize this likelihood with respect to the model parameters. And in our case the model parameters here are defined in the table; these numbers, the probabilities, are the model parameters sort of in our bigram language models so far. But you have to keep in mind that here we are storing everything in a table format—the probabilities—but what's coming up as a brief preview is that these numbers will not be kept explicitly but these numbers will be calculated by a neural network. So that's coming up. And we want to change and tune the parameters of these neural networks; we want to change these parameters to maximize the likelihood, the product of the probabilities.

💬 0 comments
Add to My Notes
00:59:13Andrej Karpathy

Now maximizing the likelihood is equivalent to maximizing the log likelihood because log is a monotonic function. Here's the graph of log. And basically all it is doing is it's just scaling your... you can look at it as just a scaling of the loss function. And so the optimization problem here and here are actually equivalent because this is just scaling, you can look at it that way. And so these are two identical optimization problems. Maximizing the log-likelihood is equivalent to minimizing the negative log likelihood. And then in practice, people actually minimize the average negative log likelihood to get numbers like 2.4. And then this summarizes the quality of your model and we'd like to minimize it and make it as small as possible. And the lowest it can get is zero. And the lower it is, the better off your model is because it's assigning high probabilities to your data.

💬 0 comments
Add to My Notes
01:00:09Andrej Karpathy

Now let's estimate the probability over the entire training set just to make sure that we get something around 2.4. Let's run this over the entire... oops, let's take out the print segment as well. Okay, 2.45 over the entire training set. Now what I'd like to show you is that you can actually evaluate the probability for any word that you want. Like for example, if we just test a single word "andre" and bring back the print statement, then you see that "andre" is actually kind of like an unlikely word. Like on average we take three log probability to represent it and roughly that's because `ej` apparently is very uncommon as an example.

💬 0 comments
Add to My Notes
01:00:51Andrej Karpathy

Now think through this. When I take "andre" and I append `q` and I test the probability of it under `q`, we actually get infinity. And that's because `jq` has a zero percent probability according to our model. So the log likelihood... so the log of zero will be negative infinity; we get infinite loss. So this is kind of undesirable, right? Because we plugged in a string that could be like a somewhat reasonable name, but basically what this is saying is that this model is exactly zero percent likely to predict this name. And our loss is infinity on this example. And really what the reason for that is that `j` is followed by `q` zero times. Where's `q`? `jq` is zero and so `jq` is zero percent likely.

💬 0 comments
Add to My Notes
01:01:42Andrej Karpathy

So it's actually kind of gross and people don't like this too much. To fix this there's a very simple fix that people like to do to sort of like smooth out your model a little bit and it's called model smoothing.

💬 0 comments
Add to My Notes
01:01:50Andrej Karpathy

Roughly what's happening is that we will add some fake counts. Imagine adding a count of one to everything. So we add a count of one like this, and then we recalculate the probabilities. That's model smoothing. You can add as much as you like; you can add five, and it will give you a smoother model. The more you add here, the more uniform model you're going to have. The less you add, the more peaked model you are going to have.

💬 0 comments
Add to My Notes
01:02:22Andrej Karpathy

So one is a pretty decent count to add, and that will ensure that there will be no zeros in our probability matrix $P$. This will of course change the generations a little bit—in this case it didn't, but in principle it could. But what that's going to do now is that nothing will be infinitely unlikely. So now our model will predict some other probability, and we see that 'jq' now has a very small probability. The model still finds it very surprising that this was a word or a bigram, but we don't get negative infinity. So it's kind of a nice fix that people like to apply sometimes.

💬 0 comments
Add to My Notes
01:02:56Andrej Karpathy

Okay, so we've now trained a respectable bigram character-level language model. We saw that we trained the model by looking at the counts of all the bigrams and normalizing the rows to get probability distributions. We saw that we can also use those parameters of this model to perform sampling of new words—so we sample new names according to those distributions. And we also saw that we can evaluate the quality of this model. The quality of this model is summarized in a single number, which is the negative log likelihood. The lower this number is, the better the model is, because it is giving high probabilities to the actual next characters in all the bigrams in our training set.

💬 0 comments
Add to My Notes
01:03:40Andrej Karpathy

So that's all well and good, but we've arrived at this model explicitly by doing something that felt sensible: we were just performing counts and then normalizing those counts. Now what I would like to do is take an alternative approach. We will end up in a very, very similar position, but the approach will look very different because I would like to cast the problem of bigram character-level language modeling into the neural network framework.

💬 0 comments
Add to My Notes
01:04:04Andrej Karpathy

In the neural network framework, we're going to approach things slightly differently, but again end up in a very similar spot—I'll go into that later. Now, our neural network is still going to be a bigram character-level language model. So it receives a single character as an input, then there's a neural network with some weights or parameters $W$, and it's going to output the probability distribution over the next character in a sequence. It's going to make guesses as to what is likely to follow this character that was input to the model.

💬 0 comments
Add to My Notes
01:04:35Andrej Karpathy

Then, in addition to that, we're going to be able to evaluate any setting of the parameters of the neural net because we have the loss function—the negative log likelihood. So we're going to take a look at its probability distributions and we're going to use the labels, which are basically just the identity of the next character in that bigram—the second character. Knowing what second character actually comes next in the bigram allows us to look at how high of a probability the model assigns to that character. We of course want the probability to be very high, and that is another way of saying that the loss is low.

💬 0 comments
Add to My Notes
01:05:10Andrej Karpathy

So we're going to use gradient-based optimization to tune the parameters of this network because we have the loss function and we're going to minimize it. We're going to tune the weights so that the neural net is correctly predicting the probabilities for the next character.

💬 0 comments
Add to My Notes
01:05:24Andrej Karpathy

So let's get started. The first thing I want to do is compile the training set of this neural network. Create the training set of all the bigrams. I'm going to copy-paste this code because this code iterates over all the bigrams. So here we start with the words, we iterate over all the bigrams, and previously, as you recall, we did the counts, but now we're not going to do counts. We're just creating a training set.

💬 0 comments
Add to My Notes
01:05:56Andrej Karpathy

Now this training set will be made up of two lists: we have the inputs (`xs`) and the targets or labels (`ys`). These bigrams will denote `x` and `y`—those are the characters. So we're given the first character of the bigram, and then we're trying to predict the next one. Both of these are going to be integers. So here we'll take `xs.append(ix1)` and `ys.append(ix2)`. And then here we actually don't want lists of integers; we will create tensors out of these. So: ```python xs = torch.tensor(xs) ys = torch.tensor(ys) ```

💬 0 comments
Add to My Notes
01:06:42Andrej Karpathy

We don't actually want to take all the words just yet because I want everything to be manageable. So let's just do the first word, which is "emma". And then it's clear what these `xs` and `ys` would be. Here, let me print character 1 and character 2 just so you see what's going on here.

💬 0 comments
Add to My Notes
01:07:01Andrej Karpathy

So the bigrams of these characters is: `.` `e`, `e` `m`, `m` `m`, `m` `a`, `a` `.`. This single word, as I mentioned, has one, two, three, four, five examples for our neural network. There are five separate examples in "emma," and those examples are summarized here.

💬 0 comments
Add to My Notes
01:07:19Andrej Karpathy

When the input to the neural network is integer 0, the desired label is integer 5, which corresponds to `e`. When the input to the neural network is 5, we want its weights to be arranged so that 13 gets a very high probability. When 13 is put in, we want 13 to have a high probability. When 13 is put in, we also want 1 to have a high probability. When 1 is input, we want 0 to have a very high probability. So there are five separate input examples to a neural net in this dataset.

💬 0 comments
Add to My Notes
01:07:54Andrej Karpathy

I wanted to add a tangent of a note of caution to be careful with a lot of the APIs of some of these frameworks. You saw me silently use `torch.tensor` with a lowercase 't', and the output looked right. But you should be aware that there's actually two ways of constructing a tensor: there's `torch.tensor` (lowercase) and there's also a `torch.Tensor` (capital) class which you can also construct. So you can actually call both. You can also do `torch.Tensor` and you get `xs` and `ys` as well. That's not confusing at all.

💬 0 comments
Add to My Notes
01:08:27Andrej Karpathy

There are threads on what is the difference between these two, and unfortunately the docs are just not clear on the difference. When you look at the docs of lowercase tensor, it says "construct tensor with no autograd history by copying data"—it doesn't make sense. The actual difference, as far as I can tell, is explained eventually in this random thread that you can Google. Really it comes down to, I believe, that `torch.tensor` infers dtype (the data type) automatically, while `torch.Tensor` just returns a float tensor.

💬 0 comments
Add to My Notes
01:09:04Andrej Karpathy

I would recommend stick to `torch.tensor` (lowercase). Indeed, we see that when I construct this with a capital 'T', the data type here of `xs` is `float32`. But `torch.tensor` (lowercase), you see `x.dtype` is now `int64`. It's advised that you use lowercase 't', and you can read more about it if you like in some of these threads. But basically, I'm pointing out some of these things because I want to caution you and I want you to get used to reading a lot of documentation and reading through a lot of Q&As and threads like this. Some of this stuff is unfortunately not easy and not very well documented, and you have to be careful out there. What we want here is integers because that's what makes sense. So lowercase `tensor` is what we are using.

💬 0 comments
Add to My Notes
01:10:01Andrej Karpathy

Okay, now we want to think through how we're going to feed in these examples into a neural network. It's not quite as straightforward as plugging it in, because these examples right now are integers. So there's like a 0, 5, or 13; it gives us the index of the character, and you can't just plug an integer index into a neural net. These neural nets are sort of made up of these neurons, and these neurons have weights. As you saw in Micrograd, these weights act multiplicatively on the inputs ($Wx + b$), there's `tanh`'s and so on. So it doesn't really make sense to make an input neuron take on integer values that you feed in and then multiply on with weights.

💬 0 comments
Add to My Notes
01:10:41Andrej Karpathy

So instead, a common way of encoding integers is what's called **one-hot encoding**. In one-hot encoding, we take an integer like 13 and we create a vector that is all zeros except for the 13th dimension, which we turn to a one. And then that vector can feed into a neural net. Now conveniently, PyTorch actually has something called the `one_hot` function inside `torch.nn.functional`. It takes a tensor made up of integers (long is an integer) and it also takes a number of classes, which is how large you want your vector to be.

💬 0 comments
Add to My Notes
01:11:27Andrej Karpathy

So here, let's import `torch.nn.functional as F`—this is a common way of importing it. And then let's do `F.one_hot` and we feed in the integers that we want to encode, so we can actually feed in the entire array of `xs`. And we can tell it that `num_classes` is 27. It doesn't have to try to guess it; it may have guessed that it's only 13 and would give us an incorrect result. So this is the one-hot. Let's call this `xenc` for x encoded.

💬 0 comments
Add to My Notes
01:12:02Andrej Karpathy

We see that `xenc.shape` is 5 by 27. We can also visualize it: `plt.imshow(xenc)` to make it a little bit more clear because this is a little messy. So we see that we've encoded all the five examples into vectors. We have five examples, so we have five rows, and each row here is now an example input into a neural net. We see that the appropriate bit is turned on as a one and everything else is zero. For example, the 0th bit is turned on, the 5th bit is turned on, the 13th bits are turned on for both of these examples, and then the 1st bit here is turned on. That's how we can encode integers into vectors, and then these vectors can feed into neural nets.

💬 0 comments
Add to My Notes
01:12:52Andrej Karpathy

One more issue to be careful with here by the way is: let's look at the data type of the encoding. We always want to be careful with data types. What would you expect `xenc`'s data type to be? When we're plugging numbers into neural nets, we don't want them to be integers; we want them to be floating point numbers that can take on various values. But the dtype here is actually 64-bit integer. The reason for that, I suspect, is that `one_hot` received a 64-bit integer here and it returned the same data type. When you look at the signature of `one_hot`, it doesn't even take a dtype—a desired data type of the output tensor. And so we can't, like in a lot of functions in torch, do something like `dtype=torch.float32`. `one_hot` does not support that.

💬 0 comments
Add to My Notes
01:13:37Andrej Karpathy

So instead, we're going to want to cast this to float like this. So that everything looks the same, but the dtype is `float32`, and floats can feed into neural nets.

💬 0 comments
Add to My Notes
01:13:54Andrej Karpathy

So now let's construct our first neuron. This neuron will look at these input vectors. And as you remember from Micrograd, these neurons basically perform a very simple function $Wx + b$, where $Wx$ is a dot product. Let's first define the weights of this neuron—basically, what are the initial weights at initialization for this neuron? Let's initialize them with `torch.randn`.

💬 0 comments
Add to My Notes
01:14:21Andrej Karpathy

`torch.randn` fills a tensor with random numbers drawn from a normal distribution. A normal distribution has a probability density function like this, and so most of the numbers drawn from this distribution will be around 0, but some of them will be as high as almost 3 and so on. Very few numbers will be above 3 in magnitude. So we need to take a size as an input here, and I'm going to use size to be 27 by 1. Let's visualize $W$; so $W$ is a column vector of 27 numbers.

💬 0 comments
Add to My Notes
01:15:03Andrej Karpathy

These weights are then multiplied by the inputs. So now to perform this multiplication, we can take `xenc` and we can multiply it with $W$. This is a matrix multiplication operator in PyTorch (`@`). The output of this operation is 5 by 1. The reason is the following: we took `xenc` which is 5 by 27, and we multiplied it by 27 by 1. In matrix multiplication, you see that the output will become 5 by 1 because these 27 will multiply and add.

💬 0 comments
Add to My Notes
01:15:45Andrej Karpathy

Basically, what we're seeing here out of this operation is we are seeing the five activations of this neuron on these five inputs. And we've evaluated all of them in parallel. We didn't feed in just a single input to the single neuron; we fed in simultaneously all the five inputs into the same neuron, and in parallel PyTorch has evaluated the $Wx + b$ (here just $Wx$, there's no bias). It has evaluated $W \cdot x$ for all of them independently.

💬 0 comments
Add to My Notes
01:16:21Andrej Karpathy

Now, instead of a single neuron though, I would like to have 27 neurons, and I'll show you in a second why I want 27 neurons. So instead of having just a 1 here, which is indicating the presence of one single neuron, we can use 27. Then when $W$ is 27 by 27, this will in parallel evaluate all the 27 neurons on all the 5 inputs, giving us a much bigger result. So now what we've done is 5 by 27 multiplied by 27 by 27, and the output of this is now 5 by 27.

💬 0 comments
Add to My Notes
01:17:03Andrej Karpathy

So what is every element here telling us? It's telling us, for every one of 27 neurons that we created, what is the firing rate of those neurons on every one of those five examples? So the element for example `(3, 13)` is giving us the firing rate of the 13th neuron looking at the 3rd input. And the way this was achieved is by a dot product between the 3rd input and the 13th column of this $W$ matrix here.

💬 0 comments
Add to My Notes
01:17:46Andrej Karpathy

Okay, so using matrix multiplication, we can very efficiently evaluate the dot product between lots of input examples in a batch and lots of neurons, where all those neurons have weights in the columns of those $W$'s. In matrix multiplication, we're just doing those dot products in parallel. Just to show you that this is the case, we can take $x$ and take the third row, and we can take $W$ and take its 13th column, and then we can do element-wise multiply and sum that up. That's $Wx + b$ (well, there's no plus $b$, just $Wx$ dot product) and that's this number. So you see that this is just being done efficiently by the matrix multiplication operation for all the input examples and for all the output neurons of this first layer.

💬 0 comments
Add to My Notes
01:18:46Andrej Karpathy

Okay, so we fed our 27-dimensional inputs into a first layer of a neural net that has 27 neurons. So we have 27 inputs and now we have 27 neurons. These neurons perform $W \cdot x$; they don't have a bias and they don't have a non-linearity like `tanh`. We're going to leave them to be a linear layer. In addition to that, we're not going to have any other layers; this is going to be it. It's just going to be the dumbest, smallest, simplest neural net, which is just a single linear layer. And now I'd like to explain what I want those 27 outputs to be.

💬 0 comments
Add to My Notes
01:19:21Andrej Karpathy

Intuitively, what we're trying to produce here for every single input example is we're trying to produce some kind of a probability distribution for the next character in a sequence. There's 27 of them, but we have to come up with precise semantics for exactly how we're going to interpret these 27 numbers that these neurons take on.

💬 0 comments
Add to My Notes
01:19:39Andrej Karpathy

Now intuitively, you see here that these numbers are negative and some of them are positive, etc., and that's because these are coming out of a neural net layer initialized with these normal distribution parameters. But what we want is we want something like we had here—like each row here told us the counts, and then we normalized the counts to get probabilities. We want something similar to come out of the neural net, but what we have right now is just some negative and positive numbers.

💬 0 comments
Add to My Notes
01:20:10Andrej Karpathy

Now we want those numbers to somehow represent the probabilities for the next character. But you see that probabilities have a special structure: they're positive numbers and they sum to one. That doesn't just come out of a neural net. And they can't be counts because counts are positive integers. So counts are also not really a good thing to output from a neural net.

💬 0 comments
Add to My Notes
01:20:36Andrej Karpathy

So instead, what the neural net is going to output, and how we are going to interpret the 27 numbers, is that these 27 numbers are giving us **log counts**. Basically, instead of giving us counts directly like in this table, they're giving us log counts. To get the counts, we're going to take the log counts and we're going to exponentiate them.

💬 0 comments
Add to My Notes
01:21:01Andrej Karpathy

Now, exponentiation takes the following form: it takes numbers that are negative or positive—it takes the entire real line—and then if you plug in negative numbers, you're going to get $e^x$, which is always below one. If you plug in numbers greater than zero, you're getting numbers greater than one, all the way growing to infinity. So basically, we're going to take these numbers here, and instead of them being positive and negative and all over the place, we're going to interpret them as log counts, and then we're going to element-wise exponentiate these numbers.

💬 0 comments
Add to My Notes
01:21:52Andrej Karpathy

Exponentiating them now gives us something like this. You see that these numbers now, because they went through an exponent, all the negative numbers turned into numbers below 1 (like 0.338) and all the positive numbers originally turned into even more positive numbers, greater than one. So exponentiated outputs here basically give us something that we can use and interpret as the equivalent of counts originally. The neural net is kind of now predicting counts, and these counts are positive numbers; they can never be below zero, so that makes sense. And they can now take on various values depending on the settings of $W$.

💬 0 comments
Add to My Notes
01:22:54Andrej Karpathy

So let me break this down. We're going to interpret these to be the log counts. Another word for this that is often used is so-called **logits**. These are logits (log counts). Then these will be sort of the counts (logits exponentiated). This is equivalent to the $N$ array that we used previously—remember this was the array of counts. So those are the counts, and now the probabilities are just the counts normalized.

💬 0 comments
Add to My Notes
01:23:41Andrej Karpathy

I'm not going to scroll all over the place; we've already done this. We want counts summed along the first dimension and we want to keep them as true. We went over this, and this is how we normalize the rows of our counts matrix to get our probabilities (`probs`). So now these are the probabilities. And when I show the probabilities, you see that every row here of course will sum to 1 because they're normalized, and the shape of this is 5 by 27.

💬 0 comments
Add to My Notes
01:24:27Andrej Karpathy

Really what we've achieved is for every one of our five examples, we now have a row that came out of a neural net. And because of the transformations here, we made sure that this output of this neural net now are probabilities, or we can interpret them to be probabilities. So our $Wx$ here gave us logits, and then we interpret those to be log counts. We exponentiate to get something that looks like counts, and then we normalize those counts to get a probability distribution. And all of these are differentiable operations.

💬 0 comments
Add to My Notes
01:25:00Andrej Karpathy

So what we've done now is we're taking inputs, we have differentiable operations that we can backpropagate through, and we're getting out probability distributions. So, for example, for the zeroth example that fed in, it corresponded to feeding in this example here. We're feeding in a dot into a neural net. The way we fed the dot into a neural net is that we first got its index, then we one-hot encoded it, then it went into the neural net, and out came this distribution of probabilities. Its shape is 27 numbers, and we're going to interpret this as the neural net's assignment for how likely every one of these 27 characters are to come next.

💬 0 comments
Add to My Notes
01:26:01Andrej Karpathy

As we tune the weights $W$, we're going to be of course getting different probabilities out for any character that you input. And so now the question is just: can we optimize and find a good $W$ such that the probabilities coming out are pretty good? And the way we measure "pretty good" is by the loss function.

💬 0 comments
Add to My Notes
01:26:17Andrej Karpathy

Okay, so I organized everything into a single summary so that hopefully it's a bit more clear. So it starts here with an input dataset. We have some inputs to the neural net and we have some labels for the correct next character in a sequence—these are integers. Here I'm using torch generators now so that you see the same numbers that I see, and I'm generating 27 neurons weights, and each neuron here receives 27 inputs.

💬 0 comments
Add to My Notes
01:26:48Andrej Karpathy

Then here we're going to plug in all the input examples `xs` into a neural net. So here, this is a **forward pass**. First we have to encode all of the inputs into one-hot representations. So we have 27 classes, we pass in these integers, and `xenc` becomes an array that is 5 by 27 zeros except for a few ones. We then multiply this in the first layer of a neural net to get logits. Exponentiate the logits to get fake counts, sort of. And normalize these counts to get probabilities.

💬 0 comments
Add to My Notes
01:27:26Andrej Karpathy

These last two lines, by the way, are called the **Softmax**, which I pulled up here. Softmax is a very often used layer in a neural net that takes these $z$'s (which are logits), exponentiates them, and divides and normalizes. It's a way of taking outputs of a neural net layer (which can be positive or negative) and outputs probability distributions—something that always sums to one and are positive numbers. It's kind of like a normalization function if you want to think of it that way. You can put it on top of any other linear layer inside a neural net, and it basically makes a neural net output probabilities. That's very often used, and we used it as well here.

💬 0 comments
Add to My Notes
01:28:13Andrej Karpathy

So this is the forward pass, and that's how we made a neural net output probability. Now you'll notice that this entire forward pass is made up of differentiable layers. Everything here we can backpropagate through. We saw some of the backpropagation in Micrograd. This is just multiplication and addition; all that's happening here is just multiply and then add, and we know how to backpropagate through them. Exponentiation we know how to backpropagate through, and then here we are summing, and sum is easily backpropagable as well. And division as well. So everything here is differentiable operation and we can backpropagate through.

💬 0 comments
Add to My Notes
01:28:57Andrej Karpathy

Now we achieve these probabilities which are 5 by 27. For every single example, we have a vector of probabilities that sums to one. And then here I wrote a bunch of stuff to sort of break down the examples. So we have five examples making up "emma," and there are five bigrams inside "emma." Bigram example 1 is that `e` is the beginning character right after `.`. The indexes for these are 0 and 5.

💬 0 comments
Add to My Notes
01:29:31Andrej Karpathy

So then we feed in a 0—that's the input of the neural net. We get probabilities from the neural net that are 27 numbers. And then the label is 5 because `e` actually comes after `.`. So that's the label. And then we use this label 5 to index into the probability distribution here. So this index 5 here is 0, 1, 2, 3, 4, 5—it's this number here. That's basically the probability assigned by the neural net to the actual correct character.

💬 0 comments
Add to My Notes
01:30:10Andrej Karpathy

You see that the network currently thinks that this next character, that `e` following `.`, is only one percent likely, which is of course not very good, right? Because this actually is a training example and the network thinks this is currently very, very unlikely. But that's just because we didn't get very lucky in generating a good setting of $W$. So right now this network thinks it is unlikely, and 0.01 is not a good outcome. So the log likelihood then is very negative, and the negative log likelihood is very positive. So 4 is a very high negative log likelihood, and that means we're going to have a high loss, because what is the loss? The loss is just the average negative log likelihood.

💬 0 comments
Add to My Notes
01:30:51Andrej Karpathy

The second character is `m`. And you see here that also the network thought that `m` following `e` is very unlikely—one percent. For `m` following `m`, it thought it was two percent. And for `a` following `m`, it actually thought it was seven percent likely. So just by chance this one actually has a pretty good probability and therefore pretty low negative log likelihood. Finally here it thought this was one percent likely. So overall our average negative log likelihood—which is the loss, the total loss that summarizes how well this network currently works (at least on this one word, not on the full data)—is 3.76. Which is actually a fairly high loss; this is not a very good setting of $W$'s.

💬 0 comments
Add to My Notes
01:31:36Andrej Karpathy

Now here's what we can do. We're currently getting 3.76. We can actually come here and we can change our $W$, we can resample it. So let me just add one to have a different seed, and then we get a different $W$. Then we can rerun this. And with this different seed, with this different setting of $W$'s, we now get 3.37. So this is a much better $W$, right? It's better because the probabilities just happen to come out higher for the characters that actually are next.

💬 0 comments
Add to My Notes
01:32:08Andrej Karpathy

And so you can imagine actually just resampling this. We can try two... okay this was not very good. Let's try one more... we can try three... okay this was terrible setting because we have a very high loss. So anyway, I'm going to erase this. What I'm doing here, which is just guess and check of randomly assigning parameters and seeing if the network is good—that is amateur hour. That's not how you optimize a neural net.

💬 0 comments
Add to My Notes
01:32:38Andrej Karpathy

The way you optimize your neural net is you start with some random guess, and we're going to commit to this one even though it's not very good. But now the big deal is we have a loss function. So this loss is made up only of differentiable operations, and we can minimize the loss by tuning $W$'s—by computing the gradients of the loss with respect to these $W$ matrices. And so then we can tune $W$ to minimize the loss and find a good setting of $W$ using gradient-based optimization.

💬 0 comments
Add to My Notes
01:33:11Andrej Karpathy

So let's see how that will work. Now things are actually going to look almost identical to what we had with Micrograd. So here I pulled up the lecture from Micrograd, the notebook—it's from this repository. When I scroll all the way to the end where we left off with Micrograd, we had something very, very similar. We had a number of input examples (in this case we had four input examples inside `xs`) and we had their targets. Just like here we have our `xs` now, but we have five of them and they're now integers instead of vectors. But we're going to convert our integers to vectors except our vectors will be 27 large instead of three large.

💬 0 comments
Add to My Notes
01:33:51Andrej Karpathy

Then here what we did is first we did a forward pass where we ran a neural net on all of the inputs to get predictions. Our neural net at the time, this `n(x)`, was a multi-layer perceptron. Our neural net is going to look different because our neural net is just a single layer—single linear layer followed by a Softmax. So that's our neural net. And the loss here was the mean squared error. So we simply subtracted the prediction from the ground truth and squared it and summed it all up, and that was the loss. Loss was the single number that summarized the quality of the neural net. When loss is low, like almost zero, that means the neural net is predicting correctly.

💬 0 comments
Add to My Notes
01:34:36Andrej Karpathy

We had a single number that summarized the performance of the neural net and everything here was differentiable and was stored in a massive compute graph. And then we iterated over all the parameters, we made sure that the gradients are set to zero, and we called `loss.backward()`. `loss.backward()` initiated backpropagation at the final output node of loss. Remember these expressions? We had loss all the way at the end, we start backpropagation, and we went all the way back and we made sure that we populated all the `parameters.grad`. So that `grad` started at zero but backpropagation filled it in.

💬 0 comments
Add to My Notes
01:35:14Andrej Karpathy

And then in the update, we iterated over all the parameters and we simply did a parameter update where every single element of our parameters was nudged in the opposite direction of the gradient. So we're going to do the exact same thing here. So I'm going to pull this up on the side here so that we have it available, and we're actually going to do the exact same thing.

💬 0 comments
Add to My Notes
01:35:43Andrej Karpathy

So this was the forward pass where we did this and `probs` is our `ypred`. So now we have to evaluate the loss. But we're not using the mean squared error; we're using the negative log likelihood because we are doing classification, we're not doing regression as it's called. So here we want to calculate loss. Now the way we calculate it is it's just this average negative log likelihood.

💬 0 comments
Add to My Notes
01:36:07Andrej Karpathy

Now this `probs` here has a shape of 5 by 27. So to get all the... we basically want to pluck out the probabilities at the correct indices here. So in particular, because the labels are stored here in array `ys`, basically what we're after is: for the first example, we're looking at probability of 5; for the second example, at the second row or row index 1, we are interested in the probability assigned to index 13; at the second example we also have 13; at the third row we want 1; and then the last row which is 4 we want 0. So these are the probabilities we're interested in, right? And you can see that they're not amazing as we saw above.

💬 0 comments
Add to My Notes
01:37:00Andrej Karpathy

So these are the probabilities we want, but we want a more efficient way to access these probabilities, not just listing them out in a tuple like this. So it turns out that the way to do this in PyTorch—one of the ways at least—is we can basically pass in all of these integers in the vectors. So these ones, you see how they're just 0, 1, 2, 3, 4? We can actually create that using `torch.arange(5)`: 0, 1, 2, 3, 4. So we can index here with `torch.arange(5)` and here we index with `ys`.

💬 0 comments
Add to My Notes
01:37:41Andrej Karpathy

And you see that that gives us exactly these numbers. So that plucks out the probabilities that the neural network assigns to the correct next character. Now we take those probabilities and we actually look at the log probability, so we want `.log()`. And then we want to just average that up, so take the mean of all of that. And then it's the negative average log likelihood that is the loss.

💬 0 comments
Add to My Notes
01:38:14Andrej Karpathy

So the loss here is 3.7 something. And you see that this loss, 3.76, is exactly as we've obtained before, but this is a vectorized form of that expression. So we get the same loss, and the same loss we can consider as part of this forward pass.

💬 0 comments
Add to My Notes
01:38:36Andrej Karpathy

Okay, so we made our way all the way to loss. We've defined the forward pass, we forwarded the network and the loss. Now we're ready to do the **backward pass**. So, backward pass. We want to first make sure that all the gradients are reset so they're at zero. Now in PyTorch, you can set the gradients to be zero, but you can also just set it to `None`. Setting it to `None` is more efficient, and PyTorch will interpret `None` as a lack of a gradient and is the same as zeros. So this is a way to set to zero the gradient: `grad = None`.

💬 0 comments
Add to My Notes
01:39:10Andrej Karpathy

And now we do `loss.backward()`. Before we do `loss.backward()`, we need one more thing. If you remember from Micrograd, PyTorch actually requires that we pass in `requires_grad=True` so that when we tell PyTorch that we are interested in calculating gradients for this leaf tensor. By default, this is false. So let me recalculate with that, and then set to `None` and `loss.backward()`.

💬 0 comments
Add to My Notes
01:39:40Andrej Karpathy

Now something magical happened when `loss.backward()` was run. Because PyTorch, just like Micrograd, when we did the forward pass here, it keeps track of all the operations. Under the hood, it builds a full computational graph, just like the graphs we've produced in Micrograd. Those graphs exist inside PyTorch, and so it knows all the dependencies and all the mathematical operations of everything. And when you then calculate the loss, we can call a `.backward()` on it, and that backward then fills in the gradients of all the intermediates all the way back to $W$'s, which are the parameters of our neural net.

💬 0 comments
Add to My Notes
01:40:20Andrej Karpathy

So now we can do `W.grad` and we see that it has structure; there's stuff inside it. And these gradients—every single element here—so $W$ shape is 27 by 27, `W.grad` shape is the same 27 by 27. And every element of `W.grad` is telling us the influence of that weight on the loss function. So for example, this number all the way here... if this element, the `(0,0)` element of $W$, because the gradient is positive, is telling us that this has a positive influence in the loss. Slightly nudging $W$—slightly taking $W_{0,0}$ and adding a small $h$ to it—would increase the loss mildly because this gradient is positive. Some of these gradients are also negative.

💬 0 comments
Add to My Notes
01:41:20Andrej Karpathy

So that's telling us about the gradient information, and we can use this gradient information to update the weights of this neural network. So let's now do the update. It's going to be very similar to what we had in Micrograd. We need no loop over all the parameters because we only have one parameter tensor and that is $W$. So we simply do: ```python W.data += -0.1 * W.grad ``` And that would be the update to the tensor. So that updates the tensor. And because the tensor is updated, we would expect that now the loss should decrease. So here if I print `loss.item()`, it was 3.76, right? So we've updated the $W$ here, so if I recalculate the forward pass, loss now should be slightly lower. So 3.76 goes to 3.74.

💬 0 comments
Add to My Notes
01:42:25Andrej Karpathy

And then we can again set grad to `None` and backward, update. And now the parameters changed again, so if we recalculate the forward pass, we expect a lower loss again: 3.72. Okay, and this is again doing the... we're now doing gradient descent. And when we achieve a low loss, that will mean that the network is assigning high probabilities to the correct next characters.

💬 0 comments
Add to My Notes
01:42:56Andrej Karpathy

Okay, so I rearranged everything and I put it all together from scratch. So here is where we construct our data set of bigrams. You see that we are still iterating only on the first word "emma". I'm going to change that in a second. I added a number that counts the number of elements in `xs` so that we explicitly see that number of examples is five because currently we're just working with "emma" and there's five bigrams there. And here I added a loop of exactly what we had before. So we had 10 iterations of gradient descent of forward pass, backward pass, and an update.

💬 0 comments
Add to My Notes
01:43:30Andrej Karpathy

So running these two cells, initialization and gradient descent, gives us some improvement on the loss function. But now I want to use all the words. And there's not 5, but 228,000 bigrams. Now however, this should require no modification whatsoever; everything should just run because all the code we wrote doesn't care if there's five bigrams or 228,000 bigrams, and everything should just work.

💬 0 comments
Add to My Notes
01:43:58Andrej Karpathy

So you see that this will just run, but now we are optimizing over the entire training set of all the bigrams. And you see now that we are decreasing very slightly, so actually we can probably afford a larger learning rate. And probably go for even larger learning rate... even 50 seems to work on this very very simple example. Right, so let me re-initialize and let's run 100 iterations. See what happens.

💬 0 comments
Add to My Notes
01:44:36Andrej Karpathy

Okay, we seem to be coming up to some pretty good losses here, 2.47. Let me run 100 more. What is the number that we expect by the way in the loss? We expect to get something around what we had originally actually. So all the way back, if you remember in the beginning of this video when we optimized just by counting, our loss was roughly 2.47 after we had added smoothing (before smoothing we had roughly 2.45 likelihood—sorry, loss). And so that's actually roughly the vicinity of what we expect to achieve. But before we achieved it by counting, and here we are achieving roughly the same result but with gradient-based optimization.

💬 0 comments
Add to My Notes
01:45:20Andrej Karpathy

So we come to about 2.46, 2.45, etc. And that makes sense because fundamentally we're not taking any additional information; we're still just taking in the previous character and trying to predict the next one. But instead of doing it explicitly by counting and normalizing, we are doing it with gradient-based learning. And it just so happens that the explicit approach happens to very well optimize the loss function without any need for a gradient-based optimization, because the setup for bigram language models is so straightforward, so simple, we can just afford to estimate those probabilities directly and maintain them in a table.

💬 0 comments
Add to My Notes
01:45:58Andrej Karpathy

But the gradient-based approach is significantly more flexible. So we've actually gained a lot because what we can do now is we can expand this approach and complexify the neural net. So currently we're just taking a single character and feeding into a neural net and the neural net is extremely simple, but we're about to iterate on this substantially. We're going to be taking multiple previous characters and we're going to be feeding them into increasingly more complex neural nets. But fundamentally, the output of the neural net will always just be logits, and those logits will go through the exact same transformation: we are going to take them through a Softmax, calculate the loss function (the negative log likelihood), and do gradient-based optimization.

💬 0 comments
Add to My Notes
01:46:44Andrej Karpathy

Actually, as we complexify the neural nets and work all the way up to Transformers, none of this will really fundamentally change. The only thing that will change is the way we do the forward pass, where we take in some previous characters and calculate logits for the next character in the sequence; that will become more complex. But we'll use the same machinery to optimize it.

💬 0 comments
Add to My Notes
01:47:10Andrej Karpathy

It's not obvious how we would have extended this bigram approach into the case where there are many more characters at the input because eventually these tables would get way too large because there's way too many combinations of what previous characters could be. If you only have one previous character, we can just keep everything in a table that counts. But if you have the last 10 characters that are input, we can't actually keep everything in the table anymore. So this is fundamentally an unscalable approach, and the neural network approach is significantly more scalable and it's something that actually we can improve on over time. So that's where we will be digging next.

💬 0 comments
Add to My Notes
01:47:48Andrej Karpathy

I wanted to point out two more things. Number one: I want you to notice that this `xenc` here, this is made up of one-hot vectors, and then those one-hot vectors are multiplied by this $W$ matrix. We think of this as multiple neurons being forwarded in a fully connected manner, but actually what's happening here is that, for example, if you have a one-hot vector here that has a one at say the fifth dimension, then because of the way the matrix multiplication works, multiplying that one-hot vector with $W$ actually ends up plucking out the fifth row of $W$. `logits` would become just the fifth row of $W$. And that's because of the way the matrix multiplication works.

💬 0 comments
Add to My Notes
01:48:42Andrej Karpathy

So but that's actually exactly what happened before. Because remember, all the way up here we had a bigram, we took the first character, and then that first character indexed into a row of this array here, and that row gave us the probability distribution for the next character. So the first character was used as a lookup into a matrix here to get the probability distribution. Well, that's actually exactly what's happening here because we're taking the index, we're encoding it as one-hot, and multiplying it by $W$. So logits literally becomes the appropriate row of $W$. And that gets, just as before, exponentiated to create the counts and then normalized and becomes probability.

💬 0 comments
Add to My Notes
01:49:27Andrej Karpathy

So this $W$ here is literally the same as this array here, but $W$ remember is the log counts, not the counts. So it's more precise to say that $W$ exponentiated is this array. But this array was filled in by counting and by basically populating the counts of bigrams, whereas in the gradient-based framework we initialize it randomly and then we let the loss guide us to arrive at the exact same array. So this array exactly here is basically the array $W$ at the end of optimization, except we arrived at it piece by piece by following the loss. And that's why we also obtain the same loss function at the end.

💬 0 comments
Add to My Notes
01:50:20Andrej Karpathy

And the second note is: remember the smoothing where we added fake counts to our counts in order to smooth out and make more uniform the distributions of these probabilities? And that prevented us from assigning zero probability to any one bigram. Now if I increase the count here, what's happening to the probability? As I increase the count, probability becomes more and more uniform. Right, because these counts go only up to like 900 or whatever, so if I'm adding plus a million to every single number here, you can see how the row and its probability when we divide is just going to become more and more close to exactly even probability—uniform distribution.

💬 0 comments
Add to My Notes
01:51:05Andrej Karpathy

It turns out that the gradient-based framework has an equivalent to smoothing. In particular, think through these $W$'s here which we initialized randomly. We could also think about initializing $W$'s to be zero. If all the entries of $W$ are zero, then you'll see that logits will become all zero, and then exponentiating those logits becomes all one, and then the probabilities turn out to be exactly uniform. So basically when $W$'s are all equal to each other, or say especially zero, then the probabilities come out completely uniform.

💬 0 comments
Add to My Notes
01:51:45Andrej Karpathy

So trying to incentivize $W$ to be near zero is basically equivalent to label smoothing, and the more you incentivize that in the loss function, the more smooth distribution you're going to achieve. So this brings us to something that's called **regularization**, where we can actually augment the loss function to have a small component that we call a regularization loss.

💬 0 comments
Add to My Notes
01:52:08Andrej Karpathy

In particular, what we're going to do is we can take $W$ and we can for example square all of its entries, and then we can sum them. Because we're squaring, there will be no signs anymore—negatives and positives all get squashed to be positive numbers. And then the way this works is you achieve zero loss if $W$ is exactly zero, but if $W$ has non-zero numbers, you accumulate loss.

💬 0 comments
Add to My Notes
01:52:41Andrej Karpathy

So we can actually take this and we can add it on here. So we can do something like `loss + W**2.mean()`. Let's actually instead of sum let's take a mean because otherwise the sum gets too large. So mean is like a little bit more manageable. And then we have a regularization loss here, say `0.01 *` or something like that; you can choose the regularization strength.

💬 0 comments
Add to My Notes
01:53:11Andrej Karpathy

And then we can just optimize this. And now this optimization actually has two components: not only is it trying to make all the probabilities work out, but in addition to that, there's an additional component that simultaneously tries to make all $W$'s be zero, because if $W$'s are non-zero you feel a loss. And so minimizing this, the only way to achieve that is for $W$ to be zero. So you can think of this as adding like a spring force or like a gravity force that pushes $W$ to be zero. So $W$ wants to be zero and the probabilities want to be uniform, but they also simultaneously want to match up your probabilities as indicated by the data.

💬 0 comments
Add to My Notes
01:53:47Andrej Karpathy

And so the strength of this regularization is exactly controlling the amount of counts that you add here. Adding a lot more counts here corresponds to increasing this number, because the more you increase it, the more this part of the loss function dominates this part, and the more these weights will be unable to grow, because as they grow they accumulate way too much loss. And so if this is strong enough, then we are not able to overcome the force of this loss and basically everything will be uniform predictions. So I thought that's kind of cool.

💬 0 comments
Add to My Notes
01:54:30Andrej Karpathy

Okay, and lastly before we wrap up, I wanted to show you how you would sample from this neural net model. And I copy-pasted the sampling code from before, where remember that we sampled five times, and all we did, we started at zero, we grabbed the current `ix` row of $P$ and that was our probability row from which we sampled the next index and just accumulated that and break when zero. Running this gave us these results. Still have the $P$ in memory, so this is fine.

💬 0 comments
Add to My Notes
01:55:09Andrej Karpathy

Now the speed doesn't come from the row of $P$, instead it comes from this neural net. First we take `ix` and we encode it into a one-hot row of `xenc`. This `xenc` multiplies our $W$, which really just plucks out the row of $W$ corresponding to `ix`—really that's what's happening. And that gets our logits. And then we normalize those logits, exponentiate to get counts, and then normalize to get the distribution, and then we can sample from the distribution.

💬 0 comments
Add to My Notes
01:55:40Andrej Karpathy

So if I run this... kind of anticlimactic, or climatic depending on how you look at it, but we get the exact same result. And that's because this is an identical model. Not only does it achieve the same loss, but as I mentioned, these are identical models, and this $W$ is the log counts of what we've estimated before, but we came to this answer in a very different way. And it's got a very different interpretation, but fundamentally this is basically the same model and gives the same samples here. And so that's kind of cool.

💬 0 comments
Add to My Notes
01:56:16Andrej Karpathy

Okay, so we've actually covered a lot of ground. We introduced the bigram character-level language model. We saw how we can train the model, how we can sample from the model, and how we can evaluate the quality of the model using the negative log likelihood loss. And then we actually trained the model in two completely different ways that actually get the same result and the same model.

💬 0 comments
Add to My Notes
01:56:36Andrej Karpathy

In the first way, we just counted up the frequency of all the bigrams and normalized. In a second way, we used the negative log likelihood loss as a guide to optimizing the counts matrix (or the counts array) so that the loss is minimized in a gradient-based framework. And we saw that both of them give the same result. And that's it.

💬 0 comments
Add to My Notes
01:57:01Andrej Karpathy

Now the second one of these, the gradient-based framework, is much more flexible. And right now our neural network is super simple; we're taking a single previous character and we're taking it through a single linear layer to calculate the logits. This is about to complexify. So in the follow-up videos, we're going to be taking more and more of these characters and we're going to be feeding them into a neural net, but this neural net will still output the exact same thing: the neural net will output logits, and these logits will still be normalized in the exact same way, and all the loss and everything else and the gradient-based framework, everything stays identical. It's just that this neural net will now complexify all the way to Transformers. So that's gonna be pretty awesome and I'm looking forward to it. For now, bye.

💬 0 comments
Add to My Notes
Video Player
My Notes📝
Highlighted paragraphs will appear here