Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 1: Overview and Tokenization
Disclaimer: The transcript on this page is for the YouTube video titled "Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 1: Overview and Tokenization" from "Stanford Online". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.
Watch the original video here: https://www.youtube.com/watch?v=SQ3fZ1sAqXI
Welcome everyone. Um, this is CS 336, language models from scratch. And this is our core staff. So I'm Percy, one of your instructors. Um, I'm really excited about this class because it really allows you to see the whole language modeling building pipeline end-to-end, including data, systems, and modeling. Um, Tatsu, I'll be co-instructing with him. So, I'll let everyone introduce themselves.
Hi everyone. I'm Tatsu. I'm one of the co-instructors. I'll be giving lecture in, you know, a week or two, probably a few weeks. Um, I'm really excited about this class. Percy and I, you know, spent a while being a little disgruntled thinking like, what's the really deep technical stuff that we can teach our students today? And I think one of the things that is really, you got to build it from scratch to understand it. So, I'm hoping that that's sort of the ethos that I take away from that class.
Uh, hey everyone. I'm Roit. Um, I actually failed this class when I took it. But now I'm your CA. So when they say anything is possible...
Hey everyone, I'm Neil. I'm a third-year PhD student in the CS department. I work with you. Um, yeah, mostly interested in my research on synthetic data, language models, reasoning, all that stuff. So yeah, should be a fun quarter.
Uh, hey guys, I'm Marcel. I'm a second-year PhD. I work with... These days I work on health...
And he was a topper of many leaderboards from last year. So he's the number to beat. Okay. All right. Well, thanks everyone. Um, so let's let's continue.
As Tatsu mentioned, this is the second time we're teaching the class. We've grown the class by around 50%. I have three TAs instead of two. And one big thing is we're making all the lectures on YouTube so that the world can learn how to build language models from scratch. Okay.
So why do we decide to make this course and endure all the pain? Um, so let's ask GPT-4. So if you ask it why teach a course on building language models from scratch, um, it the reply is, teaching a course provides foundational understanding of techniques, fosters innovation, kind of the typical kind of generic blather. Okay. So here's the real reason. So we're in a bit of a crisis, I would say. Researchers are becoming more and more disconnected from the underlying technology. Um, eight years ago, researchers would implement and train their own models in AI. Even six years ago, you'd at least take the models like BERT and download them and fine-tune them. And now many people can just get away with prompting a proprietary model.
So this is not necessarily bad, right? Because as you introduce layers of abstraction, we can all do more. And a lot of research has been unlocked by the simplicity of being able to prompt the language model. And I do a fair my share of prompting. So, so there's nothing wrong with that. But it's also remember that these abstractions are leaky. So in contrast to programming languages or operating systems, um, you don't really understand what the abstraction is. It's a it's a string in and string out, I guess. Um, and I would say that there's still a lot of fundamental research to be done that requires tearing up the stack and co-designing different aspects of the data and the systems and the model. And I think really that full understanding of this technology is necessary for fundamental research. So that's why this class exists. We want to enable the fundamental research to continue. And our philosophy is to understand it, you have to build it.
So there's one small problem here, and this is because of the industrialization of language models. So GPT-4 is rumored to be 1.8 trillion parameters, cost 100 million dollars to train. Um, you have XAI building the clusters with 200,000 H100s, if you can imagine that. Um, there's an investment of over 500 billion, you know, supposedly, um, over over four years. So these are pretty large numbers, right? Um, and furthermore, there's no public details on how these models are being built. Um, here from GPT-4, this is even two years ago, um, they very honestly say that due to the competitive landscape and simply safety limitations, we're going to disclose no details. Okay, so this is the state of the world right now. And so in some sense, frontier models are out of reach for us. So if you came into this class thinking you're each going to train your own GPT-4, sorry. Um, so we're going to build small language models, but the problem is that these might not be representative.
And here's some two examples to illustrate why. So here's a kind of a simple simple one. Um, if you look at the fraction of flops spent in the attention layers of a transformer versus an MLP, um, this changes quite a bit. So this is a this is a tweet from Steven Fuller from quite a few years ago, but it's it this is still true. Um, if you look at small models, it looks like the number of flops in the attention versus the MLP layers are roughly comparable. But if you go up to 175 billion, then the, you know, the MLPs really dominate, right? So why does this matter? Well, if you spend a lot of time at small scale and you're optimizing the attention, you might be optimizing the wrong thing because um at larger scale, it just gets washed out. This is kind of a simple example because you can literally make this plot without actually any compute. You just like do napkin math.
Um, here's something that's a little bit harder to grapple with is just emergent behavior. So this is a paper from Jason Wei from 2022. And um, here this plot shows that as you increase the amount of training flops and you look at accuracy on a bunch of tasks, you'll see that for a while, it looks like the accuracy nothing is happening and all of a sudden you get these kind of, uh, you know, emergence of various phenomena like in-context learning. So if you were hanging around at this scale, you would have been concluding that, well, these language models really don't work, when in fact you had to scale up to get that behavior.
So, so don't despair, we can still learn something in this class. And but we have to be very precise about what we're learning. So there's three types of knowledge. There's the mechanics of how things work. This we can teach you. We can teach you what a transformer is. You can, you'll implement a transformer. We can teach you how model parallelism leverages GPUs efficiently. These are just like kind of the raw ingredients, the mechanics. So that's fine. We can also teach you mindset. So this is something a bit more subtle and seems like a little bit, you know, um fuzzy, but this is actually in some ways more important, I would say, because the mindset that we're going to take is that we want to squeeze as most out of the hardware as possible and take scaling seriously, right? Because in some sense, the mechanics, all those, we'll see later that all of these ingredients have been around for a while, but it was really, I think, the scaling mindset that OpenAI pioneered that led to this next generation of um AI models.
So mindset, I think hopefully we can, you know, bang into you that to think in a certain way. And then thirdly is intuitions. And this is about which data and modeling decisions lead to good models. This unfortunately we can only partially teach you and this is because what architectures and what data sets work at low scales might not be the same ones that work at large scales. And but, you know, that's just, but hopefully you got two and a half out of three. So that's um pretty good bang for your buck.
Um, okay, speaking of intuitions, there's this sort of I guess sad reality of things that, you know, you can tell a lot of stories about why certain things in the transformer are the way they are, but sometimes it's just, you know, you do the experiments and the experiments speak. Um, so for example, there's this PaLM paper that introduced the SwiGLU, which is something that we'll see a bit more in this class, which is a type of nonlinearity. Um, and in the conclusion, you know, the results are quite good and this got adopted. But in the conclusion, there's this honest statement that we offer no explanation except for this is divine benevolence. So there you go. This is the extent of our understanding.
Okay. So now let's talk about this bitter lesson that I'm sure people have, you know, heard about. I think there's a sort of a misconception that the bitter lesson means that scale is all that matters. Algorithms don't matter. All you do is pump more capital into building the model and you're good to go. I think this couldn't be further from the truth. I think the right interpretation is that algorithms at scale is what matters. And because at the end of the day, your accuracy of your model is really a product of your efficiency and the number of resources you put in. And actually efficiency, if you think about it, is way more important at larger scale because if you're spending, you know, hundreds of millions of dollars, you cannot afford to be wasteful in the same way that if you're looking at running a job on your on your local cluster, you might run it again, you fail, you you debug it. And if you look at actually the utilization and the use, I'm sure OpenAI is way more efficient than any of us right now. So efficiency really is important.
And furthermore, this, I think is this point is maybe not as well appreciated in the sort of scaling rhetoric, so to speak, which is that if you look at efficiency, which is a combination of hardware and algorithms, but if you just look at the algorithm efficiency, there's this nice OpenAI paper from 2020 that showed over the period of 2012 to 2019, there's a 44x algorithmic efficiency improvement in the time that it took to train ImageNet to a certain level of accuracy, right? So this is huge. And I think if you, I don't know if you could see the abstract here, this is faster than Moore's law, right? So algorithms do matter. If you didn't have this efficiency, you would be paying 44 times more cost. This is for image models, but there's some results for language as well.
Okay. So with all that, I think the right framing or mindset to have is what is the best model one can build given a certain compute and data budget. Okay. And this question makes sense no matter what scale you're at because you're sort of like...it's accuracy per resources. And of course, if you can raise the capital and get more resources, you'll get better models. But as researchers, our goal is to improve the efficiency of the algorithms. Okay. So maximize efficiency. We're going to hear a lot of that.
Okay. So now let me talk a little bit about the current landscape. Um, and a little bit of, I guess, you know, obligatory history. Um, so language models have been around for a while now. Um, you know, going back to Shannon, um, you know, who looked at language models as a way to estimate the entropy of English. Um, I think in AI, they really were prominent in NLP where they were a component of larger systems like machine translation, speech recognition. And one thing that's maybe not as appreciated these days is that if you look back in 2007, Google was training fairly large n-gram models, so 5-gram models over two trillion tokens, which is a lot more tokens than GPT-3. Um, and it was only, I guess in the last two years that we've gotten to that token count. Um, but they were n-gram models, so they didn't really exhibit any of the interesting phenomena that we know of language models today.
Okay. So in the 2010s, I think a lot of the, you can think about this as a lot of the deep learning revolution happened and a lot of the ingredients sort of kind of fell into place, right? So there was the first neural language model from Yoshua Bengio's group back in 2003. There were seq-to-seq models. Um, this I think was a, you know, big deal for, you know, how do you basically model sequences, from Ilya and Google folks. Um, there's an Adam optimizer, which still is used by the majority of people, dating over a decade ago. Um, there's attention mechanism, which was developed in the context of machine translation, which then led up to the famous "Attention Is All You Need" or the aka the transformer paper in 2017. People were looking at how to scale mixture of experts. There was a lot of work around the late 2010s on how to essentially do model parallelism, and they were actually figuring out how you could train, you know, 100 billion parameter models. They didn't train it for very long because these were like more system work, but all the ingredients were kind of in place by the time 2020 came around.
So I think one, you know, other trend which was starting in NLP was the idea of, you know, these foundation models that could be trained on a lot of text and adapted to a wide range of downstream tasks. So ELMo, BERT, um, you know, T5, these were models that were, for their time, very exciting. We kind of maybe forget how excited people were about, you know, things like BERT, but it was a big deal.
And then I think, I mean, this is abbreviated history, but um, I think one critical piece of the puzzle is, you know, OpenAI taking these ingredients, you know, and applying very nice engineering and really kind of pushing on the kind of the scaling laws, embracing it as, you know, this is the kind of the mindset piece, and that led to GPT-2 and GPT-3. Um, Google, you know, obviously was in the game and trying to, you know, compete as well. Um, but that sort of paved the way, I think, to another kind of line of work, which is, um, these were all closed models, so models that weren't released and you could only access via API. But there were also open models starting with, you know, early work by Eleuther right after GPT-3 came out. Meta's early attempt, which didn't work maybe as quite as well, um, BLOOM, and then Meta, Alibaba, DeepSeek, AI2, and there's a few others which I have listed, have been creating these open models where the weights are released.
Um, one other piece of, I think, tidbit about openness I think is important is that there's many levels of openness. There's closed models like GPT-4. There's open weight models where the weights are available and there's actually a paper, a very nice paper with lots of architectural details, but no details about the data set. And then there's open source models where all the weights and data are available and the paper where they're honestly trying to explain as much as they can. Um, you know, but of course, you can't really capture everything in a paper. And there's no substitute for learning how to build it except for kind of doing it yourself.
Okay. So that leads to kind of the present day where, um, there's a whole host of, you know, frontier models from OpenAI, Anthropic, xAI, Google, Meta, DeepSeek, Alibaba, Tencent, and probably a few others that sort of dominate the current landscape. So, we're kind of in an interesting time where, you know, just to kind of reflect, a lot of the ingredients, like I said, were developed, which is good because I think we're going to revisit some of those ingredients and trace how these techniques work. And then we're going to try to move as close as we can to best practices on frontier models, but you know, using information from essentially the open community and reading between the lines from what we know about the closed models.
Okay. So just as an interlude, um, so what are you looking at here? So, um, this is an executable lecture. So it's a program where I'm stepping through and it delivers the content of the lecture. So one thing that I think is interesting here is that you can embed code. So if you, um, you can just step through code, and I think this is a smaller screen than I'm used to, but you can look at the environment variables as you're stepping through code. So that's useful later when we start actually trying to drill down and giving code examples. You can see the hierarchical structure of the lecture, like we're in this module and you can see where it was called from main. Um, and you can jump to definitions like supervised fine-tuning, which we'll talk about later. Okay. And if you think this looks like a Python program, well, it is a Python program. Um, but I've made it, uh, you know, processed it for your viewing pleasure.
Okay. So, let's move on to the course logistics now.
Um, actually maybe I'll pause for questions. Any questions about, um, you know, what we're learning in this class?
Yeah.
Would you expect from this class to be able to lead a team to build a frontier model?
So the question is, would I expect a graduate from this class to be able to lead a team and build a frontier model? Of course, with, you know, like a billion dollars of capital. Yeah, of course. Um, I would say that it's a good step, but there's definitely many pieces that are missing. And I think, you know, we thought about we should really teach like a series of classes that eventually leads up to as close as we can get. But, um, I think this is maybe the first step of the puzzle, but there are a lot of things, and happy to talk offline about that. But I like the ambition. Yeah, that's what you should be doing, taking the class so you can go lead teams and build frontier models.
Okay.
Um, okay, let's talk a little bit about the course. Um, so here's a website. Everything's online. This is a five-unit class. Um, but I think that maybe doesn't express the the level here, um, as well as this quote that I pulled out from a course evaluation. Um, "The entire assignment was approximately the same amount of work as all five assignments from CS224N plus the final project." And that's the first homework assignment. So, not to all scare you off, but just just giving some data here.
Um, so why should you endure that? Um, why should you do it? I think this class is really for people who have sort of this obsessive need to understand how things work all the way down to the atoms, so to speak. And I think if you, you know, when you get through this class, I think you will have really leveled up in terms of your research, engineering, and the comfort level of comfort that you'll have in building ML systems at scale will just be, I think, um, you know, something.
There's also a bunch of reasons that you shouldn't take the class. For example, if you want to get any research done this quarter, maybe this class isn't for you. If you're interested in learning just about the hottest new techniques, there are many other classes that can probably deliver on that better. For example, you spending a lot of time debugging BPE. Um, and this is really, I think, about a class about, you know, the primitives and learning things bottom-up as opposed to the kind of the latest. Um, and also, if you're interested in building language models or, you know, for... um, this is probably not the first class you would take. Um, I think practically speaking, you know, as much as I kind of made fun of prompting, prompting is great, fine-tuning is great. If you can do that and it works, then I think that is something you should absolutely start with. So, I don't want people taking this class and thinking like, "Great, any problem, the first step is to train a language model from scratch." That is not the right way of thinking about it.
Um, okay. And I know that many of you, um, you know, some of you were enrolled, but we didn't, we did have a cap, so we weren't able to enroll everyone. And also for the people online, you can follow at home. Um, all the lecture materials and assignments are online, so you can look at them. The lectures are also recorded and will be put on YouTube, although there will be some number of weeks lag there. Um, and also we'll offer this class next year. So if you were not able to take it this year, um, don't fret, there will be next time.
Okay. So, the class has five assignments. Um, and each of the assignments, we don't provide scaffolding code in the sense that you're literally given a blank file and you're supposed to, you know, build things up. Um, and in the spirit of learning, building from scratch. But we're not that mean. Um, we do provide unit tests and some adapter interfaces that allow you to check correctness of different pieces. And also the assignment write-up, if you walk through it, does a sort of a gentle job of doing that. But you're kind of on your own for making good software design decisions and figuring out what you name your functions and how to organize your code, which is a useful skill, I think.
Um, so one strategy, I think, for all assignments is that there is a piece of the assignment which is just implement the thing and make sure it's correct. That mostly you can do locally on your laptop. You shouldn't need compute for that. And then you should, we have a cluster that you can run for benchmarking both accuracy and speed. Right? So I want everyone to kind of embrace this idea of like you want to use as small a data set or as few resources as possible to, you know, prototype before running large jobs. You shouldn't be debugging with one billion parameter models on the cluster if you can help it.
Okay. Um, there's some assignments which will have a leaderboard, um, which usually is of the form, do things to make perplexity go down given a particular training budget. Last year, it was I think pretty, you know, exciting for people to try to, um, you know, try different things that you either learn from the class or you read online.
Um, and then finally, I guess this year is, you know, this was less of a problem last year because I guess Copilot wasn't as good, but, you know, Cursor is pretty good. Um, so I think our general strategy is that, you know, AI tools are, you know, can take away from learning because there are cases where it can just solve the thing you want it to do. But, you know, I think you can obviously use them judiciously. So, but use at your own risk. You're kind of responsible for your own learning experience here.
Okay. So, uh, we do have a cluster. So, thank you Together AI for providing a bunch of H100s for us. Um, there's a guide to please read it carefully to learn how to use the cluster. Um, and uh, start your assignments early because, um, the cluster will fill up towards the end of a deadline as everyone's trying to get their large runs in.
Okay. Um, any questions about that? You mentioned it was a five-unit. Were you able to sign up for it for like...
Right. So, the question is, can you sign up for less than five units? I think administratively, uh, if you have to sign up for less, that is possible, but it's the same class and the same workload.
Yeah. Any other questions?
Okay. So in this part, I'm going to go through all the different components of the course and just give a broad overview, a preview of what you're going to experience. Um, so remember, it's all about efficiency given hardware and data. Um, how do you train the best model given your resources? So for example, if I give you a Common Crawl dump, a web dump, and 32 H100s for two weeks, what should you do? There are a lot of different design decisions. Um, there's, you know, questions about the tokenizer, the architecture, systems optimizations you can do, data things you can do. And we've organized the class into these five, um, units or pillars. So I'm going to go through each of them, you know, in turn, um, and talk about what we'll cover, what the assignment will involve, and and then I'll kind of wrap up.
Okay. So the goal of the basics unit is to just get a basic version of a full pipeline working. So here you implement a tokenizer, model architecture, and training. So just to say a bit more about what these components are. So a tokenizer is something that converts between strings and sequences of integers. Intuitively, you can think about the integers corresponding to breaking up the string into segments and mapping each segment to an integer. And the idea is that your sequence of integers is what goes into the actual model, which has to be like a fixed dimension. Okay. So in this course, we'll talk about the Byte Pair Encoding (BPE) tokenizer, which is um relatively simple and and still is used. Um, there are I guess a promising set of um methods on tokenizer-free approaches. So these are methods that just start with the raw bytes and don't do tokenization and develop a particular architecture that just takes the raw bytes. Um, this work is promising, but, you know, so far, I haven't seen it been scaled to the frontier yet. So we'll go with BPE for now.
Okay. Okay, so once you've tokenized your sequence or strings into a sequence of integers, now we define a model architecture over these sequences. So the starting point here is the original transformer. Um, that's what is the backbone of basically all um, you know, frontier models. Um, and here's an architectural diagram. Um, we won't go into details here, but uh, there's a attention piece and then there's a MLP layer with some normalization.
Um, so a lot has actually happened since 2017, right? I think there's a sort of sense to which, oh, the transformer was invented and then, you know, everyone's just using the transformer. And to a first approximation, that's true. We're still using the same recipe, but there have been a bunch of the smaller improvements that do make a substantial difference when you add them all up. So for example, there is, um, the activation, um, you know, nonlinear activation function, the SwiGLU, which we saw a little bit before. Positional embeddings, there's new positional embeddings, um, these rotary positional embeddings, which we'll talk about. Um, normalization, um, you know, instead of using LayerNorm, we're going to look at something called RMSNorm, which is similar but simpler. Um, there's a question where you place the normalization, which has been changed from the original transformer. Um, the MLP use, uh, the canonical version is a dense MLP, and you can replace that with mixture of experts. Um, attention is something that has actually been getting a lot of attention, I guess. Um, there's full attention, and then there's, you know, sliding window attention and linear attention. All of these are trying to prevent the quadratic blowup. There's also lower dimensional versions like, you know, GQA and MQA, which we'll get to in a second. Um, or not in a second, but in a future lecture. And then, you know, the most kind of maybe radical thing is other alternatives to the transformer like state-space models like Hyena, where they're not doing attention, but you know, some other sort of operation. And sometimes you get the best of both worlds by, you know, mixing, making a hybrid model that mixes these in with transformers.
Um, okay, so once you define your architecture, you need to train. So there's design decisions that include optimizer. So AdamW, which is a variant, basically Adam fixed up, is still very prominent. So we'll mostly work with that, but it is worth mentioning that there are more recent optimizers like Lion and Sophia that have shown promise. Um, learning rate schedule, um, you know, batch size, you know, whether you do regularization or not, hyperparameters, there's a lot of details here. And I think this class is one where the details do matter because you can easily have, you know, order of magnitude difference between a well-tuned architecture and something that's just like a vanilla transformer.
So in assignment one, basically you'll implement the BPE tokenizer. Um, I'll warn you that this is actually the part that seems to have been a lot of surprising, maybe a lot of work for people. So, um, just, you know, you're warned. And you'll also implement the transformer, cross-entropy loss, AdamW optimizer, and training loop. So again, the whole stack. And, you know, we're not making you implement PyTorch from scratch. So you can use PyTorch, but you can't use like, you know, the transformer implementation from PyTorch. There's a small list of functions that you can use, and you can only use those.
Okay, so we're going to have some, uh, you know, TinyStories and OpenWebText datasets that you'll train on. And then there will be a leaderboard to minimize the OpenWebText perplexity. We'll give you 90 minutes on a H100 and see what you can do. So this is last year. Um, so see, we have the top. So this is the number to beat for this year.
Okay. All right. So that's the basics. Now after basics, um, I mean, in some sense, you're done, right? Like you have the ability to train a transformer. What else do you need? So the system part really goes into how you can optimize this further. So how do you get the most out of hardware? And for this, we need to take a closer look at the hardware and how we can, you know, leverage it. So there's kernels, parallelism, and inference are the three components of this unit.
So okay, so to first talk about kernels, um, let's talk a little bit about what a GPU looks like. Okay. So, a GPU, which we'll get much more into, is basically a huge array of these, um, you know, little units that do floating point operations. Um, and maybe the one thing to note is that this is the GPU chip, and here is the, um, the memory that's actually off-chip. Um, and then there's some other memory like L2 caches and L1 caches on-chip. And so the basic idea is that compute has to happen here. Your data might be somewhere else, and how do you basically organize your compute so that you can be most efficient?
So one quick analogy is imagine that your memory is where you can store like your data and the model parameters is like a warehouse, and your compute is like the factory. And what ends up being a big bottleneck is just data movement costs, right? Um, so the thing that we have to do is how do you organize the compute, like even a matrix multiplication, to maximize the utilization of the GPUs by minimizing the data movement? And there's a bunch of techniques like fusion and tiling that allow you to do that. So we'll get all into the details of that, and to implement and leverage a kernel, we're going to look at Triton. There's other things you can do with various levels of sophistication, but we're going to use Triton, which is developed by OpenAI and a popular way to build kernels. Okay, so we're going to write some kernels. That's for one GPU.
So now, um, in general, you have these big runs take, you know, ten thousands if not tens of thousands of GPUs. And but even at 8, it kind of starts becoming interesting because you have a lot of GPUs, they're connected to some CPU nodes, and they also are directly connected via NV switch, NVLink. Um, and the... it's the same idea, right? Now the only thing is that data movement between GPUs is even slower, right? Um, and so we need to figure out how to put model, you know, parameters and activations and gradients and put them on the GPUs and do the computation and to minimize the amount of, you know, movement. Um, and then so we're going to explore different types of techniques like data parallelism and, you know, tensor parallelism and so on.
So, um, so that's all I'll say about that. And finally, inference is something that we didn't actually do last year in the class, although we had a guest lecture. Um, but this is important because inference is how you actually use a model, right? It's basically the task of generating tokens given a prompt, given a trained model. And it also turns out to be really useful for a bunch of other things besides just chatting with your your favorite model. You need it for reinforcement learning, test-time compute, which has been, you know, very popular lately, and even evaluating models, you need to do inference. So we're going to spend some time talking about inference.
Um, actually, if you think about the globally the cost that's dedicated, that's spent on inference is going to, you know, eclipse the cost that it is used to train models because training, despite it being very intensive, is ultimately a one-time cost. And inference cost scales with every use. And the more people use your your your model, the the more you'll need inference to be efficient.
Okay. So in inference, there's two phases. There's a prefill and a decode. Prefill is you take the prompt and you can run it through the model and get some, you know, activations. And then decode is you go autoregressively one by one and generate tokens. So prefill, all the tokens are given, so you can process everything at once. So this is exactly what you see at training time. And generally, this is a good setting to be in because you can, it's naturally parallel and you're mostly compute-bound. What makes inference, I think, special and difficult is that this autoregressive decoding, you need to generate one token at a time and you, it's hard to actually saturate all your GPUs and it becomes, you know, memory-bound because you're constantly, you know, moving data around. And we'll talk about a few ways to speed the models up. Just speed inference up. You can use a cheaper model. Um, you can use this really cool technique called speculative decoding where you use a cheaper model to sort of scout ahead and generate multiple tokens. And then if these tokens happen to be good by some for some definition good, you can have the full model model just, you know, score them and accept them all in parallel. Um, and then there's a bunch of systems optimizations that you can do as well.
Okay, so after the systems, oh, okay, assignment two. So you're going to implement a kernel, you're going to implement some parallelism. So data parallel is is very natural and so we we'll do that. Um, some of the model parallelism like FSDP turns out to be a bit kind of complicated to do from scratch. So we'll do sort of a baby version of that. Um, but, you know, I encourage you to learn and, you know, about the full version. We'll go over the full version in class, but implementing from scratch might be a bit, you know, too much. Um, and then I think an important thing is getting in the habit of always benchmarking and profiling. I think that's actually probably the most important thing is that you can implement things, but unless you have a feedback on how well your implementation is going and where the bottlenecks are, you're just going to be kind of flying blind.
Okay, so unit three is scaling laws. Um, and here the goal is you want to do experiments at small scale and figure things out and then predict the hyperparameters and loss at large scale. So here's a fundamental question. So if I give you a flops budget, you know, what model size should you use? If you use a larger model, that means you can train on less data, and if you use a smaller model, you can train on more data. So what's the right balance here? And this has been quite studied quite extensively and figured out by a series of papers from OpenAI and DeepMind. So if you hear the term "Chinchilla optimal," this is what this is referring to. And the basic idea is that for every compute budget, number of flops, you can vary the number of parameters of your model. Okay. And then you measure how good that model is. So for every level of compute, you can get the optimal parameter count. And then what you do is you can fit a curve to extrapolate and see if you had, let's say, you know, 1E22 flops, you know, what would be the parameter size? And it turns out these minimums when you plot them, it's actually remarkably linear, which leads to this like very actually simple but useful rule of thumb, which is that if you have a particular model of size N, if you multiply by 20, that's the number of tokens you should train on, essentially. So that means if I say, you know, 1.4 billion parameter model should be trained on 28 billion tokens.
Okay, but, you know, this doesn't take into account inference cost. This is literally how can you train the best model regardless of how big that model is. So there's some limitations here, but it's nonetheless been extremely useful for model development.
So in this assignment, this is kind of fun because we define a quote unquote "training API," which you can query with a particular set of hyperparameters. You specify the architecture, you know, and batch size and so on, and we return you a loss that your decisions will get you. Okay. So your job is you have a flops budget and you're going to try to figure out how to train a bunch of models and then gather the data. You're going to fit a scaling law to the gathered data, and then you're going to submit your prediction on, you know, what you would choose to be the hyperparameters, what model size, and so on at a larger scale. Okay. So this is a case where you have to be really, we want to put you in this position where there's some stakes. I mean, this is not like burning real compute, but, you know, once you run out of your flops budget, that's it. Um, so you have to be very careful in terms of how you prioritize what experiments to run, which is something that the frontier labs have to do all the time. And there will be a leaderboard for this, which is minimize flops, minimize loss given your flops budget.
Question from '24. Yeah. So if we're working ahead, should we expect assignments to change over time or are these going to be the final assignments? So the question is that these links are from 2024. Um, the rough assignments, the rough structure will be the same for 2025. There will be some modifications, but if you look at these, you should have a pretty good idea of what to expect.
Okay, so let's go into data now. Okay, so up until now, you've done, you've have scaling laws, you have systems, you can, you have your transformer implementation, everything, you're really kind of good to go. But data, I would say, is a really kind of key ingredient that I think differentiates in some sense. And the question to ask here is, what do I want this model to do, right? Because what the model does is completely, I mean, mostly determined by the data. If I put, if I train on multilingual data, it will have multilingual capabilities. If I train on code, it'll have code capabilities. It's not, you know, it's very natural. And usually data sets are a conglomeration of a lot of different pieces. There's, you know, uh, this is from The Pile, which is, you know, four years ago, but the same idea I think holds. You know, you have data from, you know, the web. This is Common Crawl. Um, you have, you know, maybe Stack Exchange, Wikipedia, GitHub, and different, you know, sources which are curated.
And so in the data section, we're going to start talking about evaluation, which is given a model, how do you evaluate whether it's any good? So we're going to talk about perplexity, which measures standard kind of standardized testing like MMLU. Um, if you have models that generate utterances for instruction following, how do you evaluate that? Um, there's, you know, also decisions about if you can ensemble or do chain of thought at test time, you know, how does that affect your evaluation? And then, you know, you can talk about entire systems, evaluation of an entire system, not just a language model, because language models often get these days plugged into some agentic system or something.
Um, okay, so now after establishing evaluation, um, let's look at data curation. So this is, I think, an important point that people don't realize. I often hear people say, "Oh, we're training the model on the internet." This just doesn't make sense, right? Data doesn't just, you know, fall from the sky and there's the internet that you can pipe into your model. Um, you know, data has to always be actively acquired somehow.
So even if you, you know, just just as an example of, you know, I always tell people, look at the data. Um, and so let's look at some data. So this is uh some Common Crawl, um, you know, data. I'm going to take 10 documents and I think hopefully this works. Okay, I think the rendering is off, but um, you can kind of see, uh, this is a this is a sort of random sample of of Common Crawl. Um, and you can see that this is maybe um, not exactly the data. Oh, here's some actually real text here. Okay, that's cool. Um, but if you look at most of Common Crawl, aside from this is a different language, but you can also see this is very spammy sites, and you'll quickly realize that a lot of the web is just, you know, trash. And so, well, okay, maybe that's not that surprising, but it's more trash than you would actually expect. I promise.
Um, so what what I'm saying is that there's a lot of work that needs to happen in data. So you can crawl the internet, you can take books, archives, papers, um, GitHub. Um, and there's actually a lot of processing that needs to happen. Um, you know, there's also legal questions about what data you can, you know, train on, which we'll touch on. Um, nowadays, a lot of frontier models have to actually buy data, um, because the data on the internet that's publicly accessible is actually, uh, turns out to be, you know, a bit limited for that kind of the, you know, the really frontier, um, performance.
And also, I think it's important to remember that this data that's scraped, it's not actually text, right? First of all, it's HTML or it's PDFs or in the case of code, it's just directories. So there has to be an explicit process that takes this data and turns it into text. Okay, so we're going to talk about the transformation from HTML to text. Um, and this is going to be a lossy process. Um, so the trick is, how can you preserve the content and some of the structure, um, without, um, you know, basically just having HTML. Um, filtering, as you could, you know, surmise, is going to be very important both for getting high-quality data, but also removing harmful content. Um, generally people train classifiers to do this. Deduplication is also an important step, which we'll talk about.
Okay. So assignment four is all about data. We're going to give you the raw Common Crawl, you know, dump so you can see just how bad it is. And you're going to train classifiers, dedupe, and then there's going to be a leaderboard where you're going to try to um minimize perplexity given your token budget.
So now, now you have the data, you've done this, built all your fancy kernels, you've trained, now you can really train models. But at this point, what you'll get is a model that can complete the next token, right? And this is called a essentially a base model. And I think about it as a model that has a lot of raw potential, but it needs to be aligned or modified some way. And alignment is a process of making it useful. So in alignment captures a lot of different things, but three things I think it captures is that you want to get the language model to follow instructions, right? Completing the next token is not necessarily following the instruction. It'll just complete the instruction or whatever it thinks will follow the instruction. Um, you get to here specify the style of the generation, whether you want to be long or short, whether you want bullets, whether you know you want it to be witty or have sass or not. Um, and you, when you play with, um, you know, ChatGPT versus Grok, you'll see that there's different alignment that has happened. And then also safety, one important thing is for these models to be able to refuse answers that can be, you know, harmful. So that's where alignment also kicks in.
So there's generally two phases of alignment. There's supervised fine-tuning. And here the goal is, I mean, it's very simple. You basically gather a set of user-assistant pairs. Um, so prompt-response pairs, and then you do supervised learning. Okay. And the idea here is that the base model already has the sort of the raw potential. So just fine-tuning it on a few examples is sufficient. Of course, the more examples you have, the better the results. But there's papers like this one that shows even like a thousand examples suffices to give you instruction-following capabilities from a good base model.
Okay, so this part is actually very, you know, simple, and it's not that different from, um, you know, pre-training because it's just you're given text and you just maximize the probability of the text.
Um, so the second part is a bit more interesting from an algorithmic perspective. So the idea here is that even with SFT phase, you will have a decent model. And now how do you improve it? You can get more SFT data, but that can be very expensive because you have to, you know, have someone sit down and annotate data. So there, the goal of learning from feedback is that you can leverage lighter forms of annotation and have the algorithms do a bit more work. Okay. So one type of data you can learn from is preference data. So this is where you generate multiple responses from a model to a given prompt, like A or B, and the user rates whether A or B is better. And so the data might look like, you know, it generates, "What's the best way to train a language model?" "Use a large data set" or "Use a small data set." And of course, the answer should be A. So that is a unit of expressing preferences.
Another type of supervision you could have is using verifiers. So for some domains, you're lucky enough to have a formal verifier, like for math or code. Or you can use learned verifiers where, um, you train an actual language model to, um, to rate the response. And of course, this relates to evaluation. Again, algorithms. Um, this is, you know, we're in the realm of reinforcement learning. So, uh, one of the earliest algorithms that was developed that was applied to instruction tuning models was PPO, Proximal Policy Optimization. Um, it turns out that if you just have preference data, there's a much simpler algorithm called DPO that works really well. Um, but in general, if you want to learn from verifiers data, you have to... it's not preference data, so you have to embrace RL fully. And, um, you know, there's this method which we'll do in this class, which is called Group-Relative Preference Optimization, which simplifies PPO, makes it more efficient by removing the value function, developed by DeepSeek, which seems to work pretty well.
Okay, so assignment five implements supervised tuning, DPO, and GRPO, and of course, evaluate.
Question about assignment one. Do people have similar things to say about assignments two or...
Yeah, the question is, um, assignment one seems a bit daunting. What about the other ones? I would say that assignment one and two are definitely the most heavy and hardest. Um, assignment three is a bit more of a breather, and assignment four and five, at least last year, were, um, I would say a notch below assignment one or two. Um, although I don't know, depends on, we haven't fully worked out the details for this year. Yeah, it does get better.
Okay, so just to recap of the different pieces here. Um, you know, remember efficiency is this driving principle and there's a bunch of different design decisions. And you can, I think if you view everything through a lens of efficiency, I think a lot of things kind of make sense. Um, and importantly, I think, you know, we are, it's worth pointing out that we are currently in this compute-constrained regime, at least this class and most people who are somewhat GPU-poor. So we have a lot of data, but we don't have that much compute. And so these design decisions will reflect squeezing the most out of the hardware. So for example, data processing, we're filtering fairly aggressively because we don't want to waste precious compute on bad or irrelevant data. Tokenization, like it's it's nice to have a model over bytes that's very elegant, but it's very compute inefficient with today's model architectures. So we have to do tokenization as an efficiency gain. Model architecture, there are a lot of design decisions there that are essentially motivated by, you know, efficiency. Training, I think the fact that we're, most of what we're going to do is just a single epoch. This is clearly we're in a hurry. We just need to see more data as opposed to spend a lot of time on any given data point. Scaling laws is completely about efficiency. We use less compute to figure out the hyperparameters. And alignment is maybe a little bit different, but the connection to efficiency is that if you can put resources into alignment, then you actually require less, you know, smaller base models. Okay.
So there is a, you know, there's sort of two paths. If your use case is fairly narrow, you can probably use a smaller model. You align it or fine-tune it and you can do well. But if your use cases are very broad, then there might not be a substitute for training a big model.
So that's today. So increasingly now, at least for Frontier Labs, they're becoming data constrained, which is interesting because I think that the design decisions will presumably completely change. Well, I mean, compute will always be important, but I think the design decisions will change. For example, you know, learning, taking one epoch of your data, I think doesn't really make sense if you have more compute. Why wouldn't you take more epochs at least or do something smarter? Or maybe there will be different architectures, for example, because the transformer was really motivated by, you know, compute efficiency. Um, so that's something to kind of ponder. Still, it's about efficiency, but the design decisions reflect what regime you're in.
Okay, so now I'm going to dive into the first unit. Um, you know, before... any questions?
Do you have a Slack or...
Uh, the question is, do we have a Slack? We will have a Slack. We'll send out details after this class.
Yeah. Will students auditing the course also have access to the same materials?
Uh, the question is, students auditing the class will have access to all the online, you know, materials, assignments, and we'll give you access to Canvas so you can watch the lecture videos.
Yeah. What's the grading of the assignments?
What's the grading of the assignments? Um, good question. So there will be a set of unit tests that you will have to pass. So part of the grading is just, did you implement this correctly? Um, there will be also parts of the grade which will be, did you implement a model that achieved a certain level of loss or is efficient enough? Um, in the assignment, every problem part has a number of points associated with it. And so that gives you a fairly granular level of what grading looks like.
Okay, let's jump into tokenization. Okay, so Andrej Karpathy has this really nice video on tokenization. And in general, he makes a lot of these videos that actually inspired a lot of this class, how you can build things from scratch. Um, so you should go check out some of his videos.
Um, so tokenization, as we talked about, um, is the process of taking raw text, which is generally represented as Unicode strings, and turning it into a set of integers, essentially, and where each integer represents a token. Okay. So we need a procedure that encodes strings to tokens and decodes them back into strings. Um, and the vocabulary size is just the number of values that a token can take on, the number of the range of the integers. Okay, so just to give you an example of how tokenizers work, let's play around with this really nice website which allows you to look at different tokenizers and just type in something like, you know, "hello," or or whatever. Um, maybe I'll do this.
Um, and one thing it does is it shows you the list of integers. This is the output of the tokenizer. It also nicely maps out the decomposition of the original string into a bunch of segments. And a few things to kind of note. First of all, the space is part of a token. So unlike classical NLP where the space just kind of disappears, everything is accounted for. These are meant to be kind of reversible operations, tokenization. Um, and by convention, for whatever reason, the space is usually preceding the token. Um, also notice that, you know, "hello" is a completely different token than " hello," which you might make you a little bit squeamish, but you know, it seems and it can cause problems, but that's just how it is.
Question I was going to ask, is the space being leading instead of trailing intentional or is it just an artifact of the BPE process?
Um, so the question is, is the spacing before intentional or not? Um, so in the BPE process, I will talk about, you actually pre-tokenize and then you, and then you tokenize each part. And I think the pre-tokenizer, it does put the space in the front. So it is built into the algorithm. You could put it at the end, but I think it probably makes more sense to put in the beginning. Um, but um, actually I don't, well, I guess it could go either way is my sense.
Um, okay, so then if you look at numbers, um, you see that the numbers are chopped down into different, you know, pieces. Um, it's a little bit kind of interesting that it's left to right. So it's definitely not grouping by thousands or anything semantic. Um, but anyway, I encourage you to kind of play with it and get a sense of what these existing tokenizers look like. Um, so this is a tokenizer for GPT-4 for example.
Um, so there's some observations that we made. Um, so if you look at the GPT-2 tokenizer, which we'll use as a reference, okay, let me see if I can, um, okay, hopefully this is... Let me know if this is getting too small in the back. Um, you could take a string, if you apply the GPT-2 tokenizer, you get your indices. So it maps strings to indices and then you can decode to get back the string. And this is just a sanity check to make sure that it actually round-trips. Um, another thing that's I guess interesting to look at is this compression ratio, which is if you look at the number of bytes divided by the number of tokens. So how many bytes are represented by a token? And the answer here is 1.6. Okay, so every token represents 1.6 bytes of data. Okay, so that's just a GPT-2 tokenizer that OpenAI trained.
Um, to motivate kind of BPE, I want to go through a sequence of attempts. Like suppose you wanted to do tokenization. What would be the sort of the the simplest thing? The simplest thing is probably character-based tokenization. A Unicode string is a sequence of Unicode characters, and each character can be converted into an integer called a code point. Okay, so "a" maps to 97. Um, the world emoji maps to 127,757. And you can see that it converts back. Okay. So you can define a tokenizer which simply um, you know, maps each character into a code point. Okay. So what's one problem with this?
Compression ratio is one.
The compression ratio is one. Um, so that's, uh, well, actually the compression ratio is not quite one because a character is not a byte. Um, but it's maybe not as good as you want. One problem with that, if you look at some code points, they're actually really large, right? Um, so you're basically allocating one slot in your vocabulary for every character uniformly. And some characters appear way more frequently than others. So this is not a very effective use of your kind of budget. Okay. Um, so the vocabulary size is huge. I mean, the vocabulary size being 127,000 is actually a big deal, but the bigger problem is that some characters are rare, and this is an inefficient use of the vocab. Um, okay, so the compression ratio is um 1.5 in this case because it's the tokens, uh, sorry, the number of bytes per token, and a character can be multiple bytes.
Okay, so that was a very kind of naive approach. Um, on the other hand, you can do byte-based tokenization. Okay, so Unicode strings can be represented as a sequence of bytes, because every string can just be, you know, converted into bytes. Okay. So some, you know, "a" is already just kind of one byte, but some characters take up as many as four bytes, and this is using the UTF-8 kind of encoding of Unicode. There's other encodings, but this is the most common one that's dynamic. So let's just convert everything into bytes and see what happens. So if you do it into bytes, now all the indices are between 0 and 255 because there are only 256 possible values for a byte by definition. Um, so your vocabulary is very, you know, small and each byte is, I guess, not all bytes are equally used, but you know, it's not too, you don't have that many sparsity problems.
Um, but what's the problem with byte-based encoding?
Long sequences.
Yeah, long sequences. So this is, I mean, in some ways, I really wish byte encoding would work. It's the most elegant thing, but you have long sequences. Your compression ratio is one. One byte per token. And this is just terrible. A compression ratio of one is terrible because your sequences will be really long. Attention is quadratic naively in the sequence length. So this is, you're just going to have a bad time in terms of efficiency. Okay, so that wasn't really good.
Um, so now the thing that you might think about is, well, maybe we kind of have to be adaptive here, right? Like, you know, we can't allocate a character or a byte per token, but maybe some tokens can represent lots of bytes and some tokens can represent few bytes. So one way to do this is word-based tokenization. And this is something that was actually very classic in NLP, right? So here's a string, and you can just, uh, you know, split it into let's say a sequence of segments, okay, and you can call each of these tokens. So you just use a regular expression. Um, here's a different regular expression that GPT-2 uses to pre-tokenize. Um, and it just splits your string into a sequence of strings. So, um, and then what you do with each segment is that you assign each of these to an integer, and then you're done. Okay. So what's the problem with this?
Yeah.
So the problem is that your vocabulary size is sort of unbounded. Well, not maybe not quite unbounded, but um, you don't know how big it is, right? Because on a given new input, you might get a segment that's uh, that you've never seen before. And that's actually kind of a big problem. This is actually word-based is a really big pain in the butt because um, you know, some real words are rare and you know, you actually, it's really annoying because new words have to receive this UNK token. Um, and if you're not careful about how you compute, you know, the perplexity, um, then you're just going to mess up. So, you know, word-based isn't, I think it captures the right intuition of adaptivity, but it's not exactly what we want here.
So here we're finally going to talk about the BPE encoding or Byte Pair Encoding. Um, so this was actually a very old algorithm developed by Philip Gage in '94 for data compression. Um, and it was first introduced into NLP for neural machine translation. So before papers that did machine translation or any, basically all NLP used word-based tokenization. And again, word-based was a pain. So, you know, this paper pioneered this idea, well, we can use this nice algorithm from '94 and we can just make the tokenization kind of round-trip and we don't have to deal with UNKs or any of that stuff. And then finally, this entered the kind of language modeling um, era through GPT-2, which was trained on using the BPE tokenizer.
Um, okay, so the basic idea is instead of defining some sort of preconceived notion of how to split up, we're going to train the tokenizer on raw text. That's a basic kind of insight, if you will. And so organically, common sequences that span multiple characters, we're going to try to represent as one token, and rare sequences are going to be represented by multiple tokens. Um, there's a sort of a slight detail, which is to for efficiency, the GPT-2 paper uses a word-based tokenizer as a sort of preprocessing to break it up into segments and then runs BPE on each of the segments, which is what you're going to do in this class as well.
Um, the algorithm BPE is actually very simple. So we first convert the string into a sequence of bytes, which we already did when we talked about byte-based tokenization. And now we're going to successfully merge the most common pair of adjacent tokens over and over again. So the intuition is that if a pair of tokens shows up a lot, then we're going to compress it into one token. We're going to dedicate space for that. Okay. So let's walk through what this algorithm looks like. So we're going to use this "cat and hat" as an example and we're going to convert this into a sequence of integers. These are the bytes. Um, and then we're going to keep track of what we've merged. So remember, merges is a map from two integers, which can represent bytes or other preexisting tokens, and we're going to create a new token. And the vocab is just going to kind of be a handy way to represent the index to bytes.
Um, okay. So, we're going to... the BPE algorithm, I mean, it's very simple. So, I'm just actually going to run through the code. You're going to do this a number of times. So, the number is three in this case. We're going to first count up the number of occurrences of pairs of bytes. So, um, hopefully this doesn't become too small. So, we're going to just step through this sequence, and we're going to see that... Okay. So, what's 116, 104? We're going to increment that count. 104, 101, increment that count. We're going to go through the sequence and we're going to count up, um, you know, the bytes. Okay. So now after we have these counts, we're going to um, find the pair that occurs the most number of times. Um, so I guess there's multiple ones, but we're just going to break ties and say 116 and 104. Okay, so that occurred twice. Um, so now we're going to merge that pair. So we're going to create a new slot in our vocab, which is going to be 256. So so far it's 0 through 255, but now we're expanding the vocab to 256. And we're going to say every time we see 116 and 104, we're going to replace it with 256. Okay?
And then we're going to um, just apply that merge to our training set. So after we do that, the 116, 104 became 256, and this 256, remember, occurred twice. Okay. So now we're just going to loop through this algorithm, you know, one more time. The second time, um, it decided to merge 256 and 101. Um, and now I'm going to replace that in the indices. Um, and notice that the indices is going to shrink, right? Because our compression ratio is getting better as we make room for more vocabulary items and we have a greater vocabulary to represent everything. Okay, so let me do this one more time. Um, and then the next merge is 2573. And this is shrinking one more time. Okay. And then now we're done.
Okay. So let's try out this tokenizer. So we have the string, "the quick brown fox." Um, we're going to encode into a sequence of indices. And then we're going to use our BPE tokenizer to decode. Let's actually step through what that looks like. Um, uh, this... well, actually maybe decoding isn't actually interesting. Sorry, I should have gone through the encode. Um, let's go back to encode. Um, so encode, um, you take a string, you convert to indices, and you just replay the merges in, and importantly, in the order that they occur. So, I'm going to replay these merges and then, um, and then I'm going to get my indices. Okay. And then verify that this works.
Okay. So, that was, um, it's pretty simple. You know, it's because it's simple, it was also very inefficient. For example, encode loops over the merges. You should only loop over the merges that matter. Um, and there's some other bells and whistles like there's special tokens, pre-tokenization. And so in your assignment, you're going to essentially take this as a starting point and, or I mean, I guess you should implement your own from scratch. Um, but your goal is to make the implementation, you know, fast. Um, and you can like paralyze it if you want. Um, you can go have fun.
Okay, so summary of tokenization. So tokenizer maps between strings and sequences of integers. Um, we looked at character-based, byte-based, word-based. They're highly suboptimal for various reasons. BPE is a very old algorithm from '94 that still proves to be an effective heuristic. And the important thing is that it looks at your corpus statistics to make sensible decisions about how to best adaptively allocate vocabulary to represent sequences of characters. Um, and you know, I hope that one day I won't have to give this lecture because we'll just have architectures that map from bytes, but until then, we'll have to deal with tokenization. Okay.
So that's it for today. Next time we're going to dive into the details of PyTorch and give you the building blocks and pay attention to resource accounting. All of you have presumably implemented, you know, PyTorch programs, but we're going to really look at where all the flops are going. Okay, see you next time.