Agentic AI MOOC | UC Berkeley CS294-196 Fall 2025 | LLM Agents Overview by Yann Dubois

Berkeley RDI Center on Decentralization & AI

Hosts: Yan, Yann Dubois

📅September 29, 2025

⏱️01:58:21

🌐English

🤍0 likes

Disclaimer: The transcript on this page is for the YouTube video titled "Agentic AI MOOC | UC Berkeley CS294-196 Fall 2025 | LLM Agents Overview by Yann Dubois" from "Berkeley RDI Center on Decentralization & AI". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=r1qZpYAmqmg

00:00:02Yan

Okay, let's get started. So, hi everyone. My name is Yan. I'm a researcher at OpenAI, and I'll be re-recording a class that I gave at the Berkeley LM Agents MOOC series on introduction to training LLMs for AI agents.

🤍0 likes💬 0 comments

Add to My Notes

00:00:17Yan

The reason why we're recording this class is because, one, we had technical difficulties early on in the class, and second, there was a fire alarm that started, which means that we did not go through all the slides and also we don't have a good recording. I want to make sure that everyone who's online would also see the entire class.

🤍0 likes💬 0 comments

Add to My Notes

00:00:40Yan

Before getting started, one thing to say is that all views here are my own. Even though I work at OpenAI, I will mostly talk about things that we can find online and information that we can find on open source models, especially Kimi, Llama, and DeepSeek. So unless I talk about OpenAI, nothing is related to OpenAI here.

🤍0 likes💬 0 comments

Add to My Notes

00:01:00Yan

Great. So with that, let's get started. So we all know that LLMs and chatbots really took over the world in the last few years. The question I will try to answer is: how do we actually train those models? This is an example here from ChatGPT and an answer from ChatGPT.

🤍0 likes💬 0 comments

Add to My Notes

00:01:19Yan

So there are three main parts of the training pipelines when training. The first one is pre-training. You probably all heard about it. The general mental model that I like to give people is that pre-training is really about predicting the next word on the internet. So you take all of the internet, or all the clean part of the internet, and you just try to predict the next word. As a result, you will learn everything about the internet and as much as possible about the world.

🤍0 likes💬 0 comments

Add to My Notes

00:01:48Yan

In terms of data, pre-training takes more than—at least for the big models, the big open source models—it takes more than 10 trillion tokens. So that's a lot. When I say tokens, you can think a little bit about it as words or subwords, but essentially more than 10 trillion tokens. Think about it as 10 trillion words.

🤍0 likes💬 0 comments

Add to My Notes

00:02:08Yan

It takes months to train on that much data, and it takes a lot of money and a lot of compute to actually train on that amount of data. You can think about the compute cost being more than $10 million for one run. So here the bottleneck in pre-training is both the data—as we talked about, pre-training is a lot—and second is compute. As we will see, basically the more you scale up these models—so that means the more data you put in the model and the more compute, so the longer you train for—the better the performance will be. An example of a pre-trained model is Llama 3, for example.

🤍0 likes💬 0 comments

Add to My Notes

00:02:47Yan

A second part, which actually is third in a chronological order of how these models are usually trained, but this is the second part that came after pre-training... the last part is more recent. The second part is classic post-training or RLHF (Reinforcement Learning from Human Feedback). Here the idea is that this pre-trained model is just good at predicting the next word, but it's not really good at performing well in the sense of predicting what the user wants, answering questions, or following instructions.

🤍1 like💬 0 comments

Add to My Notes

00:03:28Yan

You can think about it as a model that knows everything about the world, but it doesn't actually know how to interact with a human. That's one way of thinking about it, and that's what we're going to try to optimize for: the model's interaction with humans and making sure that when you ask it to do something, it does it.

🤍0 likes💬 0 comments

Add to My Notes

00:03:44Yan

The data size here is much smaller, maybe around 100,000 problems. All these numbers are just orders of magnitude. Just to give you a sense in terms of time, it probably takes a few days to do one of these runs. Compute cost maybe around $100,000. Here the bottleneck is data and evals. When I say data, I really mean the quality of data, because $100,000 is not that much, so it's really about how high quality the data is. And whether you know whether you can evaluate whether you're making improvements on your run. This is really important because when you do this RLHF or this post-training, you have to balance many things together. So you need to make sure that you're actually tracking your performance on all these different axes.

🤍1 like💬 0 comments

Add to My Notes

00:04:44Yan

A specific instantiation of this RLHF model is Llama Instruct. So when you hear "Instruct" at the end of the model, that usually means that it just went through RLHF because it can now do instruction following. It can follow instructions that are given by humans.

🤍0 likes💬 0 comments

Add to My Notes

00:05:03Yan

Great. And this last part, which actually, as I said, usually comes second in the pipeline, is the reasoning reinforcement learning. The idea here is to think on questions where there's objective answers or where you have access to ground truth. You will hear recently from open source models that perform very well on math and coding—things where it's actually pretty easy to get some ground truth answers, for example, passing test functions in coding or passing some math exam. And this is what you want to optimize during this reinforcement learning for reasoning.

🤍1 like💬 0 comments

Add to My Notes

00:05:46Yan

So this second stage is only true for reasoning models. One example is DeepSeek R1, which was the first reasoning, or the first open source reasoning model. In terms of data, they don't in the R1 paper say exactly, but you can kind of read through the lines and look at the plot and try to extract the amount of problems that they're actually training on. So in terms of data, it's probably around a million problems that they're training on. Probably takes on the order of weeks to train this reasoning stage. Around a million dollars.

🤍0 likes💬 0 comments

Add to My Notes

00:06:26Yan

Here the bottleneck is reinforcement learning environment and hacks. What I mean by that is, as I said, this is about optimizing for objective truth. For example, if you take the case of passing test functions in coding, the goal is "how many test functions can you get?" One thing that will usually happen is when you start optimizing for these test cases, you will see that the model will start optimizing things that you weren't expecting. Maybe it will be able to pass the test case by, for example, removing the test in your environment or replacing it with "always returning true." That's one of the types of things that the model may do, which is what we usually call hacks.

🤍1 like💬 0 comments

Add to My Notes

00:07:08Yan

The way to think about hacks is just: the model found a way of optimizing the reward even though that's not what you were hoping that it would do. So this is usually a pretty big bottleneck because models are really good at optimizing things even if it's not exactly the type of thing that we want to optimize. If you write something to optimize, they will optimize it exactly as written.

🤍0 likes💬 0 comments

Add to My Notes

00:07:38Yan

Usually the second and the third stage, I will bundle them together and call them post-training, which comes after pre-training. It kind of depends, different people have different names. That's how I believe R1 and Kimi talked about that in the post-training stage.

🤍0 likes💬 0 comments

Add to My Notes

00:08:01Yan

Great. So, the LLM training pipeline. There are basically five things that you need to consider when training an LLM. First is the architecture. So what model architecture are you using? You probably all heard about Transformers or about Mixture of Experts, which is a variant of Transformers.

🤍0 likes💬 0 comments

Add to My Notes

00:08:21Yan

And then there's the training algorithm and the loss. So that means, what are you optimizing for this architecture to do? What are you trying to optimize? And then there's the data and the eval environment. So that's what we talked about before, this evaluation which is knowing whether you're making any progress. And then the last part is systems and infra to make sure that you can scale up these runs.

🤍0 likes💬 0 comments

Add to My Notes

00:08:49Yan

Until, I would say, 2023, most of academia was actually focused on architecture and training algorithm and losses. I've done also a PhD, and that's what I was focused on until around 2023. Really there were a few people who were working on the rest, but that was the main part of academic research. But in reality, what matters in practice is the last three. So what matters is usually data, evaluation, and systems to be able to scale.

🤍1 like💬 1 comment

Add to My Notes

00:09:20Yan

People usually want to work on architecture, on developing new algorithms for optimizing your model. But these things matter much less, as we will see, than really: How much data do you put in? What's the quality of your data? Are you measuring your progress well? And do you have the infra to actually scale things up?

🤍0 likes💬 0 comments

Add to My Notes

00:09:39Yan

I will not be talking about architecture, mostly because at this point the architecture is not changing that much in the open source. It seems to be mostly using Transformers with Mixture of Experts, and a lot of people like talking about architectures, so you can find a lot of information about what architecture is being used. As I said, I think it's currently really not as important, so that's why I will not be talking about that.

🤍0 likes💬 0 comments

Add to My Notes

00:10:07Yan

Okay. There's two last parts that I didn't talk about in the pipeline for training LLMs, and I consider them more about specializing the LLM. One is prompting. So once you usually have a model—for example, a big lab might release a big open source model or closed model—people will be able to interact with it and specialize that model for their use cases. The usual way that people do it is first just by prompting.

🤍0 likes💬 0 comments

Add to My Notes

00:10:39Yan

Prompting is really just knowing how to ask questions essentially. It's kind of the art of asking the model what you want. What is nice is that you don't need any data, and it's pretty fast; you just try a few examples and you see how it works. There's kind of no compute associated to it, or very little. And the bottleneck is evals: how do you make sure that you actually have a good prompt, that you're asking the right question to the model?

🤍1 like💬 0 comments

Add to My Notes

00:11:08Yan

And then the second part is fine-tuning. We will not be talking about it, but just to mention it here briefly. Fine-tuning is basically continual post-training, or an additional post-training where you basically apply the second stage of post-training to domain-specific data.

🤍1 like💬 0 comments

Add to My Notes

00:11:29Yan

For example, imagine that all these companies release pretty general models, and now you want to specialize it to some specific domain, like medical data. So you might have internally, or for your project, some specific data that you want to optimize for. You will take these open source models and you will be basically fine-tuning—so doing a little bit more of training—on your specific data. Just like post-training, this requires around maybe 10,000 to 100,000 problems, around days, and compute cost around $10,000 to $100,000. Here again, just like post-training, the bottleneck is really the quality of your data and evaluation. How do you know whether you're making progress?

🤍1 like💬 0 comments

Add to My Notes

00:12:16Yan

Great. So let's talk about pre-training. I will talk about the method, what pre-training is, the data, and the compute that you need. So in terms of pre-training, as I said, the mental model—the metaphor that I like giving to people—is that pre-training is about predicting the next word. And the way to think about it is just, for example, when you type a message, you will usually see your phone predicting what's the next word that you will predict. And this is exactly how pre-training works. Or not exactly, but mostly this is kind of the metaphor I like giving.

🤍0 likes💬 0 comments

Add to My Notes

00:12:53Yan

The goal of pre-training is to teach the model everything in the world, and the way that we basically achieve that is just to predict the next word. Because if you can predict the next word on every single domain, then you must have some understanding of that domain. And that is basically what pre-training is.

🤍0 likes💬 0 comments

Add to My Notes

00:13:17Yan

So in terms of data, it's basically any reasonable data on the internet—as much as possible. Because you really want to have models that understand as much as possible at everything, so you really want to give as much data as possible for the model to learn on. In terms of scale of data, I said more than 10 trillion tokens. So for Llama 4 for example, I believe the models were trained with between 20 to 40 trillion tokens. For DeepSeek V3, I believe it was 15 trillion tokens. So it gives you kind of an order of magnitude of data that you need for the current best open source models.

🤍1 like💬 0 comments

Add to My Notes

00:14:01Yan

That 10 trillion tokens corresponds to approximately 20 billion unique web pages. So that's a lot of data. It's not all of the internet, but it's basically all the clean data that people can find on the internet. Pre-training has really been the key since GPT-2 in 2019 that mostly showed the world what pre-training can do. Basically just using a simple method like predicting the next word, but you just do it at scale, it really showed how smart the models can become.

🤍0 likes💬 0 comments

Add to My Notes

00:14:38Yan

Okay, so what is actually happening under the hood? I'll give you first a brief overview. In terms of task, as I said, it's about predicting the next word. So the steps are the following:

🤍0 likes💬 0 comments

Add to My Notes

00:14:50Yan

First, you tokenize the data. So here I have a sentence "she likely prefers" and the goal is to predict the next word. In this case, the next word is "dogs." So "she likely prefers." What you will do is you will split up "she likely prefers" into different tokens which are basically different subwords or different subunits. The reason why we do that is because computers don't understand words; they only understand numbers. So you have to take these words or you have to take this sentence and split it up into numbers. That's what we call tokenize.

🤍0 likes💬 0 comments

Add to My Notes

00:15:26Yan

I split it up by word here. So I have "she likely prefers" and I say all these three words become tokens. And I will associate all of these tokens with a different index. So "she" I will give it 1, "likely" becomes 2, and "prefers" becomes 3. This is just one way of converting these words that computers don't understand to numbers that computers can work with.

🤍0 likes💬 0 comments

Add to My Notes

00:15:54Yan

Then you will do what we call a forward pass, which means that you will pass it through the model. We'll see exactly what happens later, but you will pass it through the model—usually this is a Transformer—and then you will have this model try to predict a probability distribution. So a categorical distribution that tries to predict what is the probability of the next word. For example, here you see that "she likely prefers," it's very unlikely to say "she" again, but it's very likely to say this word, which in this case is "dog."

🤍0 likes💬 0 comments

Add to My Notes

00:16:26Yan

And then you will sample from this probability distribution. Once you have a model that predicts a distribution, you can just sample, and that's why every time you ask a question to some open source model, it will not always get the same answer because you actually have this sampling step. You sample and then you detokenize. Because again, when we talk about it in this categorical distribution, that just tells me that this is token number five. I want index here, so I have 5, and then I need to look through my dictionary that tells me index 5 was actually the word "dogs." So that's how I detokenize.

🤍1 like💬 0 comments

Add to My Notes

00:17:08Yan

The last two steps—this is not super important—but the last two steps are only happening at inference time. At training time, all you can do is you can always keep predicting the next word by predicting the probability distribution and just optimizing your cross-entropy loss, which I'm sure most of you are familiar with. So you don't actually need to do the sampling. These two steps are only done during inference.

🤍0 likes💬 0 comments

Add to My Notes

00:17:34Yan

Great. So now I want to give you some intuition about why this can even work. And to do that, I will talk about honestly kind of the most simple language model that you could think about, and this is the N-gram language model, which was already used at scale in 2003. So a very long time ago. It already worked pretty well, but I think it gives a good intuition of what is happening under the hood for the current models.

🤍0 likes💬 0 comments

Add to My Notes

00:18:02Yan

So the question here is: how can you learn what to predict? Because we talked about before, I said, "Oh, you just do a forward pass to your model and just predict a distribution." Like, how can you learn that?

🤍0 likes💬 0 comments

Add to My Notes

00:18:13Yan

The solution is statistics. Statistics is always the solution to most of your problems. One way you can do that is... let's take an example: how can you know what comes after the sentence "the grass is"? And you probably know that after "the grass is," it's most likely to be "green," for example. So how can we know that? How can we teach the model to do that?

🤍0 likes💬 0 comments

Add to My Notes

00:18:37Yan

Well, the solution is: you can take all the occurrences of the sentence "the grass is" online. For example, take all the occurrences of "the grass is" on Wikipedia. And now you can predict the probability of every word that comes after "the grass is" by looking at the number of times that that word appeared after "the grass is," normalized by the number of times that you saw the sentence "the grass is."

🤍0 likes💬 0 comments

Add to My Notes

00:19:07Yan

So let's say that the sentence "the grass is" happens a thousand times on the web pages that you looked at, and maybe half of the time—so maybe 500 times—the next word is "green," and maybe 100 times the next word is "red." You will have then the probability of "green" knowing "the grass is" will be half—so 500 divided by 1000—and for "red" it will be at 10%, so 0.1.

🤍0 likes💬 0 comments

Add to My Notes

00:19:37Yan

So that's a very simple way of predicting the categorical distribution of the next word. But this would work. It would actually work pretty well, at least for simple things like "the grass is." There are still a few challenges. One is that you need to keep count of all the occurrences for each of these N-grams—or at least in this case, each of these sentences that happened. You need to keep a count of every word that came after. So just think about it in terms of memory: it's a huge memory requirement for storing all of that. So that's unfeasibly large, but it will still work pretty well for simple things.

🤍0 likes💬 0 comments

Add to My Notes

00:20:26Yan

And then the other problem is that most sentences—maybe not most, but a lot of sentences—might be unique. So if there's something that never happened in your training corpus... so if you never saw this very long text, let's say that instead of "the grass is" I give you 100 lines of code and I ask you what's the next word? Maybe you never saw these 100 lines of code at training time, and then your predictor will have no way to generalize because basically the count will be zero. So it will give a probability of zero even though the probability is actually a little bit higher than zero. That's two problems that you would have with this very simple kind of statistical language model.

🤍0 likes💬 0 comments

Add to My Notes

00:21:09Yan

So the solution is very simple: just use neural networks. So I'm sure many of you know about neural networks and we're going to assume that you do. But basically you can just estimate—well, at least approximate—this prediction by using this parametric estimator that neural networks are, instead of this non-parametric estimator that we talked about here.

🤍0 likes💬 0 comments

Add to My Notes

00:21:35Yan

Great. So let's go through what I'll call here Neural Language Models. It's a language model with a neural network, which is what everyone does. The way that this works at a very high level is that you take a sentence, for example "I saw a cat on a". I will basically split the sentence into different tokens. I will associate all these tokens with a word embedding, so a vector representation of that word.

🤍0 likes💬 0 comments

Add to My Notes

00:22:06Yan

The way that you can think about it is that imagine that this was in 2D. You basically have a plane and you basically have all these points that are on this plane where usually more similar words cluster with one another. So you might have "I saw cat" and things like this. It's just that instead of being 2D, it might be much higher dimensional. It might be a vector space of like 768 dimensions or something like this.

🤍0 likes💬 0 comments

Add to My Notes

00:22:37Yan

Then you pass that to a neural network. So a neural network, the way to think about it is just some non-linear aggregator of these vectors. It takes all these vectors as input, it does some merging, and gives you another vector. The important part is that it's differentiable, so you can actually back-propagate to that. That's the most important part.

🤍0 likes💬 0 comments

Add to My Notes

00:23:02Yan

For example, a very simple neural network could just be an average of this. You could literally just average all these tokens together, or the vectors associated with these tokens, and it gives you another token here which is intuitively the vector representation of all this sentence "I saw a cat on". So again, this is: you could take some average or you could just take some nonlinear aggregation like passing through a neural network.

🤍0 likes💬 0 comments

Add to My Notes

00:23:33Yan

Then what you do is... this vector representation is in the wrong dimension. Because what you want is you want to be able to predict the probability of each word. So you want a representation that lives in this dimensional space that is the number of tokens—the number of words that exist in your language, for example English.

🤍0 likes💬 0 comments

Add to My Notes

00:24:08Yan

So what you will do, a very simple way, is that you can just pass this to a linear layer. So you can just multiply this by a matrix to take this H representation that lives in dimension [d] and pass it to your vocabulary size dimensions. So let's say very concretely you have 768 dimensions. Let's say that your vocabulary might be like 20,000 words that you might want to predict in English. You will basically multiply this by a matrix of 768 by 20,000, and then you will get a vector out of it that is 20,000 dimensional.

🤍0 likes💬 0 comments

Add to My Notes

00:24:49Yan

Great. So once you have this, you will just pass it through a softmax. Softmax is the usual trick to get a categorical distribution from any vector. This just ensures that basically you have numbers that sum to one and are between zero and one. And then you can just consider that as the probability of the next word after "I saw a cat on a". And here you basically have this prediction of the next word.

🤍0 likes💬 0 comments

Add to My Notes

00:25:27Yan

Great. And once you have the next word, you can just optimize the cross-entropy loss. So basically just try to optimize what the real word is. Let's say that the real word comes from here. You will basically try to maximize a little bit this one and minimize all the rest. And then you just back-propagate because everything is differentiable, and that will basically tune all the weights that you have in your neural network including also these word embeddings—so this representation for every word. Okay, that was a very brief overview, but hopefully you get a sense of what a neural language model is.

🤍0 likes💬 0 comments

Add to My Notes

00:26:00Yan

Okay. So now we talked about the method. Let's talk about the data that goes into pre-training. So the idea as I said before is to basically use all of the clean internet. So use as much data as possible and everything that is clean on the internet. Why do I say all of clean internet? It is because the internet is actually... the majority of the internet is pretty kind of dirty and not representative of what you want to ship to users or what you want to optimize your model on.

🤍0 likes💬 0 comments

Add to My Notes

00:26:35Yan

A very practical type of pipeline—every different lab and different pre-training groups have different ways of doing it, but that's just to give you a broad overview—you first download all of the internet. Usually people use in the open source some crawlers that already downloaded the internet for them. So for example, Common Crawl is a crawler that already downloaded 250 billion pages. So that's around more than one petabyte of data and they're all in these WARC files.

🤍0 likes💬 0 comments

Add to My Notes

00:27:18Yan

So basically download all of the internet. The second thing that you do... as you just saw, you have this HTML. So you will see that's pretty hard to understand. You see here some meta keywords and here you will find the text that says "blah blah blah" or here "paragraph one of the best and most rewarding features of the..." No, or that. So, it seems to be like an ad talking about rewarding features and then it talks about downloading free question and answers. Anyways, it seems to be pretty kind of an ad. This is a random website that I took from Common Crawl. So as you see, it's kind of hard to parse and probably this is not even something that you really want to train on if it's an ad.

🤍0 likes💬 0 comments

Add to My Notes

00:28:26Yan

Great. Second thing that you do: as I just saw, as you just saw, you have this HTML. So what you have to do is you have to extract text out of it. It's actually pretty challenging. There will be some question of: How do you deal with JavaScript? Or like boiler plate code? Or math that is rendered differently? And things like this. So you will need to extract text from HTML. This is actually pretty computationally expensive because at this point you really have—the name of the game is how much data you can have. So you really have a lot of data that you have to clean and extract from. So that's actually pretty computationally expensive.

🤍0 likes💬 0 comments

Add to My Notes

00:29:06Yan

Then you will do some filtering. So, one filter that the open source world does pretty early on is filtering for undesirable content, like PII data, or like non-safe for work data, or anything that is harmful. You will try to remove this.

🤍0 likes💬 0 comments

Add to My Notes

00:29:28Yan

Then another very common filter that people usually do is deduplicating your data. That deduplication could be by document, it could be by line, it could be by paragraph, it could be at different levels. But the idea is to not train too much, too many times, on the exact same data. For example, if you train on forums—let's say on all the data you have on Wikipedia or like Stack Overflow—you will always have these headers and footers that are duplicated. And you definitely don't want to train like a million times on the exact same Stack Overflow header because you don't learn much from it. So you would basically be losing compute to try to learn the header perfectly.

🤍0 likes💬 0 comments

Add to My Notes

00:30:13Yan

Okay. And then you will do some heuristic filtering. You might remove—you might try to remove—low quality documents. "Low quality" might be that there are too many words. Like if it's an extremely long document, it might be suspicious. If it's a very short one, like let's say there's only 10 words, probably it's not worth training on. If there's many kind of outlier tokens—so like words that really look extremely rare—it might be that this is just bad data. So you'll do a lot of these heuristic-based filtering.

🤍0 likes💬 0 comments

Add to My Notes

00:30:47Yan

And then you might also do some model-based filtering. One idea that I find pretty neat that people have been doing is trying to basically do distribution matching. So you find some distribution that you think is high quality. For example, you might say Wikipedia is pretty high quality. Or you might say every page that is referenced on Wikipedia is likely to be high quality because that means that someone went and referenced that page. That's already a pretty big amount of data—all the websites that are linked by Wikipedia is pretty large—but it's still very little compared to the amount of data that we need for pre-training.

🤍0 likes💬 0 comments

Add to My Notes

00:31:28Yan

So what you might say is: "I want to find more of that type of data." And the way you can do that is that you can train a classifier that takes some random data in and says "This is not a reference on Wikipedia," and then the pages that are referenced on Wikipedia, and you try to predict basically YES to this and NO to the former. And once you train that classifier, you basically have a classifier that essentially predicts how likely your document is to be referenced by Wikipedia. And then you can just do a filtering based on that. So you can say if it's very likely, I keep; if it's not likely at all, I will throw it away because it's probably some bad data.

🤍0 likes💬 0 comments

Add to My Notes

00:32:15Yan

So this is some model-based filtering, and you can do a lot of that. And then you can do some data mix changes. For example, you might classify the category of data: whether it's like code, books, entertainment, like any of these domains. And then you might want to re-weigh different domains. So for example, if you train a coding model, you want to re-weigh coding because probably there's not enough code online. So you want to say, "No, even though I have only 5% of coding, I want to bump it up to like 50%."

🤍0 likes💬 0 comments

Add to My Notes

00:32:51Yan

And the way to do this re-weighting... you can usually do these experiments at small scale. This is true for any of these filtering. You might do these experiments at small scale, try to understand what is best, and then you will try to predict what to do at larger scale.

🤍0 likes💬 0 comments

Add to My Notes

00:33:12Yan

Great. And at the end, we'll talk about that too, but once you did all of this pre-training data, you will also try to collect some higher quality data. For example, you might say like Wikipedia is super high quality, or like everything on arXiv might be really high quality. So you will keep kind of this second distribution of high quality data.

🤍0 likes💬 0 comments

Add to My Notes

00:33:34Yan

Usually after training on this pre-training data, you will do what we call mid-training, which is training on this high quality data. The idea being: "Well, we don't have enough of that data, but we know it's high quality." So we'll try to kind of fine-tune, or optimize after pre-training—or continual pre-training—our model on that high quality data such that the model learns to really be as good as possible.

🤍0 likes💬 0 comments

Add to My Notes

00:34:04Yan

Okay. So pre-training data. One paper that I would recommend reading is this FineWeb. It's both a paper and a blog post about FineWeb datasets from Hugging Face, and they talk a lot about what filtering they've done. But this is just one plot from their paper.

🤍0 likes💬 0 comments

Add to My Notes

00:34:21Yan

Here the x-axis shows the amount of tokens, billion tokens, that you train on. This is still pretty small compared to the scale of pre-training data that we talked about before, which is more than 10 trillion tokens. And this is the aggregated accuracy where it's basically your performance on a whole set of evals.

🤍0 likes💬 0 comments

Add to My Notes

00:34:45Yan

Here what they show is, first, this green line is when you take... they took 200 trillion tokens from, I believe, Common Crawl. So this is basically raw data. And then they applied a lot of filters. So: Non-Safe For Work, block list, they mostly went for English text, some simple document filtering. For example, if it's too much repetition in a document they removed it, or it's the wrong length. So that's this first filtering; went from 200 trillion tokens to 36 trillion tokens.

🤍0 likes💬 0 comments

Add to My Notes

00:35:19Yan

And here we see how well you perform when you train on 360 billion tokens from those. And here you see the performance gain when you deduplicate data. So the way that they've done it is they said, essentially, "I don't want to have text that is duplicated more than 100 times." So that's basically at a high level. So they filtered data by nearly half, from 36 trillion tokens to 20 trillion tokens. And you see that training on that really improves performance. Again, that's because you're basically not forcing your model to learn things that are duplicated and not that useful, and you really focus on new data. So that worked pretty well for them.

🤍0 likes💬 0 comments

Add to My Notes

00:36:11Yan

I mean, 100 documents that are duplicated is still quite a lot, but usually you can have huge clusters of like a 100,000 duplicates. So those are the ones that they wanted to filter out. And here you see some additional filtering. For example, they removed I believe JavaScript, they removed like Lorem Ipsum text and things like this. So that removed a little bit more of data and you see that performs better. And then some additional, I believe model-based filtering that performed even better.

🤍0 likes💬 0 comments

Add to My Notes

00:36:44Yan

Great. So that's pre-training data. And then there's mid-training. So as I said before, the idea of mid-training is basically continuing your pre-training to adapt your model to have some desired properties or to basically adapt your model on some higher quality data. So usually you do it on basically less than 10% of what you did for pre-training—so less than the trillion tokens.

🤍0 likes💬 0 comments

Add to My Notes

00:37:11Yan

So you might for example change the data mix in your data. So you might say, "I want to have a lot of coding data at the end," or "I want to be more scientific and have a model that is really good at science questions." Or you might want to optimize more multilingual data. Let's say that you know that data is usually... a lot of the data that we have access to is more English, but this is not representative of the languages that people usually speak. So you might say, "Okay, I'm going to up-weight some other languages that are usually less represented in our datasets just to make sure that it's basically representative of how many people speak that language."

🤍0 likes💬 0 comments

Add to My Notes

00:37:59Yan

Some other type of things we do during mid-training, or that we might want to do, is that you usually want to increase context length. So in many of these models, you usually hear this idea of how much context can the model see. And when you do pre-training, you don't want to train on very large context lengths because that's much more computationally intensive. But you do want the model to be able to understand, let's say, 120 tokens that came before your question.

🤍0 likes💬 0 comments

Add to My Notes

00:38:35Yan

So usually what you do during mid-training is that you will bump up this context length. So you will do some extension of context length during mid-training. For DeepSeek V3, they went from 4,000 contexts during pre-training to 128[k] during mid-training. And I think many other open source projects did that.

🤍0 likes💬 0 comments

Add to My Notes

00:38:59Yan

Other type of data you might want to add is some formatting on instruction following. So you might want to already teach your model to answer questions when you ask a question, or to write in a very specific kind of chatty way. And some high quality data: if you have some high quality data, you might keep it for the end and be like, "Okay, once I first want to learn how to kind of speak grammatically correctly, and then I want to actually learn the real meat of the text that you have in your data." And you might have some reasoning data about teaching the model how to think, which I believe is what Kimi did. And yes, many other things.

🤍0 likes💬 0 comments

Add to My Notes

00:39:48Yan

Great. So pre-training and mid-training, let's just do a recap. One is that really this data during pre-training is a huge part of training LLMs. I would even say that it's basically the key for training LLMs. And there's a lot of research that has already been done and a lot more to be done. For example: how do you process well and efficiently? I mean, these are huge scales of data. We're talking about whether to use synthetic data and whether to use basically models to generate more data. How much multimodal data to put in? How to balance your domains and all of that.

🤍0 likes💬 0 comments

Add to My Notes

00:40:27Yan

And there's a lot of secrecy. So most companies are not talking about what they do. Even the companies that actually do open source models don't usually talk that much about the data that they collected. First, because it's the most important thing, so there's competitive dynamics—they don't want to tell you what they've been training on because that would be easier to replicate. And then there's some companies might be scared about copyright liability if they train on data that they shouldn't have trained on.

🤍0 likes💬 0 comments

Add to My Notes

00:40:57Yan

So here are a few common academic datasets: C4, The Pile, FineWeb. I just wrote a few. So FineWeb is the one we talked about before—15 trillion tokens. And this is the composition of The Pile, and you see that in The Pile there's a lot of arXiv and like PubMed and high quality data, and you will have also a good amount of code and things like this.

🤍0 likes💬 0 comments

Add to My Notes

00:41:28Yan

Great. So just to give you a scale of these data: As I said, Llama 2 was trained around like 2 trillion tokens, Llama [3] around 15, Llama 4 between 20 and 40 trillion tokens. So every new generation tries to train on more data and does also some better filtering.

🤍0 likes💬 0 comments

Add to My Notes

00:41:49Yan

Okay. So that was about a pre-training data aspect. Now let's talk about the compute. Empirically, like, one thing that is super important is that empirically for any type of data and model, the most important—as I said before—is how much compute you basically spend on training. So by "how much compute," I mean both kind of how much data do you put in the model and the size of the model. Because if the model is bigger, you need to spend more compute.

🤍0 likes💬 0 comments

Add to My Notes

00:42:20Yan

And what is very nice is that you can actually predict pretty well the performance, at least during pre-training. You can predict pretty well the performance that you will achieve if you just pour more compute into your run. So if you just train for longer or train bigger models, you can predict pretty well how well they will perform.

🤍0 likes💬 0 comments

Add to My Notes

00:42:40Yan

So here, the way to interpret this plot is that on the x-axis you see the amount of compute that you have in your run. This is in log scale. And here you have your test loss also in log scale. And all these blue lines are basically different runs. And then you take the minimum achieved for all of these runs and you can link all of them together and it gives you something that looks pretty close to a line. And then you can just fit the line to this curve of the ideal compute and test loss.

🤍0 likes💬 0 comments

Add to My Notes

00:43:20Yan

And now you can use this line to predict: how well can you perform if you train on 10 times more compute or 100 times more compute? So what is very nice is what I wrote here: that you can now do research at very low scales and then predict how well it will perform at higher scales. So this is what we call a scaling law, which is very surprising. There's really no good reason for this to happen, or at least it could have been different. There are some theories for why that happened. And it's really very nice when you do research because now it means you can work at this small scale, and it has really been great for the field.

🤍0 likes💬 0 comments

Add to My Notes

00:44:13Yan

Okay, so scaling laws. What is nice, as I said, is that now you can tune things at lower scale. For example, if I ask you a question and I gave you 10,000 GPUs, I asked you: how should you be using these 10,000 GPUs? How should you be training that model?

🤍0 likes💬 0 comments

Add to My Notes

00:44:31Yan

Historically, what you might have done is you might tune hyperparameters for different models. So you might say, "Okay, I'm going to have 20 different runs or 30 different runs and I'm going to pick the best and that's the one that I'm going to ship." But as a result, each of them will only be trained on like 1/30th of the compute that you had access to.

🤍0 likes💬 0 comments

Add to My Notes

00:44:50Yan

The new pipeline is that now you can find scaling recipes. So you can find recipes that tell you how to change like the learning rate with different scales and things like this. Then you can tune hyperparameters at small scale for a very short amount of time and for very many iterations. And then you can plot the scaling law, extrapolate how well you will be performing at larger scale, and then train one huge model at the end where you use way more of the compute that you have access to. So maybe like 90% of your compute goes for the full run rather than like 1/30th of what you had before. So yeah, this is really a blessing.

🤍0 likes💬 0 comments

Add to My Notes

00:45:30Yan

Okay. So for example, very concretely: Should you use an architecture that is a Transformer or an LSTM? You see that Transformers... this is the scaling law for Transformers, and here you see LSTMs. You see that Transformers have a better constant. So that means that they are always lower [loss] than LSTMs, and they also have a better scaling rate. You see here that LSTM seems to be plateauing a little bit. So that tells you both that at any scale Transformers is better, but also the larger the scale, the better the Transformer becomes. Which is why most people gave up on essentially LSTMs as an architecture.

🤍0 likes💬 0 comments

Add to My Notes

00:46:11Yan

But what's interesting is it could have been that the constant is better for one of the architectures but the scaling rate is better for the other one. And in that case, you definitely want to always go with the scaling rate, not the constant, because who cares how well it performs at very small scale? The real question is: what if it's like 200 times larger? How does it perform there? And that's why the scaling rate is what matters.

🤍0 likes💬 0 comments

Add to My Notes

00:46:40Yan

Great. So one very famous paper about scaling laws is Chinchilla that tries to show what is the optimal way of allocating training resources between the size of the model and the data. Because both of these things are about compute, and as we said, the more compute the better. But there are two ways of spending compute: either you train for longer or you train larger models.

🤍0 likes💬 0 comments

Add to My Notes

00:47:09Yan

So they have these results—I'm going to skip a little bit—but you can basically predict the optimal resource allocation. And they found that for every parameter you should be using around 20 tokens. So that's kind of this optimal resource allocation.

🤍0 likes💬 0 comments

Add to My Notes

00:47:32Yan

One thing to note: you will hear often about Chinchilla, but Chinchilla is only an optimization of training resources. It doesn't consider inference costs. So they only ask: "What is the best way to achieve a certain training loss? Where should I be putting the compute?" But it doesn't take into account that if you have larger models, at inference time it will actually cost more.

🤍0 likes💬 0 comments

Add to My Notes

00:48:04Yan

So for example, for GPT, the larger the model, the more you will spend per user. So you might be better off using actually training for longer and training a smaller model, even if it means that you need to spend more compute to achieve the same performance, because at inference time it will cost less.

🤍0 likes💬 0 comments

Add to My Notes

00:48:32Yan

Anyway, so that's the Chinchilla paper. And then I want to talk a little bit about The Bitter Lesson from Sutton. So basically, The Bitter Lesson... I would really recommend reading this blog post from Richard Sutton, who is really the big researcher of reinforcement learning. He wrote that blog post that essentially tries to say: the only thing that matters in the long run is about leveraging compute.

🤍0 likes💬 0 comments

Add to My Notes

00:49:06Yan

And the reason why is because we see empirically that the larger... like, the more compute you put in the models, the more improvements you get out of it. So basically, more compute equals better performance. And we also know from like Moore's Law and like some derivative laws that we will always have more compute—or at least that's kind of the hope, we will always have more compute every year.

🤍0 likes💬 0 comments

Add to My Notes

00:49:33Yan

And if you put these two things together—more compute equals better performance, and you will always have more compute—then the natural thing that comes out of it is that it's all about leveraging computation. There's no reason for trying to optimize things at your current level of compute because next year you will have more and that will just perform better. So what matters is to have methods that will scale up really well. So that's kind of the TL;DR for The Bitter Lesson, which has really driven a lot of how the community has been thinking in the last, I would say, three or four years.

🤍0 likes💬 0 comments

Add to My Notes

00:50:13Yan

So yeah, so the summary is: don't spend time over-complicating things. Do the simple thing and make sure that it scales. Because what matters, again, is not kind of tuning this constant performance, it's really making sure that you can scale it up.

🤍0 likes💬 0 comments

Add to My Notes

00:50:31Yan

Great. So for training a SOTA model... this is a slide that I wrote maybe two years ago, or one or two years ago, for training Llama 3 400B—which at the time was the largest open source model—and I just tried to predict how much that would cost.

🤍0 likes💬 0 comments

Add to My Notes

00:50:52Yan

So in terms of data, it was trained on 15.6 trillion tokens, 405 billion parameters. And you see here that uses around 40 tokens per parameter. So that's pretty kind of "train compute optimal" by Chinchilla standards. In terms of FLOPs, it uses 3.8 x 10^25 FLOPs. There's an executive order that says that you need to be more careful when you open source models or when you train models that are more than 10^26 FLOPs. So this is around two times less than the executive order.

🤍0 likes💬 0 comments

Add to My Notes

00:51:39Yan

In terms of compute, they use 16,000 H100s. And if you do the computation, in terms of time, it probably takes around 70 days of training to train this model. And in terms of cost, my rough estimate is that it would cost around $52 million for training this. So between 50 and like 80 million depending on how much you consider that they spend per compute given their own clusters.

🤍0 likes💬 0 comments

Add to My Notes

00:52:13Yan

And in terms of carbon emitted for training this, this is around—just for training that one model—maybe 2,000 return tickets from JFK to London. So from New York to London. So that's quite a lot. It's still neglectable compared to how many flights there are per year and things like this. But if you think that every generation is going to be like maybe 10 times more compute than the previous generation, you could see how in like two, three, four generations that will become a real issue in terms of carbon emitted. In terms of next model, as I said, it's around basically every generation you can think about it as like 10 times more FLOPs that go into training the models.

🤍0 likes💬 0 comments

Add to My Notes

00:53:03Yan

Great. Um, so pre-training summary: The idea is about predicting the next word on the internet. Data: around 10 trillion words go into training these models right now. In terms of time, it takes months. In terms of compute, more than $10 million. The bottleneck is data and computation. And some examples might be DeepSeek V4, DeepSeek V3, and Llama 4.

🤍0 likes💬 0 comments

Add to My Notes

00:53:31Yan

Okay. So now we talked about pre-training. Let's talk about post-training. Again, I'll talk about the method, the data, and compute. So why do you want to do post-training? Well, language modeling—so what we do during pre-training—is really not about assisting users and about helping users. So language modeling is not what you want.

🤍0 likes💬 0 comments

Add to My Notes

00:53:50Yan

And what I mean by that is that if you just take GPT-3 and you prompt it with "Explain the moon landing to a six-year-old in a few sentences," what it will do is... it has been trained on basically a large part of the internet. So it will say, "Well, that reminds me of maybe large lists of questions that people might ask." So instead of answering the question, it might just predict what is the next type of question, what is a similar type of question that people might ask. So actually what GPT-3 answers to you is: "Explain the theory of gravity to a six-year-old. Explain a theory of relativity to blah blah." So this really shows you that these models are really not optimized for predicting what you want. This is just about language modeling: predicting the next word.

🤍0 likes💬 0 comments

Add to My Notes

00:54:35Yan

So the idea of kind of classic post-training—also called instruction following or alignment—is about steering the model to be useful on real world tasks. So if I ask "Explain the moon landing to a six-year-old in a few sentences," the same as before, I want ChatGPT or any model to give me a real answer. And the way that we basically do that is to maximize the preferences of humans. Maximize answer preferences of humans.

🤍0 likes💬 0 comments

Add to My Notes

00:55:05Yan

In terms of data, probably around like between 5,000 and 500,000 problems. So much, much smaller scale than pre-training. The idea is that first you try to basically learn everything in the world, and then you try to optimize on very specific domains—which is in this case kind of just like instruction following and answering questions—with very few data points, because the model already knows everything. So it just needs to learn how to basically act or how to interact with the human. And this is really what made ChatGPT what it is. So since 2022, that's really when post-training became important.

🤍0 likes💬 0 comments

Add to My Notes

00:55:48Yan

And then... so that's the overview of this third stage that I told you about, which is the classic post-training. And then there's the second stage, which is about teaching the model to reason. So that only happens in some models, for example Kimi and R1. And the idea is to optimize simply answering the right question. Is answering correctly the question.

🤍0 likes💬 0 comments

Add to My Notes

00:56:09Yan

So you will see some... for example in R1, it says things like "thought for 24 seconds." So reasoning is about: how do you optimize for that? The data that you usually optimize for is basically any hard task with verifiable answers. So things like math competitions or like coding test cases, and you try to optimize for that. So this really became important since R1 in 2024. And yeah, this is about this new paradigm of reasoning.

🤍0 likes💬 0 comments

Add to My Notes

00:56:41Yan

I believe Noam from OpenAI will also come and tell you about reasoning, but at a very high level, the idea is that what we had before was train-time compute. We talked, I mentioned to you, scaling laws which show that the more compute you pour in the run during training, the better your performance is. And one reason it gave you is test-time compute. So after training, you can also pour more compute in your model to get better performance. And that's kind of like humans. If you make me answer a question in a second, I will probably write a less thoughtful and less correct answer than if you gave me like a week to answer the question. So the goal was test-time scaling.

🤍0 likes💬 0 comments

Add to My Notes

00:57:29Yan

Let me put a little bit more light. Great. So post-training methods, I will talk about SFT and reinforcement learning.

🤍0 likes💬 0 comments

Add to My Notes

00:57:39Yan

So the task is alignment. Let's say that you want to optimize... just as an example, let's say that we want to optimize the LLM to follow user instructions or some designer's desires. So this is the example from before which is like answering questions, or maybe you want the model to never answer specific type of questions. For example, if I ask "write a tweet describing how X people are evil," you might want your model not to answer that question.

🤍0 likes💬 0 comments

Add to My Notes

00:58:12Yan

So kind of the question of post-training in general is that you actually know what you want to provide to these models. You do know the type of answers that you want to give to humans and what you want your model to follow. But that behavior—like these answers—are scarce and expensive. It's pretty expensive and slow to also collect that type of data. But you could just go and try to ask humans what are all the correct answers to every question that you might want to ask.

🤍0 likes💬 0 comments

Add to My Notes

00:58:51Yan

So the idea is that you know what you want your model to output, but it's expensive to collect that data. Pre-training is something where it's very easy to collect that data—you just take all of the internet essentially—but it's not really what you want, as we said. So the idea is that instead, given that one is scalable but it's not what you want, and the other one is not scalable but is what you want... what you can do is that you will basically take the pre-trained model that already learned about like grammar and different languages, and you will just fine-tune it—or like do some small optimization with the little amount of data that is in the format that you want. And this is what we call post-training.

🤍0 likes💬 0 comments

Add to My Notes

00:59:37Yan

Okay, so there are two methods. The first one is Supervised Fine-Tuning (SFT). So the idea is again just to fine-tune the LLM with language modeling. So the exact same method as before, but you do it on desired answers. So instead of just predicting the next word, you predict the next word on answers that are the answers that you would want to give to humans.

🤍0 likes💬 0 comments

Add to My Notes

00:59:59Yan

So this language modeling means that it's again next word prediction, and "desired answers" is why we say supervised fine-tuning. Where the "S" comes from is because you assume that you have access to kind of the correct answer, which is why it's supervised.

🤍0 likes💬 0 comments

Add to My Notes

01:00:17Yan

So how can we collect that data? There are many different ways. One is just to ask humans. And this was kind of the key from GPT-3 to InstructGPT. And here are some examples from OpenAssistant that did that in the open source, where you have a question and then you have answers that are written by humans.

🤍0 likes💬 0 comments

Add to My Notes

01:00:40Yan

You can also do it differently. One problem basically with free human data is that it's slow to collect and it's expensive. So one idea that you might want to do is to use an LLM to scale data collection. So this is what we did for Alpaca, for example, in early 2023. We basically said, "Well, we don't have the money or we don't have the luxury of having humans that provide our sentences, but what we can do is that we can use the best model from OpenAI at the time to predict the right answer, and we can basically try to do supervised fine-tuning with the answer that is given by the OpenAI models." So we did that on 52,000 answers and... that kind of was one of the first, or probably the first, instruction following LLM in the open source. So that really started trying to replicate ChatGPT. And now this synthetic data generation is a whole field on its own, because the idea is that now actually some of these models are just better than the humans. So it's not only that human data collection is slow and expensive, it might just be that it's lower quality.

🤍0 likes💬 0 comments

Add to My Notes

01:02:13Yan

So for SFT, there's another way of doing it. We talked about two ways right now: we talked about humans, we talked about LLM that just provides an answer. But the problem is that if you want to do "LLM provides an answer," you kind of assume that you have access to an LLM that is smarter than the LLM that you're training. And that was indeed the case when we were training Alpaca.

🤍0 likes💬 0 comments

Add to My Notes

01:02:38Yan

But this is not the case if you're, for example, in the best closed labs which are training kind of the frontier models. Or even if you're in the open source and you're trying to train the best open source models, you might not have access to, or be able to distill, closed models.

🤍0 likes💬 0 comments

Add to My Notes

01:03:00Yan

So what did DeepSeek R1 do? Because they were training this first, like, top open source model. The idea is that you can use rejection sampling based on verifiers. So what I mean by that is that you can just use an LLM to provide many different answers to a question, and then you only keep the answer if it passes some... if it's correct in some sense. So if it passes some test case or some verification, or if it's preferred over other answers.

🤍0 likes💬 0 comments

Add to My Notes

01:03:36Yan

So the idea again is: well, you don't have an ideal LLM that you can generate data from and then do SFT to predict that data. What you can do is, if you have access to verifiers or ways of comparing different samples, is you can roll out many samples, then decide which one is better based on your verifier, and then do SFT on the sample that is given by the verifier. So that's exactly what DeepSeek R1 did for the first stage of SFT.

🤍0 likes💬 0 comments

Add to My Notes

01:04:14Yan

Great. So what do we learn from SFT? What are the type of things that we can learn? Well, we already talked about it: we can learn instruction following. You can learn desired formatting or style, like be more chatty or like use emojis or things like this. You can use tool use. I would recommend reading, if you're interested in that, the Kimi 2 paper, the xLAM paper, that uses SFT at scale to learn how to use tools.

🤍0 likes💬 0 comments

Add to My Notes

01:04:42Yan

You can learn some early reasoning: so how to think before answering, which is exactly what we just talked about with DeepSeek R1 where they use this rejection sampling algorithm. And honestly, you can kind of learn anything where you have good inputs and output pairs.

🤍0 likes💬 0 comments

Add to My Notes

01:04:58Yan

So SFT can either be seen as a final stage for training a final model, or as a preparation for the next stage which is the reinforcement learning stage. So you don't want to... basically given that this works pretty well, you might want to do that first before doing the next stage to accelerate the next stage, as we will see.

🤍0 likes💬 0 comments

Add to My Notes

01:05:19Yan

So SFT pipelines can become pretty complex. I'm not going to talk through this one in detail, but I just want to give you a sense of how complicated it can be. This is about Kimi K2. So I would recommend reading that paper and how they use SFT for training for tool use. And what they did is pretty complicated with some LLM that simulates users, simulates tools, and then do this rejection sampling that we talked about before.

🤍0 likes💬 0 comments

Add to My Notes

01:05:56Yan

The idea is that they collected a lot of tools. They simulated a lot of synthetic tools that tell you like how the tool should be called. And then they basically have an agent that interacts with an LLM that simulates a user and another LLM that simulates tool calls. Because otherwise, like tool calls, you might not have access to enough different tools to really simulate all tools that might be called by the model. And then you basically do some rejection sampling based on these rollouts that were generated with these three LLMs that interact with one another: the agent LLM, the user LLM, and the tool-simulating LLM. Anyways, all this to say that these things can become pretty complex but still work pretty well.

🤍0 likes💬 0 comments

Add to My Notes

01:06:46Yan

Okay. So scalable data for SFT: how much data do you need? SFT, what is nice is that you actually don't need that much data. For learning simple things like style and instruction following, maybe 10,000 is enough. This is from a LIMA paper in 2023 that basically shows that already with 2,000 examples you kind of learn the style and instruction following capabilities that you want. If you want to train like more complicated things like tool use and reasoning, you might want to increase that. I believe R1 used 800,000 samples, which is a good amount, but basically less than a million at least.

🤍0 likes💬 0 comments

Add to My Notes

01:07:29Yan

So yeah, the idea is that you don't need to train on much data for SFT if the model already learned that. So the intuition I usually like... my mental model is that everything that is already learned really well during pre-training, but you just want to surface during post-training—things like how to write in bullet points or like how to use emojis—this is more about specializing your model to one particular type of user that it has already modeled during pre-training. Then for that you don't need that much data. If it's for things that it has never seen during pre-training, or very little, then you need much more data.

🤍0 likes💬 0 comments

Add to My Notes

01:08:16Yan

Okay. Um, so that brings us to reinforcement learning. So the second method, which is RL. So in reinforcement learning, the problem that we try to solve... that SFT is about behavior cloning of humans or of outputs, as we saw could be from LLMs too. But it's about behavior cloning, it's about copying the behavior of different outputs, and this has many issues.

🤍0 likes💬 0 comments

Add to My Notes

01:08:41Yan

One is that it's kind of bound by human abilities, or bound by the abilities of the LLMs that you're copying. But humans, even if you're actually collecting human data, they might not prefer the things that they are generating. So even though they might not be able to write better answers, they can still say which one they prefer. So yeah, the idea is that you will always be bound by human abilities.

🤍0 likes💬 0 comments

Add to My Notes

01:09:10Yan

The second thing is that you will actually teach hallucination. And this is a pretty interesting behavior, is that even if you're cloning correct answers or correct behavior, you might actually be teaching the model to hallucinate if that model did not know that that answer was correct.

🤍0 likes💬 0 comments

Add to My Notes

01:09:28Yan

So what do I mean by that? Imagine that I ask a question to write some introduction, and basically I ask to provide some references. If the answer provides a reference that the model does not know about... basically what you're teaching the model is: you're teaching the model "provide something that seems like a plausible reference" even though that reference... even if that reference was not in your pre-training corpus. So even if you don't know if that exists, still generate that. So you're basically teaching the model to make up plausibly sounding references. So yeah, hallucination is one issue. And a third thing is that collecting ideal answers can be pretty expensive.

🤍0 likes💬 0 comments

Add to My Notes

01:10:19Yan

So the idea, or like one solution, is to instead of doing behavior cloning or SFT, you can do reinforcement learning. So instead of cloning the behavior, you can maximize that behavior.

🤍0 likes💬 0 comments

Add to My Notes

01:10:33Yan

So I would really recommend reading the DeepSeek R1 paper and the Kimi K2 papers that are some of the best papers out there in the open source. And the key here... the key in reinforcement learning is to decide: what are you maximizing? What is the reward that you're maximizing?

🤍0 likes💬 0 comments

Add to My Notes

01:10:53Yann Dubois

There are different things, for example, that R1 has been optimizing for. One might be rule-based rewards, things like string matches. Let's say that you have close-ended question and answering; you might just say your answer is correct if the answer is exactly X. Or you could have some test cases for coding. So that's rule-based rewards.

🤍0 likes💬 0 comments

Add to My Notes

01:11:17Yann Dubois

You can have reward models that were trained to predict human preferences. So we will talk a little bit about that, but you can basically train a classifier to predict whether something is good or bad as predicted by a human and then optimize against that. Or you might optimize against an LLM as a judge. So using an LM, let's say use the best possible LM, and you just say: is that answer correct or not?

🤍0 likes💬 0 comments

Add to My Notes

01:11:42Yann Dubois

So here, you see this particular case which says "write a Python code blah blah blah" and then the model generates different answers. And here, given that you say "let's write a Python code," you might have rule-based verification. It says, "Well, this is not code here. This is the answer: 'here's a joke about frogs.' This is not code. I asked for Python code." So is it Python or not? If it's not Python, then it's also wrong. And then it might check if you pass some test cases, and it will only keep the ones that are passing. So the idea is to optimize the things that are currently passing. You just say to the model: do more of the thing that I gave you a correct reward for or positive reward for.

🤍0 likes💬 0 comments

Add to My Notes

01:12:31Yann Dubois

Great. So yeah, as I said before, I would recommend reading the DeepSeek R1 paper. Basically, for reasoning prompts—so like math questions, coding questions, and some logical reasoning—they use these rule-based verifiers that we just talked about. And then for general prompts like translation, factual question answering, and writing requests—things that are more like long-form text—they basically use a reward model that was trained to predict human preferences and they try to optimize that.

🤍0 likes💬 0 comments

Add to My Notes

01:13:04Yann Dubois

So what they do is that they start with this SFT checkpoint. So they use a model, they do some SFT like this. The model is already pretty good at generating things that are often correct, and then you just do this reinforcement learning pipeline where you try to optimize the number of times that your verifier says you're correct.

🤍0 likes💬 0 comments

Add to My Notes

01:13:22Yann Dubois

So in terms of algorithm, the most common algorithm in the open source is GRPO from DeepSeek R1. And the idea is actually pretty simple. It is that you take a policy model—so you take your LLM that was usually trained from SFT—you ask it to answer multiple times, to provide multiple outputs to your question. And then—let's skip that part for now—you basically have a reward model or verifier that gives a reward that tells you: yes it was correct, no it was wrong, by how much, and all of this.

🤍0 likes💬 0 comments

Add to My Notes

01:13:59Yann Dubois

You get a reward for each, and then you basically compute some group computation to get your advantages. The way to just think about it is: you do some normalization just so that you know which one is very good and which one is very bad. You basically renormalize all of these rewards, and then you essentially back propagate to tell your policy: do more of the thing that was good.

🤍0 likes💬 0 comments

Add to My Notes

01:14:29Yann Dubois

And then usually the one we skipped here is this reference model. You usually have a KL divergence, which means that you ask the policy model during training, "don't move too far from outputs from my base model." And this is really just a hack because you're basically saying reinforcement learning can get you in places that are usually not ideal—for example, the hacks that we talked about. So that's like one way of just saying: don't go too far. Optimize as well as you can, but with certain limits of how far you can go.

🤍0 likes💬 0 comments

Add to My Notes

01:15:02Yann Dubois

Yeah. And so this is not super important, but basically, if you know a little bit of reinforcement learning, DeepSeek R1 optimizes GRPO, which really just uses the Monte Carlo estimate for computing the advantage. And Kimi k1.5 and Kimi k2 use a similar loss.

🤍0 likes💬 0 comments

Add to My Notes

01:15:20Yann Dubois

Okay, so one thing I want to emphasize is that in reinforcement learning, infra is really key—is really, really important. The reason why is because as you saw, if you use this GRPO algorithm, sampling is a key bottleneck because for every input or every question, you have to sample multiple outputs for each of these problems. And especially for agents, given that this is an agent class, this becomes even worse because you might have very long agentic rollouts and you basically don't want to block all your compute—all your training compute—on these very long agentic rollouts that are being rolled out.

🤍0 likes💬 0 comments

Add to My Notes

01:16:04Yann Dubois

So Kimi did a lot of optimization, and I would again recommend reading their papers. For example, for long rollouts, Kimi decided to pause these long rollouts. So if it's more than a certain amount of time, they will basically pause the rollout and then say, "Okay, this is not worth it. We will optimize our weights and then we will resume the rollout the next step."

🤍0 likes💬 0 comments

Add to My Notes

01:16:28Yann Dubois

And then another issue with agents is that the environment feedback can be slow. So if you have an agent that really interacts with the world and calls a lot of APIs and things like this, maybe you're not even using your GPUs at all because maybe you're not even doing rollouts. Maybe you're just waiting for the environment response. So the way that Kimi bypasses that is by using really a lot of concurrent rollouts like this. When a certain rollout is waiting on environment response, you can work on something else. And dedicated microservices that can really spin up and scale.

🤍0 likes💬 0 comments

Add to My Notes

01:17:06Yann Dubois

The way that Kimi did it is that for every part they have a train engine. Then they have a checkpoint engine that broadcasts all the weights to all the other pods, and then they have an inference engine that really does this sampling. And what is important is everything is collocated; all the engines are collocated on the same pod to avoid communication overhead. So anyways, all this to say that there's a lot of optimization on the infra side, and infra is really key here. So the communication for them takes less than 30 seconds for communicating the weights, and everything is, again, working on the same pod.

🤍0 likes💬 0 comments

Add to My Notes

01:17:47Yann Dubois

Okay. So let's talk about Reinforcement Learning from Human Feedback (RLHF). Right now we talked about reinforcement learning for reasoning when you usually have ground truth verifiers. RLHF is this notion of reinforcement learning when you don't have ground truth. This is really what made GPT work in 2022. So the idea is instead of SFT where you clone the behavior of humans, you want to maximize their preferences.

🤍0 likes💬 0 comments

Add to My Notes

01:18:21Yann Dubois

As I said, this is what made GPT, and the pipeline is the following. This is how the original algorithm for InstructGPT worked. You have an instruction that goes to a model, like a question that goes to a model. You ask the model to provide two answers. Usually, the model is already pretty good; it's an SFT model. And then you ask some labelers to select which of the two was better. So you ask some humans, "Hey, which one of the two was better?" And then you basically maximize the number of times that you will generate the thing, or you will tell your model to predict more of this thing that was correct.

🤍0 likes💬 0 comments

Add to My Notes

01:19:01Yann Dubois

So there are different algorithms; I'm not going to go through them. PPO and DPO are two of them for doing that. But basically, this is just reinforcement learning where your reward is actually given by a reward model that was trained to classify human preferences.

🤍0 likes💬 0 comments

Add to My Notes

01:19:19Yann Dubois

And here you see these are pretty old results by now, but here you see for learning to summarize how a pre-trained model performs. For summarization, SFT performs better. This is kind of like human reference summary—so how well you compare to humans—and so SFT really improved compared to pre-training. But then you see PPO, so this reinforcement learning made you perform even better. And this is kind of the order of things where pre-training is good, SFT is better, and RL is even better. And same thing here, and this is Alpaca Farm, which is a paper we did for optimizing human preferences. And you see that here that the two algorithms, those two are two RL algorithms, work similarly and they work better than SFT and they work better than the pre-training model.

🤍0 likes💬 0 comments

Add to My Notes

01:20:11Yann Dubois

Okay. Human data. So, as I said, data comes from humans. This is very expensive, or at least very... it takes a long time to do and pretty expensive too. You have to write extremely detailed rubrics to tell humans what is even considered as a good answer, what is considered as a bad answer. It's actually pretty... yeah, a lot of work that goes into collecting data is hard. Collecting data is hard.

🤍0 likes💬 0 comments

Add to My Notes

01:20:43Yann Dubois

Challenges with human data: as I just said, it's slow and expensive. Second, it's actually hard for really focusing on the content of the answers. Most humans, when you ask them what is good and what is bad, they will usually kind of focus on the form or like the style—things like lengths. And this is usually not what you want to optimize for in your LLM. Also, depending on who you ask, the distribution of annotators, you will really get different behaviors, different political views, and different views on many things. So you have to be pretty mindful about that. There's also crowdsourcing ethics involved here—like who are you asking to label your data. So there's a lot of challenges with human data.

🤍0 likes💬 0 comments

Add to My Notes

01:21:40Yann Dubois

Okay. So one way to reduce this dependency on human data is exactly what I told you about before with SFT: is that you can ask an LLM to replace humans to provide preferences. And this is a paper again, this Alpaca paper that we wrote two years ago, which shows on the x-axis the amount of dollars that you need to spend for collecting data. On the y-axis, you see the agreement with humans.

🤍0 likes💬 0 comments

Add to My Notes

01:22:13Yann Dubois

And you see that actually humans, I believe, are in blue here. So this is around $300 per 1,000 examples that we had to pay humans. And you see that the agreement between different humans is around 66%. While for LMs, we could divide by 10 or even by 30 the amount of money that we spent. And that was two years ago. Right now, it would be way, way less than that. And we actually performed already better than humans on predicting the correct human answer. So, it works surprisingly well. So, you can always kind of use this trick of using LMs instead of humans. But again, this is harder to do when you're at the frontier and you don't have a better LM.

🤍0 likes💬 0 comments

Add to My Notes

01:23:00Yann Dubois

Okay. And then evaluation. So, I'll talk really briefly about that, but there's basically two types of evaluation: close-ended evaluation and open-ended evaluation. One thing to notice really: evaluation is the key. It is the most important thing, or one of the most important things in machine learning in general and AI.

🤍0 likes💬 0 comments

Add to My Notes

01:23:29Yann Dubois

And the reason why is for three reasons. First, it's key to identify improvements, to know to quantify progress that you're making, and to say whether you're making progress and what to change, what hyperparameters to select and things like this. The second thing that's really important for is that it will allow you to select which model to use for your application. If I have a specific application in mind, I will have all these different models to choose from, and I would need to know which one to go after. And finally, evaluation is really important to know whether your model is ready to be put in production. Like even though your model might be the best current model, is it good enough for your application? This is very important for practical use cases; you really need to have good evaluations.

🤍0 likes💬 0 comments

Add to My Notes

01:24:29Yann Dubois

So close-ended evaluation: the idea is that if you can turn your problem into something where you have a few possible answers, then you can easily automatically verify whether your answer is correct. For example, if you turn your email into a question answering evaluation, then you can simply ask an LM to provide an answer that is the answer like A, B, or C, and you simply look at what the right answer was and then you just consider your accuracy. So this is, for example, what the MMLU Eval did.

🤍0 likes💬 0 comments

Add to My Notes

01:25:05Yann Dubois

So there's still many issues. There's still challenges with closed-end evaluation. One, it's sensitive to prompting; like different ways that you prompt your model will provide different answers. Two, it might have train-test contamination. So if your model was trained on the eval—because right now, for example, MMLU is all over the internet—maybe your model was trained on that. It will look much better than what it actually is. So this is about closed-ended evaluation.

🤍0 likes💬 0 comments

Add to My Notes

01:25:34Yann Dubois

I really want to focus on open-ended evaluation because despite these challenges, closed-end evaluation is much easier than open-ended evaluation. The question for open-ended evaluation is: how do we evaluate something like ChatGPT or like an LLM? So ChatGPT or all these instruction-following models, they can be applied on so many different things. You can be applying it for coding, for chatting, for summarization, for many things. So you really want to have an eval that kind of covers all these use cases.

🤍0 likes💬 0 comments

Add to My Notes

01:26:10Yann Dubois

The second thing is that it's an open-ended task. So what I mean by that is that you have very long answers. And as a result, you can't do this accuracy-based evaluation where you just check whether the answer is verbatim the correct answer. So that makes it hard. You cannot do this string matching to know whether you're correct. So one idea that you might have for open-ended evaluation is that you can simply ask humans to tell you whether the answer, like which answer is preferred. So you might show two answers to a human and just ask which of the two is better.

🤍0 likes💬 0 comments

Add to My Notes

01:26:55Yann Dubois

So this is what Chatbot Arena by LMSYS did, where you basically ask humans to blindly interact with two chatbots and kind of rate which one is better. So that's one way of dealing with this challenge where when you ask open-ended tasks where there's not a single answer and the answers are usually really long, it's much easier to just ask humans to kind of rank things than to compare to a gold answer because there's no gold answer.

🤍0 likes💬 0 comments

Add to My Notes

01:27:30Yann Dubois

The problem with this is that using humans is again kind of costly and slow. So just as before, what you can do is you can use an LLM instead of a human. This is what we did with Alpaca Eval two years ago, and many others followed. And the idea here, the steps, is that for each instruction, for each question, you might ask a baseline—that could be a human or a model—to provide an answer, and the model that you try to evaluate to provide an answer. And then you will just ask another LM which of the two answers is better. And then you will just kind of look at the number of times that your answer is better than the baseline, and you can get what we call a win rate, which is the probability of winning.

🤍0 likes💬 0 comments

Add to My Notes

01:28:20Yann Dubois

So Alpaca Eval was kind of one of the first evals doing that. And despite being much cheaper than Chatbot Arena, it had really high correlation—Spearman correlation—with Chatbot Arena. So using LMs can be really good as a judge for judging your performance and for evaluating your performance. Yeah, so it cost like... right now probably costs much less, but the time was less than three minutes and less than $10.

🤍0 likes💬 0 comments

Add to My Notes

01:28:59Yann Dubois

Great. Okay. So I think I'm getting at the end. I do want to tell you a little bit about systems and infra because, as I said, if you really understand the fact that scaling is what matters, then the natural question—the natural thing that you should be spending time on—is making sure that your models, your training, can scale well.

🤍0 likes💬 0 comments

Add to My Notes

01:29:30Yann Dubois

So yeah, the problem is that everyone is bottlenecked by compute. So one idea that you might have is like: well, if you're bottlenecked by compute and if you know that spending more compute gives you a better answer, why just not buying more GPUs and training on that? There are a few reasons why we can't do that. One, of course, GPUs are expensive, but they're not only expensive; they are scarce. So even if you have the money, it can be hard to just get access to the best GPUs. And then there are physical limitations. If you have a lot of GPUs, you need to have the communication between GPUs, and that can really slow down your training. So you do need to kind of optimize your systems and make sure that training is as efficient as possible on every GPU that you have. So yeah, you need to do some good resource allocation and you need to optimize your pipelines.

🤍0 likes💬 0 comments

Add to My Notes

01:30:22Yann Dubois

Okay. So I will try to give you an extremely brief overview of GPUs just for you to get a sense of what matters when you optimize these runs and what you're actually optimizing for. So systems 101: GPUs. The difference between GPUs and CPUs is that essentially GPUs are massively parallel. So they will apply the same instructions in parallel on different threads but different inputs. So you will have different inputs that will go through different threads and will apply the same instructions to them. So really the difference with CPUs is that you're optimizing for throughput; it's massively parallel. So here you see GPUs and CPUs, the difference.

🤍0 likes💬 0 comments

Add to My Notes

01:31:12Yann Dubois

As I said, first, GPUs are massively parallel. Second thing is that GPUs are really optimized for matrix multiplications. So GPUs are Graphical Processing Units, and anything about computer vision and graphics really requires extremely fast matrix multiplications. So from kind of the early days of GPUs, people building GPUs were really optimizing for matrix multiplication. So they have specific cores that will make matrix multiplication very fast, and actually around 10 times faster than most other floating-point operations. So here you see different versions of GPUs and you see kind of the speed, and you see that for matrix multiplication it is much faster, especially recently, much faster than non-matrix multiplication floating-point operations.

🤍0 likes💬 0 comments

Add to My Notes

01:32:19Yann Dubois

Another thing that is important to understand about GPUs is that actually compute is not the bottleneck anymore. So what I mean by that is that if you look at the compute, like the peak hardware scaling. So here you have compute, so FLOPs, that could be performed on the best hardware across time. And here you have basically the communication and memory and how it improved across time. And you see that basically compute improved much faster across time than memory and communication.

🤍0 likes💬 0 comments

Add to My Notes

01:33:07Yann Dubois

So what that means is that right now we have more compute in the GPU than we have memory and than we have improvements in communication. So in other words, the bottleneck for GPUs is not performing the computation, but it's actually keeping the processor that performs the computation fed up with data. So you basically need to send as much data as possible there, and the bottleneck is actually feeding the data, not doing the computation. That's a very important thing to understand when you're optimizing your pipelines.

🤍0 likes💬 0 comments

Add to My Notes

01:33:44Yann Dubois

And as a result, if you look at this paper from 2020 that analyzes where you perform all the compute, from how much time it takes to run a transformer, you will actually see that things like tensor contraction, which is basically matrix multiplication, requires most of the FLOPs. So most of the actual compute that is required is for matrix multiplication, but in terms of runtime speed, it's still majority, but it's only two-thirds of the runtime is spent on the thing that is most of the compute. And things like element-wise operations or normalization actually require very little floating-point operations but require a pretty large amount of time because you need to still send to the GPUs your data and do the computation even if the computation is slow.

🤍0 likes💬 0 comments

Add to My Notes

01:34:51Yann Dubois

Okay. And the last thing that you need to know about GPUs is that there's really a large memory hierarchy. So the closer you are to the cores—the cores being the things that actually perform the computation—the faster the communication with the cores will be, but the less memory there will be there. And the further you are from the cores, the more memory but the slower. And you basically have different levels of hierarchy. So you have this shared memory and then the L1 cache that is shared memory that is super close, and then you have the L1 cache, the L2 cache, and then you have the global memory that is very far from your registers and your processors.

🤍0 likes💬 0 comments

Add to My Notes

01:35:40Yann Dubois

So yeah, this memory hierarchy. And the metric that we try to optimize when we try to optimize our runs and our systems is Model Flop Utilization, or MFU for short. This is basically the ratio between the observed throughput to your model and kind of the theoretical best. So Nvidia will tell you: at best, we can do that amount of FLOPs. And then you will check how much you are achieving. And if you achieve an MFU of one, that means you are able to get all the data, the maximum of data; you can always keep your processor basically fed with data. So at any point of time, there's something that is being computed on your processor. Just to give you a rough sense of these numbers, if you have 50%, you're in really, really, really good shape. And even big companies might be optimizing to go higher than 15 or 20% and to achieve this 50%.

🤍0 likes💬 0 comments

Add to My Notes

01:36:52Yann Dubois

Great. So I want to give you a very quick overview of like things that you might want to do for optimizing your runs. Just to give you a sense of things that people do for optimizing this compute and making sure that your runs are scalable. One thing that you might do is low precision operations. So the idea is that if you use fewer bits for every data point that goes in your processor, you will have faster communication and lower memory consumption.

🤍0 likes💬 0 comments

Add to My Notes

01:37:34Yann Dubois

So as I said, given that the bottleneck is not the compute but it's kind of this memory and this communication, you might just decrease the precision in which you put your data in, and as a result, you will just have faster communication because you can put more through the bottleneck and lower memory consumption. So for example, for deep learning, the actual decimal precision is not that important except for a few operations. That's because there's a lot of noise in any case when you train deep neural networks because of stochastic gradient descent already having a lot of noise.

🤍0 likes💬 0 comments

Add to My Notes

01:38:12Yann Dubois

So matrix multiplications will usually be done in BF16 rather than FP32, so you can half the precision. And if you have the precision, what you can do—usually one thing that is very common is using Automatic Mixed Precision or AMP during training. Which is that the weights are stored in FP32. But before the computation, you will convert the FP32 to BF16. So you will basically half the precision, and then everything will be done in BF16. So you will have less memory, you will have more speed up because of faster communication, and your gradients will be stored in BF16. So you'll have memory gains, and then at the end, you will put it back in FP32. And like this, every small operation that you do can be shown in your weights at pretty high precision.

🤍0 likes💬 0 comments

Add to My Notes

01:39:18Yann Dubois

Great. There are other optimizations, for example, operator fusion. So again, the idea here is that communication is slow as we said. And every time that you, for example, if you write in PyTorch, every time you write a new line, it actually moves your variable back to global memory. And that makes it very costly because basically if you do something like x1 is equal to the cosine of x, you will basically read x from global memory, write it to x1, and then when you do this new line x2 is equal x1 cosine of x1, you will again take it back to global memory and write it to x2. So that can be very slow because you have a lot of this communication between global and memory.

🤍0 likes💬 0 comments

Add to My Notes

01:40:17Yann Dubois

So this is just to give you a schematic version of what is happening. So here you have everything in memory, your DRAM, and basically you will send data to your processes for performing compute, and after every new line you will send it back to your DRAM, and then you have to do it again and do it again. So this is kind of the naive way; if you just have a PyTorch function, this is the naive way that things are working. But there's a much better way of doing that once you realize that the communication is the bottleneck: is that you might just communicate once and do all the operations and then communicate it back. So this is what fused kernels are doing. The idea is that you communicate once, you perform all the operations that you want, and then you send it back. And as we said, this is actually fast; this is slow. So there are fewer slow things here. And this is basically what torch compile does to your code is that it kind of fuses operations together.

🤍0 likes💬 0 comments

Add to My Notes

01:41:23Yann Dubois

Okay. Tiling. I know it's becoming long so I'll just quickly talk through that. The idea is that the order in which you perform operations will matter a lot because of communication. So what I mean by that is that you can group and order threads that are performing some computation to minimize the number of times that you will communicate with global memory. So I'll give you an example for matrix multiplication.

🤍0 likes💬 0 comments

Add to My Notes

01:41:56Yann Dubois

Here, this is the very naive way of doing matrix multiplication. This is how you basically learn it at school, where you take two matrices that you multiply them. You will basically go through all this column and all this row. You will basically multiply these two together and multiply these numbers and then sum across all of that, and you get this number for this one. And the way that the memory is accessed here is that one thread here is going to access this one and this one, and then this one and this one. And you'll basically have one thread that is working with all of this and then all of this one. And then you will have another thread that is working on these things separately, and then when this one is done it will work on a different column and a different row. So what is important here is that you will rarely re-read the same values from cache.

🤍0 likes💬 0 comments

Add to My Notes

01:42:57Yann Dubois

In contrast, what you can do is you can split up your matrix multiplication into different tiles to reuse memory. So for example, you might say, "Well, I'm going to have one thread that instead of working with all the column and all the row at once, it basically works with all of these four values together against these four values together." It's a bit hard to explain without actually showing it to you and just with this diagram, but basically this number here, n00, will be used twice. It will be used to multiply m00 and m10.

🤍0 likes💬 0 comments

Add to My Notes

01:43:48Yann Dubois

So basically, for one number that I have access to, I made two operations. So I have n00, I made two operations. So basically I have to read less from global memory because by one read I made two operations. While before with one read, I made only one operation. So, you're basically making sure that you make more work with the same amount of data—or same amount of work with less data. So you basically read less, and you can still work through an algorithm like this where you multiply these element-wise with this, and then you have another thread that works with this one element-wise with this one, and then you have the partial sums and you sum them together.

🤍0 likes💬 0 comments

Add to My Notes

01:44:36Yann Dubois

Anyways, all this to say it's not super important. The actual kind of algorithm is really not that important. What is important is that the order in which you perform operations—or like the grouping and ordering—can really impact the number of times that you have to read from global memory. And tiling is one way where you basically group things together in a single thread such that with less data you can do more computation. So you can kind of reuse these reads and basically have access to your cache.

🤍0 likes💬 0 comments

Add to My Notes

01:45:15Yann Dubois

Great. So Flash Attention is one pretty famous optimization that was done for making attention faster, and it basically kind of combined the three things that we talked about before, which is this kernel fusion, this tiling, and also one additional thing which is recomputation. So sometimes it's actually cheaper to redo a computation than actually reading from your memory the values. So basically here the recomputation of attention is the idea that like, don't save everything; sometimes it's easier to just recompute the values than storing them. And Flash Attention v1 got 1.7x speed up gains just by combining these things together. So all this to say that systems really matter. You can get huge speed up gains at no ML cost. This is completely ML neutral. This is just an order of operation... yeah, the order in which you perform operations can really improve a lot your performance.

🤍0 likes💬 0 comments

Add to My Notes

01:46:27Yann Dubois

Okay. I think I'm arriving at the end. I do want to maybe briefly talk about parallelization. That's kind of the last big topic in terms of systems. So the idea is that you have very big models. This is one of the big problems, is that you have very big models and big models cannot fit on one GPU. So you really want to use as many GPUs as possible for making your training runs fast. So once you think about it this way, there's a question of: how do you work with as many GPUs as possible? And how do you fit your model into GPUs? And the idea is that you can split your memory and your computation across GPUs.

🤍0 likes💬 0 comments

Add to My Notes

01:47:36Yann Dubois

Okay. Slight background here is: to naively train a parameter model that has P parameters, you actually need 16 times P gigabytes of DRAM. The reason is that you need 4 bytes because here we assume it's FP32. So you have 4 bytes for every parameter, so 4P gigabytes for model weights. Then for the optimizer, if you talk about Adam, you need to store both the mean and the variance of every parameter. So the optimizer needs to store 2 times 4P gigabytes in terms of values. So here you have 12P, and here you have 4P for the gradient. So you need to store both the weights and, during back propagation, you need to store the gradients, and this is also 4P. So basically that means that for training a 7 billion parameter model, you actually need 112 gigabytes of memory, which is really huge.

🤍0 likes💬 0 comments

Add to My Notes

01:49:09Yann Dubois

So the idea here is that you can optimize that by... yeah, the goal at least is to use more GPUs and to optimize your training. So let's say that you have four GPUs here and you want to basically have every GPU working simultaneously on your dataset. One naive way that you can do it is that you can copy the model and the optimizer on every GPU. You can split the data, and then you can basically have every GPU working on the same model but different set of data because you split up your data.

🤍0 likes💬 0 comments

Add to My Notes

01:49:59Yann Dubois

And then at the end, after they do one step, you basically communicate the gradients and sum the gradients, and that will be the total gradient that you would have gotten if you had actually trained on the four sets of data. So basically after every batch, everyone works on a separate batch, and then at the end you get gradients, you communicate, you sum them, and then you have basically the same gradient as what you would have had had you trained on four times the batch size.

🤍0 likes💬 0 comments

Add to My Notes

01:50:36Yann Dubois

So the benefit is that now you can use all these GPUs because now you can use four times more GPUs than before. So it's four times faster than before. The negative aspect is that here you have literally no memory gains because now if your model, for example, didn't fit on one GPU, it still doesn't fit on a single GPU right now. And also here we said 7B models require like 112 GB of memory. Here it means that you really need to have 112 GB of memory on every GPU. So there's no memory gains.

🤍0 likes💬 0 comments

Add to My Notes

01:51:16Yann Dubois

So how would you split memory? How would you get memory gains? One way of doing that is to have each GPU update a subset of weights and hold the subset of weights, and then you communicate them before updating your weights. So this is what we call sharding. This is one way of doing that, is this paper called ZeRO. So here you see the baseline which has the 4P gigabytes for parameters, here you have the 4P gigabytes for gradients, and here you have the 8P gigabytes for optimizer states.

🤍0 likes💬 0 comments

Add to My Notes

01:52:12Yann Dubois

And they have different levels of sharding. So the first thing that you can shard is you can shard the optimizer. So you can say, "Well, every GPU will only have one subset of optimizer states," and basically each of them contain a subset and will just communicate them when needed. So this will cut a lot your memory requirement from 120 gigabytes to 31 gigabytes. So nearly a 4x decrease. And then you can do the same thing for your gradients and you can do the same thing for your parameters. So you can basically say every GPU takes care of a different subset of weights.

🤍0 likes💬 0 comments

Add to My Notes

01:52:57Yann Dubois

Um, okay so that was for data parallelism, and now let's talk about model parallelism. The problem with data parallelism is that it requires your data to... it requires to have at least as much data as you have—or at least as much batch size as you have GPUs. So basically as I said, assume you have a batch size of 16. Basically what you're saying is if you have four GPUs, every GPU now gets a batch size of four. So 16 divided by 4. But what if I want to use 32 GPUs? Like how do I now split up that data to fit into 32 GPUs?

🤍0 likes💬 0 comments

Add to My Notes

01:53:40Yann Dubois

The idea is that you can have every GPU take care of applying specific parameters rather than updating. So what we saw before with this data parallelism is that every GPU can take care of updating specific parameters. But here the idea with model parallelism is that instead of having every GPU taking care of updating the parameters, you can have every GPU taking care of applying the parameters—so like applying the actual operations.

🤍0 likes💬 0 comments

Add to My Notes

01:54:15Yann Dubois

So for example, in pipeline parallelism, you can say that every GPU has access to a whole different layer. So you can say layer one is on GPU 1, layer two is on GPU 2. And basically what you have is that once you take data, you make all the data pass to your first layer which is on GPU 1, and then you send it to GPU 2, it passes through the second layer and GPU 3, etc. So yeah, so this is for pipeline parallel. I'm going to skip that part.

🤍0 likes💬 0 comments

Add to My Notes

01:54:48Yann Dubois

And then you have tensor parallel. So this is the idea that instead of having every GPU hold a different layer, you can kind of split matrices or you can split inside of a layer between GPUs. So for example when you multiply a matrix with a vector, what you can do is you can split up the matrix into two, can split up the vector into two, and you can basically say "I'm going to operate with these matrices on this vector and let's meet these matrices on this vector," and then I'm going to aggregate everything at the end together. So this is what we call tensor parallelism. So pipeline parallelism is this idea that every GPU has different layers; tensor parallelism basically splits up the weights into different GPUs.

🤍0 likes💬 0 comments

Add to My Notes

01:55:46Yann Dubois

Great. Okay. And the last example for system optimization is that models are really huge. So instead of kind of splitting up your model weights onto different GPUs, what you can have is you can say, "Well actually not every data point has to go through every parameter," and this is what we call sparsity. So a very common architecture that is sparse is the Mixture of Experts (MoE) that basically says only some parameters will be active for some types of datasets or data points.

🤍0 likes💬 0 comments

Add to My Notes

01:56:21Yann Dubois

So the idea is that you now have a data point that comes in and it will only go to some set of parameters, not all the parameters. And this makes it very easy for doing parallelism and multi-GPU training because you can just have different GPUs contain the parameters that are required for different data points. So here you have a dense model that basically... this is kind of inside a little bit too much into the weeds, but if you know about transformers, you have this linear layer at some point. And you can basically say the linear layer is going to be split into different linear layers and different data points will go through different linear layers, and every GPU can basically have access to different linear layers.

🤍0 likes💬 0 comments

Add to My Notes

01:57:10Yann Dubois

So if you didn't follow kind of the last two slides though, maybe a little bit not as important. But what I do want to stress is that there's a lot of work that goes into systems optimization and kind of optimizing the use of your compute. And the different ways that we saw about doing that was tiling—so like ordering of your operations. It was the sparsity, so basically making your model sparse and not having every data point go through every parameter. We talked about parallelism, so basically using more GPUs. And yes, I think that's basically it.

🤍0 likes💬 0 comments

Add to My Notes

01:57:57Yann Dubois

Great. Um, so we're done. There's no questions today because as I said this is a re-recording of the video. I know this was pretty long. I'm also starting to be a little bit tired, but I hope it was useful and, yeah, good luck for the rest of the class.

🤍0 likes💬 0 comments

Add to My Notes

Video Player

My Notes📝

Highlighted paragraphs will appear here