[Full Workshop] Reinforcement Learning, Kernels, Reasoning, Quantization & Agents — Daniel Han
AI EngineerDisclaimer: The transcript on this page is for the YouTube video titled "[Full Workshop] Reinforcement Learning, Kernels, Reasoning, Quantization & Agents — Daniel Han" from "AI Engineer". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.
Watch the original video here: https://www.youtube.com/watch?v=OkEGJ5G3foU
Hello guys. Hello, hello, hello. Yes, sorry for being a bit late—there was a lot of traffic. But welcome to the AI Engineer World's Fair. Thanks for coming to my session. Today, we're going to talk about a deep dive into RL kernels, agents, and quantization.
You might know me, or you might not know me, but I'm Daniel. My brother is somewhere... yeah, somewhere. But thanks for coming again. We also have stickers and other random stuff later after the talk.
On Twitter, we tweet a lot. We did a gradient accumulation bug fix last year. We introduced something called async offloaded gradient checkpointing. We also work with the Hugging Face, Google, Meta, Mistral, and Phi teams to fix bugs in their open-source models like Gemma, Llama, Mistral, Phi, and more.
So if you want to follow the latest updates in AI, you'd better follow us. We tweet about random stuff. You might even find out when next models might be released—we sometimes tell people approximately, so that might be very interesting.
We also do open-source contributions to the entire open-source ecosystem. For example, we contribute sometimes to `llama.cpp`. We work with the Qwen team and Mistral on their releases. We also do, you know, $FP4$ bug fixes, Llama 4 bug fixes, which increase accuracy by a bit. So definitely utilize some of the newer uploads that we do, which fix bugs all the time.
We just surpassed 10 million monthly downloads on Hugging Face. And yeah, we also have a GitHub package with 40,000 GitHub stars. Essentially, we make fine-tuning faster and reduce memory usage. That's the GitHub package; definitely check that out.
There are free Colab notebooks. I'm not sure if many people know, but you have free GPUs that Google offers—you can just use them. Please use them more! And Kaggle, if not many people know, offers 30 hours of free GPUs per week, and there are no restrictions on it. Please utilize the free resources as much as possible; they won't be unhappy. Just use them all.
So yes, we have notebooks. If you scroll down a bit on our GitHub page, there are free notebooks for Colab and Kaggle. We do reasoning, continued pre-training, supervised fine-tuning, and other stuff.
We also upload models to our Hugging Face page. For example, DeepSeek-R1-0528 released a few days ago, and we uploaded $1.58$-bit quants, which are very small and retain most of the accuracy. So these can run on your local device. Even if you have very low VRAM or not a very good GPU, it will still work, and we will constantly upload models.
Sometimes people complain to us, "Please stop uploading fixed models, it's kind of annoying." But too bad—unfortunately, models have bugs, so we do have to fix them immediately. You will see sometimes, for example, the accuracy can increase by 10%.
Sometimes, the large model providers won't tell you that they uploaded a fix. They're not going to tell you, but be sure to download the latest models; you will get all the fixes.
Now today, let's start off with history, right? Does everyone remember Llama? Although finally it got leaked, right? It was just a research paper—Meta saying, "Oh, we trained LLaMA. Where are the weights?" It was only research access. And then suddenly it got leaked, and that kind of spawned the entire open-source movement. Some of the authors are not part of Meta anymore, but LLaMA is extremely important for the entire ecosystem. It was like the beginning of open source, in a way, for large language models.
The most famous plot from the paper is this, right? If you keep training the model more, the loss just keeps going down. Well, the question is: will the loss keep going down? That's the question.
Llama 1 was only trained on $1.4$ trillion tokens, right? Now, $1.4$ trillion tokens is actually very little. Most models are trained 10 times more. And you can also see from the trend that the bigger the model, the lower the loss. So 7B is the blue line, and then 65B was the red line. You can see that, in general, as a model gets bigger, it gets smarter.
The training loss numbers are correct, so you should see generally these numbers from around maybe a bit higher than $1$. If you see training losses when you do fine-tuning of like $8$ or $13$, definitely something's wrong. So you should get at least losses around two-ish or three-ish.
And so now, Google's new Gemma 3 models are trained on $14$ trillion tokens, right? Which is much more. Llama 4 is trained on $30$ trillion tokens, which is literally at least 10 times more. Gemma is 10 times more; Llama is, you know, 30 times more.
Oh, yes, I forgot—you can also access the slides via the QR code if you want. It's also on the docs as well, so if you go to the docs, there will be a link to all the slides. I will probably post the slides anyway on Twitter and elsewhere, so you can access them as well.
Also, if there are questions, please raise your hand and ask. I will essentially do an intermission between some parts, and I will ask people if you have questions. Last time, during the talk, many people asked questions; I will answer every single one, even if it's stupid. I don't care, please ask. Sometimes I get stuff wrong, so, yes, just ask questions. Are there any questions? I'm assuming no. Okay. Okay.
Let's go to... Okay. So, I don't know if people have seen this plot—it's very famous, from Maxim. He shows the open-source versus closed-source performance on popular benchmarks. I think this is MMLU 5-shot. You can see the green line is open-source models, and then the red line or the orange line is, you know, closed-source models.
You can see that, in general, the slope of the open-source models is more dramatic than the closed-source models. And I would say, in general, the open-source models and closed-source models, in terms of MMLU, have kind of reached the same accuracy, right? You can see that—okay, this is already outdated—but in general, you see like Llama 3.1 405B kind of reached GPT-4o level, so open-source models definitely have caught up to closed-source models.
However, there was a "however." Recently, since September 2024, I would call this something called the "open-source drought." No one wants to talk about it, but I will. In September 2024, o1-preview got released. And to be honest, the open-source community was shocked. Suddenly, the capabilities diverged.
There's something called the MMLU plateau, where most open-source models and closed-source models kind of converged. The open-source models were equivalent to the closed-source models. But suddenly, in September 2024, OpenAI released o1-preview, and it kind of shocked the entire community because the capability or intelligence kind of skyrocketed. With reasoning, long reasoning traces, it was just a total change of mindset.
For four months, the open-source community kind of died internally because there was nothing—we couldn't replicate it, we didn't know what to do. "Do we do this? Do we do that? I don't know." But then suddenly, in January 2025, DeepSeek-R1 came along, and they released R1. That's when the entire world kind of changed their view. In fact, you can train open-source models to be as powerful as o1 or o3 or whatever. So that was what I call the open-source drought.
However, there was a previous drought even before that. Remember when ChatGPT got released in December 2022? Before ChatGPT, most models were base models. They were not really instruct fine-tuned that well, and so most large pre-trained models were actually kind of useless, or they were terrible.
But then suddenly ChatGPT came along, and they did better reinforcement learning from human feedback (RLHF), better instruction fine-tuning, better instruction following, and it really changed the world. Large language models were already here before 2022, they were already there, but it was ChatGPT which showed that if you have good data, good instructions, good answers, good supervised fine-tuning, and good reinforcement learning, you can actually make the model very useful. And yes, again, open source had a delay—a very long delay—until Llama 1, I guess. So, I would say open source always tries to catch up to the closed-source models.
The next question is: what is after reasoning? Is there going to be something else? I think that's a very good question. My personal take is it's going to be very hard. I think reasoning was like the last most... The DeepSeek-R1 paper said that most likely the model already has these reasoning capabilities, and we just need to accentuate them.
So, I'm not sure if there's going to be some new step-function where we'll get to the next capability. In my view, I think every single time, the closed-source models will always do a step-function, but who knows? Maybe now it will plateau forever. I don't know. So you have long discussions about whether AGI is going to come or not, but who knows? The talk is not going to be about that.
But yes, next. So I call the first jump the SFT or RLHF jump, right? That's essentially: if you do good supervised fine-tuning, you get this large jump in performance. And then the second jump is called the RL jump, right? This essentially can increase performance dramatically if you employ methodologies like RL. But the question is: what's the next jump? I don't know.
So, I'm not sure if you guys saw this picture before—it's very widely known in the community—by Yann LeCun. He essentially showed this cake: unsupervised learning or just pre-training in general is the cake (not that good); and then supervised fine-tuning is kind of the icing on top of the cake, so it's a bit better; and then reinforcement learning is the cherry, right? I'm not sure if people like the cherry, but some people like the cherry.
And so the goal is: how can we get the cherry? But the problem is, there is so little data about this, right? Reinforcement learning has very, very little data. And so the problem is, most large model labs will train these large pre-trained models, and then they will iteratively refine them to make the model better through supervised learning and through reinforcement learning.
Interestingly enough, this slide was actually shared last year, very popular. But actually, this was from September 2016. I had to dig this up on YouTube. Yann LeCun actually talked about this back in 2016—literally nearly 10 years ago. I was like, "Wait a second, that's 10 years ago. Very long." So this slide was actually very popular on Twitter, I think it was in November last year. People kept tweeting about it. I saw this and I was shocked. But yes, this encapsulates the current AI boom.
So firstly, when we talked about these large models, remember they started from a base model, and so we call these training stages, right? When you have a base model, you then convert it to a chat model, right? For example, ChatGPT is not a base model; it is an instruct fine-tuned model, or some sort of fine-tuned model from a base model.
So actually, OpenAI does have a base model somewhere sitting in their server—they're probably not going to serve it ever—but it is somewhere on their computers, and they essentially fine-tuned it to make GPT-4. Claude 3 or Claude 4 has most likely a base model, and then they fine-tune it to become Opus. Gemini also has a base model, and they convert it to Gemini 1.5 Pro. This phase, when you convert a base model to a chat model, is the fine-tuning phase. And then the question is: what do we do in the RL, right? Is it reinforcement learning? Is it supervised fine-tuning? Is it some other special sauce? I don't know, but we will essentially discuss these topics.
Any questions first?
So, for example, in open-source models, you might have seen Gemma 3 PT, Gemma 3 IT, Llama 4, Llama 4 Instruct, Qwen 3 Base, Qwen 3, Mistral Small Base, Mistral Small Instruct, Llama 2, Llama 2 Chat, right? These terminologies... To be honest, I think the open-source community should standardize the terminologies, like "Instruct" or "Chat" or "IT" and "PT"—maybe they should standardize it a bit.
But in general, if you see "IT", it means Instruct (instruction-tuned); "PT" means Pre-trained; "Instruct" just means instruction fine-tuned. Qwen 3 just removed it entirely—it's just called Qwen 3, and then the base model is called with "Base". And so essentially, these naming methodologies... If you see them on Hugging Face, hopefully, you will now recognize these different types of models.
And so generally what we say for reinforcement learning and fine-tuning is: fine-tuning is everywhere. You start off with pre-training, you then convert it into a supervised fine-tuning model via supervised fine-tuning (SFT). You also might hear "IFT", which is instruction fine-tuning—they're the same thing. And then we have what we call post-training, the post-training phase. But actually, recently it kind of changed.
I don't know if you guys have been keeping up with the latest terminology—I actually don't really like terminology anymore. But we have something called pre-training, which is you take all of Wikipedia, all of the web, everything—all of the data you can ever see—shove it into the model, and predict the next word. That's called the pre-training stage.
We then have something called the mid-training stage, which essentially gives you higher-quality data. For example, you can weight Wikipedia more because it's higher quality. You can essentially do long context extension as well; you shove this in during the mid-training stage. So if your context length of your model is very short and you want to extend it to a very long context, you shove this in during the mid-training stage.
And then the second stage is the supervised fine-tuning stage where you want to convert the model to a chat model. And then we have the post-training phase, which is preference fine-tuning like DPO, RLHF, and stuff like that.
And then we have this new thing called reinforcement fine-tuning, or RLVR. If no one knows what RLVR stands for, it stands for Reinforcement Learning with Verifiable Rewards. This is a new paradigm—not the same as preference fine-tuning or DPO—where we consider reward functions to make models much better. And so this is how I would envision the whole training phases of models.
Another way to put it is: we have some random initialization of the model, like some random weights of the model, right? 7 billion parameters—literally random numbers, right? Like GPT-4—I don't know, 1.8 trillion parameters—just random numbers. And then somehow we move in the space like the black line, right? Pretend this is some high-dimensional, 1.8-trillion-dimensional space, and then we somehow move in this space and we get the final model. That's a green dot. The question is: how do we move in this space to get to the final model? That's the question.
Most people start from a random initialization. You do the pre-training phase, which is very long. You get to this dark blue dot, right? That's called the pre-trained model. And then you do some supervised fine-tuning/instruction fine-tuning to get the blue dot. Notice the line for the light blue line is very short because there is not that much data for supervised fine-tuning.
Then somehow we get the blue dot, and we keep doing more iterations to get to the purple dot, which is through preference fine-tuning. And then finally, we get the green dot, which is reinforcement learning via verifiable rewards, like o1 or o3. And so, the goal is somehow we have to move from the black dot to the green dot. Essentially, all of large language models—all of AI—is just an optimization problem, right? How do we make this easier to get to the green dot?
You could theoretically guess: why don't you just go from the black dot directly to the green dot, skipping all of the dumb phases—just skip it entirely? Yes, you could do that, but it's not going to be very efficient. You're going to be waiting there for, I don't know, millennia. Your loss is not going to go down. So, the tricks that we found in AI are that you have to do these phases to get to your final green dot.
There is a new methodology where you can actually bypass the supervised fine-tuning stage and the preference fine-tuning stage and directly go to the green dot. There is a way, and that's the dark red line. I think DeepSeek-R1-Zero kind of showed that you can use a pre-trained model—a base model—and directly do some reinforcement learning with verifiable rewards and just skip SFT entirely. So that's a new paradigm that people want to focus on.
In my view, I think you should still do the light blue, the purple, and then the green. I don't think you should directly skip over to the green. If you want to waste resources, you can skip to the green, but I don't think large model labs want to waste resources. Hopefully not.
So, I don't know if people have seen this diagram. Agents in the old sense—everyone keeps connecting agents with reinforcement learning. Okay, but why? In general, an agent is: you have some sort of environment, and you have the agent doing something in the environment; you get an action, you do the action, and then you get some sort of reward.
The reward is $R$, and $S$ is the state—what the environment currently looks like. Essentially, RL tries to optimize this loop. You're trying to maximize a reward given some sort of action. And that's why RL and agents are connected.
Assume the agent is the language model, right? Assume the agent is, in fact, the language model. The environment is kind of fishy—it's hard to say what exactly the environment is; it's more like the language model's inference space. That's the environment, kind of. But pretend this was a game, right? Pretend the agent was the computer and the environment is Mario, for example. You're playing the Mario game automatically, and your goal is to win the game. So, the whole goal of RL is to maximize the reward.
Another one is Pac-Man, right? You have the yellow Pac-Man, and you can either go up, down, left, or right. Up, down, left, or right—that is the action $a$. The orange little things are rewards, right? If you eat an orange dot, you will get a positive reward $R^+$. If you eat a very big one, you'll get a very large reward. But if you encounter one of the enemies, you will get a negative reward $R^-$. The question is: how do we maximize the reward based on this environment?
For language models, there is a trick. The trick is this loop kind of changes because we don't actually have a continuous loop. The state does not actually change over time. In a game, if you do an action, the whole state changes, right? The environment totally changes, and so you have to continuously keep a history of the past steps.
But in language models, there is no history, right? If you do a prompt, "What is $2 + 2$?", and then you ask another question, "What is $4 + 4$?", it's totally not relevant to your previous prompt. Okay, fine, it is kind of relevant in a conversation, but it's not directly correlated. And so, you can actually delete one of the lines—that's the next prompt. You can delete it entirely.
So, for example, "What is $2 + 2$?" right? Essentially, you have all of these options: it could be $0$, it could be $1$, it could be $2$, it could be infinity, it could be B, it could be D—it could be anything that you like, a symbol. And so, "What is $2 + 2$?" is the state. That's the question.
The reward, for example, if you choose $4$, your reward is $+1$. If you choose anything else, your reward might be negative infinity, $0$, or whatever number you like. You can come up with any number you like for the reward. It doesn't have to be $+1$. It can be $+10$. It can be $+100$. You can do anything that you like. You can also do distance-based scoring.
For example, is choosing the number $5$ better than choosing $0$? That's a question. What do you guys think? Is choosing the number $5$ better than $0$ or is it worse?
I guess better.
Yes, okay. Better. So what would you do for the reward then? Pretend the model outputs $5$ for "What is $2 + 2$?"
Zero.
Someone said zero. Okay, zero is fine because it's wrong. Yes, the answer $5$ is wrong, so you should probably give a reward of zero. But is there a better answer?
Less than one.
Yes, okay. Less than one, so some sort of, maybe $0.8$. Correct. You could do, like, the distance of the answer to the correct answer. If it's $5$, you could do some reward like that. Pretend the model says "A"—what is the reward?
Minus one.
Okay, or it could be $-10$ because that's very bad, right? You should not output a letter; it should be some sort of number. So that's how you design reward functions. We just design a reward function, take your reward function (it's just `if` statements), shove this into a language model fine-tuning phase, and there we go—you have o3! Okay, well, you won't have o3, but you know what I mean. Essentially, o3 is a collection of all of these reward functions.
Remember, it doesn't have to be "What is $2 + 2$?". It is a general math question: "What is $10 + 20$?", "What is $10 \times 200 / 10$?"—whatever math equation you ever want. This function can take your question and convert it into a number, and o3 is just a collection of all of these reward functions.
And so the goal of RL is to make the good ones more good. You want the good rewards to increase in value. For example, for the $4$, you want the $4$ to appear more, but you want the $3$ to decrease. You don't want the $3$ to keep appearing in your answer, but you want the "D" and the "B" to be very, very, very heavily penalized. That's the goal of RL.
We don't actually have the answer... Okay, this question is very easy: "What is $2 + 2$?" is obviously $4$. Yes, that's very easy. But pretend you have some sort of complicated question like, for example, "How do I win the stock market?" Let's use a dumb example. You don't know what actions you're going to take, but the point is you have the result, like profit or loss. But the question is: we don't know how to get to the good profit. And so, the question is how do we maximize good actions as much as possible and decrease bad actions as much as possible? And that is RL.
And so OpenAI released something called RLHF in ChatGPT, and they showed that you just need some training data. You interact with the agent, which is the language model, you then get some actions (which is your answer to the language model), you feed this into a reward model, and then you get some reward, and you keep iteratively doing this step, and you'll finally get ChatGPT. So remember, you start with the base model, and you convert this into GPT-4 via this method.
To expand on it a bit: if you guys have heard of PPO, what is PPO? Essentially, PPO is just... You expand the box for the agent (the language model is like an agent), you expand it, and there are just three models inside of it. There is a generating policy, the reference policy, and then there's a value model. That's all, nothing that special. We will talk about each of these things separately, but PPO is just an optimization algorithm to make RLHF work better.
GRPO, which is the algorithm behind DeepSeek-R1, smartly deletes one of the things: the value model. It just gets rid of it. Now, why would you delete it? We will talk about this, but the trick is: if you delete the value model, you save parameters, you save compute, and it's much more efficient.
Remember, each of these models is kind of like a large language model, right? Pretend your generating model is already 1.8 trillion parameters. What are you going to do, make another 1.8-trillion-parameter model for the value model? So we just get rid of it—delete it. And that is GRPO. That's the biggest difference. Any questions?
You talked about negative rewards. It's confusing because in pre-training, isn't it about probabilities? The phrase "reward"—when it can be positive or negative—versus comparing it to pre-training where it's always negative...
Do you mean pre-training as in, like, the negative log-likelihood or some probability? During pre-training, the goal is to maximize probability, so you output some number from $0$ to $1$, the probability of the next word, and you want to maximize that. For RL, you want to maximize reward. If it's a negative reward, you still want to maximize that. So if it's $-1$, you just want to make this $-1$ go up and be in the positive range.
But also, rewards can actually just be negative. For example, your reward function can output $-10$ and $-1$. The good one is $-1$, and the bad one is $-10$. Your goal is to move towards $-1$ as much as possible because your goal is to maximize it. So I would say the reward is a misnomer; you could just add $10$ to everything, and then it scales the numbers. Does that kind of make sense, or is it a question of nomenclature?
Most people, to be honest, they don't like negative rewards; actually, in the RL space, people just like to do positive rewards. I don't know, I like negative rewards—I feel like it's more intuitive for me. Yeah. Any other questions?
So you've got your language model, and then you've got the generating policy and reference policy, right? What models are being used there? Is it the same as your language model, or is that another model trick?
Yes, very good question. There are some tricks that you can employ. Most people just make them the same model. The reference model is like the beginning of the model, and the generating model is the model that you're updating. So, essentially, the reference model is like the base model—okay, that's probably not a good... Fine, just keep it as the base model.
The generating policy is actually the model where you update it. So every single time you get a base model, you update 1, update 2, update 3—that is the generating policy. But we will talk about this. So it's essentially the same model, but there are updates to the model.
The reference policy is the model that is not updated; the generating policy is the model that is actually updated, and they're both the same model. One of them is updated, the other one is not updated. But we will talk about that, yes.
The actions—is it typically one token, or is it more tokens, like a full sentence?
That's a good question. In the Pac-Man case, the action will be a string of actions, right? You can go up, then down, then left or right—some sort of long history. In the language model space, this is generally called single-turn and multi-turn. Generally speaking, currently, single-turn is what most people do; it's just one action.
So the action will be... Essentially, you say, "What is $2 + 2$?" and the answer is "4". That is the action. The action is actually the inference space. What is the actual chain of thought—that is the action. And so, it's just one action, but it is the total sum of the chain of thought. So if you have: "What is $2 + 2$? I think the answer is $4$, let me do some working out...", that entire thing is the action. Does that kind of make sense?
There was a Claude conversation around how to finish a poem's next line—they have to think ahead to the last letter or the last word to match the previous one. Do you think a reward-focused next-step allows you to do that?
You could. I think, for pre-training specifically, there are research papers which show that pre-training doesn't just predict the next word; it does try to predict many words ahead. And so, yes, maybe the reward model in reinforcement learning essentially accentuates a pre-training behavior, so maybe this behavior already exists in the model and we just see it more often.
I would say that the model itself already has this capability, we just want to make it more obvious. So maybe the model already knows how to do that—it already knows how to predict 10 words ahead or 20 words ahead. It already knows how to do that, but we just want to make it more obvious. I'm not sure if that answers it.
Would it be safe to say like... so if it's a generation, you want almost the circuit for the last word?
I think for reinforcement learning specifically, yeah, I guess so. Your goal is to maximize a reward, and so, however you try to get there, it's different from general pre-training. Pre-training is just maximizing the probability of the next word, but reinforcement learning is trying to maximize reward.
The question is: how do you actually maximize a reward? Do you do chain of thought? Do you do what you describe—thinking about the future? I don't know. What is the reward function actually doing? What is a language model actually doing? We don't fully know. But yes, the goal is to maximize the reward.
I was curious about when you were talking about arithmetic, whether $5$ is a better answer than, say, "yes" or something. Given that there are these closed circuits between all these different related mathematical functions you can do on numbers in space, whether it is in the literature or in the current state of the art better to train it so that a closer prediction is more accurate, or whether just saying the right answer is right and everything else is wrong—which in some logical sense is true—tends to produce more performance in that space?
Yes, you're correct. You should have data that gives more accurate results. But saying whether $5$ is better than $10$ in general... Is that what people do in production? When there is exactly one correct answer and everything else in a mathematical sense is equally wrong because it's not that answer...
That is a good question, I don't know. Large model labs won't tell you exactly what they do. In our experiments, when you use our notebooks, we actually show that if you do distance-based scoring—meaning the closer your number is to the actual number, the better the results—you will get better results. But generally speaking, it's easier to just say $5$ is wrong, just give it zero reward. Everything is zero, and the good one is $1$. It's actually much easier to do.
For example, if you want to do execution of code, how do you actually do distance-based scoring? If you ask it to create a Flappy Bird game, you just have the final output, but you don't actually know how to verify, like, "Oh, is this Flappy Bird game better than the previous Flappy Bird game?" Only in a mathematical sense can you do distance-based scoring.
I'm assuming large model labs probably do the binary approach—the majority of them just do yes or no, binary. But in our experiments for math specifically, you should do distance-based scoring. It makes the model learn faster.
For a verifiable domain like math, $2 + 2$ makes sense, but it's not going to scale for really large numbers or large multiplications. So are we going to end up using tool use to calculate that, or can the model potentially be trained to do that?
That is a good question. In the olden days, before this paradigm came along, we would think that you can just use a tool like a calculator. You should actually—I would say you should still use a calculator to calculate $2 + 2$, right? You should not use a language model. But with RLVR, the trick is we actually found that if you just do $2 + 2$, or you do another question like $10 \times 10$, or some sort of complicated mathematical expression—like the derivative of $x^2$ or something—it randomly learns to actually solve that equation without actually doing overfitting. And so, I would say that with RL, you can actually make the model learn how to do multiplication, how to do addition. So it's actually in the model.
Would we use this in production, or...
Oh yes, people use that in production. Okay, maybe don't use it in production if you're not sure if the answer is correct—maybe it will say $3 + 3 = 7$, it's possible! But essentially, it's getting better. Maybe in the future, all mathematical equations can just be done by a model.
A few months ago, before o1 got released, people would still say, "Use a calculator, use some sort of tool calling." Yes, you should probably still do it. But imagine as time goes on, as models get better and better in terms of training data just for the math equations, in the limit as we get all of the world's data for just these math questions, it should in theory solve them all—in theory. It's always "in theory." But yes, you don't need tool calling; it's not necessary. Yes?
My question is about the reward model. In practice, are people using large language models as a reward model, or is it...
Good question. I was actually going to go into that in the next slides. We'll be talking about that, yes.
Does this change with multi-turn? I mean, you showed a single turn, but what about multi-turn?
Yes, you could do multi-turn. It's a bit more complicated. There are tricks you can do. You imagine that your current step is good, and then you just continue doing inference. You append your next question. For example, "What is $2 + 2$?" You say, "Okay, let me think about this question... The answer is $4$." And then, what is your next question?
Maybe the user interacts with it and says, "Oh, I don't think your answer is correct." And then the model says, "Oh, okay, let me rethink about this... I still think the answer is $4$." So you could chain this all together and shove it into the whole RL step. You could do that. It's a bit more complicated, and the diagram will be a little bit different.
The follow-up to that would be: do you assume a loop is a single turn, or is a loop a lot of turns and then you only give a reward at the end, or do you give sub-rewards?
Very good question. In the DeepSeek-R1 paper, you could do sub-rewards, or you could just do the reward at the very end. I think sub-rewards might actually do better in general, but the problem is sub-rewards are very hard to calculate. You would rather just wait all the way until the very, very end and just give a reward. That's probably the easiest.
So it is more about efficiency. To be honest, all of AI is about efficiency—what is more efficient. It's all optimization. So the answer is, I would suggest people just shove a reward at the very end.
Once you get your reward signal, is it just the REINFORCE algorithm with the gradient to go back?
We will talk about that, yes. Most... yes, correct. We will talk about REINFORCE, we'll talk about PPO and stuff like that, yes.
What about this one?
The problem is, if you skip from pre-training to the RLVR stage, it's relatively hard because your model doesn't actually know how to do instructions, right? You have this base model, you ask the question to the base model "What is $2 + 2$?", it's not going to say "I think the answer is $4$."
You might be lucky—somewhere in your pre-training data, somewhere on the web, someone asked this question, "What is $2 + 2$?", and the answer was "the answer is $4$". But you have to be lucky. So, the problem with this is that the whole trick of SFT is you want to force the model to answer "What is $2 + 2$?" in an instruction way, right? You want it to say "the answer is $4$". You don't want it to blabber on, get some Wikipedia article, and shove it as the output.
So the whole point of SFT, preference fine-tuning, and stuff like that is to force the model to make it more optimal to output conversation-style. If you want to skip, it's also fine, it's just not efficient. I'm assuming large model labs are trying to do this, so it's not like a "you should or you shouldn't"—they are trying. Does that make sense?
A couple of questions. One is: does the online policy optimizer update the reference model after some steps?
The reference model does not change. The reference model is just a model that you didn't train—it's like the base model or whatever SFT checkpoint you started with. It doesn't change. You could change it, but I think that would be too expensive. If you change it, that'll be more complicated. Remember, all of AI is about optimization and efficiency, so I feel like you don't have to. You could—I don't know if there are papers talking about it, though. Maybe OpenAI does it, I don't know.
The other question is: do we need fewer samples compared to pre-training?
The trick of RL is you just need a reward function. You need to make that, and you don't need data—you don't need the ground truth answer in the data. Oh, actually, you do need the answer, but you don't need the chain of thought. You just need lots of questions like, "What is $2 + 2$?", "What is $4 + 4$?" Remember, you can actually automatically generate this, right?
In terms of the number of samples, do you need fewer samples compared to...
You should do as many as possible. Most large language models, I think for o3 or o1, I don't know what the percentage of compute is—maybe they spend like 5% or less on RL. But the goal is: what happens if you spend double the compute just on RL, right? Previously, if you did 14 trillion tokens on pre-training, make RL 14 trillion tokens. The goal of large labs is to just do that. So currently it's very little, but over time it will increase.
But compared to pre-training, the number of samples will be much less?
Currently, yes, because it's expensive—you need to generate rollouts. It's expensive, but over time, I think maybe by next year or this year, large model labs' goal is to do this phase the most. That's their goal, because remember, you can automatically generate questions now. "What is $2 + 2$? What is $2 \times 2$? What is $10 / 10$?" Generate as many math questions as you like.
But remember, you can also generate coding questions, or you can generate any questions that you like, or you can use the supervised fine-tuning data itself for the RL step. You can do that as well. Does that kind of make sense?
How do you protect the SFT from being screwed up by RL? For example, if you have "What is $2 + 2$?", "I think $2 + 2$ is $4$" is a good answer. But I don't want the model to generate a whole page of equations; I just want the answer. Are there techniques to make sure we're not violating our instructions?
Very good. Yes, we will talk about that—clipping and stuff like that, yes.
Is there any research done on incentivizing specific circuits in the model? For example, there's a circuit that says $2 + 2 = 4$, but can you incentivize the concept of addition in general rather than just weighting specific examples?
That is a very good question. I don't know. I think during the pre-training phase, essentially, somewhere in the internet, someone wrote "What is $2 + 2$?", and somehow maybe someone did a derivation of it. Pretend there is some derivation of some complicated math equation, and so the model somehow learned to predict that entire trace.
If it keeps seeing this, it would accentuate that fact: "Oh, okay, I've seen this before, let's make this even more prevalent in the model." So somewhere in the model, it has learned $2 + 2 = 4$.
I guess what I'm saying is: can we use step-by-step or a super-good reward where you are weighting the concept of addition rather than just the specific equation?
Correct. So there were actually two schools of thought. The first one is the model already has this knowledge—it already knows what $2 + 2$ is—and RL just tries to maximize the occurrence of $2 + 2 = 4$. It tries to weight this circuit in the model more. So the model already knows it, and we are just maximizing that circuit.
The second school of thought is: maybe the model doesn't know, and RL actually learns a new thing. I'm more in the first camp—the model probably already knows it, and we're just making it more accentuated.
The extent of my question is basically: I'm wondering if we can target those circuits directly during training, if that makes sense.
You could. I guess what you could do is get the language model, see which weights are changed during the RL phase, and then analyze them. If you just give it "What is $2 + 2$? What is $2 + 2$?" over and over again, you can see which of the weights are changing, and essentially you can extract this from the model.
I don't know if there is research about this, but I'm just making stuff up on the spot. You could do that. Oh, maybe that's a research question—someone should write a research paper on that. Any other questions?
Does the RL update change all the parameters in the model, or is there a way to target a subset?
Yes, we will also talk about that. Large model labs will most likely change all of the parameters—every single parameter is changed. But there are papers which show that actually, not all of the parameters are changing that much. Some of them are changed by practically zero. The majority of updates to the model are like zero, and only some very small updates to the model are seen. So that kind of aligns with the circuit idea, where the model already knows how to do whatever question you give it, and most of the updates are zero.
You asked: is it better to aim for changing all parameters? You can also do things like parameter-efficient fine-tuning (PEFT), like LoRA, where you don't have to fine-tune every single parameter. But I think the majority of large language model labs just train everything.
Otherwise, this again becomes an optimization problem: which layers do we select and stuff like that, which gets more complicated. But yes, you could do parameter-efficient fine-tuning—actually, we're going to show a notebook for that, so you can actually do it on your own computer. Yes, one more question.
My question is about the KL divergence term. I wanted to understand if we can modify or remove it to help the model learn new capabilities, rather than just forcing it to stay close to the reference model.
Good question about the divergence term. Do you mean, like, removing it entirely? Would that make it better? The whole point of the KL divergence term is to make sure the model does not stray too far away from the supervised model. Maybe I'm not fully up to date with the latest research papers, but your point was that if we remove the KL term, it might be better because it learns new capabilities.
Yes, I wanted to understand the strategies around that.
Do you mean you want to elicit more capabilities from the model? In terms of strategies, it's hard to say. I'm assuming the large model labs probably know strategies. I will show examples of how to make RL better, like how to reach higher rewards faster.
But as for new capabilities, it's actually very, very hard to elicit completely new capabilities in the model. The question is: is it actually new, or was it already part of the model? Most research papers are a bit hand-wavy; they say most updates are sparse, so most likely it's not completely new capabilities.
But what happens if, one year later, all of the model updates are not sparse? Would that be considered a new capability? I don't know, those are open questions. I probably didn't fully answer your question, but maybe the other parts of the talk will answer some of it. Okay, I will keep going and take more questions later.
Okay. So, the reward model, right? It was actually a language model—some sort of neural network, some AI model—that predicts the reward. In RLVR, we delete this entirely and we just call it the reward function, or the ground truth reward. If it's correct, you get $+1$; if it's bad, it's just $0$. So you essentially delete another part. GRPO essentially deletes another part, right? You delete the value model—totally remove it—and then you delete the reward model, and it's just a reward function.
And yes, as a reward model, you could use an LLM as a judge. You could ask a language model itself to say whether the answer is good or bad. You could do a regular expression check, like: is the formatting of the answer good or bad? Is the math equation good or bad? Is the final output good or bad? You can do distance-based scoring and stuff like that.
You can also execute Python code and see if it actually executes, checking for import errors, format errors, or other Python errors, and use that as a reward. So, this blue box—the reward—can be anything that you like. It just needs to output a number: $-1$, $+1$, whatever. It just has to be a number. In fact, you can make a dumb reward that is just random: $+1$ fifty percent of the time, and $-1$ fifty percent of the time. Confusingly, a paper recently showed that actually, random rewards work!
But why? Someone might ask why. I don't actually believe... Actually, there was an update showing that this was wrong—it's because the benchmarks they used were incorrect. When they said they increased accuracy from 20% to 50%, the model itself was actually already at 50% beforehand, they just didn't check the accuracy of the correct baseline model properly. So there was a recent rebuttal to those types of papers. But anyway, interesting results, I guess.
Okay, so remember, in RL, the goal is that you don't know the best action to take in the space, right? When you're playing Pac-Man, I don't know if going left, right, up, or down is the best. But at the very, very end, you will either win (get some reward) or you will die. The goal of RL is to maximize the occurrence of the best action—or rather, the better action—you can take, compared to all of the other bad actions.
In normal pre-training, you already know what the best answer is because you already know what the next word is. If you want to predict, "Hello, my name is Daniel," you already know the next word is going to be "Daniel". But in RL, you don't know in advance what the actual correct action sequence is to get the reward. The only thing you can do in RL is to maximize the probability of the better options.
And so, yes... okay, now more math! The goal is to maximize this equation. That's the goal of RL. What is this equation? $J(\theta)$ is like the objective function we want to maximize. We want to calculate the gradient with respect to the policy (the language model), where the action is given a state and $R$ is the reward. If you want to write this down in English, we want to take the derivative of the log-probability of the action given the state, times the reward: $\log P(a | s) \times R$.
I'll give you an example with the Pac-Man case. Okay, so you are Pac-Man. The red is your enemy—you definitely don't want to go there. But you want to eat the two gray dots. Remember, you can only go up, down, left, or right, so you only have four actions. The action space is just up, down, left, or right.
For the rewards, I just randomly made some numbers up: if you go to the red thing, you get $-10$ reward (or actually, it should be negative infinity since you die, but let's say $-10$); if you eat the gray dots, you get $+1$ or $+1$; and if you go up, it's just $0$ reward because there is nothing there.
Now, when you get this language model, it has to tell you what the next action is. For now, we'll just assign every single action (up, down, left, or right) a $25\%$ probability, so you go up $25\%$ of the time, left $25\%$ of the time, and so on. These are your numbers, and this is the entire state.
The goal of RL is to go towards the right much less. You want to push the probability of that $0.25$ on the right much lower, and you want to go down and left much more. So you want to push those probabilities higher, and the top one is not really that important. So, RL essentially tries to avoid doing the bad thing and do the good thing much more.
If you convert this into a table, you have the probability of the action given the state: $P(a | s)$. Remember, up, down, left, or right, we just assigned $25\%$ chance ($0.25$). The reward $R$, which we can calculate, we just made some numbers up: $0$, $+1$, $+1$, and $-10$.
The probability times the reward, we get some numbers: $0$, $0.25$, $0.25$, and $-2.5$. And then if you take the log of the probability times the reward, you get some numbers: $0$, $-0.6$, $-0.6$, and $6.02$. From this table, does anyone know which row we want to maximize? What is the goal? What do we want to maximize? You want to maximize... wait, what is the reward of the bottom row?
Minus ten.
Correct. We actually want to *minimize* the bottom row. Remember, the reward is $-10$. We do not want to maximize the last row because the last row is the worst. So, we want to decrease that probability dramatically—it's way too large, we want to decrease it. The other rows we want to maximize. And so the goal is... okay, we just take the sum of all of that, and the sum of the log probabilities times the reward is $4.81$.
And so remember... okay, let's try it by hand. By hand, we shall do the bad action even more, right? We actually do the worst thing. What happens? The probability times the reward is now $-4.0$ (it used to be $-2.5$). And so, the sum of the log probabilities times the reward actually decreased—it decreased to $2.58$. Before, it was $4.81$.
Is $2.58$ smaller or bigger than $4.81$? Obviously smaller, so actually, this is worse. You should not do this; this is bad. Remember, the goal is to maximize this equation.
And so, $4.81$ is actually better. The original state is actually better than $2.58$. So the thing we just did is worse—do not do this. However, let's do the right thing less, let's not go to the right, and actually maximize the rest. You shall see that if you do the log probability times reward and sum them all, you will get $8.9$, which is a larger number. So the goal is to maximize this as much as possible.
You could say, "Wait, we know the answer, right? You should just make going right 100% probability, and you'll get maximum reward. Why don't we just do that?" But you should not do that, because if you do this, your model will get stuck and just say, "Okay, let's just keep going right forever," and it becomes very bad for optimization. So definitely don't do that.
Now, as someone mentioned REINFORCE: we don't just multiply by the reward. You shouldn't do that. You actually multiply by something called the *advantage*. And what is the advantage? The advantage is the reward minus the average reward, or the baseline reward: $A = R - B$. So you shouldn't actually just maximize a reward directly; you want to maximize a reward relative to the average reward across the entire model. So it's called the baseline.
And so, this baseline $B$ is what the value function or the value model predicts. Remember, GRPO deletes the value model. This was the value model, and this value model essentially estimates what the average reward is if we just look at the current state. It does not look at what the next step is, it does not look at what the next action is; it just takes a snapshot of what you currently see.
Essentially, it looks at the current state and just guesses what the reward is—you're not supposed to give it the actual rewards (like $-10$, $+1$, $+1$, or $0$). It just looks at the current state and produces a number, and this number is called the average reward.
And so the goal is now we don't actually want to maximize just the reward, we want to maximize the advantage as well. So we multiply all this together, and the goal is to maximize this new equation. Does anyone have any questions? There is a lot of math, but any questions?
In terms of probability, is the large language model predicting an estimated probability, or is this a known probability of all the possible states? How do you get that in the practical world?
A large language model predicts the next word. For example, you take the entire Wikipedia, chunk it into small tokens, and the output is just what the next word is. My name is Daniel, but it could also be Michael, Bob, or whatever. You have all of these probabilities for every single word in the entire vocabulary (like 128,000 words), and you assign a probability for every single one.
The trick of this for language models is you can utilize those probabilities directly. That's the trick, and that essentially makes everything easier. Any other questions?
What about multimodal models? Do you do RL on multimodal models?
Oh, that is harder. I would say you could... You could look at a Sudoku puzzle, convert the image to text, and just cheat. I guess you could do that. You could give it Pac-Man—give it the Pac-Man screen—and tell the model, "What should I do next?" Vision is kind of the same thing, but it's more complex.
Does o3 do vision plus reinforcement learning? I think it does, yes. You could. For open source, I don't think I've seen open-source models do that very well yet. It is still very hard. Any other questions?
What is the $B$, the baseline? Is it the average of the reference model, or what exactly is it?
Your goal is to see this current state of the model—whatever the environment currently looks like—and you just want to produce a number that approximates what the total average reward is. I'll give you an example. Pretend you're playing chess, or Go (remember AlphaGo). You look at the board, the current state of the board, and you say: "What is the probability of the white player winning?"
You're not supposed to do any future predictions; you just have to predict the probability of winning by just looking at the current board. That's kind of the average reward.
Is it always low?
It's always low. Yes, correct. But remember, at the very, very end phases, you might get a higher reward. But that's the goal—you essentially want to predict the probability. For example, in chess, I'm sure there are some steps you can take to make the reward higher.
When the model sees this, you need to essentially say: "Is this board better than the previous boards?" And so, you have to train this model as well. It needs to output a probability of winning. That's for the chess example. Does that kind of make sense, or no?
Is the value model the same as the reference model?
No, the value model is totally different. There are three models: there is a value model which predicts the average reward of the state; the reference model is just the model that you started with; and then the policy (the actual model that you're changing) is the final result of your model, like the actual chat model. So there are actually three models.
For the baseline $B$, do you look at the current state, and then do you use the policy to output a probability of whether this chessboard is good or bad?
You look at the current state, and then you output a probability of whether this chessboard is good or bad—like a $0.8$ or an $80\%$ chance you're going to win. Yes, something like that.
I guess because the policy is predicting... is that the probability?
Yes, we will talk about that. This is just a general, simpler formula.
Because the policy is predicting a sequence of actions, how do we normalize the probability? Do we do it per token, or across the whole turn?
That is a good question, and that is an active area of research because you could either normalize by all of the tokens or the entire turn. It remains to be seen which one is better—people are still actively talking about that.
Generally speaking, normally people just assume this rollout is correct, assume this chain of thought is correct, and they just look at the very end. But then you do have to multiply probabilities, so there is a multiplication somewhere and you will get very small numbers.
You know, they get very small, but the numbers are relative, right? So everything is very small, but then the bigger ones are still very small, but it's still better. So they're all relative. Any... Was there one more? Yes.
Yes.
Oh, no, no, it's very old. Yes, very, very old.
Yes. I wonder if you can give some advice on how to think about this training on an abstract level about error propagation. If you have a trained model which does the scoring or does the value function or whatever, that itself is trained from data. It has some error margin. And if you have a softmax function, for example, only 1 in a 100 times will it produce the wrong probability.
How do we think about the development over time of these models and to what extent that error propagation is something that you can observe, measure, systematize, and engineer around? I don't really understand what the mindset is in this process right now.
In my view, I think all of these formulas are just made up. The goal is to maximize reward, but the question is, you can't just maximize reward because otherwise, you might make the model really silly. Like you might say, "What is $2 + 2$?" and it just says, "Four." Pretend your dataset was just "What is $2 + 2$?", right? You literally just cheat: "What is $2 + 2$?", "What is $2 + 2$?", "What is $2 + 2$?" and make the model just say, "Four, four, four" forever. Do you want this as a model? Definitely not.
So we want it to learn. Okay, if I give it the next question, "What is $8 + 8$?", it should not just say, "Four." Or "What is $2 - 2$?", it shouldn't say, "Four." The goal of all these algorithms is to somehow force the models not to overfit to your question. These formulations are trying to do these things to not overfit.
Well, I'm thinking about the chess example you were saying, which scores the board and produces this number—a number which is like, "this is good" or "this is bad." Sometimes these well-trained models have these novelties where they say, "Make this move." The new state is not obviously good or whatever, but they somehow figured this out.
And suppose that your training mechanism for the value function model hasn't picked up on something like that. In fact, there's some error in the tendency of the value model, and its probability of producing a perfect scoring of the board is not always exactly right into your training process. How do you think about that? What is the mindset?
The value model—you have to train it together. So it's a combination of the entire algorithm. The value model predicts what is the probability, but you actually have to train this as well, and that is actually the problem. Some people could train this separately. You can get all the chess possibilities and then output what is the final number. I think that's what some people do.
You could train this in tandem with the model, actually. I think that's more difficult. There is always error in the value model—always. But you have to train this model as well. You will reduce the error, but there's always error. I think there are some numbers where you can force the value model to be less prominent, like don't forcibly utilize the value function. But in GRPO, we just get rid of the value model anyway. It's totally gone, so you don't need to worry about that anymore.
Okay, I will keep going. Let me just check the time actually. Okay. So remember, the goal is the advantage. We want to maximize advantage, not reward anymore. Advantage is the reward minus the average reward, or the baseline reward, right? If the advantage is less than zero ($A < 0$), it means that it is worse than average. If the advantage is more than zero ($A > 0$), it means it is better than average. And so the goal is we want to do the action more if it's better than average in general.
Now, to PPO. I don't know if you guys have seen the PPO formula—it is ugly, but this is the PPO formula. It looks more confusing because there's a clip, and then there's $\epsilon$, and so on. But we could just strip everything away. It's just the probability of the action given the state times the advantage, right? We literally just discussed this. Okay, minus a log. The log is gone, but anyway, it's just that. The rest of it is trying to reduce overfitting.
And so remember, essentially, there's a thing called the division of the old model. Essentially, it's the model that created the action. And the goal is we now want to maximize this likelihood ratio. We don't just want to maximize the probability of the model. We don't want to maximize the probability of the action given the highest reward; we actually want to maximize the likelihood instead. But what is this likelihood?
So I did some numbers—I just made some numbers up. Pretend the numerator is $0.01$ and the denominator is $0.01$. $0.01 / 0.01 = 1$. If the top is $0.01$ and the bottom is $0.99$—remember, these are all probabilities—you divide the top by the bottom, and you'll get roughly $0.01$. Again, if the top is $0.99$ and the bottom is $0.01$, you'll get $99$, and so on. The last one is $1$.
And so the goal is, $0.01 / 0.99 = 0.01$. This means that the action that you do is actually very likely, right? Because the bottom is $0.99$. But we actually don't like this, right? Remember, the top is $0.01$. We do not like this, so the ratio is $0.01$.
And then if the denominator is $0.01$, this action is actually not likely. But we actually like this because the top number is $0.99$. And so when you do the division, you get $99$. This is actually good. And so we're not actually trying to maximize the probability; we're actually trying to maximize the likelihood now.
And so the question is, why don't we just maximize the probability, like the first equation? Why do we need to do this division thing? Because if we maximize just the top, you will have reward hacking. "What is $2 + 2$?" It might say, "To solve this question, we need to do blah, blah, blah, blah..." and suddenly it says, "Hello, hello, hello, hello, hello, hello," and then it says, "Four." Is this good? I don't think so. We don't want it to say, "Hello, hello, hello" or have some weird trace in the reasoning model. We don't want this to happen. And so actually, this "Hello, hello, hello, hello" is very unlikely. The goal of the division is to reduce these issues.
The $\epsilon$ part is called the trust region. Essentially, we don't want to make large steps for PPO, right? We don't want to do large steps, and the trick is we want to restrict them. You don't want to overfit the model. So now we restrict the model, and $\epsilon$ could be like $0.2$ or $0.1$. Then $1 - \epsilon$ is $0.8$ or $0.9$, and $1 + \epsilon$ is $1.2$ or $1.1$. The trick is we just want to not move the direction of the gradient that much. We don't trust the model or the algorithm that much, so we want to constrain it.
And then in PPO, there's also a KL term. Essentially, what this does is we want the model to be as close to the supervised fine-tuned model as much as possible. We don't want it to go so far away from the base model or the SFT model. So essentially, if it deviates too much, we want to tax it. This $\beta$ is like $0.05$, and the KL divergence is the distance—okay, it's not exactly a distance, but it's like the distance between the current model and the reference model. We want to shove this into the equation.
So you can see with PPO, there are many moving parts. Who cares about the equation? It's not that complicated. The point is all of these extra add-ons are just to reduce overfitting and not to make the model randomly go to some weird state that overfits to your questions. The trick of PPO is they just added all these terms to make training more stable.
And so the final equation is like this. Hopefully, to be honest, no one even calculates the formula; it's not that important. But I just tried to break it down into pieces. The goal—remember, the goal is to maximize this equation. We want to maximize it. Normally, I just like to think about this one, right? You just need to learn this one. You want to maximize the probability. So it's just this equation. Remember we did the table? Just this is enough. You don't need to learn the rest of the formulas; it's not very interesting. Any questions? Yes.
Yes, correct.
So the biggest problem is, pretend you just started RL. You have the base model or a supervised fine-tuned model, and then you do RL. The gradient updates at the very beginning are going to be gigantic, right? You ask, "What is $2 + 2$?" and if it says five, you want to penalize it dramatically. The problem is you don't actually want to do large steps. And so the goal is you want to constrain it. The constraint factor is, if the gradient update is extremely large, you just want to constrain all the numbers. The goal is just to constrain the update, not to make it too large.
What about the ratio?
Oh, the ratio—it's the KL divergence. Oh, sorry, not the KL, the likelihood ratio. To be honest, I think I need to do more research. I would ask Gemini exactly what it is. That's my answer; I'm probably not the best person to answer every single question. Yes. Any other questions? Yes.
Yes.
It's the model that actually created the action. And so the top one is all of the numbers that actually—how do I explain this? The bottom one is the model that created the action. So for example, the model says you want to go up, down, left, or right.
Is there a correct or a...
Oh, we just created the action. It could be anything. It could be—so it's whatever action the model says currently. It might be wrong; it might be good; it might be bad. It's just any action. Any other questions? Okay. Yes.
[Inaudible question about space/optimization]
I don't think I can answer that question. I don't know, maybe research papers show it. I'm not sure. Okay. Well, okay. So GRPO, the trick from PPO is we remove the value model. We get rid of it entirely. We do not want to estimate the average reward; it's totally removed. And the reward model is now replaced with a reward function.
Remember, the value model is removed. We get rid of it entirely. But what do we replace it with? The trick of GRPO is we do rollouts or inference sampling. To get the answer to "What is $2 + 2$?", you literally make four inferences. You just literally call the model four times. It could say the answer is zero, the answer is one, the answer is two, or the answer is four. You literally call the model four times, and you take the reward. The correct answer to "What is $2 + 2$?" is four, so you want the last number to be $+1$ and the rest to be $0$.
And the trick is you literally just take the statistics of your current rollout. You take the statistics of all of this. You literally take the reward minus the mean, divided by the standard deviation. You get the z-score. And this is your baseline—this is your value model. There's no more value model anymore; it's just a statistical score.
And so I did this on a table as well. "What is $2 + 2$?" If you think it's zero—remember, the prediction could be $0, 1, 2,$ or $4$, and your reward could be $0, 0, 0,$ or $1$. If you take the mean or the average of all the rewards, you get $0.25$. If you take the standard deviation, you get $0.433$. And then you do the reward minus the mean, divided by the standard deviation, and you get some numbers.
Remember, the number $4$ is correct. That is why the reward minus the mean divided by the standard deviation is $1.44$. It's the largest number. And so that is why we need to essentially maximize that good answer, and we want to reduce the bad answers.
But why is it called Group Relative in GRPO? Because it's not just one question; it's many questions. It could be, "What is $2 + 2$?", "What is $4 + 4$?"—okay, well, my graphs are all the same, my plots are all the same, but anyway, imagine there are four different tables: "What is $2 + 2$?", "What is $4 + 4$?", "How do I create this Python function?", you know, whatever. And there will be four tables. And so the Group Relative part just means that for each question, we take the statistics within each group.
For example, for "What is $2 + 2$?", you literally call the model four times and you get some answers. For "What is $4 + 4$?", you call it four times. For "Create Python code," you call it four times.
Yes, there are other factors of GRPO. Essentially, we already explained what GRPO is, right? Everything you need to know about GRPO, we already explained, and the total mathematical formula looks kind of like this. There's some rearrangement. For example, the $-\beta D_{\text{KL}}$ term is just taken out of the reward function. That's the only other difference. Hopefully, it makes more sense about the parts of the GRPO formula. It's actually not that complicated to understand.
The majority of it is just trying to reduce overfitting—that's the whole goal. $-\beta$ times the KL divergence is to reduce overfitting. $1 - \epsilon$ and $1 + \epsilon$ is to reduce overfitting. The division is to reduce overfitting. Everything is reducing overfitting, right? So that's all of machine learning and AI—it's just to make the training more stable and to reduce overfitting.
I would highly suggest these two resources. Nathan Lambert's *Policy Gradients* book—it's online, very, very, very helpful. And Yannic Kilcher's video on GRPO is very, very helpful as well. And now I will go into a Colab demonstration of GRPO. Before that, does anyone have any questions? Let me just check the time. Questions? Yes.
[Inaudible question about learning or memorization]
Memorization? You get more memorization. To answer your question another way, I think it's actually because GRPO itself has a constraint. Remember, the goal of all these algorithms is to force the model not to detract too much from the original model. With this $-\beta D_{\text{KL}}$ divergence, $1 - \epsilon$, and all of this, we are trying to make the model not go too far away from the reference model. And I think that's the constraint because you're essentially forcing the model not to go too far.
Maybe there might be some new algorithm, some other formulation where you can go very far away—you could do that. I don't know if there are any research papers about that. I don't know, but you could, yes, you could do that. Yes. Any other questions? Yes.
[Inaudible question about JEPA or energy-based models]
Yes, JEPA. Yes, energy-based models.
Yes. So what do you think about that?
What do I think about it? I can't really comment. I mean, you should definitely listen to what Yann LeCun says.
But you don't see anything like movements in open source?
I don't think so. I think in open source, we kind of got captivated by RL and GRPO. I don't think open-source people are doing whatever he's talking about, unfortunately. I think he needs to talk about it more—JEPA and energy-based models. Unfortunately, I don't think open source is there yet. Maybe we should talk about it more, but yeah, I don't think so. Yes.
[Inaudible question about removing the value model]
Yes. Yes, you got rid of it.
[Inaudible question about group size and statistics]
Yes.
[Inaudible question about group rollout size]
So this is more of an optimization question. In theory, you should make—for example, I just selected four, right? "What is $2 + 2$?" creates four examples. You should do as many as you like—$3,000$, whatever number you like. You should do as much as possible. But remember, AI is about optimization. This is going to take forever; it's all about efficiency. So probably don't do as many as you like. But in the limit, you should do that, but not everyone can just sit there waiting for the computer to spin. So yes, you should do as much as you can.
Also, for recommendations when you do inference sampling, you should set the temperature to be like $1.2$ or $1.5$, and set min-p to be $0.1$ or something like that. If you set the temperature to be zero, you'll have the same answer every single time. So definitely don't do that. You should have high temperature numbers to make the model produce new outputs as much as possible, maximizing variability.
Distribution.
You should try your best to maximize variability. Your outputs should not all be the same. If they're all the same, I don't think it is going to learn, so you should make them as different as possible. That's why you should set the temperature to be $1.2, 1.5$, some large number. Don't do it too large, though. Any other questions? Yes.
Yeah.
[Inaudible question about dealing with zero rewards or sparse rewards initially]
All the first steps—all the rewards are basically no signal at all of the underlying... How do you deal with that? I have faced that a lot of times, and moving to a larger model sometimes helps.
Yes, that's a good question. So essentially, you're saying if the model starts off with like no reward—every single update is like $0, 0, 0, 0, 0, 0$—it's not going to learn anything. Yes, that happens all the time. But just by chance, you know, you have some small little probability, and just by chance, you will get some reward. That's the trick. You will see this: after $10,000$ inferences of "What is $2 + 2$?", suddenly the model says "four" just by random probability, and then we make this more likely. That's all of GRPO.
But this is an addition—a simple task. If it was like a proof of concept, it may never come up with...
Yes, it may never. But remember, you're not doing this with just one question. You're also shoving this together with other questions: "What is $2 + 2$?", "What is $4 + 4$?", "Derive the derivative of blah," "Do this Python function." This batch is very large; you essentially shove this all together. And the trick is, in general, it works.
Maybe by bad luck it might not work, but I feel like the bad luck won't last forever because remember you're changing the samples, right? So the question "What is $2 + 2$?", you're changing that, and the next phase will be some other question. And so the trick is, just by chance, you will get a good reward, and we just force that to be more likely.
Does that kind of make sense? To be honest, it's all luck. Yes, it's all luck. We're just guessing or praying that there's going to be some positive reward somewhere in the model. There will be negative reward, right? So if your model is really, really bad, you can do negative reward, and so you just don't want to do the negative one—you want to do the negative one less.
And by miraculous probability, relying on probabilities, you will get a good reward somewhere just by chance. Does that kind of make sense? I mean, all of the large model labs are literally relying on the fact that that's what they're doing. They're just guessing. We're just praying for the GPUs to work, and then suddenly the reward comes out. I'm being serious, that's exactly what they do. They're just waiting for the algorithm to work.
That's why people do random seeds as well. For example, the initialization of the model might not be good, so you just kill the training run. You do like $500$ training runs; if $499$ of them are zero reward, you just kill them all and don't release them.
I have seen that on my training runs.
Yes, very common.
Start with a smaller step.
Yes, I was going to show you guys that. Exactly. So you could force the model to answer some question—like for example, you ask a question, "What is $2 + 2$?" It's very easy, it's four. You just force it to learn: "Okay, it should be four first." And then you do other steps. That is actually why—remember, okay, I have to go back to all the slides. I don't remember where it is. It's the same as this problem, right?
Someone asked about why don't you just start from the blue dot, the pre-trained model, to go to the green one. It's the same thing. Essentially, the trick is we want to do some supervised fine-tuning to make it know some instructions so it knows something, and then you want to go to the reinforcement learning phase. But if you want to start from nothing, like just the pre-training phase, that's the hard part, right? Your reward might be zero forever and then suddenly one.
What if you supplement...
I think if you see zero rewards, most likely either, one, your reward function is not that good, or, yes, doing priming, making the model learn a little bit about your data, does help. So there are tricks to make it work, but generally, I would just say it's bad luck. Just bad luck, and unfortunately, you can't do anything—it's not your fault. It's just unfortunate. Yes.
How should we think about like what to expect from this? Is it going to be that this is the way open source catches up with closed source, or is this a new tool for specialization where a competent ML engineer can specialize a model? Is there consensus on where this is going to bring us? Is it going to give us a really good open-source model, or is this a new tool for specialization?
The algorithm is not special. The hard part is actually the reward functions themselves and the data that you're going to shove into the model. That's the hard part. So I think there is a misconception that the algorithm is important. No, science-wise, it's useless. Who cares about the algorithm? You can literally just use the general function which I gave you; you can just use any algorithm that you like. But the problem is actually the reward function itself.
I gave you some examples, like, "What is $2 + 2$?" The answer is four. Yes, you can do distance-based, but that's just one example. Can someone make a reward function for trading stocks? Do that, right? and then there you have a model for trading. Go ahead.
So you think it's going to be more that because people are able to create reward functions, it's a little bit easier, similar to how you could create a prompt? It's an easier thing for most people to iterate on.
It's actually quite hard, but it's easier than coming up with the algorithm itself. Actually, I think collecting data—in the olden days, large model labs would ask large data providers like Scale AI or whatever to create data. Like, "What is $2 + 2$?" You literally have someone sit there and write, "Okay, the answer is four," but then you also have to do the chain of thought, like, "I think the answer is four because of blah, blah, blah..." This is my working out. You literally have to ask someone sitting there to make the data.
The trick is: no more. You don't need the data labeling step anymore; it's totally gone. You have the answer, and you have the question—the middle step is totally removed. But you still need to make the reward function. You need to verify; you need to say, "Is the answer four good or bad?" For math, it's very easy. For code, it's somewhat easy. You can check, "Oh, did you import the correct function? Did you import the library? Did your code execute?"
But it's still hard to verify if your actual function is correct. For example, let's say your task was to create a Flappy Bird game. How do you actually know that the output is good? How do you actually know? We don't. You could again ask a human to verify, or let's test a Flappy Bird game and then give it a good reward or a bad reward.
Or the trick is: did the game actually run? If it ran, $+1$. Did you see the words "Flappy Bird" inside of the functions? If yes, $+1$. Did you see the image of the Flappy Bird sprite being used? If yes, $+1$ there. Something like that. So it's still—I would say the hardest part is writing the reward functions. And for open source specifically, if the entire open-source community starts writing reward functions, we can probably beat OpenAI's o1.
Plus compute—you still need compute, that's the problem. You still need compute. But if you write good reward functions, you'll probably catch up in no time.
So the end state here is that you want an open version of the closed model. It's not—I guess my original question is: is this a thing that people are going to use like prompts to have specialized models separately, or is it more that we want a good open model?
It depends on which school of thought you're in. If you're in the school of thought that large language models already have the capability and you're just trying to accentuate it, then there will be just one model. Yes, but this model can only learn some facts because otherwise, you're overfitting.
If you're in the second camp—that the model actually learns something new... Oh, wait, did I say it right? I think it's the opposite way around. The first one is you have many models because a model doesn't actually learn that much. But if you're in the second camp—that a model actually learns everything—then you have this one gigantic model. I think OpenAI probably subscribes to that point. Most large model labs think that actually RL can get you to AGI. It will know everything about everything; any single question you ask, it already knows.
And so that's where they're trying to go. For open source, it's harder. I think the open-source community consensus for now is that the model already knows your questions—you're just trying to accentuate it. And by doing reward functions, you're trying to weight the circuits more; you're trying to weight the model to know how to do these equations and stuff like that.
So I think the goal of open source is, if the entire community comes up with good reward functions, writes them all, then the problem is we need compute. That's the second problem, right? If you shove both of them together, you will get o10 or something, I don't know. Imagine if every single person writes a reward function once per day—okay, once per day is probably too hard—but we will have seven billion reward functions, more than OpenAI can ever come up with, and we will defeat OpenAI. But you need the computer part, that's the only issue. Okay. Any other questions? Yes.
[Inaudible question about saving traces / distilling]
Yes.
How do you feel about saving those traces?
Very smart. That's what—yes, yes, yes. I don't know if large model labs do that. You could do that, yes.
I feel like we're just mining for those good examples.
The only problem I would say is, pretend the question was, "What is $2 + 2$?", right? And then the model says, "Let me work out what is $2 + 2$. I think the number two means two apples, and I want to add two more apples. I think the answer might be three... Oh, wait a second, it's four." Should you fine-tune on that? I mean, you could, but maybe it's like cheating. Maybe it just says four by chance.
Let's say that by chance, the question was, "What is $2 + 2$?", and it says gibberish like, "I like to go to Paris for fun," and then suddenly it says "four" just by chance. Remember, we're still rewarding this. We're literally rewarding this as good, but this is not good.
No, we reward it at the very end.
Remember, we see the number four—it is good. The question was, "What is $2 + 2$?" The model can generate anything it likes.
It could...
You could, but that gets harder. The trick is people don't actually reward the steps in between—they just do the final step because otherwise, it gets too complicated. What is the intermediate steps' reward? It's way too complicated. So what you do is you just reward the final step. So if you see the number four, it's good, but we don't know how we got there.
So yes, you could...
Maybe at the very end step of RL, you can then use some data to do fine-tuning. Yes, you could. But I think, in general, it's because we don't know what the process is in between.
You say we don't train on the thought?
Oh, no, we don't train on the thought. Yes, we don't.
The problem is...
We don't train on the intermediate steps in between. You don't have to. No, you don't. You could, but remember, we don't know if the traces are good or bad. We don't know. So you can't just take this trace and then do supervised fine-tuning, because pretend the answer four is good, but we don't know the intermediate steps unless you read the data.
You could ask some human labelers to verify if this trace is good, but then that kind of defeats the whole purpose of RL. So you don't want to do this. Does that kind of make sense, or not really? Okay. Yes.
The what, sorry?
Oh, yeah, yeah, we will do that. Yes, yes, yes.
Python.
Yes.
What's the normalization among the rewards that are like best practices?
What do you mean by normalization amongst rewards?
If we have four plus...
Yes.
If we're running this in one big batch, how do we normalize the fact...
Good question.
Correct. It could be like $-10$, $+100$, negative infinity, binary...
Very good question. That's your choice. Unfortunately, that's the problem of RL—it's all about human choice. You will have to decide, for example, is the Python one more important than your math "What is $2 + 2$?" Then you can weight it more. For example, you make the reward for $2 + 2$ as $-1$ and $+1$, and the Python function as $+1,000$ and $0$, right? You have to decide on the weighting functions. That is your choice. Unfortunately, it is kind of like an art.
You could, dumbly, just do everything on the same scale. I think that's what most large model labs probably do. Everything, all the reward functions, have the same scale, like $+1, -1, +1, -1$, not $+10$ and $-1,000$. So it's up to you. Okay. Yes.
And what about stuff like, you know, "Is this a good summary?" How are we creating that?
That is the question. So now you want to analyze, okay, that's where the LLM-as-a-judge comes in. There is a school of thought that you can use a language model itself to make a number. You can ask, "Is this a good summary or is this a bad summary? Please give me a score from $-10$ to $10$." You could do that. That's called the LLM-as-a-judge method.
There is a paper which shows that you can do this for some time, but then it breaks down. So you can't just keep calling the language model—it's kind of like cheating if you're trying to call ChatGPT to train ChatGPT. It will work for some time, but then it will break down.
There was a paper—I need to find the paper—but the paper showed that if you keep doing this, your actual reward actually goes backwards. You will get more and more reward, and then suddenly it just, by bad luck—again, it's always about bad luck—the reward just goes back. I'm serious, all of AI is about bad luck and good luck, and optimization, trying to do efficiency. That's what everyone is doing, unfortunately. So yes, you can use LLM-as-a-judge. Does that kind of answer...
If I use LLM as a judge, doesn't it just end up being a teacher-and-student model?
Yes, but that's why sometimes—essentially, the problem is, like I said, if you keep doing this, it will actually perform worse. Intuitively it works, but then at some point there is actually another way. You could ask a language model to generate reward functions. That is actually another school of thought. You can actually ask a language model to generate seven billion reward functions. But the question is, are the reward functions good or bad? I don't know.
So now you need to rely on the fact that the models are good or bad. You could then ask another language model to verify the reward functions. Yes, you could do this. Maybe that's what OpenAI is doing. I don't know. Maybe OpenAI's goal this whole time is like, "Oh, let's generate all these reward functions, verify each of them, and then shove them into the function. Let's see what happens." Maybe that's what they're doing, I don't know. But yes, it is a student-teacher setup. Yes.
What's your opinion on how to make models... scalable in the sense of many?
Yes. The majority of reward functions currently—that's why they are called verifiable rewards—are math and coding, to be honest. I think coding is also hard; I don't know why people lump them together. Coding you can't actually verify is technically $100\%$ correct; you can just say it ran, or the output is most likely correct for some functions.
But for example, the Flappy Bird game: tell it to create a Flappy Bird game. How do you actually verify if it even is the Flappy Bird game? I don't know. But you could, right? That's the whole point of LLM-as-a-judge. You could take the output of the Flappy Bird game, ask the language model, "Does this look like the Flappy Bird game?", and if it's yes, $+1$. If no, $-1$. You could do that. But you can only go so far. Does that kind of...
What your opinion is on the scaling...
I think most—I think large model labs currently are just trying to use their own model to literally reward it, as I described. You know, ask it, "Oh, does this look like the Flappy Bird game? If yes, $+1$; if no, $-1$." And I think maybe that's what large model labs' view is: if you keep doing this, you'll get to AGI. That's their view. I mean, you could maybe, but then I always fall back to: "Oh, but you might have bad luck; it's not going to work." So I think in general, it won't work. You will only get so far, and then suddenly it just doesn't work. Okay, any other... Yes.
How do you think about...
[Inaudible question about combining domains or legal bots]
That is your choice again. So if you want to specify—for example, you just want to make a legal bot. You're given some sort of court case, and if the plaintiff wins or the defendant wins, you could just do law. Yes, you could. You could do that.
But in my view, you should combine it with other sources. You should combine it with some math. You should combine it with some programming because the point is you don't want the model just to overfit to just law. Maybe math might be helpful, just by chance. Again, maybe programming might be helpful—probably not, but in general, yes. So you should combine other domains together.
I feel like all the large model labs—their goal is to do every single domain possible, right? Like mine every single reward function in the whole world, make all the reward functions, shove them into the model, and it just learns. So yes, you should do more domains. Yes.
That's another...
[Inaudible question about research and model size]
So the notebook I will share will showcase that you should probably do some supervised fine-tuning first—it's called the priming stage. Otherwise, remember the plot over here? You don't want to be in the situation where you're starting from some bad pre-trained state and you're trying to go to the RL stage. It's very inefficient. Remember, AI is all about efficiency. You don't want to do this step, so we do have to do some priming—the SFT stage and the other stages. If that's your question, or...
If you want to—yes, the bigger the model, the better. Yes.
Can...
That's the trick. Essentially, the research papers show that small models actually do work, confusingly enough, because essentially these small models just do longer thinking. They do longer reasoning traces if the model is smaller, and if it's a larger model, maybe the reasoning traces are smaller in general. So I feel like the small models actually do work.
They do break down, though. If you want to do very complicated reasoning traces, then maybe the small models might not work because, you know, there are only seven billion parameters—there's not that much space you can move. And the large models, you just have more space to move around, and so that's why large models are better. I don't know if that answers your question, but...
[Inaudible question about fine-tuning distilled models]
Yes, correct. Exactly. Yes, that's what you should do. Yes, you can take a distilled model, like a reasoning model, and then further fine-tune it. Yes, you could. I would say it's a bit more complicated because you could do that, but remember the reasoning model itself is already a reasoning model, and you're trying to fine-tune it to become another domain. It might be easier, it might be harder—it's all about luck again. I don't know. So you have to try. It's all trial and error. Yes.
Two questions for you. One, it's pretty empirical—just try and see what works and what doesn't.
Yes, correct.
Okay. And then the other side: how are you keeping up with all the papers and all the content that's being put out? I'm sure it's a lot. How do you learn what to follow?
To be honest, I don't think you need to follow. That's my view. Don't try to follow the latest research because sometimes the next day it's a rebuttal of the previous paper, and then the next paper says, "Oh, it's a rebuttal of the rebuttal." So I would not try to keep too much up to date with the latest research. I think the field has kind of matured, and it is mostly stable now. You might have some algorithm increasing accuracy by $1\%$ or $2\%$ or some efficiency improvement.
Remember, all of the papers are about efficiency. It's always about efficiency—making the training more stable, reducing overfitting. It's always these similar papers. So I would say don't feel pressured to keep up to date with all papers. Twitter is very good as a resource. Sometimes I tweet about papers.
I highly suggest the Nathan Lambert RLHF book. It's very good; he keeps updating it. That is very good, so definitely read that. He updates it all the time, and so maybe follow Nathan Lambert. He's actually a very good follow on Twitter for the latest research, so he's very useful. In general, there's a lot of noise in the RL space as well. You don't know if the research is good or bad—rebuttals on top of rebuttals. So I would suggest people just try. It's just trial and error, right? Try to see if your reward function is good or bad. If the loss is zero or the reward is just $0$, unfortunately, something is wrong, or it's just bad luck. Try again. So it's just empirical. Yes.
These slides, sorry—are you putting them up?
Oh, yeah, yeah, yeah. Yes, these slides should be up. I was supposed to make a Bitly link; I'll probably do that later, but I will share the slides, yes.
On Slack?
Oh, yeah. Okay, I'll do that then. Okay. Any other... Yes.
Yeah.
[Inaudible question about GRPO vs PPO models]
In the old PPO sense, the value model is a separate model, and the reward model is a separate model. Yes, remember in GRPO we delete the value model. The value model is totally gone. We create the value baseline from just statistics from the distribution. We essentially just create—"What is $2 + 2$?" creates four examples, four trials, and then we find the mean, find the standard deviation, and that is our baseline. It's not even a model anymore.
And then the reward model is gone as well—it's just reward functions. And that is why we call it reinforcement learning with verifiable rewards. It's not normal RL anymore; you replace the reward model as well. Does that kind of make sense?
That does make sense.
Okay.
There's...
[Inaudible question about model circuits and learning]
Yes. So again, there are two schools of thought. The first one is like, the question "What is $2 + 2$?"—somewhere in the model, somewhere in this high-dimensional $1.4$ trillion parameter space, it knows to calculate it as $4$. There is some sort of circuit inside the model, and the goal of RL is just to maximize this circuit somehow via these formulas. But that's one school of thought. The other school of thought is, RL is actually learning something new—it's actually learning how $2 + 2 = 4$, and it's not originally in the model. Okay, any... Yes.
But when you say capabilities of a model that already has them inside, do you mean knowing actually the answer to a question, or knowing how to reason to get to a question?
That is a good question. Maybe both. I think it depends. It probably knows how to do the reasoning. For example, a contrived example: you get all of the entire world's data, like $30$ trillion tokens, and you just make a question that is not part of the data. You could do that, right? What is some random number times some other random number? You can make a math equation which is not in the data, but somehow the model has learned to do multiplication, has learned to do addition somewhere.
So maybe this circuit for addition, for multiplication, for many, many circuits of these functions—we just want to accentuate them all, and that is what RL is trying to do. And yes, there's a reasoning circuit—somehow the model also learns how to do reasoning, and so we also want to make that more important. And so $2 + 2$ is very important, addition is very important, multiplication is very important, and so on. We're just trying to make all of these circuits more prevalent.
But that's only one half of the AI community, right? That's only one half that thinks like that. The other half is like, "Oh, but the model is actually learning. We're actually training the model to learn, and the base model actually doesn't know how to do reasoning." Does that kind of... Okay, yes, hopefully. Any other questions? Yes.
[Inaudible question about frameworks / TRL / Unsloth]
Yes, there is TRL, there is Verlin... We also show that you can do GRPO and reinforcement learning with very low resources. So we are the only package which allows you to do GRPO on a free Colab, and so that's the only difference between us and everyone else. TRL is very good for large training runs, but for now on Unsloth, if you want to do small experimentation, you want to try stuff out, you don't know what reinforcement learning is, you don't know how to make reward functions, you don't even know what reward function to use—you should utilize our notebooks, and that's what I was going to demo.
One question is, let's say you don't have a reasoning task, just a regular task. Is there any effectiveness in using this to just improve, let's say, tool use for your model?
Yes. Yes, you can do that. Exactly. It should increase accuracy by quite a bit. If your accuracy before with tool use was not very good, RL should definitely help. And I feel like the trick of RL is it reduces overfitting. I think that's the trick because you do multiple inferences—you don't know which one is correct, but you're trying to maximize some good ones.
The problem with general fine-tuning is you're kind of overfitting the model. And so the trick of reinforcement fine-tuning is you can essentially reduce overfitting. So the model actually learns how to do tool calling, not just, "Oh, I see someone is trying to do a restaurant order, I just want to call DoorDash." But it actually learns, "Okay, because the person wants to order food, I should order DoorDash." It's like reverse thinking. So it should definitely help.
In some of my experiments, I tried not explicitly asking the model for a thinking process, just to do some task, and it does not automatically start the thinking process unless you explicitly prompt it, "Okay, first think and then..." I was trying to check if, without explicitly asking it to start the thinking process, just to improve the tool accuracy itself, does it start? And that, I think, did not happen because...
Generally, people utilize GRPO and reinforcement learning algorithms to create the thinking process. That's because it's like an artifact of GRPO—just by chance, they see the reasoning process. You don't need it. So maybe by chance, by luck, somehow it learns how to do tool calling and it's not some thinking process. It could be some weird symbols, maybe. I don't know, it could be using some other—sometimes models have different languages suddenly. It could be like that, you know? Randomly it learns how to do tool calling, made some new programming language internally—I don't know, but it could have done that. In this case, there was no thinking process.
Yes, it just directly gives the output because, let's say you sample 10 trajectories and none of them have a thinking process, then it never explores those parts.
For now, it will never. But remember, it's all about luck. Over time, you will have a thinking process just by miraculous chance—there is a thinking process somewhere, and then, "Oh, you should do this more," and it will just do this more. But to make that more probable, you should prompt it.
You can prompt it. So essentially, in the system prompt, you can say, "Please put your working out between this box." You could force the model to create the working out. You could do that. But is this the most efficient? I don't know. You could say, "Oh, please create a new language that I don't understand which does tool calling," and then it does some weird symbols, and then it does tool calling. I don't know. But yes, you should prompt it—it should make it more effective. Okay. Any other... Yes.
What's the secret sauce in Unsloth?
Oh, we utilize Triton kernels. We do kernel optimizations. We reduce memory usage by $70\%$. There are lots of optimizations that we do to make training faster and more memory efficient.
For VLMs specifically...
Yes, later. It's not the main focus, but yes, we will talk about that. For VLMs specifically, we use... Okay, that's actually in the notebook. Oh, before that, do we have any other questions? Yes.
[Inaudible question about test-time compute vs training]
Actually, the DeepSeek paper talks about this. There is pass@k and majority@k. I think they said that if you do test-time scaling, it improves majority. I think that was correct. I'll have to revert back to the DeepSeek paper.
But they did say—remember, test-time scaling is different from reinforcement learning. There are different methodologies. Test-time scaling is calling the model $10,000$ times, and then you just check by average. For example, you ask ChatGPT, "What is $2 + 2$?" It might say four, and suddenly it says five by chance, or it says zero, and you just take the most likely answer. That's called test-time scaling. And then reinforcement learning is different—it's more like, we want to actually train the model to actually do the whole trace, and you don't need to output $10,000$ examples and get the best answer. You just do one.
Yes. Correct.
Yes.
[Inaudible comment on combining them]
Yeah, so the trick is you do the RL step and then you do test-time compute—it will actually make the accuracy much higher. You—that's a good question. GRPO is kind of like that, right? So in GRPO, you do test-time scaling in the actual reward function. You literally call the model "What is $2 + 2$?", do test-time scaling, and then you aggregate the results. So it's like GRPO itself is doing test-time scaling internally. You could—that sounds like a new research paper. You could do that, I guess. Yes.
Okay, I will have to go to the notebook now. In order to access the notebook, you can go to our GitHub page. If you go to Unsloth, on the GitHub page, there is a button called "Qwen 3 GRPO," and you can click "Start for free." That's how you get the notebook. Or you can go to our docs, which have the notebook. So remember, go to the GitHub page and then click "Start for free" for Qwen 3 GRPO, and then you will get this notebook.
Generally, it's in dark mode, but I know in presentations people can't see that as well, so I will change this to light mode.
So we utilize vLLM under the hood. Does anyone not know vLLM? I think that's a good question. Who does not know vLLM? Okay, $100\%$ you must use vLLM, right? For all open source, how do you serve a large language model? Please use vLLM or SGLang, or I think Hugging Face has one as well. These are the best open-source libraries to serve open-source models. You have a GPU; how do we actually serve Llama 3? How do we serve Llama 4? You use vLLM to serve it.
The trick of Unsloth is—Unsloth is a package for fine-tuning, for GRPO, for reinforcement learning, for whatever you like, continued pre-training, whatever. And the trick is we just optimize it. We make it much faster, use $70\%$ less memory, make it fit on a free Colab. Remember, please use free Colab resources. Kaggle has 30 hours for free per week of GPUs. Please utilize them. They won't be unhappy. Please utilize them.
And yeah, so you install Unsloth and vLLM. And we have this thing called the `FastLanguageModel` class, which essentially allows you to call a model. For example, we will now utilize the Qwen 3 base model, right? Remember I told you not to do this, but we are going to do it anyway. This plot—where is the plot? We are going to go from the dark blue dot to the dark green dot. Yeah, that's what we're going to do. We're actually going to do what I suggested not to do, but anyway, we're going to do that.
You also have to set a max sequence length. So for example, if you want to make it longer, you can set it for longer. If you want longer reasoning traces, you can increase the maximum sequence length. We set it as $2048$. If you set it larger, the free GPU will run out, so that's the problem.
You can also load in 4-bit. So if you want to do 4-bit quantization, you can make the model go to 4-bit. You can reduce memory usage by quite a bit, so you can do that as well. And remember, we are utilizing LoRA, which is a parameter-efficient fine-tuning method. You don't need to fine-tune every single weight inside the entire model—this will be very, very costly. Instead, we add small weights to the model to fine-tune it, and so that's a trick that we do.
And because we utilize vLLM directly, we do a trick: we actually reduce memory usage by $50\%$. The trick is we share vLLM's weights directly. Other training frameworks have to copy vLLM's weights because you have the model for fine-tuning and you have the vLLM weights. The trick that we do is we actually share the vLLM weights directly so you can reduce memory usage by a further $50\%$.
We use something called Unsloth gradient checkpointing, which reduces memory. Essentially, everything in AI is about reducing memory usage, more efficiency. Everything that we do is just efficiency, so everything that we set is for efficiency purposes. For your LoRA rank, if you do LoRA, please set the alpha to be two times the LoRA rank. It speeds up training dramatically, so please do that.
Some lots of stuff—compiling. We do like automatic compiling and stuff like that. You don't need to read all of this. And here is the bulk—this is the most important part. Someone was asking about a prompt. You make a system prompt, right? "You are given a problem. Think about the problem and provide your working out. Place it between `<reasoning_start>` and `<reasoning_end>`." Right? So the reasoning start is "start working out" and "end working out." So it should look something like this: "start working out" and "end working out."
"You are given a problem. Think about the problem, provide your working out, place it between 'start working out' and 'end working out', then provide your solution between 'solution start' and 'solution end'." So it should look something like this. And this is the system prompt that we're going to use for reinforcement learning.
Remember, you can customize this to however you like. You don't have to say, "You are given a problem." You could say, "You are given a legal case. Think about the case and provide your legal thinking," place it between whatever tags. It can be literally anything. "thinking" and "end thinking." You can even make spelling mistakes; it doesn't really matter.
The whole goal of RL is you can design your reward, you can design the system prompt to whatever you like. I think that's the main problem—people think you must follow DeepSeek's `<think>` and `</think>` tags. You do not need to follow this at all. You can make it up entirely; this is customizable to whatever you like.
The hard part is, because we are using a base model—remember, this is a base model—you have to make a chat template as well. This is the more annoying part. You can just copy and paste this chat template; you do not need to do anything else. Just literally copy and paste it.
The base model does not have a chat template. When you call a base model, you can't actually call it for conversation—it's not ChatGPT, right? It's just a base model; it doesn't do anything. So you need to specify a template for it to understand how to do conversations. And so this is kind of like a template that we did. It's very generic; you can just copy and paste this. It should be the same for anything.
And then we show—after you do the chat template, we show an example of how to actually utilize the chat template and the tokenizer. For example, if you ask, "What is $1 + 1$?", you do the reasoning process, and it will say, "You are given a problem. Think about the problem and provide your working out..." and in the question—this is a question—your question is, "What is $2 + 2$?" Remember, the answer is four. And so, `<start_working_out>` is what you give to the model. You give all of this to the entire model.
You will give this to the model, and the goal of RL is you want to create the working-out process automatically. The RL algorithm will automatically create the working-out or the thinking process, and then finally it will say four—well, hopefully it will say four. And the goal is, if you see four, you want to make the reward higher just for that.
Now, someone was talking about fine-tuning with the instruct fine-tuning first. Remember, we go back to this diagram. We want it to start from the blue dot to go to the green dot. But we found it doesn't actually work, so don't actually do this. The trick is, we go back to this diagram. We actually want to take the pre-trained model, do some fine-tuning, do some supervised fine-tuning, and then go to the green dot.
And so this part shows that you should actually do some supervised fine-tuning. You need to do some fine-tuning to prime the model. The goal is you want to make the model not just output zero reward forever, and so this dataset allows you to prime the model to do supervised fine-tuning.
So for example, the problem is, "What is the sum of all the real numbers..." and then you use DeepSeek-R1. This is a trick, this is a hack: you use DeepSeek-R1 to create some examples, and you shove this in during the fine-tuning step, and essentially the model already learns how to do some reasoning, and that's the trick. This dataset is very small—it's only $7,000$ rows. You don't need to have that much data for just this first step. I think I only use $600$ rows—very, very little data.
So this is just a data preparation step, not that important. I need to skip to the reward function—this is the most important part of the model. So this is the supervised fine-tuning step. All of this is the SFT step, so this part of the model is that part. Not that important. Okay, we skip all of that.
This is the most important part: the reward function creation is the most important part. And I feel like the majority of people neglect this part—it is the hardest part to do. Okay, let's see where the reward function is. Oh, here it is.
For example, this one is a regular expression to match if your format is correct. For example, remember we ask the model, "Please put your working out between 'start working out' and 'end working out'." This regular expression essentially rewards the model for having "start working out" and "end working out." If it doesn't have this, you will actually penalize it. And so this is one reward function that I created.
For example, I give it an example: if you say, "let me think, end working out," it extracts two. Yes, that's good. Remember, we force the model to say, "You must generate the answer between this and this," and it successfully extracted two, so that's good. But also sometimes a model might generate some random spaces—it's possible. The model might not follow your exact format. We still try to match it; even if it generates extra spaces, we still successfully match the number two, so that's good.
This is a reward function. Essentially, what we see is: if it matches the format exactly, we increase the score by $+3$. If not, we just put $0$. And remember, this `match_format` essentially matches the regular expression. We had to create it by hand for matching the format.
This number does not have to be $+3$. It can be $+300$ or whatever you like. It can be $+1$, I don't know, it can be anything that you like, but I just found $+3$ to work fine. So you can do whatever you like—anything. And remember, the score is $0$ if you don't see it. You can also do negative rewards. For example, if it's otherwise, you can also do score minus three—you can subtract three points from it. So it is up to you; you can design your reward function however you like.
But remember, if the model output is not exactly following your format, we should still at least reward it a little bit, right? Otherwise, the reward will just be $0, 0, 0, 0$.
So the trick is, if we see a keyword, we plus one—or sorry, plus $0.5$ if you see the keyword. If you see the keyword in the output, you should at least plus $0.5$. But if you don't see it, then you minus one. This essentially allows you to partially reward the model.
More reward functions now get more complicated. This larger reward function essentially allows you to calculate the distance-based scoring. For example, remember we said over here—where is it, this one, right? "What is $2 + 2$?" Four is correct, but three is also a better answer than D, right? If you output D, it's definitely wrong. If you output five, it's okay, but it's wrong.
And so this function essentially allows you to take the good answer. This is the guess divided by your true answer, and it's like a ratio. This ratio essentially allows you to reward if your number is close to the actual answer—you give it a higher reward. And if your answer is very, very, very far off, then you penalize it by minusing reward. So this essentially allows you to do that. If it's exactly correct, you also add five points.
So this is probably the most important reward function, but this is only for math. For other things like code and stuff like that, you have to create more reward functions.
Now we test if our reward functions actually work. And yes, it extracts the numbers. This is just format reward—sorry, this is just extracting the solution. And you can see that it extracted $0.34$. It extracted this number, it extracted this, and it extracted this. If your reward function is not working very well, you probably did something wrong in the regular expression. So, please edit that.
And then these are helper functions. Oh, this is another reward function. If you see the number 12,345, we want to remove the comma because you can't actually convert this into Python. So, you want to remove the comma. And then if it's equal to the true answer, you plus $3.5$ reward, and if it's not, then you minus $1.5$ reward.
More dataset preparation functions, not that important. Here is the meat of the code for training, right? We call the VLM. Top-$p$ is $1.0$. It's probably not a good idea—$1.0$ just means you're sampling the entire space, so that's good. You can set this to like $0.8$ or something else, up to you, but I generally set it to be $1.0$ to be a full sampling of the entire space. Min-$p$ is $0.1$. I suggest people use this because otherwise, the model might go into doing inference of random outputs, so use $0.1$.
And temperature—I did suggest people increase the temperature to $1.2$, or you can do $1.0$. Try to increase your temperature as much as possible. The more you increase temperature, the model becomes very creative; it creates random outputs. If you increase the temperature too much, like $2.0$, your model will output gibberish. So probably don't do too large numbers.
I normally suggest $1.0$, $1.1$, or $1.2$ or somewhere around there. You should utilize min-$p$ together with high temperature numbers. There is a paper about using temperature $1.5$ and min-$p$ of $0.1$; you should utilize that.
There are some other things that we utilize. `num_generations` is very important. I set this to four. This number represents how many rollouts do you want to do? How many inference steps do you want to do for the GPU, right? We chose four. So four just means, "What is $2 + 2$?" It will create four options. That is this number. If you increase this number too much, you will use much more memory. You should increase this as much as possible if you can.
There's also batch size. We set this to be one. The trick is that the batch size times the gradient accumulation is equivalent most of the time—GPU-wise it is not, but essentially what this does is, if you do one, it just means we're doing one. What is $2 + 2$? If you set batch size to be three, then we shove all of these three examples together into one.
Generally, you should set batch size to be much larger. The problem is, if you set batch size to be too big, you're going to use more memory. So the trick is, instead, you do gradient accumulation. You set this to be 16. That's a trick. Gradient accumulation essentially allows you to do addition of gradients over time, and you can skip using too much memory.
And then there's evaluation. If you want to do evaluation, there are some functions for that. And then we shall see the training.
You will get a large table of numbers during the training process. This took 2 hours and 54 minutes on a free Colab. Look at the reward column, right? The reward column: $-7.5$, $-5.5$, $-5.5$, all very bad. Oh, and then suddenly plus $13$ just by chance. Suddenly it's plus $13$. Remember, with GRPO, the trick is, if you see this plus $13$, let's maximize this even more.
And then it didn't really work, so it goes back to $-7.5$, $-5.5$, $-7.5$, and so on. And then plus $11$. You see another good reward; we want to maximize this as well, and so on. That's the trick of GRPO. By luck, by chance—literally by luck—you will have good answers. With good answers, you want to maximize them.
If you keep looking down, if you keep scrolling down, in the end—okay, I need to make a plot, but in general, your reward will increase over time. Look, these are all positive numbers now. These are all positive numbers. Your minus numbers are getting less and less and less.
But essentially, if you plot this over time, the reward will actually increase over time. There are also other numbers, like completion length. Essentially, remember, when you use a reasoning model, the reasoning trace can be extremely long. So this column just tracks how long the reasoning process is. Over time, in general, the reasoning length should get longer and longer, but sometimes that's not always the case.
There is also another column called KL divergence. This essentially tells you how far the final model is from the original model. The larger the number, it means it's getting very, very, very far away from the original model. In general, this number should get bigger over time. Sometimes it doesn't move, but you should make this number go as much higher as possible.
We also made separate reward functions. Each of those reward functions also has their own reward. The most important one is the last column, or the second-to-last column. These are the two numbers.
In RL, there is a problem: most RL training runs just follow the format, and the model doesn't actually learn the task. So the format columns are not important. Do not look at the format columns; they are useless. You need to look at the last two columns.
And if you look at the last two columns, rewards and check numbers—this essentially checks if the output is good or bad. You see that it's $-2.5$, $-2.5$, not very good, and then suddenly $3.5$. $3.5$ is good, we want to maximize this. If you keep looking over time, in general, if you take a rolling average, the model gets better and better and better. Obviously, we only trained this for 2 hours and 50 minutes. You know, if you train it for 20 days, it might actually do very well. But remember, this is a free Colab GPU.
So in general, remember, the goal of GRPO is: suddenly we see a good answer with a good reward, we want to maximize that. And that's the whole point of GRPO. It's nothing fancy; it's just like, by luck we see it, and we just want to maximize it.
We can also see some output from the model. Right at the very beginning of the model, let's see an example: "Compute the number of positive integers that divide at least two of..." some question, and then it does some reasoning trace. Remember, we already fine-tuned it a little bit. So it does something, but the answer—it just goes on and keeps blabbering on.
But then if you look at the actual answer, where is it? We print out a lot. Okay, it just keeps going on and on. This is the output of the GRPO algorithm. You will see over time, if you inspect this, that the model actually gets better and better.
For just an example, let's say we ask the model, "What is the square root of $101$?" We don't just say, "What is the square root of $100$?" That's just $10$. We say, "What is the square root of $101$?"
If you do not train the model, this is what you get: it will say, "Answers, education, math and arithmetic. What is the square root of $101$? Wiki User... oh, Wiki User." This is what it actually will say. Where do you think this data comes from? Does anyone know? Can you take a guess? Where do you think this data comes from? Probably Wikipedia, right? So if you ask the question, "What is the square root of $101$?" it doesn't do anything. Remember, this is the base model. The base model is useless; you're not going to get it to answer the question.
But after we do GRPO, we ask the question again: "What is the square root of $101$?" It says, "Okay, so I need to find the square root of $101$. Hm, let me think. I remember that the square root of numbers between perfect squares are irrational..." and so on. And it says, "Solution: $10.049875$."
$10.049875$—I think that's correct. I don't know if that's correct, but probably it's very close. So the whole point is, GRPO produced all of this reasoning trace. In the olden days, you actually had to have a human write all of this, and then you had to fine-tune the model. With GRPO, you skip that; you don't need to make this anymore. It's automatic. That's the trick of GRPO and reinforcement learning: all of this reasoning phase is automatic, totally produced from nothing, and in the end, it gets a solution.
Yes, you had that.
Yes, that's the trick. If you do the base model, the trick of doing this is: if you just let the base model go into the space, you'll still get this, but it'll take too long. Otherwise, you'll wait there for like 20 days. For demonstration purposes in a Colab, you have to do the supervised fine-tuning step. That's the trick. Yeah. Yes.
What is the advantage of doing this with 7,000 examples versus using a model out of the box?
You can use an instruct model; we actually have notebooks for that. So if you go to GRPO in general, we have notebooks for using an instruct model. Okay, the internet is very slow. We actually have other notebooks. For example, if you use Llama 3.2 3B, that is using an instruct model. You don't need to use a base model, but we showed that you can use a base model. Yes, I suggest people use instruct; you probably shouldn't use base. It's all about efficiency as well.
Yes.
[Inaudible question]
Yes, yes. Yes, correct.
Yes. So the goal of KL divergence is you want the model not to stray too much away from the original model, right? KL divergence is—I shouldn't say distance, but KL divergence is like a distance between the current model that you're training and the previous, very, very beginning of the model, right? And so essentially, if the model is too far away, your KL divergence will be very large.
If you look at the plot—where is the table? I'll scroll up a bit—which one is the KL divergence? Oh, here, this column is the KL divergence column. Over time, it should get larger and larger because the model is straying away from the original model.
If you set the beta to be zero, then you remove this term. Maybe this might make the capabilities of the model more, because you're essentially not forcing the model to be as close as possible to the base model. This is an active area of research. So yeah, some people might set it to zero, some people might not. I think the default is $0.05$ or $0.03$ if that.
Okay, any other questions? Yes.
[Inaudible question about SFT loss and reasoning capabilities]
So for fine-tuning, actually fine-tuning is very helpful already. Where is the loss? The base model already is very bad. Here, there's a loss. This is using the fine-tuning step—the priming stage, right? You use a dataset to firstly prime the model. The loss does decrease. Remember, if you see a loss of $0.64$, that's good. If you see a loss higher than $3.0$, definitely something is wrong. Higher than three is very bad. You can see the loss definitely decreases over time. So yes, doing the fine-tuning stage does teach the model a little bit to do reasoning, and it learns how to do some stuff.
Also, a very interesting fact is we used DeepSeek-R1, some of the reasoning process, to do the fine-tuning step. And interestingly, if you just call the model without doing GRPO, it kind of does reasoning already by doing 7,000 examples, right? It already says—remember, the question was... oh, wait, this is just a general question. It kind of learns how to do reasoning somewhat, but it's not perfect. And so the goal of GRPO is to forcibly make it perfect—okay, not perfect, but as much as possible. So actually, the fine-tuning step already kind of learns a little bit.
Did you use all 7,000 examples?
No, actually, you don't need to use 7,000. I think I only used 118. It's two training epochs. Yeah, it's 118. I only used 118 rows. You don't need to—you can use 10 rows. Yeah, use as little as you want.
You must use more than three rows, though, because when you do LoRA, the gradients become zero otherwise. So you must use more than three, but anything more than three is fine. Yeah, so even if you use 118, it does fine. Yeah.
Yes. [Inaudible question about model scaling and tips for larger training runs]
Do you mean like a small model versus a big model? What's the difference? Can I tell you any tricks if you want to do this training on a bigger model?
Oh, yeah. Go ahead, you can take the notebook. You will need a better GPU, though. Take the notebook, edit this here. Not 4B; you can do 14B. I think there's a 14B, or is it 12B? I can't remember. You can do whatever you like. You could even do, you know, Llama 3.3 70B, up to you. Do whatever you like. But the goal for a Colab demonstration, because it's a small GPU, is to use a small model.
We actually have notebooks for free Colab which fit 14B. So Qwen 14B actually fits in a free Colab, so you can do big models in a free Colab. For Kaggle, again, I said use Kaggle's free GPUs. This whole page has notebooks. Kaggle has notebooks for GRPO as well, so you can do whatever you like for large models.
[Inaudible question about VLM rollouts and server collocation]
Oh, do you mean like for VLM rollouts? So, the trick we do is we colocate. You use the same machine for inference and fine-tuning, and the trick is you can reduce memory usage because you're sharing the VLM weights. For some other trainers like Verl and TRL, you do have to put the inference on another server, and then your training is on a separate server, and they have to do communication. For us, there is no communication. There is none. So we do it very close to asynchronous training—nearly no delay in training. But yes, we don't support multi-node yet, but we do plan to support larger training runs. Yeah.
What's...
Yes, people have asked that. No roadmap yet, I don't know if we're going to support it. It's a bit more complicated. You could use XLA; they do have PyTorch compiled down to TPU, so maybe it might work. I don't know if it works, I've never tried it. Maybe later. Yeah, maybe later.
[Inaudible question about bypassing SFT completely]
Oh, you don't have to. The whole point then, like this thing here, we chose a base model because we can show that you can do a base model going to the green dot. But then unfortunately, in the Colab, we do have to do some supervised fine-tuning, otherwise you'll wait there forever and the reward again will be all zeros. We just want to remember that all of AI is about efficiency and speed. So we just want to showcase: okay, you do need to do the SFT step. You still need to do the supervised fine-tuning step. Yeah.
[Inaudible question about instruct models]
Oh no, you don't need to SFT first. You can take the instruct model. So the notebooks over here—for example, if you go to the Llama 3.2 3B notebook, we don't do any fine-tuning step at all. You skip directly because it's an instruct model already. It already learns how to do chat; it already learns how to answer some questions. You can skip directly to GRPO. If it loads—but yes, the notebooks... okay, you'll have to wait for it to load. Whatever, the internet's very bad. It is loading. Yes, any other questions?
Yes. Over in the REINFORCE algorithm, you had the log probability of a state or an action. Is that happening inside the sampling model? Where is that in the notebook?
Oh, the algorithm itself of GRPO? Yes, is that happening over in the GRPO trainer? Yes, it's inside the trainer itself. Somewhere in the code, it does that.
It's figuring out what is the probability of a token versus all other tokens?
Oh, the calculation is inside the trainer. So, somewhere on the GPU, you're doing this calculation, but you do get the log probabilities. Remember, from the language model, you get the probabilities already. You just get the reward function, and you just want to maximize it.
So do you take the logits that come out of the large language model, turn them into pseudo-probabilities, and then just assume that's...
I think so. Yes, that's correct. I think it's the exponential of the logits. Yes, I think that's correct. If you go to the code, there is some derivation for it, but yes, you're correct.
Anyway, the notebook loaded. But yes, there is another notebook which does the instruct model here. There is the instruct model, and there is no fine-tuning step at all; it just does the reward function and stuff like that. Will's notebook, for example, is also very good. So if anyone wants to check other notebooks out, Will's notebook also utilizes, I think, the instruct model and then does GRPO.
Okay, time is running out, but technically the GRPO portion is done. Oh, there are actually more portions. I will have to breeze through them; there are only 10 minutes left. Whoops. Any other questions? I will take questions at the very end. Anyway, I'm going to stay here afterward.
We'll now shift over to quantization. I don't know if you guys know about the DeepSeek-R1 1.5-bit quants that we did, but you can essentially download these models. DeepSeek-R1 is, I think, 730 GB. You can quantize them down to 140 GB without that much loss in accuracy. Okay, there's obviously a loss in accuracy, but the trick is you can quantize them down to be very small, and miraculously, they work.
Llama-3-70B-Instruct, for example—you can't really see the accuracy plot clearly here. The smallest number is 80% accuracy on MMLU 5-shot. The highest accuracy is 81-point-something percent, so it's actually only a 1% difference, and the one on the left is a 1-bit quant. It's very small; it's tiny in comparison to the full precision like FP8. So essentially, you can make the model eight times smaller and you only decrease accuracy by 1%. That's very interesting.
Essentially, we showcase that you can actually quantize the Mixture of Experts (MoE) layers very heavily, but you must leave the attention layers, the shared experts, and other layers in higher precision. That's what we call the dynamic quantization methodology.
If you see, there was a benchmark of Llama-3-70B-Instruct, for example. If you use a 2-bit quant, it actually gets higher accuracy than other full precision providers, which is very interesting. For example, the 2-bit quant gets 73% accuracy, and other inference providers get 65% or 67% accuracy. There is a very large difference. Okay, there might be some bugs in their models, or maybe they quantized it incorrectly, but the goal is to show that if you quantize a model down to very low bits, it still works.
We showcase this with an example. For example, if you take a vision model like Qwen2-VL-2B: if you naively quantize all the layers to be 4-bit, you ask the model, "What does this image show?" It will say, "The image depicts a vibrant and colorful scene of a coastal area," which is totally wrong. The answer should be, "The image shows a train traveling on tracks," or something like that. If you quantize everything to 4-bit, it's 1.36 GB, but it's definitely bad.
So, the trick is you must leave some layers in higher precision, and you only need to increase it by 500 MB or so to 1.8-bit, and it works. "The image shows a train traveling on tracks." It suddenly works.
But the question is: which layers do you not quantize? That's the question, right? Which layers? You could do an exhaustive search. You can check, "Oh, let's not quantize layer 0. Let's check layer 1, layer 2," checking every single one, but it'll take forever. So definitely don't do that. You have 70 choose 1, or 70 choose 1 plus 70 choose 2—horrible, right? And remember, all of AI has a bad efficiency, so don't do that.
The trick is you can check the activation quantization error and the weight quantization error, and you will see these large outliers. For example, for Qwen, if you quantize the first few layers, it's extremely bad. So you must leave the first few layers unquantized. And also, this gigantic jump for the weight quantization error: this means you probably shouldn't quantize that layer as well.
There are some other plots that we show. For example, for Llama 3.2, it's interesting; all of these graphs are very different for each model. You will notice Llama 3.2 has these weird, continuous spikes. It's because they use attention and then they put the attention back to the vision module, I think every three layers. So every three layers, it has these big jumps. This means you should not quantize those layers.
Pixtral, for example, is also a different graph again. Pixtral seems like you can't quantize many layers, unfortunately. So the whole vision module must not be quantized.
There is a very important paper talking about why, which layers you should quantize and which layers you should not quantize. It's called the "Super Weights" paper. You guys definitely should read that. Essentially, it says that in all language models, in the first few layers of the down-projection, there is a very, very, very important number—one of the numbers in the model is very, very important, and you should never quantize it, ever.
But the interesting finding is it's not actually a very large number. There is a trend in the language model space where people think that you should not quantize outliers. The problem is, these models have these big outliers—suddenly in the model there's a big number like 3,000, and if you quantize it, it essentially ruins the model.
But actually, this paper shows that it's not actually the outliers that are the problem. These numbers could be very small, and if you look at the plots, if you select these numbers and make them zero, the accuracy decreases dramatically. And so, if you see—for example, if you remove one of the numbers, the activation value totally decreases very badly. They have very large activation values, and if you remove them, it's very, very bad.
There is another trick that you can do. If you have a model that has 7 billion parameters, make every single number go to zero. The first parameter, make it go to zero, check accuracy. The second number, make it go to zero, check accuracy. The third number, go to zero, check accuracy. You can do this 7 billion times and you can see which number is the most important. You could do that as well, but remember, AI efficiency-wise, not a good idea.
More recent research: for example, the new Blackwell chips. Instead of doing quantization to like 1-bit, 2-bit, 3-bit, or 4-bit, Nvidia chips also have this new architecture, this new format called FP4 or MXFP4. Essentially, this is float 4, and float 4 is most likely going to be used a lot in the future. There are these new formats for quantization as well, which essentially allow you to train models in very low precision.
I also made this plot going from FP32. The question people always ask is: why are GPUs getting faster and faster and faster? My take is actually that this year is probably the last year you're going to get GPUs that are actually faster. There are no more faster GPUs. Why? Because the majority of GPU speed-up is because of numerical precision. From FP32 to FP16, you get five times faster.
Why is it five times faster? Because when you use transistors, the calculation complexity is the exponent plus the mantissa squared, right? In FP32, you have to use 23 numbers for the mantissa, and $23^2$ is very large. In FP16, you reduce the mantissa to 10, and that is why you get a five-times speedup from FP32 to FP16. It's because the number itself is getting smaller. The representation inside the models for each of the weights is getting smaller.
And then we moved from FP16 to BF16. It is again maybe around two times faster than FP16.
We then have FP8. FP8 is even faster and uses even less space. But then there is a problem. We get to FP4, and it's around two times faster than FP8. The problem is, what's next? Do we go to float 2, float 3, float 1? You can't push any more in terms of numerical precision. There is not much more to go in terms of that space.
So you can only get maybe 180 times faster than FP32, maybe 200 times faster, but essentially, my take is FP4 might be the final precision that gets faster—the final numerical precision. In the future, GPUs are not going to get faster. So if people want to buy Blackwell GPUs, you should probably buy them. It's most likely not going to get faster anymore. That's kind of my take.
And also—okay, I was going to talk about kernels and stuff, but I don't think I have enough time. You must use `torch.compile`. Every single function that you see, wrap it in `torch.compile`. Try it out. I always tell the PyTorch team, please make it the default. Definitely use `torch.compile`.
Why? Because it makes your training faster—sometimes, only sometimes, not all the time. It reduces memory usage most of the time. If you see bugs, they'll probably fix it.
But remember, `torch.compile` is not as easy as you think. You don't just do `torch.compile(model)`. There are actually many options you can tune. I just listed a few options, and this is literally just a few. There are like 10 more pages of options, I'm being serious—10 more pages you can tune. Imagine if you can use `torch.compile` and tune every single one. That's why I highly suggest people use `torch.compile` more effectively. It's probably the biggest thing that can change your entire training run, make it more memory-efficient, and make it faster. So definitely look through this.
Okay, so in general, yes, thank you. Definitely star us on GitHub, join our Discord if you have any questions on RL and stuff. We have a website as well. And finally, we have stickers. Yes, there are some limited-edition stickers as well that we have somewhere, I think over there. And remember, if you have any questions, I'm still going to stay around, so feel free to ask. Yeah, thanks a lot.