Stanford CS25: Transformers United V6 I From Next-Token Prediction to Next-Generation Intelligence
Disclaimer: The transcript on this page is for the YouTube video titled "Stanford CS25: Transformers United V6 I From Next-Token Prediction to Next-Generation Intelligence" from "Stanford Online". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.
Watch the original video here: https://www.youtube.com/watch?v=e_H_tkpCAK4&list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM&index=46
I'd like to um introduce our speaker for today. Um, so Shrimai is an AI scientist at Mistral AI, as well as an adjunct professor at Boston University. And her research focuses on advancing large language models or LLMs, particularly improving their reasoning capabilities. So prior to this, she was lead contributor to Nvidia's NeMoTron family of models, working on data curation, pre-training, and scaling.
Her work emphasizes optimizing pre-training pipelines, including data selection, blending, and ordering strategies to maximize downstream model performance. Um, so without further ado, I'll hand it off to Shrimai.
Hello everyone. Yeah, thank you for that introduction. Uh yeah, I joined Mistral this Monday, but uh today I'll be talking about a lot of my work I've done during my time at Nvidia. Um and if any of you are interested to know about Mistral AI or opportunities at Mistral AI, uh please feel free to email me at my email address, and I'm also here after the class uh to answer any questions. Uh so let's get started.
Uh so today I'll be talking about uh from next token prediction to next generation of intelligence and what's the future of pre-training in general. Uh before I get started, this according to me is the recipe to build a SOTA LLM, and it has four key components. You need smart data, you need a lot of high-quality data, which is diverse. You need to be able to do good filtering, deduplication, and so on.
You need smart architecture. So as you know, architectures are evolving from transformers, we now have Mamba 2 hybrid architectures, and so on. You need smart algorithms. How can you build advanced training recipes? And you obviously need smart collaboration between pre-training, post-training teams, research and engineering, and so on.
Um so, what is the work that we have done in building smart data for LLMs? I just want to give a brief overview of that. Uh so, during my time at Nvidia, I was able to contribute towards Mind, which is synthetic uh dialogues, but they are all related to math. Um NeMoTron CC math, which is common math that is uh parsed from Common Crawl documents.
Uh and all of this data, including the NeMoTron Nano-2 dataset, it's available uh in an open-source manner on Hugging Face. Uh we also developed prismatic synthesis, which encourages diversity in synthetic data generation, and NeMoTron Cross Think, which was a dataset more geared towards uh reasoning beyond math and code. And this particular NeMoTron Cross Think dataset became uh one of the most trending uh datasets from Nvidia on Hugging Face the week that it was released.
Uh but today I'll be talking about building smart algorithms for LLMs. And uh in particular, I want to talk about three of our works. Um maximizing your data's potential, uh front-loading reasoning, and uh RLP, which is using reinforcement as a pre-training objective.
So, before I get started, I want to give you like an overview of what I'm going to be talking about and what are the key components of what I'm going to be talking about. Um so, let's say there are these four kids: Pascal, Volta, Ampere, and Hopper. They all have access to the exact same data. The exact amount of data, quality of data, everything is the same.
But, what is different for these four kids is how they learn from this data. And there are three key components there. One is curriculum, which is the order in which you see the data. Second is front-loading reasoning. What I mean by that is all four of them have access to high-quality reasoning data, but how do they use it? Do they use it early on or do they wait later on in life to use it?
Uh and I like to give an analogy over here with the American education system where there is this concept of taking AP classes, which is college-level classes during your schooling years. And the idea behind that is that you will do well if you take those classes during school, then you'll not only do well in school, but you'll also do well in college. So, that basically captures the idea behind front-loading reasoning.
Um and then finally, learning through thinking, not just observing. So, these are the three key strategies, and let's see how these four kids learn using these three strategies. So, Pascal, unfortunately, does not follow a curriculum. The data that's available to him, he is just reading it randomly. He may read one text from history, one very advanced math document, and then a very basic math document. He does not utilize the reasoning high-quality data that is available to him. He just never reads it. And finally, he does not learn through thinking.
Volta, on the other hand, uh follows a curriculum, that is she reads the documents in a certain order and finds out that it's beneficial, but she decides not to read the high-quality documents available to her. And she also doesn't learn through thinking.
Ampere follows a curriculum and uses the reasoning documents that are available to him, but he also doesn't learn through thinking. And finally, Hopper learns through a curriculum. She also makes good use of the high-quality documents that she has and she also learns through thinking and engaging with the material. She's not just reading the material.
So, this is what we will go over throughout the whole talk. I'm going to be showing comparisons between Pascal, Volta, Ampere, and Hopper.
Uh so, first, let's cover maximizing your data's potential and enhancing LLM accuracy with two-phase pre-training. Uh so, as you all know, LLMs today are trained on trillions and trillions of tokens and they come from vast sources, like all kinds of diverse sources. There are legal documents, books, papers, web crawl, which is everything and anything. And you have math documents.
Um and this is an analysis done by Epoch AI where uh it shows how much data is consumed by the LLMs today and it uh projects that by somewhere around 2030, uh LLMs would consume uh more than 95% of human-generated uh data. And uh the blue curve here is what I want you to follow and see that in uh somewhere 2021 when GPT-3 uh was trained, uh it was trained on hundreds of billions of tokens. Uh but, Llama 3, which was uh in late 2024, was trained on tens of trillions of tokens and that's where we are today. Um uh and soon we will uh consume all of the human-generated data.
So, now that you have all these different data sources, um now there are two important questions to answer. First is that how do you weigh these different data sources, which is that like how do you decide which document is a better quality document and which document is not. Um and how do you order these different data sources? So, again like curriculum, how do you know which document to read first and which document to read um later on in life.
Uh so, to create an optimal blend or data mixture, we follow this uh simple pipeline where you have all this data and you do quality estimation. So, you build quality classifiers and you try to estimate what's the quality of each of your data source. And the idea then, when you're creating your optimal data mixture, is that data sources that have the same quality, they should be weighed similarly in your data mixture and higher quality data sets must be weighed higher uh than medium and low-quality data sources.
Uh next, you want to do epoch estimation. This is uh basically estimating how many repeats of a data source you want to see. So, to get the most juice out of your high-quality data source especially, you want to estimate what is the maximum number of repeats you want to see of this particular data source before it starts giving diminishing gains on your downstream tasks. So, you'll estimate the quality, you'll estimate the number of repeats and based on this, you create your optimal blend or your uh optimal data mixture.
And then for curriculum, we follow a two-phase pre-training approach uh where in phase one, the idea is to encourage diversity in data. So, you want to be able to expose the model to as much data as you can and as diverse data as you can. Where this is also mostly the data mixture looks like it has a lot of web crawl including both medium and low-quality crawl. You also show lower number of epochs of high-quality data sources at this time.
And then in phase two, the emphasis is only on high-quality data. That is only on higher number of epochs of high-quality data sources like math, Wikipedia, code, and so on. So, this is the two-phase approach and this is what we follow in our pre-training.
Uh and what are the baselines for this sort of approach? So, first is natural distribution. So, you have three axes: quality, epochs, and order of the data. So, in natural distribution, it's basically you don't care what is the quality of the data or how many repeats should I see of a particular data. Uh in natural distribution, the way you create the blend is that you just uh draw samples from a data source and they are equivalent to how many tokens are available in the data source. So, if you have a uh data source which is low in quality, but it is a huge data source. It is very large in number of tokens, then you will see more of that data source because it has more tokens available in it. And in this uh baseline, you also don't show any ordering of the data.
The second is where you create an optimal blend. Uh that is now you've figured out what is a higher quality data source and what is a low-quality data source and uh you've created a nice data mixture using this information about the quality uh and information about how many repeats you should see, but you're not really showing it in any particular order to the model during pre-training. You're just showing it in a random order.
And finally, our two-phase approach where you create the optimal blend, but you also make sure that you show it in a particular order where in phase two, you show more of the high-quality data source. So, you take care of quality, epochs, and order in this.
And so, this is the difference between Pascal and Volta that I was talking about earlier. Well, where Pascal neither did he estimate quality or epochs and he didn't follow any particular order while learning. Whereas Volta follows a two-phase approach and creates an optimal data mixture from the data that's available to her.
And if you do a head-to-head comparison, then Volta is on average 17% better compared to Pascal. And 3.4% better compared to a random ordering, but you have an optimal data mixture.
So, the next part of the talk is front-loading reasoning. And here we talk about the synergy between pre-training and post-training data. So, what's the idea or philosophy behind this is that today's LLMs follow this particular pipeline. You have pre-training where you try to learn about world knowledge and facts, supervised fine-tuning (SFT) where you're learning to mimic reasoning like answers or you're basically learning reasoning formatting. And in RL, finally, you actually learn about reasoning. And this in our opinion, where you're doing only general knowledge during pre-training and in post-training is where you add reasoning skill as a post-hoc thing, that creates an unreasoning foundation or a weak foundation.
So, what we propose is to frontload reasoning where in pre-training you not only learn about general knowledge, but you also learn about reasoning skill itself. Uh and then in post-training you can amplify and refine the skill. And this we believe would lead to stronger foundation for reasoning models specifically.
So, what is the study that we've done in frontloading reasoning? Uh we basically systematically inject reasoning style data at different phases of the training like pre-training, SFT and so on. And we try to study uh what is the impact of adding reasoning data at these different phases. Uh so, this is the general pipeline that you have data pre-training, SFT, and then RL. And now you have access to this reasoning data.
Uh and we categorize this reasoning data along three axes, which is diversity, quality, and the quantity of the reasoning data. And in this study what we do is we try to understand the diversity, quality, and quantity of this reasoning data to be added in pre-training versus in SFT and what is the impact of it. Uh so, in the next few slides what I'm going to be talking about is that you have this reasoning data and there are two conditions. You either add it in pre-training or you don't add it in pre-training.
If you don't add it in pre-training, then I'm going to call it a no reason base, which is that in your base model when you're pre-training it, you didn't see any reasoning data. And uh the second is the reason base where in the pre-training itself you've seen some amount of this reasoning data, some quantity, some quality, or some diversity of this reasoning data is added in pre-training.
And here the I just wanted to show this slide on what evaluation metrics we use to evaluate pre-trained models or post-trained models. And the reason I'm showing you this is because there are different benchmarks that are used to evaluate base models, that is only pre-trained models, and there are different benchmarks that are used to evaluate models that have undergone SFT, RL, and so on. So, when you see the numbers in the next few slides, they may be different for pre-training and post-training.
So, what did we learn from this study? So, first lesson is that including reasoning data in pre-training is beneficial. So, this is the comparison between Volta and Ampere right after you finish your pre-training. And so, if you see, then Ampere is doing 16% on average better compared to Volta. That is the reasoning base, that is adding Ampere saw this reasoning data during pre-training, Volta did not. And um Ampere got 16% gain right out after the pre-training phase. And that shows us that adding this reasoning data in pre-training leads to improvement in downstream accuracy.
But now, it's not enough that you just see gains after pre-training. What happens after SFT and after post-training? Do these gains just get washed away because you used reasoning data in pre-training and you also used reasoning data in your SFT and so on. So, this is the comparison for both of these cases, no reason base and reason base after doing SFT. And after doing SFT as well, Ampere is still doing 9.3% better compared to Volta. And this tells us that the advantage of adding reasoning data to pre-training, it doesn't get washed away after you do SFT, but it just grows.
The third lesson. So, the third lesson, I want you to again come back to this axis part that we had three different axes for categorizing reasoning data. And based on this, we have three data sets. So, one is SHQ, which is small in quantity and diversity, but it is high in quality. Uh the second is LDQ, which is large in diversity and quantity, uh but low in quality. Uh and LMQ, which is a concatenation of SHQ and LDQ. You just combine them.
Um and so, what we observe here is that if you look at uh the bar charts on the left side, then uh the LMQ and LDQ, they perform the same on average. That is, adding this high-quality SHQ data to LDQ and forming LMQ, it actually showed as if there is no benefit of adding this high-quality data um to LDQ uh right after pre-training.
But, if we post-train the same models, then we see that LMQ has a 4.25% boost compared to LDQ. So, what this tells us is that um high-quality data in pre-training, it can unlock these hidden gains after doing SFT, and you don't necessarily lead to overfitting.
Uh the fourth lesson is uh about how early reasoning builds a very strong foundation. So, now one can argue that, "Hey, uh you have this no reason base, but what if you do twice the amount of SFT? Would that catch up? Or what if you use more data during SFT? Would then you catch up?" And so on. So, there are these two cases that we have discussed.
One where you use more compute during SFT. You use 2x amount number of epochs in SFT for the no reason base. And for the reason base, you just do one epoch of SFT data. And you still see that you get 3% gains even after doing 2x the number of epochs in the SFT phase. So, this is basically using more compute in SFT.
And in the second case is where if your total data budget is fixed. So, you come and say I have these fixed number of documents and should I use all of them in pre-training or should I use all of them in SFT or should I divide them? So, the graph on the right is talking about this experiment where it's data matched. That if you keep all of your reasoning data for SFT versus if you split it and you use some of it during pre-training and some of it during post-training. And you're using the highest quality of data in the SFT phase here.
So, even when you do that, the reason base basically performs 12% on average better than the no reason base. And what this tells us is that this no reason pre-training or basically pre-training without using this reasoning data, it cannot catch up by getting more SFT compute. And even when you have fixed reasoning data budget, it's better to always use some amount of that data in pre-training.
So, the final lesson here is that front-loading reasoning creates a durable advantage. And again, this is the final comparison between Ampere and Volta and this is after it has gone through all the stages pre-training, SFT, and RLHF in this case. So, we see that like after doing all of these stages Ampere is still 19% on average better compared to Volta. And what interestingly we see is that the gain is actually ballooned for these very complex math benchmarks like AIME where it's 39% better compared to Volta.
So, if we don't have any questions on this part then I'll move to the last part of the talk. Uh which is RLP. Uh using reinforcement as a pre-training objective. Um
Yeah. So, when you say small high quality data and when you say medium high quality data, but how do you train this from code perspective with this frame? How do you guys just uh say How do you guys uh differentiate both of them?
How do we categorize into quality? Uh so, we have used um all open source data sets. So, there's Open Thoughts and there is uh NeMoTron uh SFT data set which is released. Uh so, NeMoTron SFT data set has like a lot of data sets mixed in it. So, it's like diverse and it is not very heavily filtered like Open Thoughts. So, that's why it is lower in quality uh in some sense and it has a diversity because it has many domains. It's not just math and not just code and so on. Uh and it yeah, it didn't go through heavy filtering. So, it's lower in quality compared to at least Open Thoughts.
Um yeah, it has far more tokens and it's larger. So, that's how we sort of categorized it, but you can also use like we have also done filtering in terms of like complexity, like what is more difficult questions and so on based on either the length or if we already have difficulty labels and things like that. So, you can use these different metrics to define quality. Thank you.
Same question for classifying data as reasoning data. Is there any methodology for deciding that this kind of data presents might be orthogonal to quality or probably not completely orthogonal. And then is there a notion of reasoning in the experiments you've done domain independent? If you have legal reasoning for the way a lawyer reasons it might be slightly different from how a mathematician or how an engineer does this problem, you know, or what is the difference in the characteristics?
Right. So, we have used like for reasoning data we use whatever definition the community is using and as I said we have used data sets that are already released in the community as reasoning data sets. But actually like from my understanding it should like typically a lot of that data looks like you have questions and then you have long reasoning traces and then you have like a final solution. So, like math olympiad is a very good example. A lot of that data looks something like that.
And yeah, for domain as well yeah, for the NeMoTron SFT data we didn't do any domain filtering, but I don't think we have necessarily legal domain and things like that. A lot of it is STEM focused, math, code, and other STEM areas that are sort of relevant to MMLU which is a benchmark, and so on and so forth. So, yeah, that's the composition of the data.
Are you ordering data like automated data mixtures/like recipe optimization, or are these like recipes sort of like handcrafted?
It's a mixture of both. So, like a lot of the all of the quality estimation and epoch estimation that's automated. And then, we build a data mixture based on those. So, that part is not completely automated, I must say.
So, by like epoch estimation, do you mean like how much of the total token budget for each data set, or do you mean like—
How many repeats of that data source do you want to see before you get diminishing returns? So, you There are data sets where if you see more than just two repeats, it it will give diminishing returns. Whereas, for some data sets, you can do four repeats or six repeats, and so on.
Oh, that sounds amazing. How how do you estimate that?
So, you can design an ablation study where you have a base data mixture where you're just using one repeat of that data. And then, you keep increasing it like to two repeats, four repeats, and so on. And as you increase it, the weight that that data source gets, that increases. But then, you have to take away that much weight from some other data source, which is typically a low-quality crawl or something like that. So, you'll take weight away from that and give it to this data source and try to see if you do two repeats of this, then what's the impact? Four repeats of this, then what's the impact? And so on.
I see. And do you do this for each domain in your corpus?
We Yeah, we try to do it uh for uh domain categories rather than individual data sources. Like so, for example, math, you'll just do it for all of math rather than there are many data sources within math.
I see. I see. Thank you so much.
Dr. Shrimai, have you tried the opposite approach of um quality first and diversity afterwards to see if cuz you said first we go for—
You sort of Yeah.
Monte Carlo Markov chain, you're doing exploration and exploitation kind of thing. Yeah. And you got better results by doing diversity first and quality afterwards.
Yeah, so in the two-phase approach, yeah, um we uh first wanted to do diversity and then quality and then we did have an ablation where we swapped that and it didn't give as good results. Yeah.
So, it's the opposite of human beings where if you train a child well, after that they can deal with rogue behaviors I guess that's exploration first seems to be doing the better job. Yeah. Yeah.
Yeah.
Awesome, then we can move to the last part, which is using reinforcement as a pre-training objective. Uh so, before I get started, I just wanted to talk about what was the motivation for this particular approach and I'd like to tell a story of the tale of two learners.
Uh one is Leo who learned by doing things uh and second uh is Bolt who learned by observing things. Uh now, one day there was a task to build a bridge uh and you need to make sure that this toy car is able to pass across the bridge. So, Leo quickly grabs a few stacks of blocks and sort of puts them together and creates this very simple bridge, which is not at all fancy, but it works because this car is actually able to pass.
Uh Bolt, on the other hand, uh analyzes every bridge that has ever been made and builds a magnificent suspension bridge, which is like very fancy design, flawless precision, and so on. But, it was so tiny that the toy car could not pass through it. Um and so, the lesson here is about learning by doing and not just watching. Uh and today's models, they learn by watching text. That is just predicting the next token. Uh whereas the motivation behind RLP is to teach models to reason through their own thoughts and not just by observing text.
Uh so, the problem with the standard pre-training So, I've shown this pipeline uh earlier that you have pre-training where you're gathering world knowledge and you have SFT where you're mimicking the reasoning format, and you're essentially doing imitation learning where you if you predict the next token, then you you get some kind of a reward. Uh not RL reward, but yeah, you get a yes or a no, uh and you're doing this pattern matching thing. And only later uh during the RLHF and RLVR phase, that's when you're actually doing reasoning and you're giving the ability to the model to do any kind of exploration, and that comes much later.
Um and so, reasoning is used as an afterthought, uh and added as a post-hoc thing. Uh and the questions we want to answer is that can reasoning be baked earlier during pre-training itself? and then of course do these gains actually last or are they washed away after you do the post training.
This is also very important because like you earlier I had shown this slide on how the models today they are consuming trillions and trillions of tokens and so we have sort of hit a wall where we are using the most number of tokens that we can and so now the question is how do you use them effectively or efficiently and so RLP is one of the solutions for that as well where maybe you need to learn or develop data efficient algorithms.
And so what is the difference between vanilla pre-training and RLP pre-training? So you have this context about photosynthesis which is the process plants algae and some bacteria use to make their own food using and what's the next token? So in the standard pre-training you would just predict the next token given this context and you would say it's sunlight and that's just uh the vanilla pre-training way.
But in RLP you give it an opportunity to think, an opportunity to explore and it sort of thinks that you know what photosynthesis actually relies on solar energy and so the next token must be sunlight. So here you're basing the probability of the next token given not only the context but the thought that was generated by the model itself and this is reasoning driven prediction. And this is RLP training. So the key difference is that RLP produces an explicit reasoning trace before predicting the next token and this makes the why of it very visible and trainable and not just the final answer.
So this is a brief overview of RLP and I'm going to go through each of the components step-by-step. But before that I wanted to show you the prompt that we used to generate the thoughts. So where we ask the model to specifically focus on next steps rather than jumping to like this final boxed answer and so on and we also ask the model not to restate any question or metadata commentary and so on.
So first is what is our thought policy? What do I mean by that? And so you have the prompt that I showed on the previous slide. You have the input which is the context and this goes into the thought policy. The thought policy then does roll outs. It gets the opportunity to explore different thoughts before it can predict the next token. So the thought policy does both. It generates a thought and then it generates a next token.
Now the same input context without the prompt is passed to the No-Think baseline. This is our standard next token prediction model and given this input it calculates a probability distribution on your vocabulary and predicts the next token sunlight. So from the thought policy we get $P_\theta$ which is probability of this next token given the context as well as the reasoning trace or the thought tokens that were generated by the model.
And from the No-Think baseline we get $P_\phi$ which is the probability of next token given the context. Which is standard.
So, based on these two, $P_\theta$ and $P_\phi$, we calculate this information gain based reward. Uh so, this is the information gain based reward, which is $\log(P_\theta) - \log(P_\phi)$. And what's interesting about this is that if you look at that term, then it's positive only when your thoughts actually meaningfully contribute to improving the next token prediction. So, if the thought actually was garbage or it didn't contribute, then your reward can be negative or even zero.
And uh as you can see, because this is not your standard RLVR kind of framing, where it's not uh uh a zero or one that you get at the end, uh that this reward is dense. It can have any value. Uh and uh it can also be applied to every position in the document without needing any external selection process. Uh so, based on this information gain, you get rewards for each of your rollouts, which I mentioned they are not binary, they are dense.
And finally, the last component of this pipeline is exponential moving average. So, what is this no-think baseline? This no-think baseline is basically updated, but with a delay. It is the same as thought policy, but with a delay uh of exponential moving average. And the reason we choose this is because we want the no-think baseline to be current enough to be able to provide informative comparisons in your during your training. And it is also intentionally lagged so that we can mitigate reward hacking.
Uh so, that's the whole pipeline of RLP, and now to our experiments. So, the question is, can you improve the reasoning ability of a base model uh without any task-specific tuning? So, over here uh we use the Qwen 1.7B base model, the final checkpoint, uh and we do RLP training uh for 1 billion tokens, and we use general pre-training corpora. So, now over here we don't use any reasoning-specific data or anything. You can use your web crawl, you can use books, papers, whatever you want.
And we show comparisons with the baseline, of course. Uh we also show comparisons after doing post-training, which is SFT and RLHF. Uh and we also show an additional baseline, which is that if you use 1 billion tokens to do RLP, then what if you use the same 1 billion tokens but do the next token prediction loss? That is, you just continue the pre-training of the base model. Uh so, we also show comparisons with that.
And so, here are the results. Uh RLP does show significant improvement on the Qwen 1.7B base model, and this is actually the final comparison between Ampere and Hopper. Um and so, uh when you do the base model comparison, which is the bar charts on the left, RLP does outperform the base model by 19%, and if you use those same tokens to do next token prediction, then it still does better than that by 17% on average.
And then when you do identical post-training, which is SFT and RL, the gains by uh RLP compound, and Hopper does 8% better relatively um compared to Ampere. So, uh here the lesson is that RLP is able to establish robust reasoning foundations during pre-training that are not washed away by your downstream alignment. After doing SFT and RL, they are not washed away.
Uh so, in the previous slide I showed token match comparison that if you if you do 1 billion tokens on RLP versus 1 billion tokens on next token prediction, then you see improvements with uh RLP. But, the question is would these gains sustain when you do compute equivalent baselines? Uh what I mean by that is uh RL typically takes more compute compared to next token prediction because you need to do rollouts, and you have to do n number of rollouts, you need to do completions to generate this thought and so on.
Uh so, we calculate the number of flops needed using this equation here. Uh and now uh if I train RLP with 170 million tokens, then the flop-matched equivalent tokens there is 6 billion tokens. That is, I will use 6 billion tokens to do next token prediction loss, but I will use only 170 million tokens to do RLP.
So, that's the comparison here. And uh so, here are the results. For TM is token-matched, uh continuous pre-training or next token prediction loss basically, and FM is flop-matched. And so, even for the flop-matched case, RLP outperforms next token prediction by 14% on average, uh even though in this case next token prediction loss is exposed to 35x more data.
And so, those were the results if you took the final pre-trained checkpoint and then did RLP. Now, the question is, what if you take an intermediate checkpoint of your pre-training run and use limited data for RLP, but compare it with the final pre-trained checkpoint? Then, would these gains sustain and how would that work?
So, we have intermediate checkpoints for the NeMoTron Nano 12B model V2, which is and so we take a checkpoint that is trained till 19.8 trillion tokens and we use RLP for with only 250 million tokens. Again, we use general pre-training corpora here. And then, we compare it with the base model that was trained till 20 trillion tokens. So, RLP in this case sees almost 200 billion less tokens.
And then, here are the results. So, RLP is scaling in this case with LLM size because this is much larger compared to the Qwen results that I showed earlier. And also, we have scaled frameworks because Nano V2 is a hybrid Mamba 2 based model, whereas Qwen was a transformer model.
So, again, your base has seen 20 trillion tokens. Base plus RLP has only seen 19.8 trillion tokens plus 250 million tokens. So, in this case, in spite of being trained on 200 billion fewer tokens, Hopper is still able to get a 35% average gain over Ampere. And as you can see, you see largest boost in domains like science. And after doing identical post-training, RLP still outperforms base by 3% absolute margin. Uh so, the benefits of RLP persist and maybe even amplify when scaling to larger models and across architectures.
So, here is a little bit uh background on uh the relevant literature in this space, the rise of early reasoning, like how early can we introduce explicit reasoning the way we have in RLP. Uh and so, there are uh three papers at least at the time our paper came out in this space: Quiet-STaR, Reinforcement Pretraining, which is called RPT, uh and uh Reinforcement Learning on Pretraining Data, which is called RLPT. And then, our method is called RLP.
So, here is a qualitative comparison between our technique, RLP, and RPT and RLPT. Uh so, the source of rewards in next token prediction, you have no reward, whereas in RPT and in RLPT, you have an external verifier. You will need like a different model to calculate the rewards, whereas RLP is verifier-free and it's intrinsic reward. The granularity of the reward, uh it's sparse and binary for RPT and RLPT because they use RLVR style rewards, whereas for RLP, as I showed, it's a dense reward. And finally, reasoning emergence is explicit and weak in case of RPT and RLPT, but it's explicit and strong for RLP.
And here is a quantitative comparison with the RPT technique. So, we have used uh a setting that is uh similar to the RPT paper. We use the Qwen 1.7B base model, and we use the OmniMath data set and we train using RPT technique and RLP technique on 170 million tokens. And as you can see, RLP is on average 4% better compared to the RPT technique.
And the reason behind this is that RPT actually uses an external filter. So, it's using a completely different model to first go through the pre-training data and select the tokens on which this RPT reward can be applied. We do no such selection. You can apply RLP reward on any token in your document. And RPT uses a sparse binary reward. It's just a yes or no, zero or one. And reinforcing that only select tokens and it ignores the reasoning steps itself. Whereas in RLP, we apply this dense per token reward based on information gain, as I showed earlier. And that captures the full reasoning signal because we are taking the reasoning trace into account when we are calculating the reward. And that, we believe, is the reason why we get better results.
So, the key ablations and insights that we discussed for RLP is that first is that we enable thinking for itself, exploring and thinking for itself during pre-training. So, you get an incentive or a reward for exploring and thinking. So, there's info gain reward on intermediate thoughts. And that outperforms single next token prediction, simple next token prediction. The reward in RLP is dense and not sparse and it can be applied at any position or at all positions in your document. And there is token efficiency because RLP on 250 million tokens boosts this NeMoTron 12B by 35% whereas the comparison is with 20 trillion token baseline.
So the key takeaways are that finally what is the comparison between Pascal, Volta, Ampere and Hopper? So this is a very rough percentage relative improvement that you will see between these four kids and as I said they use different learning strategies. They have the same data but they use different strategies. And so if you see here then for Hopper which she uses all three strategies curriculum front loading reasoning and learning through thinking then you can get as high as 60% relative improvement over Pascal.
And finally some tangible takeaways here are that your two-phase approach is very effective for pre-training where phase one focuses on diversity, phase two focuses on high quality data. Front loading reasoning it does create durable and compounding advantage. And sometimes the gain can be unlocked later after the SFT phase.
RLP on the other hand reframes RL for reasoning as a pre-training objective and the goal there was bridging this gap between next token prediction and this emergence of useful chain of thought reasoning. And it suggests that even using unannotated text streams. So in our paper we also show ablations where what happens if you only use web crawl data, what happens if you use only academic data and so on.
And typically we do see a 7 to 9% gain no matter what kind of data you're using. So, even if you use unannotated text streams like web crawl, you can still teach reasoning-like behavior while strengthening the foundation. And the post-training can then build deeper expertise or amplify that reasoning skill.
And finally, RLP brings exploration and reasoning incentives into pre-training, allowing a model to think through problems while still building its knowledge and understanding of the world. So, you In pre-training, you do not only want to do knowledge gathering and understanding of the world, but you also want the model to explore and understand reasoning during pre-training itself. And this we believe opens a new axis for scaling in terms of how models learn to reason.
Um and so finally, I would like to leave you with these uh four key components of pre-training: smart data, smart architecture, smart algorithms, and smart collaborations. Uh yeah, and I'm happy to take any more questions if yeah.
Do you have any questions?
Hi, so I'm wondering um about uh RLP. Um so, as you know, GRPO is a a popular framework for uh post-training. So, I'm wondering if RLP, like when you said it was generating, you know, multiple um like different uh samples, and then you would calculate uh like information gain, which you could say analogously is like the advantage in GRPO. Um so, I'm wondering if there's any similarities and what kind of contrast there are to like the post-training GRPO technique with uh RLP.
Uh yeah, we did use the GRPO technique itself actually. So in GRPO you get a reward which is zero one, but now the way you're calculating reward is this information gain based reward and then the rest of the advantage calculation is very similar to how you use GRPO in post training where you are calculating advantages based on the probabilities of the reasoning tokens as well and so on. So all that remains the same. The way you are calculating reward is the thing that has changed.
Wait, this is a quick I'm saying question is that have you tried doing RLP or any of the other techniques like the first two techniques you mentioned on vision language alignment?
No, not yet. Yeah.
Okay, sounds good. Thank you.
All right.
A quick question on the RLHF. Is RLHF still a preferred way of using for LLMs because with RLHF there's some kind of bias of making humans feel better, right? So is it still used for latest models in general? So this is independent of the RLP work.
From my understanding yes, a little bit of like the data and like like RLHF data is still used and mixed into like the RLVR and so on. Different model families use it in different ways. So but I would think that it is still part of it is still there in the RL stage basically.
So is there a way we can reduce this hallucination of the bias from models which are trained on RLHF to be not to be too nice with humans and then and more like be like able to disclose the information in a way not to please the humans or something.
The question is that can we use RLHF to reduce hallucinations or to will RLHF increase hallucination Oh,
I want to say in future when we have more models coming up, how can we reduce this bias which gets introduced by RLHF into the models? Is there a better way to eliminate that when interacting with users instead of pleasing the users? Is there a way we can have better responses but still which helps the users instead of just making them feel better?
I see. So, in future how do we mitigate the hallucinations? So, I think that's an ongoing work from my understanding. I don't work in that space where people are trying to build classifiers of how helpful this thing is, how also from the point of view of safety and so on. So, yeah, I am not following that work completely but my understanding is you need to always keep on updating those classifiers because you can always sort of get your adversarial data and so on. And also like RLHF as I said like it's still a part of the post-training pipeline but it is it has become a much smaller part.
Makes sense. Thank you.
Yeah, I think that has a lot to do with alignment research. Which like folks at Anthropic and so forth are focusing on. Yeah, I think yeah, RLHF is definitely prone to things like reward hacking. But there's a lot of ongoing work on how to reduce that. Um Can't think of anything immediately off the top of my head. I think there's things like RLAIF which uses like AI instead of humans and like which might be less subjective and biased to some extent. Um yeah, I would check out some of the work um around alignment and reducing reward hacking.
Hello, thank you for the talk. Could you speak a little more on how you quantify quality of data? In say mathematical context, that's quite clear, but you also spoke a lot about using the web crawler or other more general bodies of data, and how could you possibly go quantifying quality in that sort of context?
Right. So, there are many approaches here. The most popular being fine web EDU based classification. So, hugging face released this where it they've built classifiers to tell how educational is this content. Like, given a web crawl document, it scores it on a scale of 1 to 5. So, even at Nvidia, we have adapted that scaling. And So, that's one way.
There's also essential web that also did more types of classifications in terms of not just like educational quality or educational score, but domains, whether it belongs to science or math, and also even within education, whether it's like intermediate or high college level education, and so on and so forth. So, it depends on yeah, what do you want to extract, and you can build like a classifier LLM based classification on rubric for that, and you can yeah, classify your documents that way. But, fine web EDU is the most popular one, and like that's the at least basic one that is needed in some sense.
Any more in-person questions?
When doing RLP, what do you choose for your prefixes to start with? Do you do it by I'm assuming you do it on the highest quality data, but do you choose specific tokens or just uniformly across those that data set?
Right, that's a good question. So, theoretically RLP can be applied on any token in a document. In practice, when we run the experiments, we take a document, we just randomly choose a token, we don't really select it using some fancy techniques based on entropy and so on. So, we randomly select a token, we apply RLP technique on it, we back prop the reward, and then we throw away that document, and we just get a new document.
Yeah, so that's how and also to correct you, we are actually not doing RLP on the highest quality data. As I've been saying, like it's been done on the pre-training data mixture. So, if there is a lot of web crawl in it, and there's a lot of other stuff in it, yeah. It's not done only on reasoning data.
Well, we have an an on an online question that asks how far back can we likely go in the RLP experiments and still beat the CPT baseline, given that the 19.8 trillion token checkpoint is already over 1,000 times um data to parameter ratio?
That's a great question. So, yeah, in the paper we present results with the 19.8 trillion checkpoint, but we did do experiments on just 20% of pre-training. So, if 20 trillion is like 100% of pre-training, then we train do RLP on 4 trillion token checkpoint, and we still see gains using RLP compared to the baseline in some sense.
Great. Any other questions?
All right. So, let's give another round of applause to our speaker.