Stanford CS336 Lang. Modeling from Scratch | Spring 2025 | Lec. 3: Architectures, Hyperparameters

Stanford Online

For more information about Stanford's online Artificial Intelligence programs visit: https://stanford.io/ai To learn more about enrolling in this course visit: https://online.stanford.edu/courses/cs336-language-modeling-scratch To follow along with the course schedule and syllabus visit: https://stanford-cs336.github.io/spring2025/ Percy Liang Associate Professor of Computer Science Director of Center for Research on Foundation Models (CRFM) Tatsunori Hashimoto Assistant Professor of Computer Science View the entire course playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rOY23Y0BoGoBGgQ1zmU_MT_

Hosts: Lecturer, Speaker 2

📺Watch on YouTube

📅April 16, 2025

⏱️01:27:03

🌐English

🤍0 likes

Disclaimer: The transcript on this page is for the YouTube video titled "Stanford CS336 Lang. Modeling from Scratch | Spring 2025 | Lec. 3: Architectures, Hyperparameters" from "Stanford Online". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=ptFiH_bHnJw

00:00:05Lecturer

Um, as you may have noticed, I'm a little bit less innovative in my lecturing than Percy. So, you're going to get PowerPoint slides rather than executable Python ones, but you should be able to find the PDFs on the website as well. So I've titled this lecture "Everything you didn't want to know about LM architecture and training" because we're going to get into some of the nitty-gritty details that I think most other classes would spare you the details of, you know, like what should my hyperparameters be and those kinds of questions.

🤍0 likes💬 0 comments

Add to My Notes

00:00:34Lecturer

Some minor logistics: also if you're doing the assignments, we are updating assignments as we find some, mostly minor bugs. Make sure you pull updates to the assignments as you go along.

🤍0 likes💬 0 comments

Add to My Notes

00:00:46Lecturer

Okay. So, what we're going to do, we're going to start with a quick recap of a transformer. And I'll give you two variants of a standard transformer. One that's, you know, probably coming from the standard transformer, you know, lectures that you might see in 224N. And then I'll talk about what you implement, and kind of the modern consensus variant of a transformer.

🤍0 likes💬 0 comments

Add to My Notes

00:01:09Lecturer

And then we're going to take a much more kind of data-driven perspective to understanding transformer architectures. So the question that we're going to ask is people have trained lots of LLMs at this point and you can go and read, you know, all of those papers and try to understand what has changed, what has been in common. And from that kind of almost an evolutionary analysis, you know, try to understand what are the things that are really important to make transformers work. Right? So today's theme is the theme of the class is the best way to learn is hands-on experience, but the theme of this lecture, because we can't train all these transformers, is to learn from the experience of others.

🤍0 likes💬 0 comments

Add to My Notes

00:01:43Lecturer

So the starting point is the original transformer, right? So just as a review, right? Hopefully you all remember this from 224N or your other NLP classes. You know, you've got some simple position embeddings at the bottom. You've got multi-head attention, you've got layer norms afterwards, you've got a residual stream going upwards, you've got a MLP, and then a softmax at the very end. And we're going to see variants to all these different pieces until we get to basically the most modern variants of the transformer, and the latest one I'll talk about will be just, you know, a few months before.

🤍0 likes💬 0 comments

Add to My Notes

00:02:17Lecturer

So what you implemented is not, you know, the the vanilla transformer variant from the original paper. We've modified a few things. You know, we've put the layer norm in front of the block. So you can see on this slide over here that, you know, there's the norm is over here right before each of these blocks in the residual stream. We've asked you to implement rotary position embeddings. The feed forward layers use something called a SwiGLU. And then linear layers, you know, now omit these bias terms.

🤍0 likes💬 0 comments

Add to My Notes

00:02:49Lecturer

And you might ask, why have you forced us to implement this weird variant of a transformer instead of the original 'Attention is All You Need' transformer? And so we're going to go through some of those questions. And then yesterday I was thinking, okay, I should catch up on all the developments that have happened in architectures over the last year. And Percy warned me about this because he said, "You're going to have to redo the lecture every year." And so I started looking and I was like, all right, yeah, there's a couple good papers recently. There's Command R, there's OLMo 2, there's, you know, small LM and 54. And then you go looking and you're like, wow, yeah, there's Gemma 3 and Qwen 2.5 and InternLM and then there's, you know, more. I can't even sort of, you know, cover the screen with these guys, right?

🤍0 likes💬 0 comments

Add to My Notes

00:03:28Lecturer

There's a lot of models. There were about 19 new dense model releases in the last year, many of them with minor architecture tweaks. And on the one hand, it's kind of annoying to go through all these papers and say like, you know, what is happening in all of these? But also it's like an actual wealth of information because not all of them do the same thing. And you can kind of see, you know, not all of you can, especially in the back, can see the details of this slide. But I put together a little spreadsheet of, you know, what all these models are doing, and starting with, you know, all the way from 2017, the original transformer, all the way to 2025, what the newest models are doing. And we'll talk about this as we go. But you kind of see sort of certain kinds of architecture changes sort of being explored. Like so here on this column is position embeddings. People used to do all sorts of stuff like absolute, relative, RoPE. There was a sort of alibi phase for some people, but then now starting around 2023, everyone just does RoPE. Right? So you can kind of see this convergent evolution almost of neural architectures, and we're going to talk about all of these different kinds of things.

🤍0 likes💬 0 comments

Add to My Notes

00:04:29Lecturer

Right? So the parts that I'll cover, so this is a preview of the three major sections of this lecture. And if I have time, I'm also going to talk about different attention variants at the end. The first thing is going to be architecture variations. That's what I'm going to talk about. So activations, feed forwards, attention variants, position embeddings, all of those things.

🤍0 likes💬 0 comments

Add to My Notes

00:04:48Lecturer

And then having nailed down the architecture, what do we have to do? Well, we have to pick hyperparameters, right? Like how big do we make the hidden dimension? How big do we make the sort of inner projection layer inside of an MLP? What do we do about the number of dimensions? How many vocab elements? Those are all sort of important things that you have to choose when you're actually training your language model. And you don't want to just sort of pick these out of a hat, right? You want to select them in some fairly intelligent way. So, we're going to start with architecture variations.

🤍0 likes💬 0 comments

Add to My Notes

00:05:19Lecturer

And the two things that I'll, you know, mention right here, and I'll, you know, go back to them as I talk. The first one is, you know, there's not that much consensus in a lot of the choices. There's been sort of convergent, you know, evolution in the last few years. What I'll call like Llama-like architectures at the very bottom here, but people do all sorts of things. They swap between layer norm and RMS norm. They do serial versus parallel layers. There's one choice that basically everyone does since the very first GPT, and I'll talk about that in a bit. But there's, you know, lots of different variations that we can learn from here.

🤍0 likes💬 0 comments

Add to My Notes

00:05:56Lecturer

The big one, I've already talked about this guy in 224N. So, if you remember that lecture, this will be review for you rather than being totally new. I think the one thing basically everyone agrees on and agreed on almost from the very start is the use of pre-norm versus post-norm. That terminology will get a little bit more confusing, but the original transformer paper did, you know, this thing on the left over here, where you had your residual stream in the gray, and you know, in addition to the residual stream, you had these layer norms after sort of every subcomponent. So you would do your multi-head attention, you would add back to the residual stream, and then you would layer norm that. And then you would do the same thing with your fully connected layer and then you would layer norm it.

🤍0 likes💬 0 comments

Add to My Notes

00:06:37Lecturer

And very, very early on, people realized that moving this layer norm to the front of this sort of non-residual part, so this block on the right, did much better in many different ways. And basically almost all modern LMs that I know of use this kind of pre-norm. There have been some sort of new innovations recently that I'll touch on in two slides, but lots of, you know, models have moved to this. The one exception is OPT 350M, which I'm guessing, you know, they kind of messed that one up and that was sort of orphaned when they were training. That was a fun find in my survey of architectures.

🤍0 likes💬 0 comments

Add to My Notes

00:07:16Lecturer

So this pre versus post-norm thing, if you look into why it was originally developed, the arguments were that, you know, if you wanted to use this post-norm stuff, it was much less stable. And so you would have to do some careful learning rate warmup style things to make it train in a stable way. And so if you look at some of the earlier papers, you know, arguing for this pre-norm approach, Salazar and Yin and also this Xiong in 2020 paper, you almost always see sort of this comparison of, "Hey, if we use pre-norm and we do some other stability inducing tricks, then we can remove warmup and these systems work just as well, if not better, than sort of the, you know, the post-norm layer norm with careful warmup type approaches." And you see this in sort of a machine translation setting here. You see this as well on the right, on, you know, various other tasks, especially using BERT, which was trained with post-norm.

🤍0 likes💬 0 comments

Add to My Notes

00:08:13Lecturer

So there were many arguments about why this was helpful. There were arguments about gradient attenuation across layers. Like if you do pre-norm, then the gradient sizes would remain constant, whereas if you did post-norm without warmup, then it would sort of blow up in this orange way. It's a reasonable argument, but I think a maybe more closer to modern intuition would be this argument that pre-norm is just a more stable architecture to train. And so some of the earlier work by Salazar identified all these loss spikes that if you were training with pre-norm, kind of in blue here, you would see a lot more loss spikes and the training would be kind of unstable as you were training. So you see the gradient norm here, you know, is spiking and generally higher than the one with pre-norm. And so today, you see pre-norm and other layer norm tricks being used essentially as stability inducing aids for using large, training large neural networks.

🤍0 likes💬 0 comments

Add to My Notes

00:09:11Lecturer

And so this brings us to one new fairly, I think recent innovation. I think this didn't exist when I gave this lecture last year. Which is this variant that I don't think really has a great name, but I'm just going to call it the double norm for the moment here. So this is the original figure that I showed you at the very beginning, and we know that putting layer norms in the residual stream is bad. But actually someone in 224N this year asked, "Well, but why do you have to put the layer norm in the front? Why can't you put it, you know, after the feed forward network?" And of course you can, and not only that, sort of recent people have gone around and just added the layer norm after the, you know, the blocks as well. And so Grok and Gemma 2 both take this approach of layer norms both in front and after. OLMo 2 does only the layer norm after the feed forward and the multi-head attention. And so this is actually kind of an interesting change. Pre-norm has just been kind of dominant and the only thing for a while, but things have been changed up a little bit. So now there's a new variant, and this is actually, you know, there's been some evaluations of this kind of approach. People have argued it's a little bit more stable and nicer to train on these larger models.

🤍0 likes💬 0 comments

Add to My Notes

00:10:24Lecturer

By the way, feel free to stop me and ask me questions as well. I have a tendency to sort of keep going if no one stops me. So yes.

🤍0 likes💬 0 comments

Add to My Notes

00:10:31Audience Question

Why is layer norm in the residual bad?

🤍0 likes💬 0 comments

Add to My Notes

00:10:34Lecturer

Why is layer norm in the residual bad? That's a good question. I don't think I can give you like a, you know, this is the proof of why it's bad. I think one, you know, intuitive argument for why this might be bad is that the residual gives you this identity connection all the way from almost the top of the network all the way to the bottom. And so if you're trying to train really deep networks, this makes gradient propagation very easy, right? So there's lots of arguments about how, you know, LSTMs and these other kinds of, you know, state space models have difficulty propagating gradients backwards. An identity connection does not have any such problems. And so putting layer norms in the middle, you know, might mess with that kind of gradient sort of behavior. And that, of course, you see back here, right? This is exactly the kind of plot you expect to see if that's happening. Okay, cool.

🤍0 likes💬 0 comments

Add to My Notes

00:11:18Lecturer

Um, the other thing that people now do is in the original transformer, people did, you know, layer norm. And so layer norm is this equation over here. What you do is you have, you know, the activations X coming in, you subtract the empirical mean. So that's the average of the X's up top, and then you divide by, you know, the standard de- or the variance plus a little fudge factor epsilon, and then you square root that. So you can roughly think of it as a standard deviation, right? So that's going to, you know, standardize your your activations X. You're going to scale it up by a gamma that's a learnable parameter and then shift it by a beta, right? So this makes sense. You're going to normalize, you know, your activations, and then you're going to shift them around to whatever point you want.

🤍0 likes💬 0 comments

Add to My Notes

00:11:59Lecturer

And many models use this layer norm thing and it worked quite well, but many models have sort of now moved on to RMSNorm. And this is one of the consensus changes. Like basically all the models have switched to using RMSNorm. And now what do you do? You just drop the mean adjustment. So you don't subtract the mean, you don't add a bias term, and many notable models do this: the Llama family, PaLM, Chinchilla, T5. They've all moved to RMSNorm.

🤍0 likes💬 0 comments

Add to My Notes

00:12:26Lecturer

And what's the reason for this? Um, one reason is that it doesn't really make a difference. Turns out if you train models with RMSNorm, it does just as well as training, you know, layer norm. And so there's a simplification argument. But really, I think the argument that's often given in these papers, and I think it's good to appreciate kind of the details of this argument, is that going to RMSNorm is, you know, it's faster and just as good. So in what way is it faster? Well, if I don't subtract the mean, it's fewer operations. If I don't have to add that bias term beta back, it's fewer parameters that I have to load from memory back into sort of my compute units, right? So I don't have to, you know, retrieve this sort of state.

🤍0 likes💬 0 comments

Add to My Notes

00:13:09Lecturer

And some of you might be thinking, "But wait, you told me in 224N that nothing but matrix multiplies matter for the purpose of runtime, right?" And this is not a matrix multiply, and so I shouldn't care about, you know, any of this. And that's a reasonable perspective to take if you think about, you know, the percentage of flops that is taken up by different operations in a transformer. This table, there's a nice paper by Even et al. in 2023, I think the title is like "Memory Movement is all you need" or something, that does profiling of all the different components of a transformer. And you see that, you know, tensor contractions, which are like matrix multiplies, that's like 99.88% of the flops that happen in a transformer. And so, you know, saving 0.17% of your flops doesn't seem like a huge win.

🤍0 likes💬 0 comments

Add to My Notes

00:13:56Lecturer

But I think one of the things that's important for architecture design now is to not just think about flops, because you know, flops are important, but that's not the only resource that you have to think about. It's also that you have to think carefully about, you know, memory movement. And so even though, you know, tensor contractions, so this is things like matrix multiplies, that's like 99.88% of the flops, you know, if you have things like the softmax operation or layer norms, all these like normalization operations that happen in a transformer, they're 0.17% of the flops, but actually they're 25% of the runtime. And a big reason for that is because, you know, these normalization operations still incur a lot of memory movement overhead, right? And so it does actually matter to try to optimize some of these like lower level things because it's not just about flops, it's also about memory movement. I'm going to emphasize this quite a bit more as I get into the systems lecture. Like when we talk about GPU architectures, it's going to become very, very, very important to think about memory, not just about flops.

🤍0 likes💬 0 comments

Add to My Notes

00:15:00Lecturer

And so this is one of the reasons why RMSNorm has now become sort of much more popular. And so I went back and looked at some of the earlier RMSNorm papers. I think the sad thing is that there aren't quite as many papers published by industry labs with, you know, big nice ablations. And so many of the ablations that I'll show you are going to be from a couple years back. But Nang et al. in 2020 had this very nice ablation showing, you know, here's the vanilla transformer, here's the RMSNorm version, and you kind of see the exact thing I told you. You know, the number of steps per second that you can do in a vanilla transformer is 3.5 per second. With RMSNorm, you get 3.68. You know, not a huge gain, but that's in some sense for free. And you get, you know, a final loss that's lower than the vanilla transformer. So that's great, right? In some sense, we've gotten runtime improvements and we've gotten, in fact, at least in this case, loss improvements. And so that's a win-win for us.

🤍0 likes💬 0 comments

Add to My Notes

00:16:00Lecturer

The final thing that I'll say, which is very much in line with this RMSNorm thing in terms of theme, is that most modern transformers do not have bias terms. So the original transformer, if you look at the FFN, will look something like this, right? You have your inputs X, you're going to do a linear layer with a bias term, and then you'll ReLU it, and then you'll have a second linear layer wrapping around it. But most implementations, if they're not gated units, which I'll talk about in a moment, look actually something like this. They've just dropped the bias terms.

🤍0 likes💬 0 comments

Add to My Notes

00:16:30Lecturer

You can just make this argument from basically the same kinds of underlying principles. You know, they perform just as well. Matrix multiplies are apparently all that you need to get these guys to work. And the other thing, which is maybe more subtle, is actually optimization stability. I don't quite have the deepest understanding of why the bias terms are particularly bad for stability, but there's been sort of really clear empirical observations that people have made that basically dropping these bias terms often stabilizes the training of these largest neural networks. And so now a lot of the implementations now omit bias terms entirely and train only on these like pure matrix multiply kind of settings.

🤍0 likes💬 0 comments

Add to My Notes

00:17:14Lecturer

So that's the layer norm bit. And so there's kind of two things that, you know, you should kind of think of. This is nice because the story is pretty clear. Everyone does something and so you should just kind of know this, right? Basically, everyone does pre-norm, or at least they do the layer norms outside of the residual stream. Like that's kind of the iron rule, right? You know, you get nicer gradient propagation, you get much more stable training. It just doesn't make sense to do it the other way.

🤍0 likes💬 0 comments

Add to My Notes

00:17:39Lecturer

Most people, or almost everybody, does RMSNorm. In practice it works almost as well, has fewer parameters to move around. And this idea of dropping bias terms just broadly applies. A lot of these models just don't have bias terms in most places. I think the one exception to this RMSNorm one, as I was reading yesterday, is I think Cohere, both Command and R+, use layer norm. Quite sure why.

🤍0 likes💬 0 comments

Add to My Notes

00:18:06Lecturer

Okay. Any questions on kind of the layer norm, RMSNorm, and bias term stuff before I move on? Yes.

🤍0 likes💬 0 comments

Add to My Notes

00:18:14Audience Question

Do you think there are some long-term lessons you can take away from these details that are more future proof potentially? Or do you think these are...

🤍0 likes💬 0 comments

Add to My Notes

00:18:22Lecturer

Yeah. So the question was, is there something more future proof? And I think it's hard to have like the biggest picture. In many ways, deep learning has been very empirical and like bottom up rather than top down. But I do think there's some generalizable lessons that you could sort of draw from here. I think the lesson of, you know, have very direct identity map residual connections is sort of a story and a lesson that has played out in many, many different kinds of architectures, not just, you know, in these kinds of architectures.

🤍0 likes💬 0 comments

Add to My Notes

00:18:48Lecturer

The effectiveness of LayerNorm, we'll see once again later on in this lecture, has been very effective. And so not letting your activations drift in sort of scale is another thing that I think generally has been very effective for training stability. Those two seem like fairly generalizable lessons. We will also kind of see sort of the systems concerns come into play again. So this is another generalizable lesson of sort of thinking really carefully about the impact of your architecture on the systems components of your design.

🤍0 likes💬 0 comments

Add to My Notes

00:19:20Lecturer

Okay. So now, there's this other component which is the activations. And there is a whole big zoo of activations: ReLU, Swish, GeLU, GLU. And then there's, I mean these aren't activations, there are different kinds of MLPs: GeLU, ReGLU, SwiGLU, and BiLU. And yeah, I think this is exactly the kind of thing that I didn't originally want to learn when I got into doing deep learning. I was like, I don't care about activations, it's going to train anyway. Okay.

🤍0 likes💬 0 comments

Add to My Notes

00:19:51Lecturer

But it really does matter, unfortunately, for both you and me, that SwiGLU and other GLU variants just consistently work well. And so I will explain those to you, and you should think about them carefully because they do work, and internalize that, right? So I think the ReLU and maybe the GeLU, you all should already know, right? The ReLU you learn in like some of the most basic deep learning classes, right? You just take the max of zero and in the case of an MLP, right, I've dropped the bias terms here, you know, XW1, you take, you know, the ReLU and then you do W2. Fairly easy, right?

🤍0 likes💬 0 comments

Add to My Notes

00:20:26Lecturer

A GeLU is a Gaussian Error Linear Unit. This one multiplies the linear with a CDF of a Gaussian. And so it's basically going to be like the ReLU but with a little bit of a bump here. Hopefully you can see that over here. This is not just flat at the very bottom. This makes things a little bit more differentiable, which may or may not help. And the GPT family of models, 1, 2, 3 and GPT-J and so on, all use the GeLU. And the original transformer and some of the older models used the ReLU. And really almost all the modern models have switched to the Gated Linear Units like SwiGLU and the GeGLU and others, right?

🤍0 likes💬 0 comments

Add to My Notes

00:21:11Lecturer

And really I think this is, you know, the Google folks really pushed for this like PaLM and T5 and others. But since it's sort of been tried and true, basically almost all the models post 2023 use a Gated Linear Unit. And so, you know, going back to that earlier question of like what generalizable architecture things can we learn from this lecture, you know, there are some things that have really consistently been very useful: residual connections, layer norms, gating is yet another one, right? And so this is another place where gating just appears and is a very good way of doing things.

🤍0 likes💬 0 comments

Add to My Notes

00:21:47Lecturer

So originally, this is our our fully connected layer right here, right? This is with a ReLU. Now instead of doing just linear and a ReLU, what I'm going to do is I'm going to gate, you know, the output here with a entry-wise linear term. So X.V, V is going to give me a vector and I'm going to multiply that entry-wise with my original inside term of the MLP. And then I'm going to multiply the whole thing with W2, right? So the way to think about this is I've gated sort of the hidden part of the MLP, right? So I've got my original activation that takes my inputs and puts it into the sort of hidden space, and then I'm going to gate that with X.V. And then, you know, I'm going to project that back into sort of the hidden dimensionality using W2, right? So there's this gating operation that happens entry-wise. And that's really, you know, the basic thing that's happening here. And this is the GLU plus the ReLU. And then we have an extra parameter that we've added here for the gating. This is V.

🤍0 likes💬 0 comments

Add to My Notes

00:22:49Lecturer

And so when someone says something like, "Oh, it's a GeGLU fully..." There's nothing to laugh about that. There's the GeGLU fully connected layer. What I've got here is, you know, I've got the GeLU sort of for the nonlinearity, and I've still got the exact same gating here of X.V, right? And this is the architecture that was used by many of the Google models like T5v1.1, Gemma 2, Gemma 3. And then another variant, there's a SwiGLU. And this has been very, very popular. Swish is x times the sigmoid. And this is the nonlinearity and you can kind of, you know, a sigmoid is like this and x is like this, so it will look, you know, just like the Gaussian error unit. And then, you know, you do the same thing here. You have a gating over the swish and then you get a fully connected layer here.

🤍0 likes💬 0 comments

Add to My Notes

00:23:36Audience Question

Yes, I have a question. Below a certain negative value, the swish function and also the also the the GELU function, it's not monotonically increasing. In fact, it's decreasing, right? And a lot of the argument about how gradient descent works in like input machine learning is that like okay, you want to do gradient descent click, but here it seems like it would go in the opposite direction if you use GELU or or Swish or their gated versions.

🤍0 likes💬 0 comments

Add to My Notes

00:24:00Lecturer

So yeah, so the question was, you know, this isn't monotonically decreasing. You know, there's a bit on the very left of this zero here that's kind of flipping in the derivative. And isn't that going to be a problem? I think intuitively you could have argued that this would be a problem. You might trap a bunch of activations at zeros. I think in practice, you know, if you look at kind of like neural network optimization dynamics, what's actually happening is often you're throwing very high learning rates with momentum into the optimizer. And so you're not really going to converge to this zero point, right? Like these activations are going to be all over the place. And so in practice, I don't think this little tiny negative piece is really an effect that's going to be huge for the model, if that makes sense.

🤍0 likes💬 0 comments

Add to My Notes

00:24:49Lecturer

Okay. And then going back to this, the SwiGLU is basically most models today, like the Llama family, PaLM, OLMo. And I'll show you the big table later, but you'll see that the SwiGLU is very, very popular. And one thing to note, I'll talk about this again in the hyperparameters part is, you know, now remember I've added this this V term, this extra parameter, right? And so I want to, you know, think about how to size this extra parameter. And what people do is gated models usually make this like hidden size, you know, the basically output dimensionality of W slightly smaller by a factor of 2/3 in order to make sure that the total number of parameters of this whole thing remains the same as the non-gated counterparts. And that's a convention thing that most people do. If you don't quite understand what that is, I'll go back over that again later. But you can just kind of keep in mind that basically for the gated linear units you just make everything a little bit smaller to make sure things remain parameter matched.

🤍0 likes💬 0 comments

Add to My Notes

00:25:49Audience Question

So, this may be obvious in the past. Uh, one of the benefits of ReLU is like it's very easily differentiable by the input. But if you know if you have the derivative of the CDF of the Gaussian, you have like a squared with x, does that not really slow things down?

🤍0 likes💬 0 comments

Add to My Notes

00:26:09Lecturer

That's a very good question. I'm not 100% sure what the internal like CUDA implementation of the SwiGLU or the GeGLU is. I think it's entirely possible that like internally they might be implemented with like lookup tables.

🤍0 likes💬 0 comments

Add to My Notes

00:26:25Speaker 2

I mean, what really matters is the memory pressure here and like it will be the exact same because you're reading the same amount of elements. So the extra compute is negligible on that...

🤍0 likes💬 0 comments

Add to My Notes

00:26:35Lecturer

That's actually a yeah, that's probably a better argument that like basically flops wise, this is negligible anyway and that actually the memory calculus is the same.

🤍0 likes💬 0 comments

Add to My Notes

00:26:43Lecturer

So okay, cool. Alright, so do gated linear units work? I will have more modern evidence for this as well, but I thought, you know, I should take you straight to the horse's mouth, Noam Shazeer's original paper, where he, you know, evaluates all these GLU variants. And you know, this is somewhat older stuff, so you're seeing CoLA and SST-2 performance, but you do see basically that the GLU variants consistently perform better, right? GLU is 84.2, 84.12, 84.36, 84.67. And, you know, wow, it's 2020s. They even give you the standard deviations so you can sort of figure out how significant those results are. And they in fact are significant, right? And so this is some nice evidence to to see here.

🤍0 likes💬 0 comments

Add to My Notes

00:27:31Lecturer

There was also, you know, the Nang et al. in 2020 paper, which is a very nice paper studying all sorts of architecture variants, I think in the context of T5 style models. And once again, you see that the gated linear unit variants consistently achieve kind of lower losses than their counterparts, right? Like you see that the bolded lines are exactly at the GLU variants. And this pattern has basically held up.

🤍0 likes💬 0 comments

Add to My Notes

00:27:58Lecturer

So for gating and activations, you know, there are lots of lots of variants across different models. But the Gated Linear Unit has become basically widespread and dominant, and I think for good reason. Of course, the GLU isn't necessary for a good model. Like, it's important to separate the two, right? Just because it's probably the slightly better and everyone does it doesn't mean it's necessary. And you do see examples of very high performance models not using a GLU. Like GPT-3 is one example. A more recent one, Nemotron-3 40B uses a squared ReLU, which I had not seen before, and Falcon 2 11B uses a RELU. Uh, both of those are relatively high performance models. So you can kind of see that it's not really necessary. And so, you know, evidence does point towards consistent gains from SwiGLU and GeGLU, and that's why we ask you to implement exactly that variant.

🤍0 likes💬 0 comments

Add to My Notes

00:28:51Lecturer

Cool. Okay. The final thing that I want to talk about for architectures, and this is one kind of final major, I want to say, variation that we've seen. Normally the transformer block is serial, right? In the sense that you know, for each block, the outputs come in from the bottom, and then you do your attention, and then you pass the result of that computation forward, and then you do your MLP, and then you pass that computation forward, right? And so this is inherently serial. You do attention and then MLP.

🤍0 likes💬 0 comments

Add to My Notes

00:29:25Lecturer

But of course this might have certain like parallelism constraints. So if you want to parallelize this over gigantic, you know, sets of GPUs, it might be harder to do so if you have this serial connection. You know, the systems concerns might also be more difficult, right? You might get lower utilization from your GPUs. And so a few models have done this thing that I'll call parallel layers, where basically instead of having serial computation of attention and then MLP, they will do them both at the same time, right? So you will get your X, you know, from your previous layer. You will compute both the MLP and the attention side by side, and then you will add them together into the residual stream and then that will be your output, right?

🤍0 likes💬 0 comments

Add to My Notes

00:30:04Lecturer

And this was pioneered by GPT-J, which is kind of this open source replication effort, and the folks at Google doing PaLM were kind of bold enough to do this at the really big scale. And many others have kind of followed since. So if you're implementing this, right, you can share a lot of stuff like the layer norms and the matrix multiplies can get fused together and you can get some systems efficiency out of that. It hasn't been quite as popular since then, at least in the last year. I think most of the models that we've seen have been serial layers rather than parallel ones. I think the only exceptions to this are like Cohere Command R, Command R+, and Falcon 2 11B.

🤍0 likes💬 0 comments

Add to My Notes

00:30:43Lecturer

So now I think we have the ability to kind of go back to, you know, this big, you know, hard to see chart and then see what I was sort of pointing at at the very beginning. So this column here, you know, you don't really need to be able to read any of the text because the colors will tell you everything you need to see. This check mark here, this is basically pre versus post-norm. The only two models I really know of in the early days that did post-norm, this is the original transformer and GPT and BERT if you want to include that into this table. And then almost everybody else, I think basically everyone else has done pre-norm. The only other non-checked boxes here are models that are proprietary and I don't have details for.

🤍0 likes💬 0 comments

Add to My Notes

00:31:24Lecturer

This column here on the on the leftmost thing, this is RMSNorm versus layer norm. The gray boxes are the layer norm. The blue ones are RMSNorm. Basically, most people have converted to RMSNorm. As I said, this column next to it is serial and parallel layers. Once again, most people do serial, but you see other variants. What I'm going to talk about next is going to be position embeddings, and that'll be kind of more interesting in a moment here. Any questions about any of this architecture stuff before I move on? Hopefully that gives you a bit of an overview of at least the major variations in architectures that we see.

🤍0 likes💬 0 comments

Add to My Notes

00:31:56Audience Question

Yes. Serial layer computation more efficient than parallel?

🤍0 likes💬 0 comments

Add to My Notes

00:31:59Lecturer

So the question was whether serial is more efficient than parallel. It should be actually the reverse, that parallel is more efficient than serial, and that's why you're kind of willing to do this. So in some sense, you might expect serial to be more expressive because you're composing two computations rather than just adding them together. But the benefit of parallel in theory is that if you write kind of the right kinds of fused kernels, a lot of these operations can be done in parallel or the computation is shared across the different parallel parts.

🤍0 likes💬 0 comments

Add to My Notes

00:32:30Lecturer

Okay. So cool. So the last thing I want to talk about in architecture land, I think this is the last thing, is variations in position embeddings. And I think this one's interesting because in the first few years of sort of LM land, there were a lot of different things that people were trying. Sinusoidal embeddings were from the original transformer. You know, you should have learned this in 224N. There's sine and cosine positions. Many others did absolute embeddings like the GPTs and OPT all basically just added a learned position vector to the embedding. Some others like T5 and Gopher did various kinds of relative embeddings that add vectors to the attention computation.

🤍0 likes💬 0 comments

Add to My Notes

00:33:16Lecturer

And then I think most models have converged to RoPE, which is, you know, relative position embeddings. And this I think actually started in GPT-J, once again another open-source contribution, and has really rapidly been picked up by most of the models. And so the high level thought process behind RoPE is that the thing that matters is relative positions of these vectors, right? And so if I have an embedding f(x_i), where x is, you know, the word I'm trying to embed and i is my position, then I should be able to write things down in this way, right? So there should exist an f such that f(x_i) and f(y_j), if I take the inner product of these embeddings, then I can write this down as some different function g, which is a function of the two words and the difference in their positions, right? So this is a definition that enforces basically position invariance, or absolute position invariance. So you only pay attention to how far apart these two words are.

🤍0 likes💬 0 comments

Add to My Notes

00:34:17Lecturer

And so you can, you know, do a brief check and see okay, what happens with sines? Well, you get these cross terms that are not relative. So you do still leak absolute position information. Absolute positions, like it's in the name, you know, it's not a relative position embedding. And relative embeddings, well it is relative but it's not an inner product, so it sort of violates this constraint.

🤍0 likes💬 0 comments

Add to My Notes

00:34:39Lecturer

And so RoPE is this kind of clever observation that we do know one thing that is, you know, invariant to sort of absolute things, which is rotations. And so we're going to exploit that structure to come up with our position embeddings. Right? We know that inner products are invariant to arbitrary rotation, so we're going to leverage that. So on the left, this is the starting point. Let's say my embedding for the word "we" is this arrow over here. And my embedding for the word "no" is this other arrow over here. Now I want to embed this sequence, "we know that." And I only, you know, I look at the word "we" and "no". So how do I do that? Well, "we" is in position zero, so I'm not going to rotate that guy at all. "No" is in position one, so I'm going to rotate him by, you know, one unit of rotation. And so now I have this embedding for "we know."

🤍0 likes💬 0 comments

Add to My Notes

00:35:32Lecturer

And now let's say I want to embed this sequence, "of course we know." Now "we" and "no" have the same relative positioning to each other. And so let's look at what happens. "We" gets shifted by two positions. I rotate "we" by, you know, I start in this vertical position and I rotate them twice, one and two. And then I rotate "no" by three positions because it's 1, 2, 3... sorry, 0, 1, 2, 3 position. Right? And so now if you look at these two arrows, they have the same relative angle, right? So their inner products are preserved. And so this is kind of the nice fun idea about RoPE. You just rotate the vectors and the rotation angle is determined by the position of each word. And rotations, you know, the inner products don't care about relative rotations. And so these inner products are only going to look at sort of the difference in distance.

🤍0 likes💬 0 comments

Add to My Notes

00:36:21Lecturer

Now, it's easy to think about in 2D because rotations are kind of obvious in 2D. There's only one way to rotate a vector. But in high dimensional spaces where we operate, it's not obvious at all how we are going to do this rotation. So the RoPE folks came up with, you know, in some ways the simplest but also effective way of doing this. And the way to do it is you take your high dimensional vector, in this case D, and I'm just going to cut it up into blocks of two dimensions. And every two dimensions is going to be rotated by some theta. So there's going to be a rotation speed. And I'm going to rotate the pairs of dimensions. And so now every pair of dimensions is encoding, you know, all these relative positions.

🤍0 likes💬 0 comments

Add to My Notes

00:37:04Lecturer

And much like in sine and cosine embeddings, I'm going to pick some set of thetas such that some embeddings are rotated quickly and others are rotated much more slowly. So they can capture both high frequency information or like close by information and very far away sort of lower frequency positioning information, right? And the actual RoPE math here is, you know, if you're going to think about rotations, it's just going to be multiplying with various sine and cosine rotation matrices. Hopefully you remember this kind of from linear algebra and trig. And so you can think about this as an operation where you multiply, you know, your embedding vectors with these, you know, block 2x2 block matrices. And there's no sort of additive or cross terms that sort of appear here. This is all purely relative.

🤍0 likes💬 0 comments

Add to My Notes

00:37:55Lecturer

One thing that is different, if you're used to sort of absolute position embeddings or sine and cosine embeddings here, is that the RoPE is going to operate at the actual attention layer, right? You're not going to add position embeddings at the bottom. Whenever these attention computations are going to be done, you're going to intervene on that layer, and then that's going to give you your position information.

🤍0 likes💬 0 comments

Add to My Notes

00:38:15Lecturer

And so, you know, I pulled this from, I think, the Llama implementation of RoPE. You know, you've got the initial normal attention stuff at the very top like query, keys, and values. These are, you know, your normal linear projections. And then, you know, you're going to come up with cosine and sine angles. These are rotation angles telling you how much to rotate different blocks of the query and key. And then so you take your query and your key and you're going to rotate them by the cosines and sines. And now you've gotten rotated query and rotated key. And that's going to be what's going to go into the rest of your attention computation, right? So you don't do this at the bottom, you do it whenever you generate your queries and keys. Hopefully that's clear. That's really critical to enforcing kind of this relative positioning only information. Okay, good.

🤍0 likes💬 0 comments

Add to My Notes

00:39:03Lecturer

So one of the things I want to highlight is that RoPE is actually one of the things that it seems like everyone has converged on. I, you know, went through all 19 of those papers over the weekend and basically all of them now use RoPE for various different reasons. There's, you know, the reason that RoPE has now many different algorithms for extrapolating context length, and that's an important part of sort of the modern productionized language model. But also, it seems to be empirically quite effective, even at fairly small scales in small context length. So it's kind of won out on this position embedding battle.

🤍0 likes💬 0 comments

Add to My Notes

00:39:38Audience Question

Okay. Any questions before I move on to to some of the hyperparameter stuff? Yes. Is the rate of rotation consistent across all these models?

🤍0 likes💬 0 comments

Add to My Notes

00:39:46Lecturer

Um, I don't think they're all the same. There's some variation in the thetas.

🤍0 likes💬 0 comments

Add to My Notes

00:39:53Audience Question

Oh yes. Are the the thetas like for each pair, um are those hyperparameters or are they trained?

🤍0 likes💬 0 comments

Add to My Notes

00:39:56Lecturer

They're not. It's the thetas that determine the rotation angles. They're not hyperparameters. Much like in the signs and cosines here, there's kind of a schedule to the rotation angles that are determined. And it's in the same intuition as the signs and cosines. You want to cover different frequency ranges in order to get higher or lower frequency information.

🤍0 likes💬 0 comments

Add to My Notes

00:40:24Audience Question

Yes. Oh, do the rotations create any difficulty with like training? I wonder like this like angular rotations.

🤍0 likes💬 0 comments

Add to My Notes

00:40:28Lecturer

Um the rotations themselves don't really create any issues because one way of thinking about a rotation is that it's just a matrix multiply, right? Since thetas are fixed, right, and the M's here are fixed, this is really just a fixed matrix that multiplies your vector. And so in that sense, it's not really an issue. If you were learning the thetas, then maybe you have issues because you're, you know, maybe differentiating through trig functions, but you're not doing that here. So, okay.

🤍0 likes💬 0 comments

Add to My Notes

00:40:55Lecturer

Cool. So now I think we go even one more level into the details here, and we're going to talk about hyperparameters. I feel like when you have to, you know, you're dropped in and you're asked to train, you know, a new language model, there's a lot of questions you have about hyperparameters because there's quite a few of them. And one of the things that I've realized is that actually only a few of these really get changed across different successful models. There's actually like fairly clear rules of thumb and fairly clear guidelines that people seem to be following. So, you know, there are some things like how much bigger should the feed forward size be, or how many heads should I have, or what should my vocab size be? And so we'll talk about each of those things and we'll try to constrain the space of hyperparameters that people have.

🤍0 likes💬 0 comments

Add to My Notes

00:41:43Lecturer

So, you know, the starting point, we're going to look at a simple feed forward layer, you know, just the, you know, with the bias, let's say. This is a ReLU version of it. And so there's two hyperparameters here. There's d_model, which is the dimensionality of x, right? That's the input coming into your MLP. And then you've got d_ff, so this is the feed forward dimension. This is kind of the output hidden dimension of your MLP. And from there, you're going to project back onto d_model, right? So what should d_ff be?

🤍0 likes💬 0 comments

Add to My Notes

00:42:11Lecturer

In general, you know, these these things are going to be up projections, right? You're going to have more hidden units than there were inputs. But how much bigger? Well, there is actually just like a consensus. Almost everybody that uses, you know, ReLU style MLPs are going to pick d_ff is equal to four times d_model. This is, I will show you some empirical evidence for why this is a sane number later, but as far as I can tell, there's no like, you know, law of nature that says you have to pick four. This is a convention that has really held up.

🤍0 likes💬 0 comments

Add to My Notes

00:42:47Lecturer

Now, there are a few exceptions to this rule. Remember that the GLU variants are going to scale this down by a factor of 2/3, right? And if you scale it down by a factor of 2/3, you're going to have roughly the same number of parameters. You can do a little bit of math, and if you scale the GLU variants down by a factor of 2/3, you'll come to the conclusion that the way to do that is to set d_ff equal to 8/3 d_model, right? That's going to be the number that you end up at. And you can sort of convince yourself that that will give you the same number of parameters, and that's the ratio that you would get if you started with a ratio of four.

🤍0 likes💬 0 comments

Add to My Notes

00:43:21Lecturer

So if you look at many of the models, they actually do follow this rule of thumb. PaLM for example, uh, you know, are PaLM, Mistral, and Llama are slightly larger. These are GLU models, but they don't follow this 2.6 rule. But if you look at, for example, Llama, you know, 1, Qwen, DeepSeek, and T5, they all roughly follow this like kind of 2.6-ish rule. And I can sort of put up the big table of LMs that I made later with hyperparameters. Many, many, many of them fall into this roughly 2.66 range, and that's the standard parameterization of a GLU unit.

🤍0 likes💬 0 comments

Add to My Notes

00:43:59Lecturer

I'll go through one other exception. I really like this exception because I think in many ways, you know, big large language model training is a game of copying of hyperparameters from other people, and so we don't learn very much, right? Like it's very conservative. But T5 I really like because in some sense it's really bold. And I think Google people actually do some pretty bold stuff. And so if you look at the 11 billion parameter T5 model, they have a pretty incredible setting. Their hidden dim is 1024, but their d_ff, you know, their up projected dimension is 65,000, right? And so that's going to give you a 64 times multiplier on the ratio of d_ff to d_model. And of course, you know, you compare to this where PaLM is like a factor four and everyone else is, you know, much smaller. This is a very large difference. And there's some other recent examples of using much bigger, you know, multipliers like Gemma 2 kind of follows in these footsteps and does a factor of eight. And I'll talk a little bit about this exception later. Of course T5 was a totally fine model, so this should tell you it is possible to train a model with, you know, such a much larger ratio.

🤍0 likes💬 0 comments

Add to My Notes

00:45:11Lecturer

So one of the things that I think is, you know, quantitative evidence, you know, I saw that 4x multiplier and I thought, "Is that really the right thing to do, or is there some more quantitative experiment someone's done to convince me that that is a good idea?" So one of the figures from Jared Kaplan's sort of scaling law paper, and most people know this paper for the scaling law component, but actually there's also some really useful hyperparameter components to this paper. You'll actually see that they do exactly this thing that I'm talking about, the d_ff to d_model ratio. And they plot essentially how much the loss increases as you vary this. And you kind of see that there's kind of a sweet spot. This is, you know, a ratio of 1, 2, 3, 4, and then up to like 10 or so here, right? And so there's a pretty wide basin here anywhere between 1 to maybe up to 10 where, you know, you can pick whatever feed forward ratio you want and it'll be roughly optimal. And four is not too far off from your optimal choices over here. It's like one, two, three, four. It's like right here or maybe right here, right? So that's a pretty reasonable choice.

🤍0 likes💬 0 comments

Add to My Notes

00:46:20Lecturer

So what can we learn from all this hyperparameter stuff? I think a lot of the evidence points towards, you know, you can pick the same defaults of, you know, if you're not using a GLU, you can multiply by four. If you're using a GLU, you can use roughly 2.66. And they can work pretty well for mostly all the modern LMs. T5 once again does show that you don't have to follow these rules, right? You can be a rule breaker and do whatever you'd like. There's no hyperparameter choice written in stone. You can get reasonable LMs at many other hyperparameters.

🤍0 likes💬 0 comments

Add to My Notes

00:46:52Lecturer

That said, I think the really funny epilogue to this story, right, is that T5 has a follow-up model called T5v1.1 that's improved and it uses a much more standard 2.5 multiplier on GeGLU, right? So, you know, you can read between the lines and say like, "Maybe they looked at, you know, the original T5 and said actually maybe we want to walk back that 64 times multiplier and pick a more standard one." And they did end up with a better model. So cool.

🤍0 likes💬 0 comments

Add to My Notes

00:47:25Audience Question

Yeah. Okay.

🤍0 likes💬 0 comments

Add to My Notes

00:47:25Lecturer

So I think that's a that's a good question. So the question was, "What's the ratio or sorry, what's the relationship between, you know, this ratio that I'm talking about here and generally the impact on the model, right?" And so if we go all the way back here... here, you know, the ratio is controlling essentially how wide, you know, the hidden part of this MLP is. And so the original justification in the T5 paper for picking 64 was to say, "Actually, we can get bigger and fatter matrix multiplies if we make that dimension really, really large." And while that is a kind of a true statement, you know, the wider it is, you know, you're getting more parallel computation, so to speak, rather than serial computation. So you're spending your flops and your parameters in a slightly different way than if you made your hidden units bigger, which would let you pass more information, or using more units, which would give you sort of more serial computation, right? So you're spending your parameters and your flops in a slightly suboptimal way from expressive power, but you might get systems gains if sort of your matrices are wide enough.

🤍0 likes💬 0 comments

Add to My Notes

00:48:32Lecturer

Okay. Excellent. So another thing that is a is a surprising or maybe not surprising consensus hyperparameter is the ratio between the model dimension and the head dimension times the number of heads. So I clipped this from 224N, right? But really, the basically canonical choice is to pick things so that the dimension D, that's a hidden dimension, and if you have multiple heads, you're just going to split up the number of dimensions each head gets, right? So you're going to keep the dimensions fixed as you add more heads. And you don't have to do that, right? As you add more heads, you could just keep the same number of dimensions per head, and you could just let the attention part take more and more parameters, right? You could do that. That's an option that you have.

🤍0 likes💬 0 comments

Add to My Notes

00:49:19Lecturer

But most models once again do follow this guideline. We see GPT-3, T5, LaMDA, PaLM, and Llama 2. They all have a ratio of one or almost exactly one. T5 is the one exception that breaks this rule. They tried the big ratio of 16. But otherwise, it is all, you know, fairly following this consensus. There's been a couple papers that have argued against this 1:1 ratio. You know, there's a notable one by, I don't know how to pronounce this, Boja Panelli et al. 2020, who have argued that, you know, if you have more and more heads, they're going to have lower and lower rank. And if you have very few dimensions per head, that's going to start affecting the expressiveness of the attention operation. But in practice, it doesn't really seem like we see too many significant low rank bottlenecks in practice. And most of the models with this ratio of one seem to do just fine, right? This is really a parameter that's generally been held constant by most of the models that we've seen. If I have time, I'll talk a little bit about different optimizations that people have made on this like multi-head component. But hyperparameter-wise, things have stayed fairly similar.

🤍0 likes💬 0 comments

Add to My Notes

00:50:29Lecturer

I think one of the big ones in terms of hyperparameters is the aspect ratio. So, you know, we can think about deep networks, right? We can have more and more layers or we can have wide networks. And generally, if you want one knob to control the width, that would be sort of the hidden dimension of the residual stream, right? That would control essentially the width of almost all the operations at once. And so this seems like a pretty critical thing to tune. You might think that deeper networks are smarter and more expressive or wider networks are more efficient.

🤍0 likes💬 0 comments

Add to My Notes

00:51:01Lecturer

There is generally a sweet spot of ratios that people have picked. There have been sort of outliers. Some of the early models used much smaller ratios here. So what that means is that they were much, much wider than they were deep. And then some models have gone really deep where they had way more sort of D... sorry, the other way around, really wide where they had way more d_model than n_layer. And there's been generally a sweet spot of saying we want about 128 sort of hidden dimensions per layer. And that has been generally stuck to by a lot of the GPT-3 and Llama variant models. And I'll talk a little bit about evidence for that in a second.

🤍0 likes💬 0 comments

Add to My Notes

00:51:47Lecturer

There's considerations about aspect ratio that are quite important. They will control the amount of sort of parallelism that we can do. So, if you're doing something called pipeline parallelism, what you're often going to do is you're going to take your different layers and you're going to cut them up and you're going to put them on different devices or different blocks of devices because you'll parallelize, you know, within each layer as well. And so there's going to be certain kinds of constraints that you're going to put on your model. And also, you know, if you have really wide models, then you can do something called tensor parallelism where you slice up the matrices and then you distribute those on GPUs. And one thing that we'll learn in, I think, one, two, three, four, or five lectures is that these different parallelism paradigms are going to have different constraints, right? You need really fast networking for tensor parallelism, and you can sort of maybe get away with slower networking or higher latency networking for pipeline parallelism. And so your networking constraints might in turn drive some of these like width-depth considerations.

🤍0 likes💬 0 comments

Add to My Notes

00:52:48Lecturer

But setting that aside, you might abstractly ask, you know, what is the impact of aspect ratio model performance? And once again, Kaplan et al. have a really nice visual sort of aid showing how aspect ratio impacts performance. And so this is three different scales: 50 million, 274 million, and 1.5 billion parameters. And the x-axis is aspect ratio, y-axis is sort of loss difference in percentage change. And you see that, you know, around 100, right, which is once again I told you was around the consensus choice of hyperparameters, is the minimum across different scales, right? So this is kind of backed by some of this like large scale hyperparameter data that's been published by Kaplan et al. and it roughly matches that intuition. And a really nice thing here is it seems to be the case that aspect ratio optima does not shift too much across several orders of magnitude here. So if this holds up even more, that's very good news. You can keep training on one fixed aspect ratio.

🤍0 likes💬 0 comments

Add to My Notes

00:53:51Lecturer

One thing I will note that is quite an interesting result is Tay et al. at Google had this very interesting paper sort of studying impact of depth versus width both upstream and downstream. And one of the things that they found was that if you're looking at losses, then it doesn't really matter. Parameter is the only thing that matters. Deeper models don't help you. But the story is less clear if you're looking at downstream accuracy. At the time, they were looking at sort of fine-tuned SuperGLUE accuracy. They were arguing that for the same amount of flops, deeper models might be better. So, I'll sort of just leave it at that. There's not quite as much follow-up to this work, at least in the open, that I've seen, but downstream performance may actually be slightly different in terms of the aspect ratio considerations here.

🤍0 likes💬 0 comments

Add to My Notes

00:54:41Lecturer

Okay, cool. The final thing that I want to talk about in this sort of very low-level hyperparameter world is what are kind of the vocabulary sizes that you might want to pick. And in general, vocabulary sizes have been trending upwards. And I think a big part of why is because, you know, LLMs are being deployed out in the wild. They're becoming more useful services. And when that happens, you're going to interact with people speaking different languages, people using emojis, all sorts of other kinds of, you know, almost modalities or languages than what you might expect.

🤍0 likes💬 0 comments

Add to My Notes

00:55:17Lecturer

And so I think some of the earlier models and especially monolingual models ranged around in the 30 to 50,000 token vocabulary range. You can kind of see this in like GPTs, the early Llamas. But if you look at the multilingual or I would call like production systems that have come out, they've all sort of been shifting towards the 100 to 250,000 range for their vocabulary sizes. And you know, I looked at Command R, which is one of Cohere's models. They're a company that emphasizes a lot of multilingual stuff. You know, you see very large vocab sizes from them. Even with GPT-4 and many others that have copied the GPT-4 tokenizer are going to be around the 100k tokens, right? And so that's kind of the standard that a lot of people are operating at, roughly at 100k to 200k token size. And I think there's been work showing that as models get bigger, these models can in some sense handle more and more or make good use of more and more vocab elements. And so you might see, you know, increasing trends to token counts as models get scaled up or more data is used to train them. Cool.

🤍0 likes💬 0 comments

Add to My Notes

00:56:25Lecturer

Okay. So the last thing, this is no longer sort of specific hyperparameters, but sort of two other things that you might need to do before you sort of set your model to run, which is dropout and other kinds of regularization, right? And I think this one was really interesting to me when I was originally doing kind of the research for putting this lecture together. And if you sort of think about pre-training, pre-training is about the furthest place that you might think of from regularization, right? Because pre-training you do usually like one epoch, right? You can't even go through all of your data because you have too much of it. So you're going to do one epoch training and you're almost certainly not overfitting the data in that one pass that you're doing, right? And so you might think, "All right, we don't need regularization for pre-training, right? Let's just set your optimizer loose. It's all about minimizing loss." And this is really good arguments for why you shouldn't need to to regularize.

🤍0 likes💬 0 comments

Add to My Notes

00:57:24Lecturer

But then if you look at what people do, the story is actually kind of mixed. And this story actually is maybe even more mixed than what has turned out to be. But, you know, early days, people did a lot of dropout. And then, you know, there's a lot of weight decay that also seems to be happening. And these days, I think a lot of the people have stopped publishing details on precisely their training hyperparameters. Dropout has sort of gone out of fashion, but weight decay has really been something that a lot of people continue to do.

🤍0 likes💬 0 comments

Add to My Notes

00:58:00Lecturer

And why is that? That's like a really odd thing to be doing, right? So I'll give you a moment to just kind of think about this state of affairs, right? If you're doing, you know, training a really large neural network for one pass on SGD on vast amounts of data, why would you use weight decay when you're doing that, right? So maybe some of you know the answer, but I think that's a kind of interesting thing to think about. It's very intuition sort of violating, at least for me.

🤍0 likes💬 0 comments

Add to My Notes

00:58:27Lecturer

So, okay. So the reason is because, you know, it's not to control overfitting in the sense that if you look at weight decay, different amounts of weight decay don't really seem to change the ratio of training loss to validation loss, right? So you can train with different amounts of weight decay, if you train for long enough where you know, you control your hyperparameters appropriately, you'll end up with the same train to val loss gap. So overfitting, nothing's happening here, even with zero weight decay.

🤍0 likes💬 0 comments

Add to My Notes

00:58:58Lecturer

But what is interesting is that the weight decay seems to be interacting, you know, somewhat in a strange way with the learning rate schedules of the optimizers. And so what's happening is that if you look at a constant learning rate, so this is a model trained on constant learning rate and then, you know, you suddenly decrease the learning rate in 10^-0. So you see this drop off as you, you know, decrease the learning rate. And then let's look at different kinds of weight decay that you could do. And what happens is, you know, with weight decay, the model's not training very well at this high learning rate. And then when you decrease the learning rate, it'll very rapidly drop off. And when you look at sort of cosine learning rate decay, what happens is that, you know, the models with high weight decay start out very slow, but then as they cool down, that is their learning rate decreases, they very rapidly optimize.

🤍0 likes💬 0 comments

Add to My Notes

00:59:51Lecturer

And so there's some very complex sort of interaction happening here between the optimizer and the weight decay and some sort of implicit sort of acceleration that happens near the tail end of training that ends up giving you better models. And so the answer to the question I posed you is, you know, you don't weight decay because you want to regularize the model, which is kind of what it was designed for. You're weight decaying in order to get actually better training losses. And you end up doing that because of the various learning dynamics at the tail end of training as you decrease your learning rates to zero. It's a very sort of, very interesting and complex and in some ways, you know, troubling, you know, thing to be doing with language models. But now you sort of see why, you know, if you look at the a lot of the reports, you'll see we use weight decay. This is kind of why that ends up happening.

🤍0 likes💬 0 comments

Add to My Notes

01:00:44Lecturer

Cool. Okay. So, putting all that together, so there are certain things that I think are just kind of no-brainers. So, if you're picking various hyperparameters for your model, you don't really need to think too deeply about them, in the sense that they've been validated and basically everyone else does them. So, this is things like, you know, the hidden size of a of a MLP, the head dimensions of your multi-head attention, your aspect ratio, and your choice of regularization through weight decay. Like all of those, there's fairly good, I think, consensus evidence of how to pick most of these hyperparameters, and those defaults roughly give you the kinds of things that we suggest in the assignment. So you can kind of follow along and they'll roughly give you something similar to this.

🤍0 likes💬 0 comments

Add to My Notes

01:01:28Lecturer

So okay, any questions about the hyperparameter piece?

🤍0 likes💬 0 comments

Add to My Notes

01:01:34Audience Question

Yeah. Is there a reason why dropout's like gone out of fashion?

🤍0 likes💬 0 comments

Add to My Notes

01:01:38Lecturer

That's a good question. Um, I don't think I've seen... The question was, "Why did dropout go out of fashion?" I haven't quite seen a deep analysis of why dropout is or isn't helpful. Like I haven't seen any result that for example shows that it helps for training loss. And as sort of this, you know, what this paper argues and logic would dictate, there's not really a training overfitting issue with these models that can't even do one epoch over their training data.

🤍0 likes💬 0 comments

Add to My Notes

01:02:05Audience Question

Yes. Um, do multilingual vocabularies actually contribute to improved performance in one language?

🤍0 likes💬 0 comments

Add to My Notes

01:02:12Lecturer

So, I get... Yeah. So, the question was, do multilingual vocabularies contribute to improving performance in one language? When you say one language, do you mean do multilingual or like, you know, larger vocabularies help performance in English? Is that the right question? Yeah. So, I think in your high resource language, the impact is less, right? So, you know, if you're only thinking about, you know, English language modeling, you can get away with smaller vocabularies. This much is kind of, you know, true. But the place where larger vocabularies is really helpful is when you're starting to get at, I wouldn't say the tail of your distribution, but when you get to languages that are sort of more minority. And one great example of this, if you look at any of the Cohere announcements about their models or their tokenizers, they basically always argue that because of the way they have larger vocabularies and the way they train their tokenizer, non-English and like low resources languages, they are packed into much fewer tokens. And so people using those pay much fewer, you know, much lower cost at inference time, right? Which is a, which is a great benefit.

🤍0 likes💬 0 comments

Add to My Notes

01:03:20Audience Question

Oh yes, question. In these plots, if weight decay doesn't have a significant impact on the val loss, like why why do we care about like the training dynamics or the favorable operation dynamics?

🤍0 likes💬 0 comments

Add to My Notes

01:03:30Lecturer

Right. Okay, so the question was, if it doesn't have an impact on val loss, why do we care about training dynamics? The goal is still, I want to get, you know, good training loss, right? This is the game that we're playing. And the surprising thing about weight decay is that somehow it gets us better training losses. Like I think the intuitive thing that makes sense is you do weight decay, it gives you better val losses. But that's not what happens. What it's getting you is better training losses, which are also the same as val losses.

🤍0 likes💬 0 comments

Add to My Notes

01:04:01Audience Question

Yes. Are there differences in the architecture hyperparameter choices people make as they move towards like multimodal architectures if they're images, text?

🤍0 likes💬 0 comments

Add to My Notes

01:04:10Lecturer

Yes. So the question was about multimodal models. That is a great question. My survey of multimodal models is very incomplete. What I can say is a lot of the academic and open work that I've seen, they do what you might call like shallow or like later fusion, or early fusion of the modalities. And the way that works is you kind of bolt the vision modality onto an existing language model. In those cases, the hyperparameter and architecture choices are fixed, right?

🤍0 likes💬 0 comments

Add to My Notes

01:04:40Lecturer

Um, one thing I will note, and I will talk about this in just a few slides, is that the multimodal models pioneered some pretty interesting techniques in stabilizing language model training. And that's been a really big theme, and I'll talk a little bit about those. So what is different is often when you like bolt on this new kind of vision piece and you like retrain with that, that's a big shock to the model. And so you have to think carefully about how to stabilize that training process, and those innovations have actually seeped back into like pure text language model training. Okay, cool.

🤍0 likes💬 0 comments

Add to My Notes

01:05:16Lecturer

So I went back through and, you know, I looked through all these new papers and as I was trying to think about, okay, what's been new in the last year and sort of what new architecture and related things have happened, actually, you know, the core architecture hasn't changed much, but I think the one thing that stood out as being very emphasized in a lot of the releases has been what I would call stability tricks. And so these are things where you would like to train your model in much more stable ways. And as you make bigger and bigger models or you train for longer and longer, these kinds of issues start to appear more and more.

🤍0 likes💬 0 comments

Add to My Notes

01:05:51Lecturer

So I've taken this from the OLMo 2 paper, and actually that paper is a great sort of set of, you know, academic results on LLM training stability. And you know, one thing they start with is kind of this figure, and you look at this blue curve over here and you look at this, you know, L2 norm of the gradient graph, and this is a terrifying graph to look at, right? Like, you know, your loss curve kind of seems to be behaving okay, but you've got some bad spikes every now and then. And you open up your gradient norm, and it's this horrible plot where you've got spikes everywhere where your norms are completely blowing up. And, you know, if you're training models like this, you're going to have a really tough time getting it to converge reasonably. At some point, it's going to, you know, hit, you know, "gradient norm explodes," and like you can't do anything and your training is done, right? So you can't train any further.

🤍0 likes💬 0 comments

Add to My Notes

01:06:43Lecturer

And so, there's been a lot of emphasis basically trying to turn this blue curve into something that looks a lot like the orange curve. And of course, this loss is higher, but ignore that fact because I think they just switched datasets in between these two training runs. But this orange curve, you know, has nice low gradient norms throughout. And that's really the kind of plot that you would much rather see.

🤍0 likes💬 0 comments

Add to My Notes

01:07:03Lecturer

And so you might ask, where do stability issues arise in transformers? And of course, they can arise basically everywhere. But if you look at the kind of interventions that people are making, there's really one place that really stands out as the kind of problem child, and that's the soft maxes. And it can be a problem because you're going to be taking exponentials, and those can be numerically, you know, badly behaved. You're also dividing two numbers, and so you might have a division by zero, right? So for many different reasons, this softmax piece is a part that, you know, you might have lots of issues with.

🤍0 likes💬 0 comments

Add to My Notes

01:07:41Lecturer

And so actually one more thing I want to talk about. So where are the softmaxes in a transformer? Well, there's one at the very end, so you've got to be careful about that output softmax. And also, there's soft maxes in your self attention, right? So there's two soft maxes that that we're going to think a little bit about. And for each one, I'm going to mention a stability intervention that has, you know, generally seemed to be effective.

🤍0 likes💬 0 comments

Add to My Notes

01:08:03Lecturer

Okay. So the first one is called the z-loss. And in my desire to cite a paper that's older, I've gone back to Devlin in 2014, where in a machine translation paper, you know, their goal was to try to, you know, make sure that this normalizer was near one. So if you look at P(x), that's the output softmax over here. The output softmax is two terms. You exponentiate your logits, and then you divide by the normalizer Z, right? The Z is just summing up, you know, the values across all the vocab. And so if you want this Z(x), you want to train the network to have a Z(x) close to one. Well then you can, you know, rewrite your loss and you can add a little second term here to try to force log(Z(x_i)) to be close to zero, right? Okay, so you're going to end up with an auxiliary loss term that's alpha * log^2(Z(x_i)), right? You can kind of see that derivation on the right here.

🤍0 likes💬 0 comments

Add to My Notes

01:08:59Lecturer

And this is, you know, in some sense what people often call the z-loss. I think, you know, Jacob Devlin and others did this for machine translation for totally different reasons than what it's used for today. But this was, I think the first instance of this in language modeling land was PaLM, who used this, as they called it, auxiliary loss of z-loss 10^-4 * log^2(Z) to basically encourage the softmax normalizer to behave nicely. And you can kind of reason through the behavior of this regularizer. If it succeeds and it forces log(Z(x)) to always be zero, then the log and the exponent, the exponential cancels, and you've basically just got U(r(x)). And that's a good place to be, right? That's a nice numerically stable operation. So all of these sort of problematic operations kind of go away. And so you can think of the softmax as being well behaved when Z(x) is close to one, or log(Z) is close to zero, right?

🤍0 likes💬 0 comments

Add to My Notes

01:09:55Lecturer

And you know, PaLM in some sense is a very much a pioneer because they did this z-loss trick. And many others didn't really do it for a long time, or at least the ones that had open papers. But then there was a kind of sequence of papers that have done this. PaLM 2 is actually the earliest follow-up that I know of. And then DCLM and OLMo, and now several others have basically picked up on z-loss as a very nice convenient intervention for improving stability.

🤍0 likes💬 0 comments

Add to My Notes

01:10:24Lecturer

And then the other trick that we see, so that was how to stabilize the output softmax, but we've got another softmax we've got to deal with, right? The other softmax we have to deal with is in the attention operation. And so, you know, this is from an NVIDIA paper, I forgot to put the citation marker. But here, you know, this is a block diagram of how attention works. You know, you've got your layer norm at the beginning. You got your QKVs. Ignore this for the moment. You might multiply your Q's and your K's. You'll softmax it. You multiply the V and then you'll project it. And then that's going to give you your fully connected and your output, right? So if you ignore this little piece over here, you know, this looks just like your normal multi-head attention operation.

🤍0 likes💬 0 comments

Add to My Notes

01:11:12Lecturer

So what's kind of the difference here? So several folks came up with this idea or this approach called the QK-norm, where you take the queries and the keys and you pass them through a layer norm layer before you, you know, take their inner product for the softmax operation. And this is a, you know, very different kind of approach to controlling the behavior of the softmax. Here you're not controlling the normalizer Z. Instead, you're controlling the inputs to the softmax to be kind of bounded in size, and that's going to naturally control the bad behaviors of the softmax.

🤍0 likes💬 0 comments

Add to My Notes

01:11:48Lecturer

And as I said before, this is originally an innovation from the vision and sort of multimodal model community. Dai et al. in 2023, this was, you know, a paper on training very large vision transformers. And then Chameleon and IDEFICS from Hugging Face sort of used these tricks for their like multimodal training components. And then, you know, it got picked up by several others like Gemma 2, DCLM, OLMo 2, all basically use this kind of techniques in order to stabilize their training.

🤍0 likes💬 0 comments

Add to My Notes

01:12:21Lecturer

And I think I'm allowed to add one joke per lecture, and so this is the one I'm going to go with here. I think one of the things that really has stood out in terms of stability interventions has been just how strikingly effective layer norms are, right? So we've seen, you know, going from layer norms just in the pre part of the block to both the beginning and the end of the non-residual component, and now we've also thrown it into the Q and the K component. At least in terms of improving stability, layer norms have been shockingly effective without affecting performance too much.

🤍0 likes💬 0 comments

Add to My Notes

01:12:57Lecturer

The last trick that I'll note, I think this one has been sort of not quite as frequently used, which is to soft cap the logits that go into the softmax. So the other approach that you can take, so QK-norm is in some sense a very heavy-handed intervention because we're going to operate over the entire vector. But one thing you could do is after you take the inner products for self attention, you could pass them through kind of like a soft maximum operation. So you can pass them through this equation over here. So you have your logits as your input divided by the soft cap, multiplied by the soft cap. What does that do? Well, if your logits start exceeding the soft cap by a lot, the tanh is going to clip them off to one. And so you're going to have a maximum value of soft cap over here, right? So this is going to control in some sense soft clipping of the logits. And Gemma 2 and I think OLMo 2 also do this. It hasn't been, I think, quite as popular otherwise. And I think the other sort of evidence against this, the NVIDIA folks that I mentioned earlier did actually quite a few different sort of stability improving interventions, and what they find is, you know, you have your baseline model over here. This is the perplexity of the baseline model, 11.19. Soft-capping makes it worse. QK-norm actually makes it better because you can use more aggressive learning rates and sort of push the optimizer further.

🤍0 likes💬 0 comments

Add to My Notes

01:14:24Lecturer

Cool. Okay. So that's the end of sort of the stability improving intervention stuff. Does anyone have any questions? I think that's been kind of the new development over the last year.

🤍0 likes💬 0 comments

Add to My Notes

01:14:36Audience Question

Yes. So, for the QK-norm, um like I understand that during training you will have the layer norm being applied. At inference time, is the layer norm still being kept?

🤍0 likes💬 0 comments

Add to My Notes

01:14:44Lecturer

Yes. So the question was at inference time, do you still use the norm? And the answer is yes, because the layer norm has kind of learned parameters. Like the whole, you know, action of the layer norm is it takes an activation, normalizes it to unit, and then scales them to some size. If you take that out, that's a huge change to the model. It will have no idea what to do with those unnormalized activations.

🤍0 likes💬 0 comments

Add to My Notes

01:15:08Lecturer

Okay, cool. All right. So I have this last bit, last few slides that I want to end with. If we go over then we can always push this into the next lecture, but I think we also have a lot of content next time because I have to cover DeepSpeed v3.

🤍0 likes💬 0 comments

Add to My Notes

01:15:29Lecturer

So the last thing I want to cover is variations on the attention heads. So attention heads I think haven't had as much, you know, work done to them. But there have been a few I think important changes that you need to know about in order to understand the models that are being trained. So the one thing I'll talk about, the first thing I'll talk about is GQA and MQA. And these aren't really critical to kind of the training time behavior of the models, but they're very important in understanding the inference cost and inference behavior of the models. And because this is an important architecture change, I'll mention them here in addition to probably being mentioned by Percy in some of the inference lectures.

🤍0 likes💬 0 comments

Add to My Notes

01:16:07Lecturer

The other thing that's a kind of new development I'll mention is how the most recent models like Llama 4, if you've heard of it, supports supposedly 10 million tokens of context. How does it do that? Well, it does so by sort of messing with the attention pattern in very structured ways. And so I'll talk about that as well.

🤍0 likes💬 0 comments

Add to My Notes

01:16:27Lecturer

So, GQA, MQA. If you've looked at like some of the larger models like the big Llama models or others, you'll have heard or seen this term GQA or MQA. And I'll talk through what that sort of means. So, to set the stage, let's think about the compute that you need to do attention, right? So, this is once again 224N slides here. You're going to take your, you know, XQ, your query, and your XK, and then you're going to form your big quadratic attention matrix. And you can sort of walk through each of these matrix multiplies and you can convince yourself that the total number of arithmetic operations is going to be B * N * D^2. So that's going to be, B is the batch dimension, N is the sequence length, and D^2 is going to be the hidden dimension squared.

🤍0 likes💬 0 comments

Add to My Notes

01:17:16Lecturer

And you can ask about the total memory accesses, and this is going to be B * N * D, and this is going to be, for example, accessing just this matrix here, this XQ is going to be that size. And then the softmax is going to be B * H * N^2. And you can kind of convince yourself of that by just thinking about the size of the softmax matrix, which is going to be batch * number of heads * all the different softmax activations that you have. So that's N^2 of them, right? And you've got a projection, and you've got D^2 projection operations at the very end over here.

🤍0 likes💬 0 comments

Add to My Notes

01:17:48Lecturer

And so we can take the ratio of total memory accesses and arithmetic operations. And this is going to be something that will be very important in a couple lectures. This idea called arithmetic intensity, right? So we want our arithmetic intensity to be high. What that means is we want to be doing a lot of compute for every single memory access that we do. And this is going to be because memory accesses are very expensive on a GPU, relatively speaking, and compute is relatively cheap.

🤍0 likes💬 0 comments

Add to My Notes

01:18:18Lecturer

And so in this, you know, batch computation that I'm showing you here, you know, the arithmetic intensity, if you take the ratio of those two things is going to be 1 over, um, K + 1 over BN inverse. And so this is going to mean that we can kind of keep our GPUs running because if we have sort of large number of heads and we have large batch size and large sequence length, you know, those are all going to be sort of good large numbers.

🤍0 likes💬 0 comments

Add to My Notes

01:18:48Lecturer

Of course, this is what happens at training time, right? So the issue is that at inference time, we do not have these big chunky matrices to multiply together. And so that's going to really change the nature of the behavior of our algorithms. So when we're generating text, right, remember that we have to generate a token and then the transformer has to read that token and then it has to process it. And now we can get the next token distribution and then we do the things autoregressively one token at a time, right? And by doing this, we can't parallelize this generation process. We need to go step by step for every single new token.

🤍0 likes💬 0 comments

Add to My Notes

01:19:23Lecturer

And when we do this, we're going to need to incrementally compute attention, an idea that people call the KV cache. And so what do you do? This is a lovely animation of a KV cache that's been explained. So if you can sort of look at this figure, what you're doing is, you know, you've got a query token, right? A query token here is you've generated a new token. You're conditioning on it, and now you want to ask what sort of information should I look up in the past for that query token, right? And your query tokens are shifting from one through n because you're generating new tokens one at a time. You're building up this sort of key cache over here where basically I'm building up all of the past tokens' keys, right? And the past tokens' keys don't change because they only depend on things in the past. And so I'm incrementally, as I generate tokens, building up all of these past keys. And each time I can compute one new element of Q.K, right? So the big attention matrix is going to be this lower triangular matrix. I'm computing one row at a time, and that row is exactly what's necessary to generate the next token, right?

🤍0 likes💬 0 comments

Add to My Notes

01:20:28Lecturer

So this KV cache idea, if you've not seen this before, is this idea of saying I'm going to generate the K's and the V's incrementally as I go, as I generate each token, and I'm only going to compute Q that's absolutely necessary to do my operations.

🤍0 likes💬 0 comments

Add to My Notes

01:20:45Lecturer

And so once again, you can go through and do sort of the various arithmetic components of, you know, how many flops do we do, what's the total number of memory accesses. And if you think about the KV cache, right, I'm only multiplying the absolute necessary keys and values, right? Since I'm saving all of the intermediate computations, I'm not wasting any sort of matrix or vector-vector multiply. The total number of arithmetic operations remains exactly the same: BND. But the memory access patterns are now different. Why is that? Because you know, when I do this KV caching thing, I'm going to have to move various kinds of parameters in and out of memory repeatedly. Whenever I multiply with a key, sort of K matrix, I'm going to have to put that into memory, right? And then multiply by K. And then I need to, you know, put that away and I need to compute some activations. And so I'm repeatedly loading in different matrices. And that's going to give me a much higher total memory access of B^2D + ND^2.

🤍0 likes💬 0 comments

Add to My Notes

01:21:44Lecturer

And so when you take this ratio now, the arithmetic intensity is not so good. You're going to get N/D + 1/B inverse. And so if we sort of reason through this, okay, so if I want arithmetic intensity to be high, I want this thing inside to be very small. So I need really large batches and I need N/D to be small. What does that mean? I need really short sequence lengths or really big model dimensions. And this N/D is really unfavorable because I don't want a bigger model and I don't want a shorter sequence length, right? And so this is the core in some sense inference cost trade-off that people face, right? You have this very bad memory access pattern where you have this one term N/D that's kind of really killing you in terms of, you know, the throughput of your system.

🤍0 likes💬 0 comments

Add to My Notes

01:22:34Lecturer

And so this motivates this thing called MQA. Okay. And the key idea here, right, hopefully, you know, you kind of see from this figure back here that really the part that's really bad is the keys and the values. They have this KV cache thing being built up and there's memory moving in and out. So, what you do is you can have multiple heads for the query, multiple query heads, but only one dimension or one head for the keys and values. This immensely simplifies things. Once you do this, now you're moving much less information for the K's and the V's. And so, you know, K and V is shared, but query has many heads. And so, you still have multi-head attention or multiple queries, but only single K's and V's. So, that's why it's called multi-query attention.

🤍0 likes💬 0 comments

Add to My Notes

01:23:21Lecturer

And now when you do the same kind of arithmetic, we have fewer memory accesses because we've shared the K's and the V's. And the arithmetic intensity is much, much better behaved, right? And so we can increase things like, you know, we have, we've decreased the first term by a factor of N, so longer sequence lengths are now viable, and the second term is now divided by the number of heads. So this term is also not so terrible, right? So all the different terms are controlled now, and MQA can give you much better behaviors.

🤍0 likes💬 0 comments

Add to My Notes

01:23:49Lecturer

GQA or Grouped-Query Attention basically changes this slightly. Instead of having, you know, single query or sorry, multiple query and single key, you can reduce the number of keys by some multiple, and so this will let you trade off between kind of the inference time behaviors and the expressiveness of the model, because maybe going from multi-head all the way to multi-query is a little bit too aggressive. You know, some works show that GQA doesn't hurt, but multi-head attention hurts. I'm not going to get into that.

🤍0 likes💬 0 comments

Add to My Notes

01:24:25Lecturer

I'm just going to close off with this very last thing, which I think is a really interesting development in the last few months. So back in 2019, OpenAI had this kind of cool paper basically arguing how to build longer attention models. And they were basically arguing, "Well, one way to do that is to come up with sort of sparse attention patterns, right?" So instead of paying attention to all of the sequence, I'm going to pay attention to, let's say, a local window at each sort of chunk. And then I can have sort of other sort of attention patterns that are like diagonals that help propagate information across. So you can build sparse or structured attention that trades off, you know, various kinds of expressiveness versus runtime. GPT-3 uses exactly these kinds of tricks when they originally released it to get larger attention windows.

🤍0 likes💬 0 comments

Add to My Notes

01:25:13Lecturer

Sliding window attention is another variant of this idea where, you know, at each layer, you only pay attention to a small region around your current position. And this also is going to control the total amount of sort of resources that you need. The total amount of resources you need in order to do longer context. So your effective receptive field is now the local one times kind of the layers.

🤍0 likes💬 0 comments

Add to My Notes

01:25:38Lecturer

The final trick, so those were kind of the older ideas, but the way that this has kind of been modern instantiated is some of the recent papers like Llama 4 and Gemma and Cohere Command R have now come up with this very clever trick of basically having transformer blocks where in this case you have a block, a set of four transformer blocks. The very bottom one uses full self-attention with no position embedding. So there's no RoPE, no nothing. It doesn't know about position at all, but it's full self-attention and it only happens once every four blocks. And then the three blocks above it use sliding window attention with RoPE, right?

🤍0 likes💬 0 comments

Add to My Notes

01:26:20Lecturer

And so this is actually a really clever trick to both control the systems aspect of things, because the full attention only happens every, you know, every now and then, and also the length extrapolation aspect, because RoPE only deals with local context windows, and anything that's really, really long range has no position embeddings at all. So it could, you know, extrapolate very, very aggressively, right? Because you don't have to do this position extrapolation that you do with something like RoPE. So that's a really cool development that we've seen in the last couple months.

🤍0 likes💬 0 comments

Add to My Notes

01:26:48Lecturer

So all right, I think we're coming up on time. Feel free to to ask any questions about architecture or hyperparameters. I'll be happy to answer questions after.

🤍0 likes💬 0 comments

Add to My Notes

Video Player

My Notes📝

Highlighted paragraphs will appear here