Stanford CS25: Transformers United V6 I On the Tradeoffs of State Space Models and Transformers

Stanford CS25 - Transformers UnitedPart 44 / 46

For more information about Stanford’s graduate programs, visit: https://online.stanford.edu/graduate-education April 16, 2026 This seminar covers: • A high-level overview of a recently popular subquadratic alternative to the Transformer, the state space model (SSM) • The core characteristics and design choices of SSMs and other related modern linear models Follow along with the seminar schedule. Visit: https://web.stanford.edu/class/cs25/ Guest Speaker: Albert Gu (CMU, Cartesia AI) Instructors: • Steven Feng, Stanford Computer Science PhD student and NSERC PGS-D scholar • Karan P. Singh, Electrical Engineering PhD student and NSF Graduate Research Fellow in the Stanford Translational AI Lab • Michael C. Frank, Benjamin Scott Crocker Professor of Human Biology Director, Symbolic Systems Program • Christopher Manning, Thomas M. Siebel Professor in Machine Learning, Professor of Linguistics and of Computer Science, Co-Founder and Senior Fellow of the Stanford Institute for Human-Centered Artificial Intelligence (HAI)

Hosts: Albert Gu, Anaiya Raisinghani, Patrick Steen

📺Watch on YouTube

📅April 27, 2026

⏱️01:17:07

🌐English

🤍0 likes

Disclaimer: The transcript on this page is for the YouTube video titled "Stanford CS25: Transformers United V6 I On the Tradeoffs of State Space Models and Transformers" from "Stanford Online". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=OyimE74UMF8&list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM&index=44

00:00:05Stephen

All right. So now, it's my pleasure to welcome our speaker for today, Albert Gu. He's an assistant professor in the machine learning department at Carnegie Mellon University and the chief scientist of Cartesia AI. And his research broadly focuses on theoretical and empirical foundations of deep learning. And he's particularly known for new approaches to deep sequence modeling and neural network architectures, and was recognized on the Time AI 100 list of most influential researchers in 2024. And previously, he completed his PhD here at Stanford. So without further ado, let's welcome Albert.

🤍0 likes💬 0 comments

Add to My Notes

00:00:46Albert Gu

All right. Thank you, Stephen. Does this work? Okay. So I'll just get started. It's good to be back. I think I haven't been back on campus in a while, even though it's only been like two years since I graduated.

🤍0 likes💬 0 comments

Add to My Notes

00:00:59Albert Gu

So this talk, just as a disclaimer, the talk is called "On the Tradeoffs of State Space Models and Transformers." I've given versions of this talk over the last year or so, and I turned it into a blog post as well. So if you've seen those, this talk won't be too different. There is a little bit of new content, but don't expect anything that new. And afterwards, the blog post would also be a good supplement to this as well, if you want to take a look at it.

🤍0 likes💬 0 comments

Add to My Notes

00:01:27Albert Gu

So in the last two or three years, there's been a big surge in popularity of alternative architectures to transformers, particularly models that are so-called sub-quadratic or linear in complexity. Some examples of these include the original Mamba model, which came out just over two years ago and arguably popularized a lot of the subsequent line of work. There have now been follow-ups to Mamba, Mamba 2, and just a month ago we published Mamba 3. So there's continually progress being made here.

🤍0 likes💬 0 comments

Add to My Notes

00:01:57Albert Gu

Other models you may have heard of include xLSTM, which is a refinement of the original LSTM, which is in some ways kind of the original, most popular RNN that was the canonical sequence model used before transformers. Other models that are popular are DeltaNet and the follow-up Gated DeltaNet, which combines DeltaNet with Mamba 2. This is another very popular model that is now pretty widely used. And also another paradigm called test-time training, which views recurrence as a form of test-time updates that optimize an optimization function.

🤍0 likes💬 0 comments

Add to My Notes

00:02:35Albert Gu

So these are a couple umbrella families of models that are all quite popular now. And we're seeing them used more and more in actual large-scale, production-quality models. After Mamba came out, it became used in a bunch of hybrid models that combine these linear layers with attention layers, including Jamba from AI21, Zamba from Zyfra, and Samba from Microsoft.

🤍0 likes💬 0 comments

Add to My Notes

00:03:00Albert Gu

More recently, there are models that include Hunyuan from Tencent, which was actually several hundred billion parameters. The latest Qwen models are now based on Gated DeltaNet. Kimi Linear is another hybrid model that's based on some follow-up Gated DeltaNet. I think very recently AI2 also released a new OLMo model that is hybrid, and it also uses Gated DeltaNet combined with attention. And somewhere here I forgot to list the NeMo-Megatron models from Nvidia are also very new. I think the latest one is the NeMo-Megatron 3 models, which have been scaled to again hundreds of billions of parameters, and that model uses Mamba 2. So again, these models are actually very, very widely used now.

🤍0 likes💬 0 comments

Add to My Notes

00:03:51Albert Gu

And so today I'm going to be talking about a lot of these models. And rather than focusing on any particular one, I'm going to kind of zoom out and talk about general characteristics of these models, how I think about them, how I think about transformers, and how these models relate to each other.

🤍0 likes💬 0 comments

Add to My Notes

00:04:08Albert Gu

There's a lot of different models here, like these and many, many more, and there's a lot of names used to refer to them. So in this talk, I'm going to call them state space models, which is what Mamba was considered to be. But there are actually many different lineages of work that have appeared kind of in parallel, and there are other names for related models that, for the most part, at least for the purposes of this talk, are largely interchangeable.

🤍0 likes💬 0 comments

Add to My Notes

00:04:38Albert Gu

So other things that you can call these include linear attention. They're all kind of recurrent models, so you can call them kind of just recurrent models. I sometimes call them modern recurrent models, as opposed to kind of the older ones like the LSTM. You can call them linear RNNs or just broadly linear models, right? So usually these days when people say any of these words, it kind of broadly refers to really any of these models. And when we say hybrid models, we mean kind of one of these combined with quadratic attention.

🤍0 likes💬 0 comments

Add to My Notes

00:05:12Albert Gu

Okay. So these again are actually generic architectures and they can be used pretty much anywhere. But for the most part in this talk, we will also focus on autoregressive modeling. So this is something you're probably all familiar with. Paradigm such as for language models, where you have a model that models next token probabilities of a sequence, and then it can act as a generative model where you can sample from the model by repeatedly sampling a token and feeding it back to the model.

🤍0 likes💬 0 comments

Add to My Notes

00:05:41Albert Gu

The reason we focus on this is for two reasons. First of all, it's in some sense the most important modeling paradigm right now, especially for language modeling. And secondly, for understanding the tradeoffs of these different sequence models, it's actually helpful to think about autoregressive modeling.

🤍0 likes💬 0 comments

Add to My Notes

00:05:58Albert Gu

And so just to put us on the same page, we'll really quickly go over what does inference or sampling look like for these models. So I'm not going to redefine attention and everything, but from kind of a zoomed-out point of view, what attention does at inference time is that you give it a prompt, which is a set of tokens. And then the model tries to predict the next token or word by doing these pairwise comparisons against every single previous token that it's seen, right?

🤍0 likes💬 0 comments

Add to My Notes

00:06:34Albert Gu

So you do these comparisons, you run some calculations, and then you can predict the next word. And then you can put this back into your prompt and then repeat this, right? So what characterizes attention is this ability to look over every single past token that you've seen in the context.

🤍0 likes💬 0 comments

Add to My Notes

00:06:51Albert Gu

So even without defining exactly what attention is, this I think is a very useful characterization of it. Because I kind of think of transformers as the canonical model that caches every past token, because it needs to cache every token in order to do these sort of comparisons.

🤍0 likes💬 0 comments

Add to My Notes

00:07:08Albert Gu

To get a little bit more specific, this is called the KV cache, which I think everyone probably knows. A lot of the work around transformers basically has to deal with the KV cache, right? Ways of compressing it and so on and so forth. And again, it doesn't actually matter that it's called the KV cache. That's an artifact of the way transformers are defined, but at any higher level, basically a transformer is just a model that stores this cache.

🤍0 likes💬 0 comments

Add to My Notes

00:07:33Albert Gu

And because it stores this cache, this really defines its computational characteristics. In particular, it's heavily dependent on the size of its context, and the memory and computation scales with the context. Now if you amortize this over inference, then instead of scaling linearly with the context, the model scales quadratically in compute. And so this is why attention is considered a quadratic model.

🤍0 likes💬 0 comments

Add to My Notes

00:07:58Albert Gu

And at a high level, what attention does is it does all these comparisons. And what state space models do is that, again, in inference time as your tokens stream in, instead of storing all of them explicitly, the model squishes them all into a state, basically represented by this blue ball here. And then the individual tokens are kind of thrown away and the model only interacts with data through the state.

🤍0 likes💬 0 comments

Add to My Notes

00:08:26Albert Gu

And then it does this repeatedly online, right? So you can generate a new token and put it back into your state. The state is highly compressed, which means that every single step of inference time takes constant time, and then summing this over a sequence takes linear time. So this is why we call them linear models.

🤍0 likes💬 0 comments

Add to My Notes

00:08:41Albert Gu

And so this is really just kind of a high-level way to think about it, and these are the ways that really distinguish these families of models. So now let me talk a little bit about the actual definition of these models and what I think are kind of the most important ingredients that go in.

🤍0 likes💬 0 comments

Add to My Notes

00:08:58Albert Gu

So first of all, state space models are basically defined by this equation down here. So it is essentially a special type of recurrence with a few key characteristics. We think of this recurrence as a sequence mapping from an input $X$. And when you unbind this along the sequence length, you can think of $X$ as basically a one-dimensional number or a scalar.

🤍0 likes💬 0 comments

Add to My Notes

00:09:24Albert Gu

It gets blown up with a vector $B$ and then summed into some transformation of a hidden state. So every step, just like a normal recurrence, you update the hidden state and then you incorporate the input. And then there's this recurrent hidden state denoted by $H$ that kind of characterizes recurrent models.

🤍0 likes💬 0 comments

Add to My Notes

00:09:42Albert Gu

Right? Some differences between SSMs and some of the older RNNs are that, first of all, this recurrence is linear. There's no non-linearity that's applied after this equation like they are in GRUs, LSTMs, and some of the older RNNs. Second of all, some of the big differences include these ingredients.

🤍0 likes💬 0 comments

Add to My Notes

00:10:08Albert Gu

So, the first kind of main one that differs a lot from past traditional RNNs is that the state size is much larger. So, the way I define this is that $X$ here is a one-dimensional input, and for a one-dimensional input, the hidden state becomes $n$-dimensional where $n$ is usually on the order of 64 or 128 or so. So kind of the state is 100 times bigger than the input.

🤍0 likes💬 0 comments

Add to My Notes

00:10:32Albert Gu

And so, this is very different than how LSTMs were defined. The reason this is important is because your autoregressive state here is a bottleneck of your context, right? Because all the context got compressed into the state. So, the larger your state is, the more information your model can remember, which is really important on information-dense modalities such as for language modeling.

🤍0 likes💬 0 comments

Add to My Notes

00:10:54Albert Gu

The second key ingredient is that not only does the state have to be large, but it has to be expressive enough to remember exactly what the model wants to remember. So, abstractly you can think of any RNN as a function that combines the previous hidden state with a new input. And so, you have a black box function and different parameterizations of the function define different RNNs. And making this more expressive allows the model to be more precise about the information it's modeling.

🤍0 likes💬 0 comments

Add to My Notes

00:11:24Albert Gu

So, one key difference with some of the later SSMs such as Mamba, compared to older models, is that the parameters here are defined to be functions of the input itself. And this is again different not only from older SSMs, but also different from LSTMs. We call this selectivity because it means that by having its own parameters be functions of the input, the model can control based on the input how to control the recurrence, which lets the model be more precise in selecting what information to remember.

🤍0 likes💬 0 comments

Add to My Notes

00:12:00Albert Gu

Because, for example, if $A$ here is a function of the input and I don't want to remember the input, then I can look at the input and then decide that I want $A$ to be like one and $B$ to be zero. Or if I really want to remember the input, maybe throw away the previous context, I can set $A$ to be zero. Right, so this selectivity makes the state update function much more expressive, and this was really a key ingredient in getting these models to be effective.

🤍0 likes💬 0 comments

Add to My Notes

00:12:28Albert Gu

The final ingredient is that both the first two ingredients, the state size and the more expressive state update function, make the model much more expressive and have much more capacity, but also makes the model much harder to compute. And so, in particular, unlike a traditional RNN, it's actually basically impossible to compute this by computing the recurrence. And so, a lot of the work involved in this family of models was to find clever ways of rewriting the computation to compute the training pass of this much, much faster.

🤍0 likes💬 0 comments

Add to My Notes

00:13:03Albert Gu

Some ideas here include associative scan, which is what the original Mamba model used. Basically, by exploiting the fact that this recurrence is linear, you can parallelize the computation using a specific algorithm. And later versions of this, including Mamba 2 and including many of the other variants like the Gated DeltaNet, completely rewrite this in terms of this chunked matrix multiplication.

🤍0 likes💬 0 comments

Add to My Notes

00:13:28Albert Gu

So, really I think these are kind of the three most important ingredients. There's a lot more details that are involved in how exactly you parameterize all of these, how to initialize all of these, and so on. But I think these are kind of the main reasons that made these models effective.

🤍0 likes💬 0 comments

Add to My Notes

00:13:45Albert Gu

And all of these ingredients actually existed prior to this recent line of work. For example, linear attention first was published I think in 2020 and it basically used something similar to this that had this state size expansion. Classical RNNs can be viewed as some form of selectivity or gating. And many models have already used these ideas for more efficient computation, including these older linear RNNs.

🤍0 likes💬 0 comments

Add to My Notes

00:14:13Albert Gu

So, all these ingredients were used before, but Mamba can be viewed as the first model that actually combined all three of these, and that was critical for it to be actually really effective. And it was perhaps the first model that demonstrated that recurrent models can actually be competitive with transformers on information-dense modalities like language.

🤍0 likes💬 0 comments

Add to My Notes

00:14:35Albert Gu

We now know that this statement has a bit of a caveat. There is much more nuance to modeling than just perplexity, for example, but this certainly showed that from a coarse-grain perspective like perplexity, these completely different architectures could actually be competitive.

🤍0 likes💬 0 comments

Add to My Notes

00:14:53Albert Gu

Since then, there's been again a lot of work. This is a nice table from the DeltaNet paper, but since then there's been even more models. But basically, there's a lot of different variations of these models that are really effective, but I think all of them are actually much more similar to each other. They're all variants that all include these three ingredients and tweak different definitions of them. So, many of these vary the state update a little bit. Some of them change the computation a little bit, but all care a lot about efficiency.

🤍0 likes💬 0 comments

Add to My Notes

00:15:23Albert Gu

And to give kind of an off-the-record idea of what I think are the most effective ones, probably right now the most tried-and-true variants of these are Mamba 2 and Gated DeltaNet. Those are also the ones that are most popularly used in large-scale hybrid models. Gated DeltaNet is a little more powerful, but a little slower version of Mamba 2. So, depending on your needs, both of these are pretty good, but depends on what exactly you want. Mamba 3 just came out and it's a little bit less tested at scale, but also will probably be up there.

🤍0 likes💬 0 comments

Add to My Notes

00:15:56Albert Gu

But more importantly, all of these models I think are actually much more similar to each other than they are to attention. And it's more important to, instead of talking about the exact details between all of these, I think it's more interesting to think about the higher-level tradeoffs between these and attention. So, that's what the rest of the talk will be about.

🤍0 likes💬 0 comments

Add to My Notes

00:16:15Albert Gu

So again, we're going to examine this through the nature of autoregressive modeling. And one of the key points here is that I claim that the tradeoffs of sequence models can be understood through examining their autoregressive state. What is the autoregressive state? So, in some sense, every autoregressive model can be viewed as carrying an implicit state, right?

🤍0 likes💬 0 comments

Add to My Notes

00:16:39Albert Gu

This can literally be defined as, if you're examining an autoregressive model during generation time, like in this animation, the state is what the model stores in memory in between each step of generation. So, for an SSM, the state is just going to be its fixed-size, kind of a matrix-valued state. For attention, the state is going to be its KV cache.

🤍0 likes💬 0 comments

Add to My Notes

00:17:05Albert Gu

And the definitions of these states really inform the tradeoffs of these models and their inductive biases. So, we previously said that again, we kind of talked about how these models generate. So, transformers are quadratic by looking at every single past token. And SSMs are linear time because they compress everything into a state.

🤍0 likes💬 0 comments

Add to My Notes

00:17:27Albert Gu

A simple analogy that I like is that I think of transformers like a database. So, if you think about the nature of its state, which is the KV cache, what it's doing is it's storing a representation of every single token or every single element of the sequence it's seen before. Right, and it's basically writing it down to its database, and as the sequence length grows, it's kind of expanding its database. And it's able to attend or look back very precisely onto any element of this database.

🤍0 likes💬 0 comments

Add to My Notes

00:17:56Albert Gu

On the other hand, SSMs are more like a brain in that it is a fixed-size method, and again, it kind of compresses all the information it's seen into this fixed-size state. So, this has completely different characteristics than a database. And many of the tradeoffs of these models can be kind of intuited from this analogy. So, for example, some of the weaknesses of SSMs are that they are not so good at retrieval. And this is exactly the way that brains work as well. Humans are notoriously bad at remembering exact strings of numbers and so on. But it also comes with advantages.

🤍0 likes💬 0 comments

Add to My Notes

00:18:40Albert Gu

But yeah, so at first pass, talking about the tradeoffs of SSMs, they're kind of the canonical stateful and compressive model. Statefulness is really powerful because it's what allows these models to be really efficient in online settings. Kind of again, just like brains are, it's always online, always able to consume information at a constant rate and interact with the world in real time. These are places that these sort of linear models are likely to be very good at.

🤍0 likes💬 0 comments

Add to My Notes

00:19:11Albert Gu

The compressiveness of it also has subtle benefits, which I'll talk about later in the talk. And the downsides are that because of the fixed-size state, these lack fine-grained recall and retrieval abilities. So, there's many kind of synthetic tasks like the haystack, associative recall, these sort of things that involve asking a query and digging out precise bits of information from your context. These models are not so good at it, just like brains are.

🤍0 likes💬 0 comments

Add to My Notes

00:19:38Albert Gu

Now again, the analogy has more implications because, for example, if you think about human intelligence, it's actually not just the brain. Humans use a lot of external tools to assist, and this really augments our ability to process and it augments intelligence. Right? So, if you think of human-like intelligence really as the combination of our processing unit as well as external scratch pads, then this analogy also predicts the behavior of models. And so, this is kind of one high-level inspiration for hybrid models.

🤍0 likes💬 0 comments

Add to My Notes

00:20:16Albert Gu

Starting from the advent of these state space models, they've been used in hybrid models for this reason. There's a few references here that refer to some of the older ones, but again, there's now many, many more versions of these. Some of which are very, very large scale. And one thing that's been pretty remarkable is that many different groups independently kind of verified the optimal ratios of hybrid models.

🤍0 likes💬 0 comments

Add to My Notes

00:20:47Albert Gu

So, the simplest way to define a hybrid is just to interleave your linear layer with your quadratic attention layer. Right? And then the question is what ratio of these layers should you include? Many of these older papers found something remarkably consistent, that it's something close to a 10:1 ratio of SSM layers to attention layers seemed optimal, at least from a perplexity standpoint. Nowadays, this ratio seems to have changed a little bit as models are getting better. But I think that it's still very consistent that people typically use something like at minimum a 3:1 or 4:1 ratio of linear layers to quadratic layers.

🤍0 likes💬 0 comments

Add to My Notes

00:21:26Albert Gu

And so, what's interesting about this is that if you use the analogy and you think about human intelligence, you might also kind of predict this. Because we tend to think of the brain as the main processing unit and the external databases, scratch pads, and so on as a supplement that, you know, helps you store information, but is not actually the main computational lifter. Right? And so, the fact that when you actually do these very careful ablations and find that you want more linear layers than quadratic, and this is not accounting for computation, it's just what makes the best performance, it's pretty interesting that it kind of follows the analogy as well.

🤍0 likes💬 0 comments

Add to My Notes

00:22:03Albert Gu

And so, kind of a thought is that while these linear models are sometimes viewed as having a critical drawback because the finite state means you can't remember everything, it seems like this compression is a weakness. It's actually not so clear because even if we're just looking at performance without looking at speed, you actually do want a lot of linear layers here. And so, I think a general theme is about: is the compression actually fundamentally important to intelligence and what role does it play there? And once again, there will be a very interesting experiment I'll show later that hopefully touches on this again.

🤍0 likes💬 0 comments

Add to My Notes

00:22:41Albert Gu

Okay. So, now we're going to switch and talk about just transformers. So, although there's a lot of work here on transformer alternatives, transformers are still kind of the go-to model for many things, especially language, and for good reason. They're very, very powerful. And there's kind of a prevailing mindset that they basically work on everything and they're really used in pretty much all fields, from language to vision to many more. And many people think that you basically should just use a transformer and just put your data into it and it kind of just works.

🤍0 likes💬 0 comments

Add to My Notes

00:23:20Albert Gu

And this is kind of true, actually. But there are nuances to it. So, the way I think about it is that attention is actually most effective when your data is at the right level of abstraction. What this means is that in any pipeline you actually see where transformers are used, there are pretty substantial encoder-decoder layers that transform the data to a form that is suitable for a transformer to process.

🤍0 likes💬 0 comments

Add to My Notes

00:23:46Albert Gu

So, to give some examples, when you use these models in vision domains, for example, this figure is from the original Vision Transformer paper for classification. What they do is they take your image and divide it into these patches, which are basically vision tokens, and then encode it a little bit and then pass it into the transformer. Right? And so, basically it's kind of making your data much coarser and in a form that's more suitable for the transformer. And I claim that this sort of encoding step is actually critical for the transformer to work well.

🤍0 likes💬 0 comments

Add to My Notes

00:24:25Albert Gu

In language, this encoder is called the tokenizer. So, you know, if you have a string of characters, like if you're modeling English, technically you could have just put this into your model and it should work. But we actually process it using a tokenization step that splits it into these coarser chunks. And many people actually don't really think about this step. We kind of always just use off-the-shelf tokenization. But again, I claim that this step is actually really critical for transformers to work, not just for efficiency reasons, but actually from kind of a modeling perspective in terms of what sort of features and transformations the transformer can capture.

🤍0 likes💬 0 comments

Add to My Notes

00:25:06Albert Gu

So, tokenization is notorious because it has a lot of problems. So, we can ask what happens if you don't tokenize. There's a number of outspoken critics against tokenization. This is a tweet taken from Andrej Karpathy where he kind of enumerates a bunch of edge cases with tokenization. It causes issue, you know, in spelling. It has many sort of edge cases. If you ever kind of prompt your model without ending it in a space, or maybe it's ending it in a space, one of these two will screw up your model. So, you have to be very deliberate, and these are all edge cases in tokenization.

🤍0 likes💬 0 comments

Add to My Notes

00:25:48Albert Gu

However, you know, most of these edge cases are understood and most people can engineer away a lot of these with good data or with other tricks. And so, many people think that this is not a real problem. But if you could solve tokenization, if you can get rid of it, you know, you would solve these for free. And I think that even if the engineering solution kind of just works, it's still a worthwhile problem to address for philosophical reasons.

🤍0 likes💬 0 comments

Add to My Notes

00:26:14Albert Gu

Because AI has always been about learning automatically from raw data, and models that are more and more end-to-end generally tend to be better. This also reflects on the bitter lesson. If you can find models that are more generic and work with less feature engineering, but benefit more from scale, you'll eventually get better models that learn better features from data.

🤍0 likes💬 0 comments

Add to My Notes

00:26:38Albert Gu

So, what happens if you actually train models without tokenization? So, one simple thing you can do is just look at language modeling, but not tokenize your data before. So, basically you can just pass in raw characters into your model. So, this is a plot from a recent paper of ours, but it was first done by this paper called MambaByte, which basically just tried to compare Mamba against transformers on byte-level modeling.

🤍0 likes💬 0 comments

Add to My Notes

00:27:06Albert Gu

So, again, here you basically just don't use your BPE tokenizer on the data. And these are some training curves for models that are matched in size and just trained for longer. So, what you see here is that the SSM here is actually much, much better than the transformer. So, the purple line here is a sliding window attention model. And the blue line is just the Mamba model, and these are kind of isotropic models that are basically equivalent in every way.

🤍0 likes💬 0 comments

Add to My Notes

00:27:41Albert Gu

And you can see there's a very, very large gap here. If you kind of draw the horizontal line to see the data efficiency difference, it's like basically 2x or something. And even if you let the attention be global, and in this case because byte sequences are so long, this is really increasing the computation, so that the global attention model in the dashed line here is doing 2x as much compute as the Mamba model, it's actually still not as good.

🤍0 likes💬 0 comments

Add to My Notes

00:28:09Albert Gu

And so, what this means is that there's actually a fundamental difference here. And it's not just because attention is slower. Because the comparison between these two bottom lines are kind of the same model size, same data, same sequence length, and it's simply letting the attention do its quadratic thing and using much more compute, but it's still a bit worse.

🤍0 likes💬 0 comments

Add to My Notes

00:28:33Albert Gu

Okay. So, this phenomenon occurs in other settings, too. So, we can look at sequence-like data that is not just language or kind of language with alternative vocabularies. This is a figure from the original Mamba paper that we investigated the model on DNA modeling. So, again, you can basically just take DNA sequences and do autoregressive modeling on them.

🤍0 likes💬 0 comments

Add to My Notes

00:28:56Albert Gu

And here the comparison between the orange and the red line is the Mamba versus transformer, where once again, we see that there's actually a really big gap. And if you try to match for compute or match for parameters, the Mamba model is about 3x more efficient. So, these are settings that basically don't have tokenization. And so, what is going on here?

🤍0 likes💬 0 comments

Add to My Notes

00:29:22Albert Gu

So, when I think about what attention is good at, there's a couple of heuristics that I think are really helpful for thinking about what it does and the types of settings that it excels at. So, first of all, once again, thinking about the autoregressive state. So, we kind of thought of attention as storing a database of every token it's seen.

🤍0 likes💬 0 comments

Add to My Notes

00:29:46Albert Gu

That means that it should make sense to actually cache every single token, right? And you can think of many types of data where it actually doesn't make sense to cache every token. Obvious examples include like if your data's noisy and there's spurious tokens that you don't need, or you can think of perhaps in the language setting, you know, we can often when we read English, we can often identify words even if you drop out a lot of the tokens, right? And that's kind of just a very coarse analogy that shows that you don't need to spend all the memory, all that computation to cache a representation of every token because not every token is meaningful.

🤍0 likes💬 0 comments

Add to My Notes

00:30:31Albert Gu

Another heuristic is that in some ways, softmax attention was developed as an alternative to hard attention. What hard attention means is that with softmax attention, basically you take a weighted combination of every token in your context. Hard attention means that imagine you can only take, you can only look at a single token at a time. So this is also I think a useful proxy for what attention does.

🤍0 likes💬 0 comments

Add to My Notes

00:30:57Albert Gu

And once again, it's helpful to think about this because attention excels at data where hard attention makes sense. And so this is why again applying attention to say character-level data is much less effective. Because if you're going to think about the nature of the data and how we process it, you're never going to look at a single character, right? You're kind of paying attention to the data at the semantic level of words or even coarser. Not at the level of characters. And this is kind of a very heuristic explanation for the phenomenon that we just saw.

🤍0 likes💬 0 comments

Add to My Notes

00:31:38Albert Gu

And so if we kind of look at the different types of data that you can use, this really I think is useful for thinking about where attention is useful. So if you have things like words or subwords, which is what we typically apply transformers on, they really excel because these sort of tokens are intrinsically semantically meaningful. They're modular and composable. It makes sense to want to, you know, look at a single word or store a representation of a word. When we start looking at other settings, they make less sense.

🤍0 likes💬 0 comments

Add to My Notes

00:32:07Albert Gu

If we're modeling characters or like DNA base pairs, every individual character basically doesn't have any meaning, every individual DNA base doesn't have any meaning. And so this is one explanation for why attention does so much worse here than other models. And many other settings are not so clean.

🤍0 likes💬 0 comments

Add to My Notes

00:32:30Albert Gu

For example, in vision, if you look at a patch of an image, it's not really clear which category they fall into. Some of the patches may have a lot of information, some of them may have very little, like you might just get a patch of sky that's kind of useless to your model. And so the tradeoffs there are a little bit less clear. But it goes to show why when we move away from kind of pure tokenized language, actually these alternative models become more and more popular.

🤍0 likes💬 0 comments

Add to My Notes

00:32:58Albert Gu

And I think it's because of basically this reason, that when you don't have very well-defined tokenizers, the power of transformers goes down, and the power of models that are doing this implicit compression, which in some ways can be viewed maybe as sharing a similar role to tokenization, these models get better. So this is a graph taken from pretty soon after the Mamba paper came out about where it was most popularly used in applications.

🤍0 likes💬 0 comments

Add to My Notes

00:33:28Albert Gu

And even though the model was kind of pitched as a language model, actually language only took up about 10% of the applications. And instead, it was viewed as just a generally useful sequence model for all sorts of modalities from time series to audio to especially vision.

🤍0 likes💬 0 comments

Add to My Notes

00:33:49Albert Gu

All right, so now while we're on the topic of tokenizers, a very recent application of these models has been again in the domain of tokenizer-free models. So very recently my group published a paper called HNet, which is maybe the latest model that tries to do actual tokenizer-free modeling of language.

🤍0 likes💬 0 comments

Add to My Notes

00:34:14Albert Gu

So in a nutshell, the way I'll describe HNet is that it's an end-to-end hierarchical network, and that's what HNet stands for, hierarchical network, that operates on raw data and compresses it through a data-dependent chunking process. So I'll unpack this in a little bit, but basically you think of this as a model that is operating on raw data and implicitly trying to tokenize it by chunking it inside the model.

🤍0 likes💬 0 comments

Add to My Notes

00:34:45Albert Gu

This work was largely led by my student Sukjun Hwang here. And this is the name of the paper. But kind of this is a really general neural network architecture that's really about the idea of compression. Where it's most obviously useful is simply on tokenizer-free language modeling. So to contrast, again, the way that all of our language models work is through an explicit tokenization step. If you're familiar with this, the most common one is called BPE, or byte pair encoding. And it's basically some form of heuristics that take your sentence of low-level data like characters and then implements some merge rules that group them together into bigger chunks, right? So for example, this sentence might be tokenized exactly like this.

🤍0 likes💬 0 comments

Add to My Notes

00:35:38Albert Gu

Where characters got merged together into these, I'm calling them chunks, but we usually call them tokens. What the HNet does is that instead of doing this as a separate offline step and then passing these into the model, the HNet only sees the original characters. So for example, the vocab size would be just say like 256 if you're working on bytes.

🤍0 likes💬 0 comments

Add to My Notes

00:36:03Albert Gu

And then the model is implicitly looking for boundaries like this, but this is all happening inside the model. And this is a visualization of what the model actually might do during training. So this depicts the training stage of the model.

🤍0 likes💬 0 comments

Add to My Notes

00:36:18Albert Gu

I think the animation started partway through for some reason. But basically, the model is modeling just bytes, and the green denotes where the model is trying to put boundaries. And so this is like at the beginning of training, the model is doing a lot of exploring because it doesn't know how. And then it eventually kind of stabilizes on boundaries that are quite aligned with where we would expect a good tokenizer to do. In particular, on English, it happens to coincide that your tokens generally want to be where spaces are, but if you look closely, you can see that the model is actually finding semantically meaningful subwords as well. And if you compose this, you'll find that the model finds semantically meaningful groups of words.

🤍0 likes💬 0 comments

Add to My Notes

00:37:04Albert Gu

So this was a major step towards actually doing tokenizer-free modeling without, you know, the separate offline step. So I won't go fully into the details because the model's pretty complicated, but try to give a high level of how this works.

🤍0 likes💬 0 comments

Add to My Notes

00:37:20Albert Gu

So this here on the left depicts the kind of general structure of the model. And so we think of it as hierarchical because there are two main stages. There's parts of the model that interface with kind of the fine-grained data. So what happens is that, let's say we're operating directly on bytes or characters here. The sequence gets passed through a lightweight encoder which will just be a generic sequence model or several layers of a sequence model.

🤍0 likes💬 0 comments

Add to My Notes

00:37:55Albert Gu

Then there is a special routing mechanism that for every single character predicts, "Do I want this character to be a boundary or not?" Or kind of like, "Do I want it to be the end of a chunk?" Then for the ones that are decided to be chunks, which are highlighted in the colors here, we kind of summarize each chunk and then compress them down into one representation. And so you can kind of think of the encoder and the chunking as playing the same role as the standard BPE tokenizer.

🤍0 likes💬 0 comments

Add to My Notes

00:38:26Albert Gu

And then once that is compressed down, then you have a much shorter sequence. Then you pass this into what we call the main model, and this model you can just think of as again a generic sequence model that is effective on chunks. And so here, for example, by default we will just use a transformer because they're a tried and true model that operate well on words or chunks or tokens.

🤍0 likes💬 0 comments

Add to My Notes

00:38:51Albert Gu

And so this is kind of the inner stage. Then after your main model, you take your output representations and then you expand them back out to the original resolution, pass them through a decoder, and then do your autoregressive prediction there. So at a very coarse level, it's really kind of just honestly similar to the existing pipeline with a tokenizer and then a transformer, except the tokenization happens inside the model through an encoder and then a routing mechanism to predict the boundaries.

🤍0 likes💬 0 comments

Add to My Notes

00:39:22Albert Gu

Now, yeah, so there's a lot of details here, but the one thing I'll emphasize for this talk is that the outer stages of this model, which are the ones that interface with the byte-level data, strongly benefit from being SSMs. So just like the previous experiment I saw where if you run these models on character-level data like an SSM versus a transformer, the SSM is way better. Similarly here, you'll find that any part of the model that touches byte-level data strongly benefits from being an SSM.

🤍0 likes💬 0 comments

Add to My Notes

00:39:55Albert Gu

And then outside of this, there are a lot of details in the exact chunking mechanism and other parts of the model that needed to be optimized for this to work. So this is a pretty non-trivial problem, because this is kind of like a discrete optimization problem that's really unstable and difficult. But if you get it right, it actually works and you actually can train a model that operates fully on untokenized data. And actually seems to scale pretty reasonably and can actually outperform your standard tokenized pipeline.

🤍0 likes💬 0 comments

Add to My Notes

00:40:31Albert Gu

So these are some of the results that we see. On the left here is again training plots where we're fixing a model size and we're just showing how the validation perplexity or bits per byte decreases as a function of the data seen. And one thing to also emphasize is that, sorry, here in this diagram, the main stage I said can just be a generic sequence model. That means that the main stage itself can be another HNet. So that means that you can basically compose this model to have multiple stages of chunking. Right?

🤍0 likes💬 0 comments

Add to My Notes

00:41:07Albert Gu

And so what HNet really means is that it's not per se about tokenization-free modeling, it is about dynamic chunking. So it's about the idea of taking any data and compressing it into larger chunks that ideally represent more higher-level semantic meaning. And because it's done completely end-to-end, this is kind of the first model that can not only chunk things, but also chunk things multiple times to get more and more levels of abstraction.

🤍0 likes💬 0 comments

Add to My Notes

00:41:38Albert Gu

And what we see on the plot on the left is that the transformer baseline here is the black line. And then if you train an HNet with just a single stage of chunking like in the previous diagram, it starts off worse than a transformer, but after seeing enough data it crosses over and seems to be scaling better.

🤍0 likes💬 0 comments

Add to My Notes

00:42:02Albert Gu

And the reason for this is that feature engineering generally is a form of inductive bias that can help your model in data-constrained regimes, but usually benefits less from scaling. Right? So this is the common theme that we see in deep learning. This is the bitter lesson. And so we see this exactly in this plot. If you apply a stage of chunking, then it takes some amount of data before the model learns how to create good chunks like we saw in the animation. But once it does, it should theoretically be able to create better chunks than what your hard-coded BPE algorithm gave you.

🤍0 likes💬 0 comments

Add to My Notes

00:42:41Albert Gu

And then if you compose the model by nesting it again and doing two stages of chunking, the model seems to scale even better. Although there are other challenges that occur in this case because you'll have more parameters, it's harder to train, and so on. But at least these plots kind of track what we'd intuitively expect.

🤍0 likes💬 0 comments

Add to My Notes

00:43:01Albert Gu

And you can actually also, it's not shown in this plot sorry, but you could also ask yourself what happens if you run the HNet directly on BPE tokens, and then the behavior kind of again tracks what you expect. So compared to an HNet on bytes, BPE gives you some free boost because you're getting the chunk the model based on this kind of hard-coded features.

🤍0 likes💬 0 comments

Add to My Notes

00:43:28Albert Gu

But if you just added another stage of chunking to the HNet and operate on bytes, then the model learns better features than your BPE tokens and this is what it looks like. And so if you look at these comparisons, especially these two circled in the red boxes, the theme is that no matter what setting you're in, if you train a model operating on BPE tokens, it works pretty good. But if you can manage to throw away the BPE and learn these chunks end-to-end, then your model will do even better.

🤍0 likes💬 0 comments

Add to My Notes

00:44:00Albert Gu

One variation of this that is particularly interesting is that, so again so HNet is just a generic architecture, it doesn't have to be applied on bytes. This plot shows what happens if you apply it on the BPE data. And I think this one's really interesting because it relates back to this compression argument that I talked about earlier. So what we did here is that we applied a single stage HNet to BPE tokens, and we varied the architecture in the encoder and decoder.

🤍0 likes💬 0 comments

Add to My Notes

00:44:34Albert Gu

So this plot basically, if we just look at this top line and bottom line, this says that the encoder is only transformer layers. The purple line and the red line is where the encoder and decoder are only Mamba layers, and in between you can kind of vary any combinations of these. And then what we see here is that if you only have pure transformer layers in the encoder and decoder, the model is noticeably worse. And as soon as you start putting in Mamba layers into the encoder and decoder, the model gets much better.

🤍0 likes💬 0 comments

Add to My Notes

00:45:06Albert Gu

And so what is really interesting here is that previously we kind of said that you know, transformers are great when your data is already compressed, like BPE tokens, and they're less good when your data is not compressed. So on characters, Mamba is much better than transformers. But in this experiment, we are entirely operating over BPE tokens even in the outer stages. Right? And so one might expect that putting transformers in the outer layers is completely fine, maybe even better because they're only seeing BPE tokens.

🤍0 likes💬 0 comments

Add to My Notes

00:45:42Albert Gu

This plot is showing that even when you're compute matching everything, it's better to put Mamba in the outer stages. And so the lesson here to me is that the benefit of these SSMs is not just about the resolution of the data they're seeing, but also about this implicit inductive bias toward compression.

🤍0 likes💬 0 comments

Add to My Notes

00:46:04Albert Gu

So I guess if you kind of look at the diagram here, the role of the encoder layers is not just to interface with the given resolution of data, but it's also passing these representations into the chunking layer, right? Which is deciding how to chunk the data. And so you can imagine that the encoder layers are playing some role of compression as well. Because literally the goal of it is to try to produce a literal compression of the data in a temporal sense.

🤍0 likes💬 0 comments

Add to My Notes

00:46:30Albert Gu

And so even when the data comes not as characters but as BPE, it seems that applying linear models or you can call them compressive models because again of the fixed-size state, it's very critical. And so this is I think the best evidence I've seen so far that the finite size state of recurrent models is again not just for efficiency, but actually has fundamentally different inductive biases, and perhaps in a setting like this it's actually doing something like... I think of the HNet, the goal of it is to create better abstractions. Right?

🤍0 likes💬 0 comments

Add to My Notes

00:47:04Albert Gu

That's the goal of chunking. You want to take your characters and chunk them into abstractions like words. You might want to turn your words into phrases that have meaning, and toward this goal of creating abstractions, that's fundamentally related to compression, and somehow these models actually empirically seem very important for that.

🤍0 likes💬 0 comments

Add to My Notes

00:47:20Albert Gu

A final result I'll show on this topic is again, you can apply these to different types of data. This is again another very recent paper that was just put on ArXiv a month ago, where we tried applying a couple of different established architectures to DNA modeling again. And once again, DNA is interesting because in language, you know, modeling characters is a bit of an academic exercise. It's like a long-term direction, because we do have tokenizers and most people are like, "Why not just use the tokenizer?" In DNA, it's actually really difficult to define tokenizers, at least in a semantically meaningful way like you can for English.

🤍0 likes💬 0 comments

Add to My Notes

00:47:59Albert Gu

And we see even bigger gaps generally for these kind of alternative models. This plot here shows, instead of just fixing a model and running out the training curve, this is actually showing formal scaling laws. So each point here is a different FLOP budget, and then the model size to data amount was swept. And so yeah, these are kind of standard scaling laws, but for DNA. And we found remarkably linear trends here.

🤍0 likes💬 0 comments

Add to My Notes

00:48:32Albert Gu

And what's interesting again is that these trends actually are so different from language. Usually in language these lines are kind of just shifted and you get kind of a constant magnitude of improvement, but when it comes to DNA, the HNet is actually discovering patterns that you can't really find with off-the-shelf tokenizers, and it's actually scaling fundamentally better.

🤍0 likes💬 0 comments

Add to My Notes

00:48:59Albert Gu

Okay. So hopefully all of these have kind of painted a picture that the different types of models really are doing very different things. They have very different inductive biases. And this goes back to my main thesis that I claimed, which is that transformers work really, really well, but they're particularly effective when the data has already been encoded properly, and you often need different types of models when your data is not encoded or it's very difficult to encode. And so I think the future of modeling will require not just transformers, but a lot of other ideas.

🤍0 likes💬 0 comments

Add to My Notes

00:49:35Albert Gu

So far, you know, most people have viewed transformers as basically just like a very powerful tool. We don't really look at it or touch it, and we kind of design the whole pipeline around it instead through a lot of feature engineering, through the way we process the data and so on. But hopefully I've kind of conveyed that there really is a lot of different alternatives out there. There's really a lot of room for improvement. I personally believe that there's actually still a lot of room for major improvements in architecture design.

🤍0 likes💬 0 comments

Add to My Notes

00:50:06Albert Gu

Okay, so just to come back now to the title of the talk, "On the Tradeoffs," just to summarize. SSMs are largely thought of as an alternative model that is more efficient than transformers, but potentially more weak. So just by the way they're defined, they're obviously more efficient. But it's often pointed out the specific problems they have, especially with retrieval-like abilities.

🤍0 likes💬 0 comments

Add to My Notes

00:50:35Albert Gu

But I actually think that the efficiency is a bit of a red herring. These models are not just more efficient, but they're actually doing a different form of modeling. So again, I think of them as being a stateful and a compressive model, and these lead to very different types of benefits that they have, from things like state tracking, which I didn't talk about today, to things like building abstractions, which we kind of saw through the HNet. And both of these pros and cons are two sides of the same coin, which is the way its autoregressive state is defined in terms of this brain-like state.

🤍0 likes💬 0 comments

Add to My Notes

00:51:13Albert Gu

Similarly attention, the strengths are that it gets the ability to pay attention to very specific details in its context. It's exceptionally strong at recall and retrieval, which is where SSMs struggle. And usually people think of the downside as the efficiency, the quadratic scaling of attention.

🤍0 likes💬 0 comments

Add to My Notes

00:51:35Albert Gu

But once again, I think that the efficiency is a bit of a red herring, and the way I really think about attention is that the downside is that it is highly dependent on the resolution and the semantic meaning of the data. So this is through the examples we saw. When you change the resolution even in language modeling from BPE tokens to bytes, the performance of attention changes dramatically. And another way to put this is that attention is beholden to the tokens it's given. Because of the way it's kind of caching every single token, it means that it's very sensitive to the resolution of the tokens it's given, right? And has no ability to change what a token means to it.

🤍0 likes💬 0 comments

Add to My Notes

00:52:18Albert Gu

And once again, these pros and cons are both two sides of the same coin, which is the way that its autoregressive state is defined through this database-like cache. And so I think that the efficiency arguments for both of these models are again a little bit of a distraction, because at the end of the day, we just want the models to model the things we want, and the exact efficiency there actually kind of, it's not the right thing to look at. Because there may be tasks, for example, where you really do need the quadratic attention. There may be tasks, you know, where you actually have to cache and memorize every single thing in your context and you can't get around that, right? And then so there's no point arguing that the quadratic scaling is bad.

🤍0 likes💬 0 comments

Add to My Notes

00:53:06Albert Gu

And to give kind of a final picture here. So one way to think about architecture design in general is that there's kind of a picture where we think of models as this black box where during training you give it compute, you know, and data, and then you create an intelligent model that has, you know, a wide range of capabilities on the other side. And in this picture, you basically think of your model training as a vehicle that converts compute or FLOPs into intelligence or capabilities. And what you want is to find kind of the most efficient vehicle that does this conversion.

🤍0 likes💬 0 comments

Add to My Notes

00:53:50Albert Gu

And so the central question to ask is simply, is my model using every single FLOP it's given wisely? Right? And so again, like there are settings where you actually need to spend a lot of compute in order to do some capabilities, but there are also settings where your model is doing a lot of compute that doesn't really make sense. For example, if your transformer is given character-level data, it doesn't need to be caching every single character into its KV cache. And so this is the central question of architecture design, is designing better and better black boxes here that perform this conversion, and I think there's still a lot of room for improvement there.

🤍0 likes💬 0 comments

Add to My Notes

00:54:31Albert Gu

Okay, that's the end of the talk. Questions.

🤍0 likes💬 0 comments

Add to My Notes

00:54:45Stephen

Okay, awesome. Thanks Albert for the very insightful talk. We're going to be soliciting some questions now. We're going to be balancing between our in-person questions and those online. So we can get started with some in-person questions if anybody here has any.

🤍0 likes💬 0 comments

Add to My Notes

00:55:08Audience Question

Thank you. I really admire your work. I'm a PhD in Chris Ré's lab as well, and we're constantly looking through all your old papers, so I really appreciate it. A lot of the current SSM inspiration is, you know, trying to trade off some recurrent expressivity for trainability under the condition of using backprop or backprop through time, which you mentioned the brain a bunch of times. There's very little evidence that we're keeping C copies of the brain and doing backprop through time. It'd be very unrealistic that we're doing that. So there must exist some other solver that is happening inside your brain right now. What are your thoughts on that?

🤍0 likes💬 0 comments

Add to My Notes

00:55:48Albert Gu

Yeah, that's a great question. So I'm actually a big fan of some of these works that try to look at completely different computation paradigms. I think a lot of what drives my work is the idea that we are probably in some sort of local optimum in the entire way that all of our training is set up. From, you know, the architecture, which is where a lot of this is focused, but also even to bigger things like, you know, backprop and so on. It is also not clear to me that backprop is the best thing to do.

🤍0 likes💬 0 comments

Add to My Notes

00:56:21Albert Gu

And there's a lot of codependence between the model design and just the physical hardware we have and the realities of these things. I personally haven't thought that much about backprop myself, but I have thought about like, it could cause serious issues with, for example, one issue is it causes a physical memory constraint on your long-range dependencies. Because if you're using backprop, then if you can't fit your entire sequence in memory, you literally can't learn dependencies. You can, you know, cook up pathological examples where you just literally can never learn dependencies that are long enough. So it could be a problem.

🤍0 likes💬 0 comments

Add to My Notes

00:56:59Albert Gu

It could be the case that for any realistic task that we care about, we can engineer it to be good enough, which is kind of the way the world is going. But I do think that there are probably other kind of, you know, getting out of the local optimum and doing much bigger changes to these models. Another example is that, as I mentioned, a lot of the work that's been done here is on making these models efficient on current hardware like GPUs.

🤍0 likes💬 0 comments

Add to My Notes

00:57:25Albert Gu

You actually sacrifice a bunch of expressivity that you could gain back by adding true recurrence, and there has been another line of work on trying to do this, but it really struggles because it's just so hard to compute. I actually think that that's another line of work that's quite promising, and I have some of my own ideas there, but I think that you actually again, like, you might be able to get fundamentally better models by completely reimagining, you know, the constraints that we currently have.

🤍0 likes💬 0 comments

Add to My Notes

00:57:53Stephen

Great. Any other in-person questions?

🤍0 likes💬 0 comments

Add to My Notes

00:58:01Audience Question

Thanks for the amazing talk. Yeah, it's really interesting to see your results with HNet and the dynamic chunking mechanism, and yeah, it's really cool that it sort of gets better performance on DNA sequence modeling and stuff. I was curious if you feel the chunking mechanism is sort of elegant and like will play a substantial role in future models.

🤍0 likes💬 0 comments

Add to My Notes

00:58:21Albert Gu

Yeah, so it's hard to say. So we worked on this, which took a long time, because I did fundamentally believe that chunking is a really critical primitive that will lead to better models in the long term. The version that we published is, you know, just the first step there. I think it was the first really end-to-end model that works.

🤍0 likes💬 0 comments

Add to My Notes

00:58:42Albert Gu

I don't think that is the endgame at all. There are improvements that you can probably make to all parts of it, including the chunking mechanism itself.

🤍0 likes💬 0 comments

Add to My Notes

00:58:48Albert Gu

I did fundamentally believe that chunking is a really critical primitive that will lead to better models in the long term. The version that we published is, you know, just the first step there. I think it was the first really end-to-end model that works. I don't think that is the end game at all. There are improvements that you can probably make to all parts of it, including the chunking mechanism itself.

🤍0 likes💬 0 comments

Add to My Notes

00:58:52Albert Gu

And I think so far it's not really been validated at larger scales, but yeah, I think it's just, you know, the way I think about it is that philosophically even, it just feels like it's so important to learn these things from scratch that I do think that coming up with better and better chunking mechanisms could be really important for future models.

🤍0 likes💬 0 comments

Add to My Notes

00:59:20Questioner

Thank you. So just one technical question. I'm curious, since you're setting the separators at arbitrary points, like typically with a tokenizer, you have a set dictionary. So, I'm curious how you feed that sequence in when the, you know, the sequence chunks can vary very dynamically.

🤍0 likes💬 0 comments

Add to My Notes

00:59:43Questioner

But I guess more generally, you've drawn several comparisons to like how the human brain works. This is something I'm also interested in. Like this aspect of are we doing some sort of dynamic analysis of incoming data? What data do we need to remember or compress or curate? I've been interested in that towards improving the amount of memory that models have access to. So, I'm curious what your perspective would be on using these types of state-based models perhaps to curate memories or curate more efficient memory for like an attention-based or a lookup-based model to use down the line to extend the amount that we're able to actually memorize in a continuous interaction.

🤍0 likes💬 0 comments

Add to My Notes

01:00:39Albert Gu

All right. So, two questions there. So, first, I think the first one was about a little more mechanically how the model works. This is actually one key advantage of the model. So, I think the question is about normally, when you use BPE, you have kind of a fixed vocabulary and each vocabulary, you know, you'll look up an embedding that you pass to your model. One of the challenges of tokenizer-free models is that if you have an infinite size vocabulary, you can't have infinite embeddings.

🤍0 likes💬 0 comments

Add to My Notes

01:01:08Albert Gu

So what happens here is that we don't need to think about those embeddings at all because everything is happening inside the model end-to-end. There is no lookup table at all once you get past the very first encoding stage. So, here you think of this model as just operating on a vocabulary of whatever fine-grained units you have, which might be, you know, a size of 256 if it's a byte. You get embeddings here, then you pass this into your model, and at every future stage, you're just working in embedding space.

🤍0 likes💬 0 comments

Add to My Notes

01:01:35Albert Gu

And for example, the way you create this embedding, the simplest way is just—let's say this embedding corresponds to this chunk here. The simplest way is to literally just copy and paste whatever this embedding is on the other side. It just gets passed through, and we ignore all the ones in between. Or you can kind of just pool over this chunk here. So basically, at the end of the encoder stage, you have embeddings in some latent space. And then you decide where you want the chunks to be, and your embedding that gets passed into the main stage is just, say, pooling over that chunk.

🤍0 likes💬 0 comments

Add to My Notes

01:02:10Albert Gu

The second question is about the analogy to brains and so on. To be clear, these analogies are very, very coarse. I'm not a neuroscientist; they might yell at me if they heard these analogies. But I think a lot of this is just kind of inspiration, as I use humans more as a proof of concept of what things should be possible with intelligence and what are some really high-level ways that these might be achievable.

🤍0 likes💬 0 comments

Add to My Notes

01:02:40Albert Gu

I mean, I will say that chunking, I think, is a concept that was directly inspired by cognitive psychology. But I think one of the things I do think it will be useful for is kind of as a more clever way of creating these databases or caches, right? Because one of the key challenges of long context is knowing what things do you store or not again. And I think that one of the potential use cases of things like HNets is exploring their potential for long context modeling.

🤍0 likes💬 0 comments

Add to My Notes

01:03:12Albert Gu

Because what you can do is, you can imagine if you can keep iterating this, then you can basically try to compress your context into higher and higher level abstractions or memories that are like coarse summaries of this thing, but then allows an inner model to attend over much fewer things and in a much higher-level way. So, I think that's directly an inspiration for this. I don't have further progress on that particular idea, but I do think it's clearly related to the whole philosophy of why one might want to do this.

🤍0 likes💬 0 comments

Add to My Notes

01:03:45Host

I'll read a couple questions from online. Someone asked, do SSMs outperform transformers in data-constrained or small language model settings, or vice versa, given their different inductive biases?

🤍0 likes💬 0 comments

Add to My Notes

01:04:00Albert Gu

Good question. I don't know the answer to that off the top of my head. Yeah, I think for language modeling, a lot of people care about scaling. I actually don't know. I will say I found out a little tidbit of information yesterday, which is I think OpenAI recently released this parameter golf challenge. I don't know if you guys have seen it, but people have been trying to train like really small models.

🤍0 likes💬 0 comments

Add to My Notes

01:04:21Albert Gu

I saw that somebody was trying to train a Mamba-3 base model there, which was doing decently well, but they had a lot of ablations on their submission. One thing they said is that even though the model works well, at least including a little bit of attention was actually pretty critical. So when you have these really tiny models, that's actually something I didn't know about before. So, I don't really know the answer, but that's as far as I know.

🤍0 likes💬 0 comments

Add to My Notes

01:04:48Host

One last one. With transformers, people look at attention maps as a form of interpretability. Is there a similar work or proxy for SSMs given the information compression?

🤍0 likes💬 0 comments

Add to My Notes

01:04:59Albert Gu

I think this one is harder. So, there actually are—I don't think I have the slides for it—but there is a kind of correspondence between some flavors of SSMs and attention. As I mentioned at the very beginning, there's many names for these models, and one of the names is linear attention. So, there's a much deeper connection between SSMs and linear attention, and that's implicitly related to attention.

🤍0 likes💬 0 comments

Add to My Notes

01:05:24Albert Gu

Basically, one thing you can do is visualize an attention-like map. It's not literally an attention matrix, but you can visualize the dependence on every token with every other token in a similar way to attention. And that visualization, I mean, it's basically an attention map, and you can do the same sort of comparisons.

🤍0 likes💬 0 comments

Add to My Notes

01:05:47Albert Gu

And it's actually pretty revealing of the type of behavior that it can do. For example, that visualization shows that attention sometimes can really act like hard attention and only pay attention to specific tokens, while SSMs generally always squish everything together, and it can be pretty diffuse. This is probably not the only visualization that's relevant, but yeah, I think a lot of this is ongoing work too, to do more mechanistic interpretability of these models.

🤍0 likes💬 0 comments

Add to My Notes

01:06:19Host

Okay, great. Let's give another round of applause for our speaker, Albert.

🤍0 likes💬 0 comments

Add to My Notes

01:06:29Host

All right. So, now we have some brief information from some of our sponsors, in particular, MongoDB.

🤍0 likes💬 0 comments

Add to My Notes

01:06:41Anaiya Raisinghani

Thank you all so much for being here. I promise I'll be very quick, about 10 minutes or so. But hi, everyone. My name is Anaiya Raisinghani. I'm a senior technical evangelist at MongoDB. And we are so excited to be sponsoring CS25 this quarter. And I'm here right now because there's a genuinely interesting engineering problem that I think is worth about 10 minutes of your time, and because MongoDB happens to be a really great way of solving it.

🤍0 likes💬 0 comments

Add to My Notes

01:07:07Anaiya Raisinghani

So, you're all in this room because you're thinking about transformers, about multimodal models, and about what comes next in AI, right? Here's what I want to help add to that conversation. The latest multimodal models have gotten really good at understanding a document the way a human being does. The question now is whether your data infrastructure is accurately keeping up.

🤍0 likes💬 0 comments

Add to My Notes

01:07:30Anaiya Raisinghani

Today I want to show you what it looks like when the model and the database are actually designed to work together. So, I'm going to start off with a concrete scenario, walk you through why the standard approach breaks down, and then show you what a better architecture actually looks like end-to-end. So, let's get into it.

🤍0 likes💬 0 comments

Add to My Notes

01:07:49Anaiya Raisinghani

First off, let's set the scene. We're going to be talking about an insurance company—not the sexiest example, but a really good example. So, let's say that your insurance company gets about 10,000 claims a day, and each one of these claims is a PDF. These claims include damage photos, tables of repair estimates, signatures, and the key here is that no two claims look the same.

🤍0 likes💬 0 comments

Add to My Notes

01:08:15Anaiya Raisinghani

This is actually a real dashboard that was built by MongoDB. What you're seeing on the left is an active claim, which is a smashed windshield. On the right, claims like this, five similar claims are pulled back instantly with loss amounts, dates, and notes attached. And they just recently asked, "Show me crashes like this one," and the system knew exactly what to return.

🤍0 likes💬 0 comments

Add to My Notes

01:08:37Anaiya Raisinghani

So, the question is, how do you build that? And how do you build that with retrieving back accurate information? Because the hard part, as most of us probably know, isn't the search. It's what happens to those PDFs before retrieval even starts.

🤍0 likes💬 0 comments

Add to My Notes

01:08:52Anaiya Raisinghani

Until recently, the standard approach looked a lot like this. You would OCR the document and extract whatever text you could, chunk it into pieces, pass those chunks to an embedding model, store the vectors, and then retrieve a query. And this works completely fine for very clean text documents, right? But how many documents out there, especially when you're dealing with insurance claims, health care, law, how many of those are truly perfectly clean text? A lot of context can get lost even before you start searching.

🤍0 likes💬 0 comments

Add to My Notes

01:09:26Anaiya Raisinghani

And the moment you OCR a damaged photo, the tables in the documents, the handwritten notes—so much important information that can completely alter the whole trajectory of your claim can be accidentally omitted. And in situations like these, you don't want any crucial information to be left out, right?

🤍0 likes💬 0 comments

Add to My Notes

01:09:44Anaiya Raisinghani

So, Vision RAG makes one fundamental change. Instead of parsing the document into text first, you actually embed the entire page itself as a single multimodal vector. So, let's chat about the new pipeline a little bit. You have text and image documents that come in. They go straight to a multimodal embedding model with no OCR step. The references and embeddings go straight into MongoDB Atlas as your vector store. And then at query time, the same model is able to embed the query, search finds the closest pages, and a vision-capable LLM reads the actual images to help generate your answer.

🤍0 likes💬 0 comments

Add to My Notes

01:10:22Anaiya Raisinghani

So, what makes this work? Have any of you in the audience heard of Voyage AI? I know Tengyu is a professor here, so maybe some of you have. No worries if not, you can look him up after. To ensure your important data isn't getting omitted in the process, it all really comes down to the embedding model.

🤍0 likes💬 0 comments

Add to My Notes

01:10:41Anaiya Raisinghani

With earlier multimodal models, which were CLIP-based architectures, they ran separate encoders for text and images. So, vision trunk on one side, text on the other, and then they were just routed together. The problem is that those outputs are very independently biased. Your text query and your image query aren't really in the same vector space, which means cross-modal retrieval is completely unreliable.

🤍0 likes💬 0 comments

Add to My Notes

01:11:07Anaiya Raisinghani

The Voyage Multimodal 3.5 model uses a single encoder for both. So, text and visual inputs go in together and come out in one unified vector space. This is what really closes that modality gap. So, a text query like "smashed windshield from rear impact" and an image of exactly that damage are now directly comparable one-to-one. It also supports interleaved text and images. So, for a PDF page where a table might sit next to a paragraph, they both get captured together, whereas before they might get routed separately.

🤍0 likes💬 0 comments

Add to My Notes

01:11:43Anaiya Raisinghani

So, now let's go back to our insurance example using an image example and the actual data flow. A claims adjuster uploads a photo, the query image, which is the cracked windshield. That image goes to the embedder, which produces a vector. Then that vector goes into MongoDB Atlas, where it's compared against your entire vectorized claims collection using vector search.

🤍0 likes💬 0 comments

Add to My Notes

01:12:04Anaiya Raisinghani

And what comes back from that? The top five most visually similar claims from your history. This means that you're not dependent on just keyword matches, but actual visual similarity. And the system found these because of the damage patterns, the angle, the type of impact. All of these live inside of the embedding.

🤍0 likes💬 0 comments

Add to My Notes

01:12:25Anaiya Raisinghani

And because MongoDB stores the vector in the same document as everything else—because we are a vector database—it's storing that vector next to your claim ID, the loss amount, the adjuster notes, the repair history. You get all of that context back in one query, not in a separate system, but the exact same record.

🤍0 likes💬 0 comments

Add to My Notes

01:12:43Anaiya Raisinghani

Let's zoom out a little bit and look at the architecture in question. On the left side, we have what most RAG stacks look like today. You have your operational database for your application data and a completely separate vector database for retrieval. You have embedding models running in between, a reranker layer—so many moving pieces.

🤍0 likes💬 0 comments

Add to My Notes

01:13:02Anaiya Raisinghani

Now let's take a look at the right side. MongoDB Atlas is a single platform with full-text search, vector search, embedding models, rerankers, everything all in one. This means that your multimodal data, your vectors, are all in the same document. And for those 10,000 insurance claims, your vectors actually live right next to your claim notes and your other source data, which means that there's nothing that you need to sync or glue together.

🤍0 likes💬 0 comments

Add to My Notes

01:13:28Anaiya Raisinghani

And this leaves us with a truly genuine architectural shift, right? It changes how you handle your unstructured data, which was the old pipeline that started leaving out very crucial information. You can change how you think about the modality gap in your retrieval. Once again, you have one encoder, one vector space, and your text queries and your visual documents are directly comparable.

🤍0 likes💬 0 comments

Add to My Notes

01:13:51Anaiya Raisinghani

And then it also changes how you design the database layer beneath your AI applications. When your operational store and your vector store are the exact same system, you eliminate an entire category of infrastructure complexity. Just to sum it all up, when your database thinks in documents the same way that your models think in tokens, you stop fighting your infrastructure and you start to get results.

🤍0 likes💬 0 comments

Add to My Notes

01:14:17Anaiya Raisinghani

For a deeper dive into Vision RAG and also Video RAG, please check out the article that this talk is based from by my fantastic coworker, Thibaut Gourdel, who's a senior technical PMM for MongoDB. I've linked the article right up here. And I'm going to quickly hand it over to my other coworker, Patrick, who's just going to sum up MongoDB for Startups for you all and then also talk a little bit about the Frontier Lunch Club. But thank you so much for having me.

🤍0 likes💬 0 comments

Add to My Notes

01:14:45Patrick Steen

Hello. Thank you so much, Anaiya. Hi, everyone. I just wanted a quick second to introduce myself. My name is Patrick Steen. I'm on our Ventures team at MongoDB.

🤍0 likes💬 0 comments

Add to My Notes

01:14:55Patrick Steen

I just want to quickly say thank you for having us here today, getting to talk with everyone, and also introduce our MongoDB for Startups program. Being at Stanford gives a really high signal that a lot of you are interested in being future founders. I feel like on Zoom and in this class, we got some here today. So, just wanted to offer all of you being able to sign up to our startup program. This will allow you to access Atlas credits, special one-on-one technical expertise and advice, as well as some go-to-market opportunities.

🤍0 likes💬 0 comments

Add to My Notes

01:15:28Patrick Steen

We host a lot of different events. I feel like being a founder, having a great product is half the battle, so building relationships and getting to utilize that is something we would love to help you with. So, yeah. Feel free to hit me up on LinkedIn if you guys have any ideas or I can help connect you at all.

🤍0 likes💬 0 comments

Add to My Notes

01:15:46Patrick Steen

And then secondly, I believe you guys have seen this before, our Frontier Lunch Club that we are going to be sponsoring this year. It's got some really great opportunities partnered with the AGI House. This allows you to connect into the recruiting pipeline, help you get a leg up on future opportunities if you're looking to connect with different AI researchers and research institutions.

🤍0 likes💬 0 comments

Add to My Notes

01:16:08Patrick Steen

Also, we are going to be hosting a couple of events. One's going to be a dinner on the 30th. If you'd like to scan the QR code, it just takes 30 seconds to access that WhatsApp. And then also in the back, Alexa is going to be helping out passing out some pizza. She's also a great connection at the AGI House. They have a ton of different event opportunities, as well as connections to pretty much everyone you would need to know to be successful in the Bay.

🤍0 likes💬 0 comments

Add to My Notes

01:16:35Patrick Steen

And then lastly, we're going to be hosting an event later in October, where we're going to be kind of doing a project competition ran on MongoDB, where you could potentially win $1,000. So, yeah. Just wanted to introduce myself and tell you a little bit about two things, but thank you so much for having us. And yeah, thanks again.

🤍0 likes💬 0 comments

Add to My Notes

Video Player

My Notes📝

Highlighted paragraphs will appear here