Stanford CS25: Transformers United V6 I Overview of Transformers
Stanford OnlineDisclaimer: The transcript on this page is for the YouTube video titled "Stanford CS25: Transformers United V6 I Overview of Transformers" from "Stanford Online". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.
Watch the original video here: https://www.youtube.com/watch?v=bHSDPgZYie0
All right. Hello everybody. Um welcome to CS 25 Transformers United. So, this is the sixth iteration of the course, uh which is pretty crazy. Uh time certainly flies. Um so, actually my friend and a previous co-instructor Div Garg, he's the one who started this class um and asked me to join him pretty early on actually uh during the first year of my PhD.
Um as we all know artificial intelligence AI has pretty much taken over the world. Um it's pretty much part of every aspect of our lives these days. And we thought, you know, having a course where people can hear from experts who are doing state-of-the-art research um as well as investigation on the architecture behind most modern AI machine learning um systems and models called Transformers um would be a good way of um disseminating knowledge while having people um who are more uh motivated and interested uh getting people more motivated and interested into this topic. Um so, that's what kind of inspired our class.
And the way our class works is um starting next week um we'll have a speaker um either from academia or industry who will come in to talk about um the work they're doing with Transformers. Um and we typically have a very diverse uh set of speakers covering different topics. Uh Transformers applied to computer vision um text um language modeling um biology, neuroscience, and so forth. Um so, definitely look forward to that.
And so, this will be our first introductory lecture where we'll go over course logistics as well as a background on Transformers um and a lot of things related to them including um what's coming next potentially.
Yeah, so my name is Steven. Um I'm a fourth-year computer science PhD student here. Um previously I was at Carnegie Mellon and uh Waterloo. Um and overall my research interests broadly hover around machine learning, specifically natural language processing NLP.
So, I do a lot of things related to language models. And these days as well interdisciplinary work with cognitive science and psychology. Um so, I'm looking at things like efficient reasoning with small language models in data um limited regimes. Um human and cognitively inspired um learning signals for our models as well as evaluation methods and so forth.
Hello everyone and welcome. I'm uh Karan, the other co-instructor. I'm a third-year electrical engineering PhD student. Uh I also do some work in language models and efficiency. So, stuff like curriculum learning um and RAG recently.
Uh though my main research interests for the past few years have been in computer vision and recently neuroscience. So, my current research focus is on foundation models for neuroimaging like fMRI. Uh previously I did my undergrad at Cal Poly SLO and for a year I was a post-baccalaureate researcher in Stanford radiology uh before starting my PhD.
Right. And um this iteration of the course is sponsored by um Modal AI House as well as MongoDB. Um so, we are very grateful for their support um in order to keep this class running.
And um AI House and MongoDB uh wanted me to communicate some information. So, they're partnering together to give students um all of you um additional direct access to top founders and researchers in the AI community through the Frontier Lunch Club.
Um so, later this month there's going to be an opportunity for interested students to have dinner with a handful of top founders and researchers. And [snorts] later in the year there'll also be a second event with a $1,000 uh project prize. Um and these details will be announced at the dinner.
Um additionally members will get added to the AI House's recruiting pipeline. Um so, you'll have access to a lot of startups um as well as companies um for both internship as well as full-time opportunities. Uh it's an application-based um process but free to join but spots are limited. Um please scan the QR code um to apply. And there'll be a WhatsApp group that you guys will be added to which will have more details about this.
Okay. Um so, of course most of the information about our course will be communicated through our course website. Um additionally we have a mailing list as well as a Discord server. Um so, please join those to get the most up-to-date information.
So, lectures are Thursdays um at this time here as well as on Zoom. And the only homework is basically attendance. Um in-person attendance is required when the speaker is in person, which is typically the case. Um otherwise um please attend over Zoom at the very least.
Um everyone's allowed three unexcused absences and we track attendance using a Google Form which will be open um as the lecture's running. Um so, please fill that in um each week. And please ask questions um during and after the presentation um either in Zoom chat or use Slido for the folks online. Um please do not unmute yourself on Zoom um unless we allow you to. And for people in person just raise your hands to ask questions.
Um as I said there's a Discord server um a Zoom live stream and um all lectures will be recorded um and the recordings will be released on YouTube approximately two to three weeks afterwards. Um we have social events and potential one-on-one networking with speakers. Um so, keep your eye out for that. Um and yeah, in-person questions um please raise your hand and we'll get to you.
Um for the folks on Zoom again please do not unmute yourself. Um any questions or concerns uh please send them in chat. And anybody auditing um in person who's not officially enrolled um if um seats are not available please um give priority to the students who are enrolled. And uh please keep behavior appropriate.
So, what you'll hope we hope that you'll take away from this course is a better understanding of how exactly Transformers work um from attention to scaling and so forth um as well as learning more about cutting-edge applications in various um domains um as well as of course live exposure to frontier research um speakers who are shaping the field um uh and yeah, things like key limitations, open problems that hopefully you guys will be able to contribute to as well um in order to further improve um our AI models and systems.
All right. So, to start I'll give a very brief overview of Transformers, what they are. Uh it's the title of the course of course. And also the direction that the field has taken to get up to Transformers.
So, early ML around pre-2012 um most work consisted of some sort of hand-engineered features well relevant to the task at hand uh maybe classification or some other sort of thing. These were passed into shallow models which then were trained to make predictions based on these features.
And in order to train your models you had a generally smallish set of expensive labels that were collected by hand that you compared the model's predictions to to make a feedback loop that allowed you to update the parameters.
Then we gradually had a move to supervised deep learning. So, replacing the hand-engineered features as the models got bigger with raw data. So, the data that was used to generate the features. And in this way you went directly from the raw data to the predictions taking out that sort of middleman step of hand engineering.
Going further we moved to self-supervised learning. So, here we thought even having these expensive labels was too much um and so models were trained here uh by corrupting the raw data in some way, maybe noising or masking or that sort of thing. And then training the model to reconstruct the original uncorrupted data.
And from this you learned very general raw representations that could be applied to a variety of downstream tasks. So, after this you could use your model and fine-tune it to do whatever sort of task you wanted it to.
As a case study of language um a lot of early work was like sentiment analysis. So, taking in short pieces of text, say reviews, and just trying to classify whether they're positive uh neutral or negative. So, here you also had like features based on the words and of course your classification task was pretty simple, just say 0, 1, 2 uh classification.
Later uh similarly to the direction I described we moved to um going a bit more to the source. So, language is of course very sequential and context-rich. I give you a random sentence like the quick brown fox jumps over the uh you'll know what comes next cuz you've seen that sentence so many times in your life.
So, similarly uh we started training language models um to do next token prediction. This also allowed you to make maximum use of like uh raw text data cuz you could generate um a very large amount of pairs like this.
And like you might have guessed vision was no different. Uh directions are pretty similar. So, things like autoencoders to try to generate those representations uh that could be applied downstream. And also things like contrastive learning which we'll talk about later.
With uh vision one recent-ish architecture is the masked autoencoder. So, same idea I described except in vision where you mask out some part of the image and train the model to reconstruct those missing pieces. Um one interesting question here is how do you do the masking? Typically it's done randomly.
However, you could have a random mask that reveals three patches like this, which makes the problem very easy. Even a human could tell. Or these other three patches, which makes the problem very hard. So one future direction sort of is to find a better balance between these.
Going back to language, a key component of language models is word embeddings. So we understand language very naturally, but models work in numbers. So in order to bridge this gap, you need dense vectors in a high-dimensional space that allow you to represent all the words in the vocabulary.
Typical methods in this space are like Word2Vec, GloVe, and FastText, which you can look at in case you haven't come across them. And these enable things like arithmetic and of course learning in this space. For instance, semantic similarity, like if you take king minus man plus woman, your embedding should roughly give you queen, which is the analogous word there.
However, static embeddings in which you assign a single vector to one word have some limitations. For instance, bank as a financial institution and a river bank would get the same vector even though their meanings are completely different. So to solve this problem, there are things like contextual embeddings which consider the surrounding context to assign different vectors to the same word.
Moving a bit closer to transformers, we got to RNNs, recurrent neural networks. And these are sequence models that process input step by step while maintaining a sort of memory or hidden state. So you take your input, you do some operations on it. You put some information in memory and some information to the output. You can repeat this over time to get a sequence.
However, they suffer from some problems. Like if your sequence is very long, you're going to forget what you learned at the very first step. So long short-term memory networks or LSTMs are a gated variant of RNNs that aim to solve this problem by better preserving long-range dependencies. So here you add a couple more parameters that better govern when you should remember or forget something and when you should output something.
So then we get to transformers. I'll give a very quick overview. So self-attention learns what to focus on for each token. So if we have a sequence, the connections between different words might be more or less important. So self-attention learns some matrices, query, key, and value, that allow you to learn these relationships.
To give an analogy for these, make it a little less abstract, imagine you're looking for a book in a library. Each one has a summary, a key, that tells you what it's about. Once you find a match between what you're looking for, your query, and the summary, you can access the book to well, get its information, the values.
And in attention, we do this across, say, all the books in the library and do a soft matching to get the book that is most relevant. So you'd get a score for every book telling you how relevant it is. And self-attention is just this technique applied to a single sequence. So you learn connections between every word in a sentence, say, to every other word in that sentence.
One issue here is that models that just do this have no notion of ordering. So to get around this, we add positional encodings or positional embeddings, which just tell the model where it's looking for each word. So as a very simple example, you could just have a one for the first word, a two for the second word, and on. And your model would then over training learn to associate these with the ordering of the sequence.
And lastly, you can scale this up, do multi-headed attention. So on the same sequence, you can have multiple different learned attention matrices or models that together give you a more detailed representation.
So this is the full architecture. I won't get into too much detail, but essentially combines all of the components that I talked about into a very well-performing architecture.
So we just talked about RNNs. Why are those not used in the latest models? Well, transformers are better for various reasons. A few here are that they allow for parallelism. So RNNs by design are one step by one step. So you have to process sequences in order, which makes things much slower. Whereas in attention, you can process your entire sequence in one step and speed this up on GPUs.
Additionally, although LSTMs solved the vanishing gradients and like forgetting problem a little bit, transformers still handle long context much better. You can scale them up much better. So these days, we have models that can take in like a million tokens or to that order.
And lastly here, attention allows you, like I showed, to access all previous tokens in the context and draw connections in between them. Whereas RNNs and LSTMs are limited to whatever they can store in memory in their hidden state.
So with all of these advancements in architectures, transformers today have um spread to all areas of science and technology. So we have LLMs, which all of us probably use every day, like GPT-5, Claude, etc. We have many vision models that are getting better and better. Um some even coming to your phone. We have speech, biology, video, robotics, and so much more.
Large language models in particular are just scaled-up transformers. So the architecture I showed scaled to billions of parameters and incurring heavy computational costs like time and GPUs. However, they perform very well. Generally, they're pre-trained on massive amounts of general text data, so the internet to put it in a simple word. With the training objective that I mentioned of next token prediction, so given a sequence, just predicting the next word.
As you scale transformers up as language models, a lot of emergent abilities come up. So there's a seminal paper on emergent abilities. But you just by training on general text data, you start to be able to do math or reasoning or that sort of thing.
And then like I just mentioned, um these large general models are also very good for a few-shot or zero-shot learning. So you can give it a completely new task that it has never seen before, but just because it's seen so much data, it's able to generalize and very quickly learn what you want to do.
All right. So now I'll be talking about pre-training and the importance of data. Um so pre-training is basically the first phase of training. So you have a randomly initialized model. And your goal is to get as much knowledge and um basic capabilities into it as you can.
So typically this involves training on a large variety of diverse internet text. Because so much diverse text means the model is able to hopefully learn a statistical distribution of language that is similar to the natural distribution of human language.
Um and yeah, the fundamental aspect especially of pre-training is in the data. It's the fuel behind our machine learning and AI models, especially for LLMs that are trained on so much text. So how do we maximally leverage data given its importance?
Um so we'll be talking about a few projects here ranging from the smaller scale to the larger scale. So I'll be talking about these two smaller child-like child scale sort of language learning projects. And Karan will be talking more about some larger scale projects involving things like RAG considerate pre-training and curriculum learning.
Um so I'll begin by talking about small language models and BabyLM. So humans and neural networks language models learn fundamentally differently. Like I said, language models are typically trained on large amounts of internet text. They learn from predicting the next token with the highest probability given all the previous ones.
Whereas humans likely learn in more structured hierarchical ways. For example, we are more goal-based, goal-driven, and our language is kind of um decomposed given goals or sort of steps that we want to take to accomplish something. Um furthermore, um we learn in more interactive manners. We talk to other people. So as children growing up, we talk to other kids, our mom, babysitter, and so forth.
Whereas again, language models are trained on just pure text on the internet. Um furthermore, we're multimodal and multisensory. We're continuously absorbing information from the world, and this grounds our learning of language. Whereas most language models are simply text only. Um and focus of the two projects I'll be talking about are mainly the language data differences between human learning and language model learning.
So why study small models and data? Why not just, you know, use the biggest models that can do everything? Well, firstly, um computational costs, right? Um not everybody can train a model like OpenAI GPT-4, GPT-5. Uh takes millions of dollars.
Um furthermore, small models can open up more potential use cases. Uh you can use it on your phone to do things um on your everyday life that might be uh much more sort of aligned to your goals or tasks compared to a closed-source uh large language model like GPT-5.
Um and one other thing is techniques discovered at the small scale can potentially be applied to larger scales to improve their efficiency and capabilities as well. And lastly, a greater understanding of um how we can make small language models more effective might also lead to a greater understanding of the cognitive models of how humans actually learn language um so efficiently.
Um so, the first project I'll be talking about is a recent paper we just um worked on um called Baby Scale investigating models trained on individual children's um language input. Um so, human children again um learn from far less data than large language models, um magnitudes less.
A human child from the ages of 0 to 13 um is exposed to approximately 10 to 100 million words um of language data compared to large language models in order to achieve similar capabilities such as generalization, abstraction, and reasoning. Um so, we wanted to kind of ask um why exactly is this the case?
And thankfully, we had a data set that actually had transcripts of individual um children's language data. So, we were wondering, could we train language models on the actual um transcript data that individual children are sort of um exposed to and what they say as well. Um and sort of see how that might scale um as well as how performance might vary across um individual families. And also to look at what properties of the actual data uh language data that children are exposed to um might make them uh more effective learners um compared to others.
Um so, what we saw um so, on the left actually is a scaling plot across um four different benchmarks. Zorro tests grammaticality and linguistic abilities. Uh WS is word similarity. So, WS and Coms test semantic understanding and E-Walk test general um basic world knowledge.
Um so, you'll see the four graphs on the left are trained on our individual family uh models. Um so, you'll see there's quite a lot of variation, but there is some positive scaling. So, even at such extremely small or tiny um scales of data, um there is scaling with the quantity of data. There is positive trends, but the signal is quite noisy. And you'll see um the performance varies quite heavily between um the individual models trained on different families' data.
So, on the right, we trained on synthetic tiny dialogues data. So, this is a larger data up to 200 million tokens um and it's synthetic and cleaner. So, you'll see with this cleaner uh more synthetic controlled data um the scaling laws um are cleaner and um hold better there.
So, if we were to extrapolate and um look at broader scaling curves, so up to RoBERTa base, which was pre-trained on I believe around 20 or 30 billion tokens, you'll see that our smaller experiments do sort of lie along the broader scaling curves.
Um so again, like I said, the scaling um one other thing I forgot to mention is the scaling does depend on the task. So, it scaled better for Zorro or grammaticality compared to a world knowledge, for example. So, it shows that individual children's learning environments, at least when training language models, might not be enough to get general basic um world understanding. And you might need um more extraneous or external data in order to achieve that. Um and the other takeaways I um I kind of discussed.
Furthermore, um we conducted a very comprehensive linguistic analysis of these individual um families' training data um uh correlating them to the final performance of the models trained on their data. And we found that um it wasn't just data scale that mattered, but a lot of other qualities. For example, things related to um semantic diversity um and the number of conversations, um the distributional divergence among the different um individual um data sets of each family, and so forth.
So, overall, we found that better-performing data sets tended to be more structured and more diverse and richer in terms of interaction and coverage diversity. So, this lines up with child language uh research emphasizing that quality really matters more than just quantity, especially at such small scales.
So, overall, child-scale language model learning on such extreme um extremely limited amounts of data is possible. Um but it's quite task-dependent and really depends on the quality as well as the um other aspects of your data more than the amount of data um at such small scales. Um and studying child-scale data in general might allow us to build more capable and efficient um small language models and possibly also allow us to understand human um language acquisition better.
So, the second um project I'll be talking about is also at smaller scales, but it's investigating um training bilingual and multilingual um small language models. So, in particular, there's this thing called a confusion hypothesis um mainly applying to humans, which is uh hypothesis that um human children that grow up in multilingual environments um might actually suffer more or have more difficulties learning since they're constantly um required to balance um or take in so much different conflicting um information from different languages.
And this might actually um decrease the capabilities of learning um a single primary language. So, we're wondering um do these things actually apply um especially for language models? So, um and um secondly, um we're wondering if the exact exposure structure or the way that um this bilingual language is sort of expressed to the model or the child, does this affect performance? And thirdly, do these effects vary with things like the model and data size?
Um so, in order to do so, we constructed these different um types of data sets. Um so, we have some toplines, which are 100 million tokens of um English and Spanish data. And then we have some baseline models, which are half of this um in terms of English only.
And then on top of this, we added a multilingual data in a second language um Spanish in this case. So, it's 100 million tokens mixed, 50 million of English and 50 million of Spanish. And this is simultaneously learned. We're not training on 50 million English and then 50 million Spanish um but rather um this 100 million tokens together. Um and we also look at code-switching, which is the interleaving of the two different languages at both a sentence as well as a word level, and how that might affect learning.
Um so, in terms of the first um research question, we found that the perplexities of the multilingual model are low in both languages um and very comparable and even very slightly lower than the English baseline, showing that adding a second language um doesn't really affect too much um behavior or learning of the first language.
And this is also shown through our benchmarks. The multilingual models perform pretty on par across different seeds compared to the English-only models. And secondly, um again, we looked at different exposure schemes. Um for example, should um having all the mom always talk in English and dad always talk in Spanish versus having them sort of be random by speaker and so forth, as well as sentence versus in um word-level code-switching. And again, we found no sort of um significant differences here.
And thirdly, we looked at um how these effects may vary with scale. So, we found that um using a smaller model didn't really affect performance, whereas um dropping down all the way down to 20 million words instead of 100 million words um did significantly affect performance. But overall, our trends um held up there.
So, overall, uh we find there's no interference. Um so, training um small language models on, for example, two languages doesn't really impact or negatively affect the performance on a single language, um which is quite quite cool. Um further, the exposure structure is surprisingly irrelevant. It doesn't really matter how exactly you interleave the two languages. Um it doesn't really degrade um learning of the models. And uh data scale matters more than model scale um for multilingual learning at this size. Um so, those are the overall takeaways, and feel free to check out the paper.
Okay, next I'll talk a little bit about work in uh retrieval-augmented generation. So, firstly, what is uh retrieval-augmented generation or RAG? So, typically, like we described in language models, you have this large set of web data. You train your LLM um to model that data and go from user queries to outputs.
In RAG, we augment the system a little bit uh by adding some specific or domain knowledge sort of documents with a retriever. Whenever you get a user query, your retriever uh picks out of this large set of documents the most relevant ones and feeds them to the LLM's context, which it can then use to improve the outputs that it uh generates.
So, one of our works was studying uh RAG scaling. So, how much RAG should you use? If you have a fixed set of say 100 billion tokens, what sort of allocations should you do for both of those things? And if you plotted retrieval tokens in the middle versus pre-training tokens, for every sort of budget, there would be some optimal which would form this optimal frontier.
A very quick overview of the results on the right, we found that small models benefit much more from RAG. So, as you scale up RAG, their performance improves, whereas larger models saturate very quickly.
So, described in figure form, um we plot pre-training tokens saved per retrieval token. So, essentially here looking at how much pre-training does it take to um match the performance gain from retrieval or vice versa. And when looking at this, we found that there was this crossover point. So, there's a minimum amount of pre-training that you need to do in order for your model to effectively use any RAG that you give it.
And this occurs for us at about four pre-training tokens per parameter. So, if you have a 1 billion parameter model, you need about 4 billion tokens to reach that point.
Also, on the right, we found that the improvement per 1 billion retrieval tokens varied by the scale of the model. So, very small model, like 30 million parameters, um benefits much more from RAG compared to a large model like 3 billion parameters, which with sufficient pre-training almost has zero benefit. It's going down on the graph, as you can see.
Um this, although, makes sense because a large model has already memorized so much information that it can't really use RAG with general web data. However, of course, if you add domain-specific documents, then that would be a different story. Very quick overview of this project, but if this seems interesting to you, you can scan this QR.
Next, I'll talk a bit about curriculum learning approaches and scaling up models using this. So, like Steven was talking about, as the cost of pre-training grows, there is more interest in improving learning efficiency and trying to train models with less than a trillion tokens or the insane scales that are used these days. And also, typical pre-training differs significantly from how we learn.
And also, it can result in small models that struggle at reasoning. So, how can we sort of bridge these skills? One idea I had was um we, as humans, learn from curriculums in school. Alongside that, however, our brains are also growing physically in size. So, can we do the same thing for language models? Start with a small model and very simple data and scale up both things in tandem. So, growing the model as the data gets harder.
Some prior work in this space has been like synthetic data sets, like Steven used tiny dialogues. There's also tiny stories. So, very simple, like 5-year-old level sort of stories. Um there's some work in data filtering as well. So, on that quality end, curating large data sets that are deduplicated and very clean.
On the model growing end, there was some prior work in stacking. So, if we take a small model, um train it on whatever amount of data, and then copy the encoder layers and train further, does that improve performance? And this other paper, MIDAS from 2024, showed that when you do the stacking, layers that are next to each other essentially learn the same things. So, they're not very useful, which motivates new or better approaches to doing this sort of stacking or model growth.
So, what we came up with was, like I mentioned, starting off with small model, say half the model size and relatively easy data, training it a little bit, and gradually adding layers as you go throughout training until you reach your final model size, which you can train on your hardest data to do the task that you want to.
And we found that doing this approach, which we call CGLS, curriculum guided layer scaling, you do achieve better performance than just training randomly your largest model size on all of your data, or even using the same curriculum, so a easy-to-hard scaling, but at the largest model size trained once. And these results also scale up. So, if we move from 2 billion to 20 billion tokens, the gap in performance gets even larger. So, the model gets better.
So, as a TLDR, um testing at these 124 million parameter and 1B scales, combining model growth with a structured curriculum, um sort of allows you to unlock the benefits of model growth or stacking techniques. But, not all tasks benefited equally from this approach and curriculum.
Um and although I didn't show it here, at 1 billion parameter scale, the perplexity on the random baseline was better than that of the CGLS one, despite the CGLS one doing better at reasoning or reasoning tasks.
And there's, of course, a lot of future potential in this space, like tuning hyperparameters further, exploring other notions of curriculum. So, maybe what we think is easy or hard might not be easy or hard for language model. Applying this to other domains, like image or medical imaging and that sort of thing. And, like the other ones, paper, if you'd like to check it out.
Yeah, so now we'll talk about some of these overall takeaways from these four projects studying data, especially for pre-training. So, overall, we see that data effectiveness is not just the volume or quantity of data, but the quality, structure, and how exactly it's used. So, data selection and how you actually use it strategically are very important considerations.
So, we saw with the baby scale project that individual environments of children's learning vary, and the differences are driven more by data composition and quality than quantity. And with the bilingual project, we saw that models can too learn two languages effectively without confusion or detriment to the first language, regardless of how it's exposed to those two languages.
Third, with the curriculum guided layer scaling project that Karan just talked about, we saw that scaling your model alongside a data curriculum improves downstream performance up to the 1 billion parameter scale. And with the RAG considerate pre-training scaling laws project, we saw that the optimal performance comes from balancing parametric learning through pre-training and external memory through retrieval.
So, the overall sort of takeaway is that, you know, all of this research kind of underscores that effective language modeling isn't just amassing large amounts of data, but smarter data utilization strategies that can really effectively harness its structure, quality, and characteristics. And by continuing to refine and explore these data-centric approaches, especially for pre-training, the future of LLM training promises smarter, more efficient, and adaptable models.
Right. Now, we're going to get to the sort of second major topic, which is post-training. So, after we've effectively pre-trained a randomly initialized model, for example, on a lot of internet data and has basic capabilities, what now? How do we adapt it to specific task scenarios, users, domains, and etc.?
So, this is exactly what post-training looks at. So, there's many strategies, like fine-tuning, prompt-based techniques, and what Karan and Somu talked about, retrieval-based approaches, like RAG.
So, there's a major sort of inference time-based approach called chain of thought prompting. So, this is when you want the model, for example, to talk to you or to answer a question, instead of outputting the answer directly, you get it to actually think step by step like a human before answering a question.
So, for example, we decompose more difficult problems, like math questions, into smaller subproblems or steps, and that's how we reason through them and solve them.
[snorts] And this paper, chain of thought reasoning, basically finds that you can do the same thing with language models. This suggests that deep in the model's weights, it knows more about the problem than it's letting on when you just prompt it to give you the answer.
So, here's just an example, you know, you have a basic math problem. Standard prompting directly gives you the answer, whereas chain of thought has the model actually think and output its reasoning steps before it um generates the final answer conditioned on these reasoning steps.
Now, there's been different works that extend chain of thought, for example, the tree of thoughts. So, rather than a single reasoning path, the model considers multiple reasoning paths and actually evaluates which one is more effective, for example, using methods like majority vote and so forth.
Furthermore, you can sort of offload your intermediate reasoning steps to other tools or things like programs. For example, you can generate code as intermediate reasoning steps, which can then be used to help solve them, for example, using Python or some other sort of coding language or interpreter. And this helps provide a more precise um set of steps as well as final answers, leading to higher performance, especially on things like, for example, mathematics.
Furthermore, like I said, um typically when there's a very hard problem, we would want to decompose them into smaller easier subproblems, and that's exactly what sort of Socratic questioning does. So, it uses a self-questioning module to use the language model to propose sort of subproblems related to the original problem, and then it uses a divide-and-conquer sort of algorithm to sort of solve these subproblems recursively, and then get an answer to your overall original problem.
There's also other works that look at things like decomposing formulating compositional tasks as computation graphs. Again, this is kind of similar to breaking things up into subprocedures or subproblems.
And next, um there's another line of sort of post-training approaches which use reinforcement learning and rewards. So, the most common is reinforcement learning with human feedback. So, this is a technique that trains a reward model directly from human feedback.
For example, if you have two different responses that a model gives, you give this to human and get them to rank or choose which response they prefer over the other, and this can be used as a reward in order to post-train your model to be more effective.
Furthermore, one innovation is it's called DPO or direct preference optimization. So, this is a sort of RL free-based approach. So, it directly trains the model to prefer human outputs in order to rank higher without requiring a separate reward model. So, this is a pretty popular approach.
And then, humans though humans are subjective and biased, um as well as the fact that they're typically costly. If you need a bunch of humans to rank thousands or millions of outputs, for example, that can be infeasible. So, another work does this with AI, actually. So, it gets AI off-the-shelf LLMs in order to give their preferences for which answers they prefer over others, for example, and uses this in order to tune the outputs.
And then, DeepSeek came up with GRPO or group relative policy optimization. So, rather than simply binary ranking responses of which one they prefer over the other, this actually ranks responses in a group using multiple different responses. This provides richer, more fine-grained feedback, which has shown to improve performance on many tasks such as mathematics.
And then, the last part I want to touch upon here for reinforcement learning is process supervision or process rewards. So again, a lot of problems or a lot of tasks that you want a language model or robot or whatever to solve involve multiple steps. So, rather than only having a reward at the end, it might be more optimal to actually give intermediate reward signals to your model.
So, this is exactly what process supervision is. So, you label or evaluate intermediate reasoning steps, and then you train models in order to more effectively produce good reasoning traces, so their individual steps are more accurate, which leads to of course a better final answer, but also better overall reasoning behavior and performance. And this also reduces things like reward hacking and so forth.
Next, in a similar vein, I'll talk about AI agents and specifically self-improvement. So, what is an AI agent? You've probably used one, Claude Code or that sort of thing. So, it's just a system that perceives its environment, makes some sort of decisions like this chain of thought, and then takes actions towards achieving the goal that is generally given by a human. So, like you telling Claude Code to build an app.
One big benefit of agents is that they can reflect and improve upon themselves. So, the loop looks something like this. When given input query, the agent can gather all of its inputs, the query, any other environmental factors that it might need to consider, can reason, plan, and decide what it wants to build or do, take action, and then observe what happened.
So, if you've ever used Claude Code, you might find that it runs a script and sees if it compiles successfully before reflecting, and if needed, repeating the same loop to update its outputs. So, in this way, models can currently sort of self-improve. They can reflect on their own output and try to iteratively improve those outputs.
So, yeah, models in this way learn from past mistakes and adjust their future responses based on their prior failures. Another thing that you can do is have a memory store, so like GPT these days stores some memory. In this way, it can also store its past mistakes and try to adjust the responses that it gives you later based on what it learned from those loops.
And you can combine this as well with tool use, API calls, or retrieval like we talked about to further improve your agents, so you can generate a reasoning plan which may involve calling an external tool, running a web search, database query, and use all of those outputs, incorporate them into its final response.
Okay. So, we talked a lot about language and how all the techniques that go into improving language models and making them what they are today. Next, we'll move a bit onto applications of transformers beyond language.
So, as we've all probably seen, there are many and AI has basically started to take over every field. So, there's things like DeepMind's AlphaFold in genetics, lots of medical models or medical language models, AlphaGo for chess, etc., image generation.
Speaking of image generation, I'll talk really quickly about vision transformers and how they work. Um so, before before vision transformers, there were CNNs, convolutional neural networks. However, when they came along and started yielding very good performance on language, people started thinking about ways to apply them to image.
If you simply took every single pixel and treated as a token, even if you had a very small 28x28 image, now it's still flattened to almost 1,000 pixels or tokens. And this is pretty expensive even for today's hardware abilities. Hence, this paper vision transformer suggested breaking down the images into patches, encoding these patches into smaller sequences that a transformer could learn from. This worked very well for various tasks. Like I touched on towards the beginning, served as a very good representation learning algorithm as well.
However, if CNNs work so well and work so well for many years, why use transformers? Couple of things, transformers are very flexible architectures with minimal inductive priors. So, they make very few assumptions about the input data.
In contrast, CNNs assume that nearby pixels are related in locality in that they process images with filters. So, say you might have a 4x4 filter, which is assuming that that scale um has relations inside of it. These assumptions help CNNs learn effectively with limited training data, however, also bottleneck their performance.
Transformers on the other hand have to learn from very large data sets, so you can't really train on the same scale as CNNs. However, they make no assumptions and can therefore learn much better on these larger data sets.
Another um big shift in the field was CLIP. So, trying to align text representations with image representations. They did this pretty simply and intuitively in that you encode your text into a vector, you encode your image into a vector through separate models, and then update these models to try to align both through paired data for images and text.
So, one application of this sort of thing that I work on is in neuroscience, transformers for neuroscience. So, with fMRI, we can capture changes in blood oxygenation across the brain. However, these are very noisy signals and also very high-dimensional. And more so correlated fluctuations than really like robust activations like a deep brain electrode might give you. Therefore, what generally matters isn't the absolute values of the signals, but rather the correlations in between them.
So, um one line of work in this field was to try to cluster the different parts of the brain into networks. And they found that certain numbers of networks were more stable than others, and they settled on 7 and 17. So, if we decompose your brain into seven broad networks, we get a hierarchy that looks like your visual center, so processing vision, things like salience, which regulates attention and your senses, and the DMN, which is like daydreaming.
So, if you want to apply transformers to this sort of data, one thing you could do very simply is to just train with random masking on the data. And this works okay at large scales. However, we want to make a more interpretable, better architecture. So, we took this notion of dividing the brain up into networks and gave this to the model as a prior.
So, instead of predicting random patches masked out throughout the brain, we mask out an entire network, say all of the information in your visual center, and see if it can predict that given the information in the rest of your brain. This allows us to learn more robust embeddings that give us insights into diseases like Alzheimer's disease.
So, on the left, bottom left, you can see that the embeddings trained from this model cluster based on disease status. So, CN is control, MCI is mild cognitive impairment, which eventually generally leads in old age to Alzheimer's disease if one is prone to it. Additionally, with this architecture, we can probe dependencies between different parts of the brain. So, how connected is your visual system to, say, your daydreaming network?
And with this, we get pretty plots like this, which show us all the connections in between different parts of the brain, and also how predictable each network is based on the information in the rest of the brain. So, we see that your visual center is pretty predictable from the rest of your brain, whereas others like subcortical regions involved in memory are more separate. You can't really predict them from the rest of the brain.
Also, since we have data from multiple different groups, including those with like Alzheimer's disease, we can see where changes happen inside the brain. So, we find that the DMN, or like daydreaming network, is disproportionately affected compared to the rest of the brain. So, this might help influence, say, drug development or give us more mechanistic knowledge into how the disease progresses.
And if you find this sort of work interesting, you can read the paper as well. So, a very different application of transformers to medical imaging.
So the future, what's next? We've seen that um transformers have pretty much pervaded into all of our lives. So, however, they can still enable a lot more applications. So, increasingly, we have domain-specific foundation models in medicine and law, etc.
And also, increasingly getting integrated into the real world with personalized education in some countries, advancing healthcare, taking load off doctors, and more fun applications as well like interactive entertainment and gaming, um having your NPC talk to you as an AI.
However, there are still many missing ingredients to AGI or what we're all chasing. A few are memory. So, models right now still have very limited memory, unlike humans, which have fast recall and can access very large stores.
Um so, to fix these things, we also have to reduce computational complexity. It's not very sustainable to train on trillions of tokens with trillions of parameters. Um and also things like allowing models to self-improve further beyond what agents can do. So, actually update their parameters and get better at tasks as they come across them. Um more autonomy and long horizon decision-making as well.
And throughout all of this, we have to make sure to try to align them with emotional intelligence and social understanding. Um align them with our values so they don't treat us as adversaries.
There's also the direction of minifying models, on-device LLMs, like we talked about. So, if we can make smaller models much better, we can run them on, say, our phones and have similar performance or help us more with daily tasks. And in the future, we may even be able to fine-tune models locally.
And in a similar vein, there are many limits to scaling. So, going up in data size or model size, already there are diminishing returns from scaling alone. Um training on web data as we do today probably won't get us to AGI. So, there are shifts to more post-training methods like we talked about RLHF or RLAIF, improving data quality, so curriculums, maybe trying to see how humans learn in Steven's work.
Um however, another risk of all these post-training techniques like RLHF is that if you train the model too much in post-training, it might forget what it learned in pre-training and get worse on the task rather than better. So, still ongoing work to try to break through the scaling law limit.
And this might happen through new architectures, higher quality data, improved training procedures. And like I mentioned, um also making smaller models more advanced. Yeah.
Great. Um so, there's still a lot of sort of remaining weaknesses or challenges. So, Karan touched upon, you know, reaching limits with scaling. I'm trying to get these models more computationally efficient so we can run them on our phone. We can maybe even train them on our phones in the future and so forth.
Um another big issue that remains, especially in language models, um but models in general, is this thing called hallucination. So, what is hallucination? Um we typically think of this as when a language model, for example, provides false or incorrect information. It'll make up things, um facts, citations, details that don't exist, but also usually sound quite confident about it.
Um it fails to sort of reflect uncertainty when the knowledge is missing. It doesn't really know to say I don't know. Um but it tries to always sort of um it pretends like it knows everything, basically.
So, why this happens? Um there's a few potential explanations. One is models optimized for sort of plausibility and not truth. Um they're trained to sort of predict or produce the next token with the highest probability. Um so, this is not necessarily true text. It's just the most probable text.
There's also a lack of grounding. Um so, this relates to what I'll talk about later, um world models. Um but um language models, um they don't really have a sort of notion of the world or sort of external sort of grounding. Um and further, a lot of the times when users talk to models, they're ambiguous or they give prompts that don't have enough information. So, then this can confuse the model, um for example, which prompts it to then produce hallucinated information.
So, again, hallucination reduces things like trust and reliability. And especially in um high-stakes domains like law and healthcare, you don't want to rely on a model that's just um saying whatever, right? Things that are incorrect.
And furthermore, a lot of times it's hard to detect. Again, because the model sounds confident. So, how can you be sure if it's hallucinating or not? Um furthermore, not all errors are hallucinations. And sort of what is true depends on the context as well, right? Um so, these are things I'll touch upon. But overall, there's a thing missing. What exactly is a hallucination? Um we still don't have a sort of concrete or unified definition.
So, this is what one of our recent works looks at, which is we actually define a sort of unified definition of hallucination. Um so again, existing definitions are different. It depends on things like the field or the task at hand.
For example, in summarization, a hallucination means when the language model produces an output which is not faithful to the source document. Whereas in open-domain QA, it means it produces a factually incorrect response, something that is not true in the real world.
And in the case of agents, hallucination means, for example, when it might take incorrect actions based on, for example, hallucinating a button on the webpage that doesn't exist and so forth. Um so, the same output can be a hallucination in one setting, um but correct in another.
So, our core idea is that hallucination is basically about world models. So, your large language model or whatever model has learned a sort of internal world model through things like pre-training or training on large amounts of data. On the other hand, when you have a task at hand, you have an external or ground truth reference world model.
For example, this is a world model that is described by your source document when it comes to summarization. And this world model, on the other hand, is the real world itself when it comes to things like open-domain QA.
So, this world that the language model learned may match this reference world model or may contradict it. Um so, hallucination occurs when the model's learned world model contradicts what is the reference or the ground truth world model. So, it's not just about producing a wrong answer, but having wrong beliefs about the world itself. So, in general, we define hallucination as a world modeling error.
Um so, we kind of break this down into sort of structured mathematical framework. So, every hallucination depends on three things. Like I said, a reference world model, which defines what is actually true. Um for example, this can be a document, a database, the real world, and so forth, depending on the context and situation.
You also have a view. So, this is what the model can actually see. For example, information that it has retrieved from a database is part of its view. And then you have a conflict policy. So when you have different sources of information that may disagree, um you need a way of resolving conflicts.
For example, we know in the real world that Harry Potter is a boy, but for example, in a specific situation, um a Harry Potter might be a girl. Um and for example, when someone is role-playing or they're making up their own story, the user. So in that case, the model should align with what the user uh thinks is true, uh which is that Harry Potter is a girl.
Um so you need a way of resolving this sort of conflicts when sources disagree. So a hallucination is when a model outputs something that contains a claim that is false um given these sort of um three components.
So this is a unified definition because um we found that pretty much all existing hallucination definitions and cases of hallucination and different tasks and situations can all be sort of put under this framework.
Um for example, in summarization, again, our reference world is the source document. Um what the model sees, its view, is this document. And our conflict policy is the document is the truth. It doesn't matter what is real in the real world um and so forth, but the document is your source of truth. And a hallucination type is an intrinsic um contradiction where the language model produces an output which contains a claim which contradicts the source document.
Um on the other hand, for example, for agents, um this is when the environment, the web page it's exposed to, is the truth. Um and the agent hallucinates in that it um it thinks its environment is different from what it actually is. So our unified definition sort of um allows for this um encompassing of all the prior different definitions um and domain and tasks.
Um this also allows us to separate different errors. Having a sort of grounded definition like ours, um for example, hallucination means a wrong belief about the world, separates it from things like planning errors by agents, which is when its beliefs are right, but it takes the wrong action.
So there's a clear difference there, um which is important to distinguish. Um for example, like I said, clicking a non-existent button is hallucination, whereas choosing a bad strategy um is a planning error. So the takeaway is that hallucination um is not the same as all um types of model errors.
So why this definition matters is that it makes assumptions explicit. Um so in the future, when people talk about hallucination or investigate things like hallucination mitigation techniques, they should define what is the reference or ground truth world model, what is the view that the model sees, and what is the sort of conflict resolution policy.
So this allows um this allows for um better comparisons across benchmarks, consistent definitions, um and allows for assumptions to be explicitly stated.
Um furthermore, this also enables the um production of scalable benchmarks. So we're currently working on an extension to this paper, which is producing a benchmark called Halo World. So by having a mathematical grounded definition, um what it means is you can um produce um a lot of examples of hallucination um through synthetic environments.
By manipulating those three different aspects I talked about, the reference world, the view, as well as the conflict resolution policy. So again, having them be explicitly defined means you can mathematically manipulate them, and as a result, um construct a large-scale variety of um different synthetic hallucination examples. So this can be used for evaluating hallucination models, as well as for training better hallucination mitigation techniques um in the future.
Um so feel free to check out our paper and our um our um benchmark will be released um later on. So in general, again, we define hallucination as incorrect world modeling, which depends on what is true, the ground truth reference world, what is seen by the model, and how conflict should be resolved by the model.
And the key implications again are to um reduce the confusion across papers by making assumptions explicit and allowing for better evaluation um benchmarks and mitigation techniques.
Next, um [snorts] so we talked a lot um well, Karan mentioned briefly about things like continual learning. Um so I'll I'll talk first about um long-term memory.
So a lot of current models are still stateless, as in there's no persistent learning across interactions. So what we want is some way of storing information across sessions, like the human brain, we're able to remember things, right? We have um a memory system built into our heads.
This allows us to retrieve and reuse past experiences, and we want a model to be able to do that as well, to update its knowledge over time. This will allow it to um personalize and adapt to different situations as they change, um as well as support things like continual learning and long horizon um reasoning and planning.
So today, um a lot of um memory is stored in things like external systems, things like vector databases, and so forth that we retrieve upon. Um sometimes what we do is we simply summarize past information and keep it um as compressed parts of the context. Um we also use structured memory um data structures, things like lists, graphs, and so forth.
And what is missing is, um, more reliable sort of memory updating. Um, so when we have new information that we want to store into memory, what if it contradicts previous information? What happens then? Do we erase the previous, um, do we override the previous information? Do we store them as two separate, um, pieces of information and so forth? And when our, um, memory gets super long and large, that is overwhelming, right? How do we sort of prune or get rid of, um, past, um, memory? Which examples do we throw out? The oldest ones? The ones that are least important? How do we define which ones are most important and so forth?
So this relates to something called lifelong or continuous learning. So ideally, we want AI systems that can continuously learn after deployment. So after they're trained, when they're deployed into the real world, when they're interacting with users, we want them to learn, um, for example, through implicit feedback and real-world interactions and experiences.
And ideally, we want this to be infinite and permanent. We want it to update the model weights, which is analogous to the brain, right? This is the brain of the model. Um, so a lot of the current sort of ways that models learn at inference time, um, is, um, it's inference time only, it's context only, for example. So this isn't permanent. The brain or weights of the model aren't updating.
This leads to the debate, which I won't get too much into here, which is: do we truly need parametric updates to the weights of the model for it to be, um, continuously learning, or is in-context learning, um, enough? Um, but I argue that true continual learning should involve updates to the brain or the weights of the model.
And current work, a lot of this involves things like model distillation, things again like self-improvement and reflection, which are again inference-only time enhancements, not updating the brain of the model. So what mechanisms or things do we need to investigate to really get to true continual or lifelong learning like humans? Again, as humans, right, I'm learning continuously right now just by talking to you, giving this lecture. Um, I don't need to be sat down in a chair every month and have the whole internet read to me, right? I'm learning continuously on the spot without needing a sort of explicit fine-tuning, um, sort of process, um, every once in a while.
Um, so there's this concept of model editing, which tries to edit specific nodes of the model, for example, given new facts or information. They use this thing, for example, called rank-one model editing or ROME to modify model weights. But this has a lot of inherent weaknesses. It really only works for simple facts. And furthermore, it cannot easily propagate changes to other related or dependent facts. For example, if I update, um, for example, saying, um, "Bob's dad is Jason" to "Bob's dad is Justin," but Bob also has a sister. Um, the sister's dad should also be updated. But this sort of propagation isn't really possible, um, when it comes to model editing because there could be hundreds of, for example, related sort of, um, facts that must be updated, which is not really possible with this sort of model editing sort of approach.
Now, um, other than continual learning and lifelong learning, which I think is a key piece to AGI, there's another sort of disadvantage or weakness of very large-scale models these days. So millions or billions or trillions of parameters, uh, language models trained on tons of data means they're effectively very large black boxes, which are hard to understand and interpret. So there's a line of work on interpretability, which is trying to understand exactly how these models learn and think. Um, to understand their internal representations, what they truly think, not just what they're outputting or producing.
Um, so mechanistic interpretability has a lot of techniques looking at things like circuits, um, as well as features and sort of related parts of the model, um, to sort of understand what exactly is going on internally, to sort of unpack this black box.
Furthermore, such large models, um, means they're very effective and they're going to be used for many things, right, that interface with the real world. But it's important that they're sort of aligned, so they do exactly what we intend them to do and that they're safe. So there's this sort of alignment problem, especially as models get very large and powerful.
Um, there's a lot of sort of issues with existing models, which is a lot of the times they take shortcuts to achieve a goal, and these shortcuts might be unsafe or not exactly what you want. They might not be able to generalize well to new domains and settings. So they might be aligned or trained well, but deployed to a different setting they might exhibit erratic or unintended unsafe behavior.
So some specific failure modes are things like reward hacking. So instead of optimizing the true goal, it finds some way of just optimizing the reward in, for example, a way that is not good. It could also learn to optimize hidden objectives which might not be known to the person training the model or using the model.
Additionally, we talked a lot about chain of thought reasoning or sort of long reasoning chains, right? But how do we know that the reasoning steps that the model outputs is truly what it believes or is using to reason? So this is called faithfulness. So again, the explanations that the models produce might be post hoc rather than what is actually conditioned upon in order to get the right answer. So things like causal interventions are one way of actually assessing faithfulness of reasoning steps. And like I said, there might be distribution shifts where behaviors might change in new environments.
So some different alignment techniques. We touched upon these: human and AI feedback. This allows the model to sort of learn from human preferences to be safer and more aligned. But again, things like RLHF are prone to failure modes like reward hacking. So that's something to investigate. Again, I talked about interpretability to truly understand what is going on, especially internally within the model. And I also mentioned earlier process supervision. So being able to reward reasoning steps and actually examine individual steps of a model's reasoning process is important, rather than simply only looking at its final answers or outcomes.
Furthermore, this sort of idea of constitutional AI, I believe, was introduced by Anthropic. So they actually give it a sort of constitution, a sort of written document with a lot of rules of behavior or principles to guide the model's behavior. And they actually use this in order to sort of post-train the model to get it to more align with this constitution.
There's also this idea of scalable oversight. So having humans sort of oversee these sort of models which might grow to become super powerful and very large in number is not, it's not sort of feasible. So what if we use models to supervise other models? So that's sort of the idea of scalable oversight. And overall sort of alignment, the goal is to ensure models pursue the right goals and behave reliably under all different types of conditions.
So lastly, I'll talk about sort of what comes after transformers. So again, our course again is titled Transformers United. But frankly, I think it's a good idea to think outside the box, to not all sort of focus only on transformers research, which is what most people are doing these days, but to think about what else might be interesting to investigate, because there's no guarantee that transformers and next token prediction will lead to AGI or super intelligence.
So again, transformers dominate, but they have limitations: things like inefficiencies, especially scaling with longer context and sequence lengths. It is quadratic. It fails a lot of times with long context reasoning. And it doesn't really truly learn well—this is debatable, but I believe as well that it has limitations in terms of truly understanding the world. It's not truly learning an effective world model in most cases. So this leads to these two emerging directions.
So the first are world models and this thing proposed by Yann LeCun called JEPA. It moves beyond next token prediction to actually truly try to learn how the world works. So world models are basically structured representations of environments, again, the actual world. And it's about predicting future states, not just tokens. And this will enable things like planning and reasoning more effectively over time that is also grounded in how the actual world works.
So JEPA stands for Joint Embedding Predictive Architecture. So instead of predicting raw outputs like the next token, it instead predicts latent representations. So this focuses on the latent structure rather than explicit surface details like tokens themselves. And this avoids a sort of modeling of low-level noise, but actually looks at the latent structure. So this leads to more data-efficient learning, better abstraction, and generalization, and is arguably closer to how humans learn, which is through predicting structure and states of the world rather than simply pixels or tokens.
Next, there's this sort of alternative architecture being investigated these days called state space models or SSMs. And instead of sort of modeling things like transformers, it instead models sequences using continuous state updates, maintaining a compressed internal state over time. So you'll see that this is a bit analogous to how recurrent neural networks or RNNs actually work through a hidden state. And this leads to more efficiency. It's linear time scaling rather than quadratic with the number of tokens like transformers and the attention mechanism. So it's more effective for long sequences and has stronger performance on longer context tasks.
And a specific example of this is the Mamba architecture, which is a selective state space model. However, SSMs do carry tradeoffs. They're less flexible than attention in certain settings. And this is still an earlier area of research which is being investigated.
So thank you for listening. That is our presentation. So we touched upon sort of how transformers work, the evolution of sort of things from word embeddings, Word2Vec to things like RNNs, transformers, pre-training as well as post-training, various post-training techniques, remaining challenges and weaknesses, things still remaining between us and AGI or super intelligence, things like continual learning, proper alignment, interpretability, and potentially state space models as well as world models instead.
So starting next week, we'll have speakers come in to give interesting talks about what they're working on. So actually next week we'll have Hazel Nam, who will be coming to talk about her work on JEPA and world models. And the week after, we'll be having Albert Gu talk about his work on Mamba and state space machines. So ironically, our first two speakers will not be talking about transformers, but alternative architectures. But I highly encourage you guys to learn more and think outside the box.
And please stay up to date on our Discord, mailing list, and fill out the attendance form starting next week, and we'll see you guys next Thursday.