Adv. LLM Agents MOOC | UC Berkeley Sp25 | Reasoning, Memory & Planning of Language Agents by Yu Su

Berkeley RDI Center on Decentralization & AI

Hosts: Yu Su

📅February 18, 2025

⏱️01:32:39

🌐English

🤍0 likes

Disclaimer: The transcript on this page is for the YouTube video titled "Adv. LLM Agents MOOC | UC Berkeley Sp25 | Reasoning, Memory & Planning of Language Agents by Yu Su" from "Berkeley RDI Center on Decentralization & AI". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/live/zvI4UN2_i-w

00:00:00Yu Su

I'm super excited here today to talk about a topic that is really close to my heart: language agents. I'm very excited about this topic in the past two years. It's really fascinating, and I've devoted most of my time thinking about this, so I have to share some thoughts here today.

🤍0 likes💬 0 comments

Add to My Notes

00:00:16Yu Su

Right, whether you like it or not—well, since you are in this lecture, so most likely you like it—but you probably have heard about the term "agents" a lot. Many people are super enthusiastic about it. Bill Gates said that it will bring about the biggest revolution since essentially GUI. Andrew Ng said that agentic workflows will drive massive AI progress. And then Sam Altman said that 2025 is the year when agents will work.

🤍0 likes💬 0 comments

Add to My Notes

00:00:49Yu Su

However, there's definitely also a lot of voice from the other side of the aisle. Many people think that the current LLMs are just thin wrappers around LLMs—agents are just thin wrappers around LLMs, so there is nothing fundamental here. Others think autoregressive LLMs can never truly reason or plan.

🤍0 likes💬 0 comments

Add to My Notes

00:01:15Yu Su

AutoGPT was one of the early prototypes of these agent systems. It was very popular—it was the fastest-growing repository in GitHub history in terms of number of stars—but it didn't take long for people to realize that it was, at least at the time, mostly more of a prototype, and there were a lot of limitations for a production system.

🤍0 likes💬 0 comments

Add to My Notes

00:01:44Yu Su

If we take one step back and think about, hey, what is "Agents"? It's really not a new thing, right? It's one of the very first topics we cover in AI 101, right? Agents, especially using this classic schematic illustration from Russell and Norvig. It has been part of the core pursuit of AI since the very beginning. So why all of a sudden did it become so much more popular again?

🤍0 likes💬 0 comments

Add to My Notes

00:02:15Yu Su

So let's try to define these modern agents. Many people hold this view that, hey, you have a language model, it has a text-in/text-out interface, so the kind of things it can do is quite limited. It's still a lot, but it's limited. It's not connected to an external environment. And then once you connect that to an environment—you can perceive information from the environment and it can exert impact on the environment—then it becomes an agent.

🤍0 likes💬 0 comments

Add to My Notes

00:02:48Yu Su

That seems reasonable, but at the same time, it also seems a bit oversimplified and incomplete. We often heard things like self-reflection—that an agent or LLM looks at its own reasoning process and then decides to autonomously do something else. Or you can do multi-agent simulation or a lot of other things that don't necessarily involve an external environment, or at least that's not part of the core defining trait.

🤍0 likes💬 0 comments

Add to My Notes

00:03:22Yu Su

Fundamentally, I think there are two main competing views in the community. The dominant view, I call it an "LLM-first" view. So we first have an LLM. It's amazing, it's very powerful, it knows a lot of things. So let's make that into an agent. Then, if you take this view, naturally the implications are you think of building a scaffold on top of an LLM, and it will be a lot of prompting, and it will be heavy on engineering. I think that's also where a lot of the sentiment that "agents are just thin wrappers around LLMs" comes from.

🤍0 likes💬 0 comments

Add to My Notes

00:04:05Yu Su

I tend to take a different view. I take an "agent-first" view. As I said, AI agents have been the core pursuit of AI since the very beginning. Now, they just got an upgrade. Now we integrate one or more LLMs into these agents, so they gain a new capability: to use language with reasoning and communication.

🤍1 like💬 0 comments

Add to My Notes

00:04:31Yu Su

If you hold this view, then the implications are: all the same challenges faced by the classic AI agents—like how to perceive the environments, how to reason about the environmental states, how to build world models to model the state transition dynamics in the environment, and how to use that for better planning—all of these challenges still remain. But we need to re-examine them through the new lens of LLMs. And also we need to tackle the new challenges and opportunities, like now we can do synthetic data at scale, we can do self-reflection, and we can also do o1-style internal search.

🤍1 like💬 1 comment

Add to My Notes

00:05:12Yu Su

Okay, if we agree on that view—an agent-first view—then we can look at what's really fundamentally different now once we're integrating these LLMs. I think it's really the capability of using language as a vehicle for reasoning and communication.

🤍0 likes💬 0 comments

Add to My Notes

00:05:31Yu Su

The power of using language for communication is probably more clear. Like, all of us remember the vividity, the viral success of ChatGPT. It was the fastest app to grow to 100 million users in human history. But we also know that in terms of the fundamental capabilities, ChatGPT was not significantly better than the previous generation. But OpenAI decided to tune it to be more chatty and release it to the public. So I think that really shows you the power of using language for communication. You get instruction following, you get in-context learning, and you can easily customize the output format.

🤍1 like💬 0 comments

Add to My Notes

00:06:16Yu Su

But perhaps a bit more unique this time around—because using language for communication, we have been doing that for decades in dialogue systems—but this new capability of using language for reasoning is probably a bit more unique this time around. It used to be the case that I had to do a lot of justification about why this is unique, but now with all these new reasoning models, I think this has become much easier.

🤍0 likes💬 0 comments

Add to My Notes

00:06:47Yu Su

But let's still consider this example. Most people would look at this example and think GPT-4 was being stupid and contradicting itself by first thought the answer is no, but after a bit more thinking, then it realized that the answer is actually yes. But the way I see this example, it really shows the power of using language for reasoning. So you can decide—the agent, the model can decide on the fly—to use an adaptive amount of compute for different problems. So it got a second chance, unlike in traditional ML models where you pay a fixed amount of compute for any decision. You only have one shot and you have to commit to it.

🤍1 like💬 0 comments

Add to My Notes

00:07:33Yu Su

And specifically in the context of agents, reasoning is not just for the sake of reasoning. It's not just for math or coding. Reasoning is for better acting. So you can use reasoning to infer about the environmental states, you can do self-reflection, you can do dynamic replanning—like, all for the same purpose of better acting, better planning in the environment.

🤍0 likes💬 0 comments

Add to My Notes

00:07:59Yu Su

Right now, we're ready to reconcile this new generation of language agents with the previous agents. All we need to do is to add a new type of action called "reasoning by generating tokens." This is in contrast with actions in external environments. Then for that, we also need to add an internal environment, just like an inner monologue where reasoning happens.

🤍0 likes💬 0 comments

Add to My Notes

00:08:29Yu Su

With that change, now we can also reconcile things like self-reflection. Now it becomes just a meta-reasoning action; it's reasoning over the reasoning process, just like meta-cognitive functions. And reasoning is for better acting. And at the same time, due to the power of these LLMs, the percept and external action spaces are just totally expanded as well.

🤍0 likes💬 0 comments

Add to My Notes

00:08:55Yu Su

Okay, before I proceed, let me just provide a little bit of justification of the use of "reasoning" here. Because this is such an overloaded term, if we don't properly define it, then people will frown upon it. I'm sympathetic to the position of calling all of these as reasoning because for people who are familiar with the dual-process mental model of human cognition—like by the famous late Daniel Kahneman in Thinking, Fast and Slow—you'll be familiar with concepts like perception, then intuitive inference, and then symbolic reasoning. The intuitive inference is fast, effortless, and then symbolic reasoning is slow and effortful.

🤍1 like💬 0 comments

Add to My Notes

00:09:45Yu Su

But LLMs, they don't have all of those different cognitive substates for these different types of cognitive functions. All it has is mostly just one mechanism, which is token generation. It's mostly—we will talk about this later because we will talk about intrinsic, like implicit reasoning—but mostly reasoning happens through token generation, through explicit language production.

🤍1 like💬 0 comments

Add to My Notes

00:10:16Yu Su

So if you look at this example from GPT-4o, when it looks at this image, its generation incorporates all of these different mental processes: like what we would call perception (what's in there), and intuitive inference (what you immediately infer from what you see), and also some symbolic reasoning, some more complex reasoning. So I think for this reason—because LLMs really only have this one mechanism and that blends all this together—I think it's proper to call this reasoning. One may alternatively call these "thoughts" to avoid this overloaded term, but then the risk is that we further anthropomorphize these machines. So I prefer reasoning over thoughts.

🤍0 likes💬 0 comments

Add to My Notes

00:11:05Yu Su

Right, another thing that I want to get behind us before we proceed is the name. There are so many different names for these agents right now: AI agents, autonomous agents, LLM agents, and so on and so forth. But I truly think "Language Agents" is probably the most descriptive, characteristic name for this current generation of agents, because the language is really their most salient trait.

🤍0 likes💬 0 comments

Add to My Notes

00:11:36Yu Su

And what about multi-modal agents? Well, there is perception in other modalities and they're very important, but language is still doing the heavy lifting, like the reasoning and communication part. And what about LLM agents? That's probably the most popular name out there. I think what's really needed here, or characteristic of this generation of agents, is this capability of universal language understanding and production, which turns out to be extremely important for agents. But this capability doesn't have to come from an LLM. Maybe in a few days, we will go beyond LLMs, but this need, this capability, will remain. So LLMs could be a means to an end. In that sense, "Language Agents" is still a more appropriate name.

🤍0 likes💬 0 comments

Add to My Notes

00:12:33Yu Su

Right, if we take one step back and think about the evolution of AI, I think we're really entering a new evolutionary stage of machine intelligence. So human intelligence is truly a model made by nature. Somehow our brain can take raw inputs from different sensory organs and represent them in a unified neural representation to reconstruct the world around us and also to support symbolic reasoning and decision making. That's truly fantastic.

🤍1 like💬 0 comments

Add to My Notes

00:13:06Yu Su

Then throughout the history of AI, we have made many attempts to approach human intelligence and manifest machine intelligence into these AI agents. But earlier generations of AI agents were only able to capture some limited facets of human intelligence, like symbolic reasoning or perception in single modalities.

🤍0 likes💬 0 comments

Add to My Notes

00:13:30Yu Su

Only recently, with these multimodal LLMs and the language agents built on top of them, for the first time we have a model that can encode multi-sensor inputs into a unified neural representation that is also conducive to symbolic reasoning and communication. So this drastically improved the expressiveness, the reasoning ability, and the adaptivity of agents. And that's, I think, what makes this current generation of agents so exciting.

🤍1 like💬 0 comments

Add to My Notes

00:14:06Yu Su

We can do a more detailed comparison of these different generations of agents, but in the interest of time, I will just quickly go through this and just focus on language agents. The expressiveness is pretty high. For example, you can compare with the logical agents where the expressiveness is bounded by the logical language we use. Here, the agent—the model—can essentially encode almost everything, especially the verbalizable parts of the world. But for the non-verbalizable parts, like how to recognize a face, we still have multi-modal encoders to capture those.

🤍0 likes💬 0 comments

Add to My Notes

00:14:44Yu Su

And then for reasoning, now we do language-based reasoning instead of logical inferences. So it's fuzzy, and it's flexible, and it's semi-explicit. So it's not entirely implicit; you still have this chain of thoughts that you can see what's going on. And fuzziness, I want to emphasize that it's not necessarily a bad thing, because the world around us, it has a lot of fuzziness in it. If we resort to something strictly very sound and very rigid, then it comes at a cost of sacrificing the expressiveness a lot. And then the adaptivity, which is a hallmark of intelligence, is also very high in language agents because of this strong prior captured by an LLM, and the language use in general is very flexible.

🤍0 likes💬 0 comments

Add to My Notes

00:15:39Yu Su

Right, so before we proceed, let me share with you my own conceptual framework for language agents. This is what I have been using to guide my own research agenda in the past two years or so. The most important things are what I call core competencies, and I try to arrange them in a loose hierarchy. So each box here you can roughly find the corresponding cognitive function in a human brain. And then the bottom ones are the more fundamental ones, like perception, memory, and embodiment. And the upper ones act on top of the bottom ones, like planning built on top of reasoning and world models, and reasoning built on top of perception and memory, and so on and so forth.

🤍0 likes💬 0 comments

Add to My Notes

00:16:23Yu Su

We also have many cross-cutting issues like safety evaluation. Synthetic data is both a challenge and an opportunity. And efficiency, and many new applications. So with this framework, now you can look at any new paper about agents and try to map it—map its main claims and main contributions—into this framework.

🤍0 likes💬 0 comments

Add to My Notes

00:16:50Yu Su

All right, so that's the introduction. And for the rest of the talk, we will try to cover three aspects of language agents: how we model long-term memory, for which I will use our recent work HippoRAG; then how do these language models truly reason, and I will use our work on Grokking Transformers; and then finally I will talk about planning, especially world model-based planning using a world model.

🤍0 likes💬 0 comments

Add to My Notes

00:17:25Yu Su

Okay, so let's start with memory. We'll talk about HippoRAG. The main content will be from HippoRAG, but we'll talk about a lot of other things as well. It's a neurobiologically inspired long-term memory mechanism for large language models. It's led by my student Bernal, and in collaboration with Michi from Stanford.

🤍0 likes💬 0 comments

Add to My Notes

00:17:46Yu Su

Right, so if we look at how humans—or really most other animals that have a nervous system—how they learn, it's really fascinating that we are all like these 24/7, non-stop lifelong learners. And all we learn is stored in our memory.

🤍0 likes💬 0 comments

Add to My Notes

00:18:10Yu Su

Eric Kandel, the Nobel Prize winner who got the prize for his contribution in the study of memory, especially the neurobiological foundation of memory, once said that "Memory is everything. Without it, we are nothing." Which I think is very profound, because really anything we learn has to be encoded in our memory through a process called synaptic plasticity.

🤍0 likes💬 0 comments

Add to My Notes

00:18:43Yu Su

Right, so basically we can change our synapses to memorize things and to capture all the things we're experiencing and we're learning. And there are different ways of changing the synapses. For example, you could change the strength of a synapse by either adding more receptors here or releasing more neurotransmitters. Or you can do structural changes to the synapses, like by growing new synapses for the same neuron, for example. So this is used more often in forming long-term memory.

🤍0 likes💬 0 comments

Add to My Notes

00:19:23Yu Su

Right, so we are really 24/7 learners. Even when we are sleeping, we are replaying what happens during the day. That's how this long-term memory gets consolidated. Ideally, we want that same kind of learning capacity in machines, or especially in agents as well, because these agents, they are supposed to explore the world, do things, and then accumulate a lot of learning from that and self-improve in some sense.

🤍0 likes💬 0 comments

Add to My Notes

00:20:00Yu Su

However, that's very hard for the current technologies. For these neural networks in general, and LLMs—this gigantic LLM in particular—they have the notorious issue of catastrophic forgetting. So when you're learning new things, because of this highly distributional representation in these models, when you're learning new things, it often has unintended side effects, like some other things unexpectedly got changed during that new learning process.

🤍0 likes💬 0 comments

Add to My Notes

00:20:40Yu Su

Right, and to illustrate this, we can use... in the LLM context, we can consider what people have been doing trying to edit an LLM to inject or alter a specific fact. So that's called knowledge model editing. And a sub-topic in model editing is the so-called ripple effects. So if you want to change just this one fact in a counterfactual change—like you want to say, hey, Lionel is a citizen of Syria instead of the United States—and you have some expected ripple effect because now the citizenship changed, so he should speak Arabic instead of English.

🤍0 likes💬 0 comments

Add to My Notes

00:21:35Yu Su

But if you actually look at what has changed after this edit, you will see that a lot of other unexpected things have changed, or the things that should have changed but didn't change. Right, so if you look at the negation of these original statements, you will see that it will still predict—which is wrong; you want it to be the United States because of the mutation—and then there is this... like the language here also becomes Syria, but you want it to be Arabic, and so on and so forth. So all of these just tell you that because of this highly distributional representation of these artificial neural networks, that makes continual learning very hard. The human brain and animals somehow figure out a way to do that, but we don't quite understand how that happens yet, so we cannot replicate that in machines. But that's highly desired; this kind of continual learning is highly desired for agents and for LLMs.

🤍0 likes💬 0 comments

Add to My Notes

00:22:47Yu Su

So what can we do? Good news is that for LLMs, it's possible to use an alternative form of memory called non-parametric memory. So instead of directly changing the parameters of a model with the new experiences, we can just hold the new experience external to the model—so it's non-parametric—and then just retrieve them using some mechanism as we go. But for that to work—of course, this is called RAG (Retrieval Augmented Generation)—but for that to work, there is a prerequisite condition. That is, when you retrieve something, some external information, and you say, "Hey, this is your memory so you should take it, all your decision making should be based on it," then it's based on the condition that these LLMs actually will be receptive to such external information.

🤍0 likes💬 0 comments

Add to My Notes

00:23:53Yu Su

So we did this study called Adaptive Chameleon, where we specifically study this behavior. When you have this external evidence, like from RAG, and then when that directly conflicts with an LLM's parametric memory, what would happen? Will they resist that or will they be receptive to that? Perhaps not too surprisingly at this point—but what was very surprising at the time—is that these LLMs turn out to be highly receptive to external evidence even when that conflicts with their parametric memory.

🤍0 likes💬 0 comments

Add to My Notes

00:24:36Yu Su

So for the two examples here, you have a question and the ground truth answer. In this case, the parametric memory of the LLM is correct. But if you give it a coherent counter-memory like this, then the LLM will happily accept this and change its answer. And in this case, the parametric memory was wrong, and you can correct that by giving it the correct context memory. But that kind of paves the way for non-parametric memory for LLMs because of them being highly receptive to this external evidence. Of course, this has other implications like safety—maybe this means that these LLMs could be highly gullible—but that's not the focus today.

🤍0 likes💬 0 comments

Add to My Notes

00:25:24Yu Su

Okay, right now let's talk about how to really make long-term memory work for LLMs. Most people know that RAG is the de facto solution today. Right, given something that the LLM doesn't know or is beyond its knowledge cutoff date, then it will retrieve from the internet and then use retrieved information as a kind of long-term memory to answer the question.

🤍0 likes💬 0 comments

Add to My Notes

00:25:55Yu Su

But if you think about how RAG works: okay, you embed this evidence into vectors, then you do vector-based similarity to retrieve them. That seems far simpler, far less sophisticated than the human memory system where we can recognize patterns in massive data, raw experiences, we can create a lot of associations across them, and we can dynamically retrieve them for the current context, and so on and so forth.

🤍0 likes💬 0 comments

Add to My Notes

00:26:31Yu Su

So to illustrate some of the limitations of the current embedding-based RAG system, let's consider this example. Let's say you have a query: "Which Stanford Professor works on the Neuroscience of Alzheimer's?" So you have two starting concepts: Stanford and Alzheimer's. Then let's assume, just hypothetically, you have a bunch of passages. If each passage only contains part of the information—so you don't happen to have a passage that tells you "this person is both a Stanford professor and works on Alzheimer's"—instead what you have is these separate passages. Okay, this passage says this person works at Stanford; this passage says that this person works on Alzheimer's. And then you will see there is one person who works at Stanford and on Alzheimer's, but it's in separate documents. You don't happen to have one document that tells you all.

🤍0 likes💬 0 comments

Add to My Notes

00:27:37Yu Su

Right, so what would happen if you just use embedding-based RAG? Now you embed each of these passages into a vector, you embed your query into a vector, you do compare the similarities one by one. Then you will find all of these passages are equally likely because each of them captures precisely 50% of the information. And as a result, the model has to manually go through all of these passages to figure out which ones are the correct ones. And you can imagine there could be thousands of professors working at Stanford and even more people working on Alzheimer's. Right, so it's a one-to-many, many-to-one relationship.

🤍0 likes💬 0 comments

Add to My Notes

00:28:26Yu Su

But our human memory doesn't work that way. Somehow, if we have come across like... "this person works on Alzheimer's," "this person also works at Stanford," then we will build some sort of association among them such that when this query comes, we can very easily and quickly find that, hey, this person is connected with both Stanford and Alzheimer's, so it's the answer. We don't have to go through recalling all of this other information to derive at that conclusion. Right, so in that sense, embedding-based RAG works very differently from human memory.

🤍0 likes💬 0 comments

Add to My Notes

00:29:10Yu Su

Now to study exactly how the human works in this regard, we turn to this well-established theory of human long-term memory called the Hippocampal Indexing Theory. It basically goes like this—I'll summarize it in an overly simplified way—it says that your raw memory is stored in your neocortex, particularly in where the memory was first generated. Right, so when episodic memory... like the auditory part of it will be stored in the auditory cortex, the visual part will be stored in the visual cortex, and so on and so forth. That's why when you are trying to recall something, you are like reliving the moment, because it's just re-triggering the same neurons as if you were perceiving them.

🤍0 likes💬 0 comments

Add to My Notes

00:30:09Yu Su

But importantly, you also have a structured index stored in the hippocampus that creates a much closer, essentially shortcuts that associate all these disparate memory units together. And that's what we are trying to mimic here. This kind of separation and structure index gives you two important faculties of human memory. It allows you to do pattern separation, so you can differentiate memories in a very fine-grained way, certainly beyond just vectors, at least at the concept level. Right, if you think about your episodic memory, the second second and the second before, they are very similar to each other. How can you differentiate the differences? That requires a very fine-grained separation of patterns.

🤍0 likes💬 0 comments

Add to My Notes

00:31:07Yu Su

But more importantly, which is more relevant here, is pattern completion. We can recover complete memories using just partial cues. Like the previous example, the partial cues are Stanford and Alzheimer's, and then you can quickly recall this whole fact that this person is associated with both of them. And that's due to this structure index in the hippocampus.

🤍0 likes💬 0 comments

Add to My Notes

00:31:36Yu Su

Right, so that's what we're trying to mimic in HippoRAG. We're trying to build a similar structured index for RAG systems to enjoy some of the same benefits of the human long-term memory system. I won't get into too much details, so just at a high level. But in the Hippocampal Indexing Theory, there are three main parts: there is the neocortex and hippocampus, and then there is the parahippocampal regions that connect the two. So the neocortex is more about pattern separation, so you will process the raw experience, extract patterns out of these concepts and so on. And then the hippocampus is more like the structured index, so it's the indexing, auto-association. And then the hippocampal regions is more like the working memory that connects the two to do some more iterative thinking around there.

🤍0 likes💬 0 comments

Add to My Notes

00:32:44Yu Su

So to mimic this process, we have two phases: the offline indexing phase and the online retrieval phase. For offline indexing, say you have these passages as inputs. We use an LLM to serve as a neocortex that will do open information extraction to extract these triplets—like the concepts, noun phrases, and their relationships like the verb phrases. So extract these triplets. Then we try to build a Knowledge Graph, particularly a schema-less Knowledge Graph. So we don't have an ontology or the predefined schema or anything like that; everything is extracted by an LLM from the raw experiences. Right, so we built a Knowledge Graph by consolidating all these new extracted concepts and phrases as nodes and edges. That's the offline indexing phase, and this becomes our artificial hippocampal index.

🤍0 likes💬 0 comments

Add to My Notes

00:33:55Yu Su

Then we use a dense retriever here as an encoder to consolidate things, like to identify which concepts are similar to others or synonymous to each other.

🤍0 likes💬 0 comments

Add to My Notes

00:34:09Yu Su

Then in the online retrieval phase, when a query comes in—like the Stanford Alzheimer's example—then we'll identify the concept. We use named entity recognition, so now we got Stanford and Alzheimer's. We still use a dense retriever to find the similar nodes in my index. Right, and then those nodes become the seed nodes for the graph search process. So we will use this as a seed to do a search process on the graph to find the most related things. And this particular graph search algorithm we're using here is Personalized PageRank.

🤍0 likes💬 0 comments

Add to My Notes

00:34:58Yu Su

Okay, so for people who don't quite remember how personalized PageRank works, I'll do a quick recap. It's a random walk process on the graph where you start with some seed nodes, so those start with probability one. Then you use random walks starting from the seed nodes to disperse the probability mass to their neighboring nodes. So the nodes that are closer to this seed node, or specifically those in the intersection of multiple seed nodes, will naturally end up with higher weights. Then in this case, Professor Thomas, who is connected to both Stanford and Alzheimer's, will naturally stand out and get the highest weights. Now we can use these weights over concepts to re-weight the original passages, then to retrieve the most highly weighted passage. Right, so that's how HippoRAG works in a nutshell. Of course, there are a lot of details that we're not covering, but this is the main gist.

🤍0 likes💬 0 comments

Add to My Notes

00:36:07Yu Su

And it turns out some simple strategy like this—a biologically plausible strategy—works also extremely well in practice. It's not enough, I think—it is of course a neat idea, like "hey this is biologically inspired, how cool is that"—but I think it's equally important for it to actually work, to work very well and better than existing solutions in practice. And that's what we showed with HippoRAG. So compare HippoRAG with the state-of-the-art dense retrievers at the time and show that on multiple QA datasets—three standard Multi-hop QA datasets—it performs much better, by margins. And also you can compare HippoRAG with these iterative retrieval methods like IRCoT, and it also works better than this iterative method. And also because the nature of it—really this structured index and the graph search—it's highly complimentary to all these existing methods. You can easily integrate it with other methods like IRCoT, and then you get a big boost as well.

🤍0 likes💬 0 comments

Add to My Notes

00:37:32Yu Su

Okay, and to better understand where does this power of Hippo comes from, we can consider a type of question what we call "Path Finding" questions, like the running example we have been working with. So if you think about in this information space—like this huge graph of everything connected to everything—what's the solution path structure for this question looks like? You will see it's not just like a one-to-one-to-one kind of path. It first has to start with a one-to-many relation, like Stanford: there are many professors working at Stanford. Then among those is our answer, Professor Thomas. And from this... and then for the people who research Alzheimer's, that's also a many-to-one relationship, so there are many people working on Alzheimer's.

🤍0 likes💬 0 comments

Add to My Notes

00:38:35Yu Su

If you don't have prior knowledge about the answer you're looking for, then naturally you'd have to search through all these hundreds of thousands of candidates for you to find Professor Thomas. And HippoRAG, by explicitly building, extracting and building these associations from the raw inputs, that allows you to kind of create this shortcut to quickly find the true answer. While if you use ColBERT or like IRCoT, then they won't be able to find it efficiently.

🤍0 likes💬 0 comments

Add to My Notes

00:39:10Yu Su

Okay, so that was HippoRAG. That was published in NeurIPS, which is just two months ago, but in today's pace is still like ages ago. So I'm very excited to share that HippoRAG V2 is coming very soon. And a problem with HippoRAG or all these recent structured augmented RAG methods—like GraphRAG, LightRAG, Raptor and so on and so forth—they compare favorably to these simpler baselines like these small dense retrievers like ColBERT V2 or Contriever. But recently there have been many large embedding models like GTE, like MTEB leaders. And if you actually compare this structured augmented RAG method with them, you will see that they are much worse. Including HippoRAG—they are working okay on this multi-hop QA tasks because they were mostly designed for these kind of tasks—but if you just look at all other scenarios, like just some very simple QA or some discourse understanding and so on and so forth, they don't work that well.

🤍0 likes💬 0 comments

Add to My Notes

00:40:35Yu Su

So as a result, it's very hard for this method to become just a drop-in replacement of embedding-based methods. Then in HippoRAG V2, we did a bunch of clever upgrades to HippoRAG V1, and the result is that now V2 is comparable or better than the best large embedding models across the board. So then it's much more plausible to just use HippoRAG V2 as a drop-in replacement for your RAG system. We hope to release this very soon.

🤍0 likes💬 0 comments

Add to My Notes

00:41:15Yu Su

Right, so for the memory part, the takeaways: Memory is really central to human learning. And long-term memory through parametric continual learning for LLMs is very hard. But fortunately, non-parametric memory like RAG could be a promising solution. But the recent trending in RAG is to add more structures to embeddings—like HippoRAG or GraphRAG—to enhance the sense-making capability of these models (like the ability to integrate larger, more complex and uncertain contexts) and the associativity of this memory (like the capacity to draw multiple connections between these disparate pieces of information).

🤍0 likes💬 0 comments

Add to My Notes

00:42:03Yu Su

Okay, so that's memory. And I think we're still quite far away from developing a very sophisticated memory system, but we are well on getting there. There are still many gaps, like how to handle episodic memory, like the spatio-temporal aspects of things which is central to human memory. We don't have a good solution yet.

🤍0 likes💬 0 comments

Add to My Notes

00:42:31Yu Su

Okay, so memory is the most fundamental aspect in my opinion. Then let's get to another very fundamental aspect which can be built on top of memory, which is reasoning. And our discussion today will be mainly based on this paper "Grokking of Implicit Relations in Transformers". This is by the student from our group, Bernal, and it's published in NeurIPS this year—last year.

🤍0 likes💬 0 comments

Add to My Notes

00:43:02Yu Su

Right, so we will be talking about implicit reasoning. In implicit reasoning, we don't do Chain of Thoughts. So there is no verbalized Chain of Thoughts explicitly. We ask the LLM to directly predict the answer, like in this example of compositional reasoning. Let's say the language model has memorized or knows these two atomic facts: that Barack's wife is Michelle, and Michelle was born in 1964. Then the model will be given this input like "Barack wife born in" and is asked to directly predict the answer "1964" in this compositional reasoning fashion.

🤍0 likes💬 0 comments

Add to My Notes

00:43:51Yu Su

Right, so if you do Chain of Thoughts, of course you can try to recall, "Hey, Barack's wife is Michelle, Michelle was born in this, so therefore..." But we want to push the model to implicitly, just using its parameters in a single forward pass, to do this reasoning.

🤍0 likes💬 0 comments

Add to My Notes

00:44:09Yu Su

Why? Like CoT—especially this long CoT—is all the rage right now, like in R1 or o1. Then why does implicit reasoning matter? That's... this is also what I meant in the beginning, like the reasoning mechanism of these LLMs is mostly just token generation, but implicit reasoning is still part of the reasoning repertoire.

🤍0 likes💬 0 comments

Add to My Notes

00:44:38Yu Su

So why does implicit reasoning matter? First of all, this is the default mode of training or pre-training. Because during training time, when the model is asked to predict the next token with cross-entropy loss, there's no CoT, right? At least not for now. So in training time, the model has to compress the data, has to do implicit reasoning in order to reasonably predict the next token. So no matter what, I think it's important to understand the implicit reasoning capability of language models.

🤍0 likes💬 0 comments

Add to My Notes

00:45:22Yu Su

Also, that implicit reasoning fundamentally determines how well these models acquire structured representations of the facts and the rules from their training data. And finally, there's a lot of speculation about how this o1 or R1 style long CoT emerge. I think one possible hypothesis, as it is in my mind, is like this: you start with a capable base model. Right, if the base model is not capable enough, then it won't do much. So you start with a capable base model like Llama 3 or similar, and it probably has already learned all of these basic constructs or strategies for reasoning—like self-reflection, like analogical reasoning, and so on and so forth—maybe in an implicit reasoning fashion, like just some kind of reasoning circuits in the parameters.

🤍0 likes💬 0 comments

Add to My Notes

00:46:21Yu Su

Then reinforcement learning with these verifiable rewards is just to incentivize the model to learn to use the right combination of strategies. But it's not learning new reasoning strategies through RL. And then it also encourages the model to keep trying, don't be lazy. Right, so if this hypothesis is true, then understanding how these different reasoning strategies work within the model becomes even more important.

🤍0 likes💬 0 comments

Add to My Notes

00:46:57Yu Su

Right, so back to implicit reasoning. Before our work, there was a bunch of work—great work—that showed that LLMs truly struggle with implicit reasoning. Some people show that it struggles with compositional reasoning like this; there is a famous compositionality gap. And then some other people show that LLMs, even like GPT-4 at the time, struggle with simple things like comparative reasoning or comparison. Like "Trump is 78, Biden is 82, then Trump is younger or older than Biden?"

🤍0 likes💬 0 comments

Add to My Notes

00:47:39Yu Su

So that was the previous conclusions. But we had different opinions. It seems to us like these all kind of contribute to this "autoregressive LLMs can never truly reason or plan" kind of narrative. But we had some different belief. We have more faith, more confidence in language models and Transformers. So we want to more carefully study this problem, then to see whether we can get some new insights.

🤍0 likes💬 0 comments

Add to My Notes

00:48:19Yu Su

So we started with these research questions: First, can Transformers learn to reason implicitly, or are there fundamental limitations that prohibit robust acquisition of this skill? And then, what factors—is that the scale of the training data, the distribution of the training data, or the model architecture—that control the acquisition of implicit reasoning?

🤍0 likes💬 0 comments

Add to My Notes

00:48:43Yu Su

Okay, so with these two questions, let's discuss the setup. The model and optimization are pretty standard, but we want to... I don't want to introduce questions here, so it's just like a standard decoder-only Transformer, GPT-2 architecture. And we also show that the results are pretty robust to different model scales, so you can have a deeper model but the conclusions are the same. Then the optimization is also pretty standard.

🤍0 likes💬 0 comments

Add to My Notes

00:49:17Yu Su

Then for the data we will be using, it will be synthetic data because we want to carefully control all the factors in this investigation so that we can separate out the problem we want to study. So you can imagine like... let's just consider composition. We study both composition and comparison, but let's focus on composition for now. The data will look like just a random knowledge graph. It has a set of entities and R relations (we set R to 200). So then it's a bunch of nodes and then this different relation types.

🤍0 likes💬 0 comments

Add to My Notes

00:49:58Yu Su

Then it has... these are the atomic facts, right? Like "Superstition the singer is Stevie". And then we can use this atomic fact to get some inferred facts or these two-hop compositions following the composition rule: if the head entity R1 bridge entity, and then bridge entity R2 tail entity, then we have this compositional fact: head entity R1 R2 tail entity. So it's like "Barack wife born in 1964". Right, so this is composition. And you can think... like comparison data for comparison is similar.

🤍0 likes💬 0 comments

Add to My Notes

00:50:48Yu Su

Okay, then the important setup here is that we want to study inductive learning of deduction rules. A fancy phrase, but let me just decompose this for you. First of all, we want the model—which is the decoder-only Transformer—to learn inductively. So just give a bunch of training examples, you want it to learn these rules. And particularly, these are deduction rules. So this is a typical deduction rule, right? You start from some premises and then you can deduce a new fact. Right, so this is inductive learning of deduction rules.

🤍0 likes💬 0 comments

Add to My Notes

00:51:28Yu Su

Then there are two generalization settings: what I would call the in-distribution setting and the out-of-distribution setting. Or this setting can also be called systematic generalization. So it goes like this: for the atomic facts—like those edges in the knowledge graph—we split it into two sets, the ID set and the OOD set. Then they all go through the same rule, like the composition rule, you get the corresponding inferred facts.

🤍0 likes💬 0 comments

Add to My Notes

00:52:05Yu Su

Right, then for this set of inferred facts from the ID atomic facts, we split into two sets: the training set and the ID test set. But in the ID test sets, even though we have not seen this exact inferred fact in our training data—so it's an unseen inferred fact—but we have seen all of its constituents, like those atomic facts being composed with other atomic facts in this ID training sets.

🤍0 likes💬 0 comments

Add to My Notes

00:52:48Yu Su

Right, I know it's a bit of a mouthful, but just bear with me. So for any of these inferred facts in the ID test set—like "Barack wife born in 1964"—maybe you have not seen that exact inferred fact in your training data, but you have seen, like for example the relation "Barack wife Michelle," you have seen this atomic fact being composed with other relations also. Similarly, you have seen the other atomic facts like "Michelle born in 1964" being composed with other atomic facts. You just haven't happened to see that exact inferred fact like "Barack wife born in 1964." So that you still need some generalization to capture them.

🤍0 likes💬 0 comments

Add to My Notes

00:53:42Yu Su

But for the OOD test set, then you still have seen all of the atomic facts, because otherwise the model doesn't know that these facts exist. But you have never seen any of those atomic facts being composed in a compositional fashion, like being composed with other relations. Right, so that's in that sense it's a stronger generalization setting. So that's why it's called systematic generalization. If the model can do this, then essentially it means that the model has truly learned this rule, this deduction rule, so that you can just apply it to any arbitrary new facts. Right, so that's really the goal for this learning of deduction rules.

🤍0 likes💬 0 comments

Add to My Notes

00:54:33Yu Su

Now with all this lengthy setup in mind, let's look at some fun parts, look at some interesting takeaways from our investigation. The first surprising conclusion we found is that Transformers can learn to reason implicitly, but only through a process called Grokking.

🤍0 likes💬 0 comments

Add to My Notes

00:54:56Yu Su

What is Grokking? So let's look at this figure for now. This curve is the training accuracy. You can see it quickly goes to 100%. That means the model, the training has overfit at this point. But if you look at the ID test curve, when the model first overfits here, the test accuracy was still very low. And if you keep training the model way beyond like the saturation or overfitting... like the overfitting happens around 10,000 steps, and if you train it like 20 times more optimization steps to these points—this is log scale for example, keep in mind—then all of a sudden generalization happens. Right, at least the ID generalization happens. So the model gets to 100% test accuracy on composition in the ID split.

🤍0 likes💬 0 comments

Add to My Notes

00:56:02Yu Su

Similarly for comparison, the overfitting happens very early on, but it takes about 20 more times of optimization steps for generalization to happen. And interestingly for comparison, the OOD test accuracy also gets to 100%. So that's a problem we're looking into later.

🤍0 likes💬 0 comments

Add to My Notes

00:56:29Yu Su

Right, so that's quite interesting. This Transformer truly can learn to reason implicitly, but only through Grokking. And it's one of the first studies that establishes this connection between Grokking and implicit reasoning, I believe. And we will investigate why that happens in a second.

🤍0 likes💬 0 comments

Add to My Notes

00:56:51Yu Su

Another immediate takeaway here is that, as we just said, that the level of systemicity of generalization varies by the reasoning type. For compositional reasoning, it never managed to learn to generalize OOD. Well, for comparative reasoning, OOD generalization did happen. So we want to understand why there is such a difference as well.

🤍0 likes💬 0 comments

Add to My Notes

00:57:24Yu Su

Then another interesting takeaway here is that... so before this study, there were already some studies that look at Grokking and try to understand why Grokking happens and under what conditions would Grokking happen. Then one common belief in the literature was that there is a critical data size. Like once your total amount of data surpasses a certain threshold, then Grokking happens; otherwise it won't happen.

🤍0 likes💬 0 comments

Add to My Notes

00:57:57Yu Su

But in our study, we try to study this hypothesis very carefully. Then we find that actually it's not the data size but the data distribution that matters. So in these experiments on the right, we keep the two variables: there is the total data size, which is controlled by the number of entities and it's proportional to the total number of training examples; and then there is Phi, which is the ratio of inferred versus atomic facts. Right, so if this number is larger, that means that you have more inferred facts than you have atomic facts.

🤍0 likes💬 0 comments

Add to My Notes

00:58:44Yu Su

Right, so on the right we keep the ratio fixed and increase the number of entities, or the total data size. Then you will see the speed of generalization is roughly the same when you increase the total data size. Right, it's more or less the same. However, when you keep the data size fixed—so it's the same number of entities—you just change the ratio Phi from 3.6 to 18, then you see the speed of generalization strongly positively correlates with this ratio. If you have a higher ratio, then the generalization will happen faster. And at some points it will be as fast as the overfitting of the training data. Right, so this really shows that it's the critical data distribution that matters, not the sheer size of your training data.

🤍0 likes💬 0 comments

Add to My Notes

00:59:50Yu Su

With all those interesting takeaways, our job is not done here yet, because there are still some very important questions we have not answered. Right, so why does Grokking happen? What exactly happens during this long Grokking process? What's going on within the model? And why does the level of systematicity in generalization vary by the reasoning type? So all of these questions require a deeper look inside the model. That's the mechanistic interpretation part of this study.

🤍0 likes💬 0 comments

Add to My Notes

01:00:31Yu Su

So we will use some popular and now very standard mechanistic interpretation tools. One is Logit Lens. So here we apply the internal states at some position within the Transformer and multiply it with the output embedding matrix so that we can get a distribution over output vocabulary. So we kind of get a peek into what this internal representation is about.

🤍0 likes💬 0 comments

Add to My Notes

01:01:08Yu Su

Then we will also use so-called Causal Tracing. In a nutshell, this technique allows you to quantitatively measure the amount of impact of a certain internal state to the output. Right, so it ranges from zero to one. If it's closer to one, that means that this position, this internal state, has a larger impact on the final decision. I won't get into the very details because that will take too much time.

🤍0 likes💬 0 comments

Add to My Notes

01:01:54Yu Su

So with this mechanistic interpretation technique, now we can try to find—or rather discover—that the corresponding generalizing circuit formed for different reason types. Then this is what we found—very, very beautiful. We find that for composition, after Grokking, the Transformer will learn a generalizing circuit like this. It's a kind of staged circuit.

🤍0 likes💬 0 comments

Add to My Notes

01:02:25Yu Su

Right, so this is the input to Transformer: you have the Head entity, R1, and R2. And then this is layer zero, this is layer eight, the final layer. Then the first few layers of the Transformer will essentially do two things. For these parts, it will process the first hop essentially. So look at H and R1 and then find the bridge entity B. Here we're using Logit Lens here. So it has memorized this atomic fact in the first few layers to the extent that it can reliably predict B at layer five.

🤍0 likes💬 0 comments

Add to My Notes

01:03:09Yu Su

Then another thing that needs to be done here is that it needs to defer the processing of R2. So you cannot forget about R2 because R2 needs to be used later. Even though it doesn't need to use R2 in the earlier layers, it needs to use R2 when it has found B, the bridge entity. Okay, so once the bridge entity is identified, is then in the internal states, then it can be combined with R2—and then it has memorized the atomic fact that B R2 is T—so then you can combine the bridge entity B and the delayed processing R2 to predict T.

🤍0 likes💬 0 comments

Add to My Notes

01:03:55Yu Su

Right, so that's the generalizing circuit for composition. It has two clear stages and that will also determine its generalization behavior that we'll analyze later.

🤍0 likes💬 0 comments

Add to My Notes

01:04:08Yu Su

Well, on the other hand, the comparison relationship or reasoning has a different circuit, what we call a parallel circuit. Remember for comparison it's like you have two entities E1 and E2. You know some value of some attributes of this entity—like this is Trump, this is Biden, and then Trump is V1 is 78, Biden is 82. Then the prediction is "who is older?" Is like Trump is younger or older or equals than Biden.

🤍0 likes💬 0 comments

Add to My Notes

01:04:47Yu Su

Then the circuit here is more of a parallel circuit. Okay, so the first few layers of the Transformer will learn to in parallel retrieve the attribute values like 78 and 82. Then the upper layers will use these retrieved values to do the comparison and then to predict the final answer, whether is smaller than, equal, or larger.

🤍0 likes💬 0 comments

Add to My Notes

01:05:14Yu Su

Okay, so different... through mechanistic interpretation, we can find the generalizing circuit configuration of different reason types, and they are indeed different. And this different configuration will determine their level of systematicity of generalization.

🤍0 likes💬 0 comments

Add to My Notes

01:05:37Yu Su

But before that, let me share like a simple way to fix or to improve this systematic generalization, especially for composition. Right, so as we showed earlier for composition, OOD generalization never happened. Why? If you look at this circuit and think about it, that becomes kind of obvious.

🤍0 likes💬 0 comments

Add to My Notes

01:06:08Yu Su

So for the OOD generalization to happen, the model needs to do a few things. It needs to first memorize a copy of the first hop atomic facts like H R1 B in the lower layers of the Transformer, right? So that you can find the bridge entity at layer five. Then it also needs to store a copy of the second hop atomic fact, but in the upper layers. Right, because the second hop has a delayed processing, so it needs to somehow store the atomic fact B R2 T here, not in the lower layer but in the upper layer.

🤍0 likes💬 0 comments

Add to My Notes

01:06:56Yu Su

However, for the OOD generalization—remember our definition of OOD is that none of these atomic facts has been seen during training to be composed with other facts—in that situation, the model has no incentive to store this second hop atomic fact in the upper layers. It only seen this fact, atomic fact individually, so it can easily memorize them in the lower layers. But it doesn't have the incentive to store another copy in the upper layers because that requires actual effort. That's why OOD generalization of composition never happens.

🤍0 likes💬 0 comments

Add to My Notes

01:07:40Yu Su

Well for comparison, you don't have this staged processing. So you only need to store one copy of the atomic facts and you can retrieve that value in the lower layers here for the comparison. So you don't have that issue. And that's why OOD generalization of comparison does happen.

🤍0 likes💬 0 comments

Add to My Notes

01:08:00Yu Su

Okay, and to further validate this hypothesis, we did an intervention here. So if that is indeed the case—that is because the model doesn't have incentive to store this atomic fact in upper layers—then we just need to do some parameter sharing, some cross-layer parameter sharing. So you kind of tie the weights of the lower layers with the weights of the upper layer, so it's a parameter sharing. And then if the lower layer has memorized an atomic fact, then the upper layer will get a copy of it as well. So we did that intervention and turns out—boom—OOD generalization starts to happen for composition. Right, so this further validates the hypothesis. It's a bit dense, but I think that this is probably one of the most interesting slides of this analysis of this work.

🤍0 likes💬 0 comments

Add to My Notes

01:08:59Yu Su

Um, so remember, we try to do or try to understand what exactly is going on during Grokking. We said that the Grokking process is when this generalizing circuit starts to form, but in what exact way? That's still unclear. So through causal tracing, we can actually identify what exactly is going on during Grokking.

🤍0 likes💬 0 comments

Add to My Notes

01:09:40Yu Su

Let's first... let me first say this: so we believe Grokking is the phase transition from rote learning to generalization. So Grokking starts when the model has overfit the training data. Right, so but at that point the model has only done rote learning. So it just brute-forcely memorized all the training data, including the inferred facts, but not actually in a generalizable way. But because it has memorized all of them, so the training loss will be zero. But it doesn't mean that it has captured this generalizing circuit. Then the Grokking process is essentially the phase transition from the initial rote learning to the generalization; is where this generalizing circuit gets formed.

🤍0 likes💬 0 comments

Add to My Notes

01:10:49Yu Su

And now let's analyze how exactly it gets formed. The first thing we will look at is this figure. Right, we look at two things. So S5 R1 is this position. Right, so this is the state of layer five at the position of R1. So it's this one: S5 R1. Then we can use Logit Lens to decode what's captured in that state representation.

🤍0 likes💬 0 comments

Add to My Notes

01:11:32Yu Su

In that state, you can see the Grokking process... and similarly we can do that for S5 R2. So this position. So once Grokking starts, the model already has... and this is MRR, so is the Mean Reciprocal Rank, but basically it just tells you like the ranking of the bridge entity B in the Logit Lens. If it's one, it means that the state here will always predict the bridge entity B. So that means that it has encoded this information. So when the Grokking starts, B is already there. The MRR equals one. So that means this state here always predicts the bridge entity B, which is what we want. Right, we want the first hop to get us the bridge entity. That is great.

🤍0 likes💬 0 comments

Add to My Notes

01:12:36Yu Su

But remember, we also need the model to do a delayed processing of R2, so that at this point you can combine B and R2 to predict the tail entity T. But if you look at MRR of R2 at this position when Grokking starts, it does not have R2 there. So it has a B here but it doesn't have R2 here. Despite that the model has a train... so this is on the training set, so the model has a training loss of zero. So the model can perfectly predict the tail entity T despite not using R2 here to combine with B. That means the model is just doing rote learning. So you just learned, memorized that "Hey, whenever I see head R1 or R2, then I will predict T." But it's not actually doing this staged thing like getting the bridge entity and then combine with R2 to get another... use another atomic fact and do the reasoning.

🤍0 likes💬 0 comments

Add to My Notes

01:13:49Yu Su

But through the Grokking process, you can see R2 gradually increases. So at the end of a Grokking, it always predicts R2. Then you have B here, you have R2 here, then you can actually do... recall the atomic fact here to predict the T.

🤍0 likes💬 0 comments

Add to My Notes

01:14:13Yu Su

So this is further corroborated by the causal tracing results here. So remember through causal tracing we can quantify the causal strength of each state to the final prediction. And this is the causal strength of each state at the beginning of Grokking and this is end of Grokking. Then we can calculate the difference between the two and find their diff. You can see the diff mainly happens here at S5 R1. So that means in the beginning of Grokking, even though the model has the bridge entity B here, it does not use it. Right, he just did a rote learning so it directly predicts the tail entity.

🤍0 likes💬 0 comments

Add to My Notes

01:15:08Yu Su

But the Grokking process is the process of the model learning to use this bridge entity properly, so giving it a stronger causal strength to the final prediction. And this, combined with the fact that we just shown that through Logit Lens that R2... the Grokking process is where this R2 starts to emerge. Right, this combined gives us... validates the hypothesis that now this Grokking process is really the process of forming this generalizing circuit, this staged circuit that we just shown for composition.

🤍0 likes💬 0 comments

Add to My Notes

01:15:48Yu Su

And we can actually explain—even though we didn't do any proof of any sort, but you can roughly understand—why this kind of Grokking behavior happens through the circuit efficiency and the regularization. The generalizing circuit is much more efficient than the memorizing circuit (the circuit for rote learning). So, and because you also have regularization—right, the L2 regularization term for example—then if you just keep training even after overfitting, the regularization term will decrease and then will gradually favor the circuit with higher efficiency, which is the generalizing circuit. So that's why you find this still beneficial to claim way over, beyond overfitting, because that's when the regularization starts to kick in and gets you the more efficient circuit.

🤍0 likes💬 0 comments

Add to My Notes

01:16:44Yu Su

But I think at least let me give you the definition of planning, because that in itself is a very confusing thing for various reasons. So in the context of language agents, who will work with this simplified definition of planning: given a goal G, decides on a sequence of actions A0 to An that will lead to a state that will pass the goal test. Right, of course this is oversimplified because we don't talk about the state space, the observation space, and then the actions—they are not just atomic actions, they have preconditions that have to be met before they can be taken, and so on and so forth. But for this purpose, I think that it's enough to have this definition.

🤍0 likes💬 0 comments

Add to My Notes

01:17:35Yu Su

Then with this definition, we can analyze this general trends in planning settings for language agents compared with the classic planning setting which has been studied for decades. Generally, I think the expressiveness in goal specification is drastically increasing. So now can express our goal in natural language as opposed to some formal language which is usually much more limited in expressiveness. And also we have a substantially expanded or open-ended action space. So instead of like some constraint action space like "hey you have a robot, you can move forward, you can turn left" and so on and so forth, now you have an even open-ended action space, and we will see some examples soon later.

🤍0 likes💬 0 comments

Add to My Notes

01:18:27Yu Su

Then, and because of those, it becomes increasingly hard to do automated goal test. Imagine you have a web agent that is just doing things on the web, then a lot of the time you just simply cannot just write a goal test beforehand, like what the goal state will be like. But that's fine, because fuzziness is really an inherent part of this world.

🤍0 likes💬 0 comments

Add to My Notes

01:18:55Yu Su

I'll skip this one. So I just give some examples like for web agents, like from our Mind2Web. You'll see like in terms of the task, the goal specification, the user can say "ask for anything on an open website." So it's very broad and in natural language. And then in terms of the action space, right, yeah you have some broad actions like types—like you can type, you can click, you can drag, you can hover on some elements—but the actual actions, like what elements you click on, these actual actions are dynamically populated on each web page. So if you go to a different web page, your action space will be different. Right, so and it's a very big action space you have to discover on the fly. And of course the goal test is also very hard.

🤍0 likes💬 0 comments

Add to My Notes

01:19:55Yu Su

And then there's another example of travel planning that has some similar characteristics, but we will skip here. So I think these are some of the general trends for language agent planning. And that these are, I think, these are good trends. Yeah, it makes things more challenging, but for the better because now we can support much more realistic and useful application scenarios with these language agents.

🤍0 likes💬 0 comments

Add to My Notes

01:20:26Yu Su

So let's see what I want to talk about this... maybe very quickly. But for people who are interested in web agents or computer use agents, I encourage you to look at this series of work like Mind2Web, the first hour-based web agents, and then SeeAct where we first introduce visual perception into web agents, and then OmniAct where for the first time we make a human-like embodiment for computer use agents so it only perceive environment visually—there's no HTML or anything like that—and then you directly do pixel-level operations on the screen. And that's... that this minimal, this design actually works the best across the board. And then we will talk about WebDreamer which is model-based planning for web agents.

🤍0 likes💬 0 comments

Add to My Notes

01:21:19Yu Su

Okay, so let's consider the different planning paradigms for language agents. The most common one is reactive planning or ReAct. Right, imagine each node here is a web page. Then at each state you have several candidate actions you can take that will result in a different state. For reactive planning, you just at each state observe the environment, reason about it, then make a decision and commit to the decision and then that gets you to another state. Then you just keep doing this. Right, so it's fast, it's easy to implement, but the downside is that it's greedy and it's something so you often find yourself stuck in some bad state and there's no way out.

🤍0 likes💬 0 comments

Add to My Notes

01:22:04Yu Su

Then naturally when talk about planning, we just think about search or tree search. Um, so for tree search, compared with reactive planning, you can do backtracking. You maintain a value assignment for the states on your search frontier. Then you explore the most promising branch. At some points if you find that not promising, you can backtrack and then to find to explore another branch. Right, so that gives you more systematic exploration.

🤍0 likes💬 0 comments

Add to My Notes

01:22:36Yu Su

But the downside here is that in real world environments like the internet, there are a lot of irreversible actions that makes backtracking impossible or is very hard. And then a lot of this exploration could also be unsafe or slow. To just show you how pervasive this state-changing and irreversible actions are in real world environments, let's consider just amazon.com. Right, on these single websites you can have dozens if not hundreds of these state-changing actions like you can place an order, you can make a return, you can create an account that you will agree to the terms and use, and then that has legal implications, you can change your privacy settings and so on and so forth.

🤍0 likes💬 0 comments

Add to My Notes

01:23:27Yu Su

Right, if it's there is no like this universal magical undo button that you can just try a bunch of things and then magically undo and then go back to the initial states. Um, so that makes tree search in real world environments hard. And then there's also the safety and costly issues.

🤍0 likes💬 0 comments

Add to My Notes

01:23:50Yu Su

So ideally we want to do model-based planning. Imagine you have a world model that at each state you can trigger the world model to simulate the outcome of each candidate action. Then that gives you a chance to evaluate the long-term value and safety of each candidate action before you have to commit to it. Right, then you find the most promising candidate action through simulation, then you commit to it, take it—assuming it's safe—and then you get to another state. Then give into this all over again. Right, so it's faster and safer compared with tree search, and you can also do still do systematic exploration. But the downside is really how to get this magical world model.

🤍0 likes💬 0 comments

Add to My Notes

01:24:47Yu Su

Let's start with what is a world model, because this is another overloaded term. For our purpose here, we'll take this definition: is a computational model of environment transition dynamics. So basically, if I do this right now, what would happen next? Right, very simple. Then if it's that simple and so good, what hasn't been done for language agents yet?

🤍0 likes💬 0 comments

Add to My Notes

01:25:15Yu Su

Well, the issue here is that if you think about world models in the classic deep learning literature, that's usually started in reinforcement learning. Right, where you have these simple simulated environments that you can do millions of times of trials and then you can use that to learn a world model for that simple environment. What we're dealing with here is much more complicated. Even just for a single website, right, there could be hundreds of different web pages. On a single web page, there could be hundreds of different actions and they can be constantly changing because the backend database is changing. And then this complexity quickly compounds if you consider that there are billions of other websites out there. Right, in that sense we need a kind of generalist world model for the internet. That seems very hard. Fortunately, we find that LLMs can actually reasonably predict these state transitions.

🤍0 likes💬 0 comments

Add to My Notes

01:26:16Yu Su

So in this example, if you ask an LLM, "If I click this icon, what would happen?", then you can recognize that, hey, this is a shirt, then this is probably a product. So the next stage will probably be about some product details, and because it's a shirt, there will be sizing options and other things. Right? Because LLMs have this world knowledge, commonsense knowledge trained on the internet, so it can do a reasonable job—far from perfect, but reasonable.

🤍0 likes💬 0 comments

Add to My Notes

01:26:48Yu Su

So WebDreamer leverages exactly that. We simulate a world model for the internet using an LLM, in this case GPT-4. Then when you're at a state and you have a few candidate actions, you can use the world model to simulate the outcome of each candidate action. You can even do multi-step simulation. You have another value function, which is also simulated by an LLM, to tell you how much progress you would have made going down that path. And then you can choose the highest-valued action, take that action, get to the next state, and then do this all over again. Right? So that's model-based planning for web agents. It's far from perfect, but it's a reasonable starting point.

🤍0 likes💬 0 comments

Add to My Notes

01:27:36Yu Su

In terms of results, we evaluated on WebArena. You saw that model-based planning is more accurate than reactive planning and slightly trails tree search. But remember, tree search is only possible because we are working in a sandbox environment, which is WebArena. If it's on the real websites, then backtracking in tree search will become hard. And also, the model-based planning is much cheaper and much faster.

🤍0 likes💬 0 comments

Add to My Notes

01:28:06Yu Su

Okay, so the takeaways for planning. I think language agents are expanding into these new planning scenarios categorized by the expressive but fuzzy goal specifications, the open-ended action spaces, and the more difficult goal tests. But the language for reasoning enables new capabilities, planning abilities like the world models, model-based planning, and what we didn't talk about today, but the hierarchical planning and dynamic replanning is also very important.

🤍0 likes💬 0 comments

Add to My Notes

01:28:38Yu Su

And the best planning strategy is dependent on the LLM. If you have a stronger base LLM, that it may require less scaffolding, so your planning strategy could be more reactive. But generally, how to improve planning in LLMs is still largely an open question. I think many people are trying the recipe for o1-style reasoning, trying to see whether that works for planning, but that's still a big open question.

🤍0 likes💬 0 comments

Add to My Notes

01:29:11Yu Su

Okay, so I think we're really just standing at the dawn of a long journey. We talked about planning, reasoning, world models, and memory, but there are a lot of other things we didn't talk about and there's a lot to be done.

🤍0 likes💬 0 comments

Add to My Notes

01:29:24Yu Su

Just a few immediate future directions I find interesting. I think memory, personalization, and continual learning—we're really just scratching the surface. There is a whole lot to be done on how to enable agents to continue to learn from use and exploration.

🤍0 likes💬 0 comments

Add to My Notes

01:29:42Yu Su

Then reasoning—like how to make o1 or R1 style reasoning work for language agents where you need to deal with this fuzzy world without reliable rewards, and how to integrate these external actions and environmental states. That's a big open question, and I expect there will be a lot of study on this in 2025.

🤍0 likes💬 0 comments

Add to My Notes

01:30:05Yu Su

Then planning: how to build better world models instead of just simulating one, and how to balance the reactive and model-based planning. Because they don't really want to do simulation at every single step; that's very costly. Even though humans can do simulation, we don't do that for every single decision, right? Only for those difficult ones. But how to balance this reactive and model-based planning is still an open question. How to sustain a long horizon.

🤍0 likes💬 0 comments

Add to My Notes

01:30:37Yu Su

Then for safety, I think that's a very pressing issue that really keeps me up at night. I think the attack surface of language agents is scarily broad. Right? For web agents, the attack surface is essentially the entire internet. Someone can embed something on a seemingly benign website, and your—for example, OpenAI deep research agent—can go there, visit there, and get tricked by it, and then maybe release or reveal your private information and things like that.

🤍0 likes💬 0 comments

Add to My Notes

01:31:09Yu Su

So there are two general types of safety issues. There's the endogenous risk; so these are the safety risks originating from within, from the agent itself, usually because of the incompetency of the agent. So you can mistakenly take some irreversible actions that do harm. And then there are exogenous risks, right? So this is the risk from the external environments.

🤍0 likes💬 0 comments

Add to My Notes

01:31:37Yu Su

Okay, but there are also a lot of exciting applications. Probably the one with the most clear business case is agentic search or deep research. If you have not used like Perplexity Pro or Google/OpenAI deep research, I highly encourage you to try that. And I think something big is being baked here. I think this will become a huge thing in 2025.

🤍0 likes💬 0 comments

Add to My Notes

01:32:04Yu Su

Then also like workflow automation. And personally, I'm very excited about developing agents for sciences. These are all very exciting.

🤍0 likes💬 0 comments

Add to My Notes

01:32:18Yu Su

So for a more comprehensive coverage on language agents, I encourage you to check out our tutorial with De that we just did at EMNLP like two months ago. And all the materials like slides and videos are available on our website. I thank all of my sponsors and be happy to take any questions.

🤍0 likes💬 0 comments

Add to My Notes

Video Player

My Notes📝

Highlighted paragraphs will appear here