TextPurr Logo

TextPurr

Loading...
Loading...

LLM Agents MOOC | UC Berkeley CS294-196 Fall 2024 | LLM Agents: History & Overview by Shunyu Yao

Berkeley RDI Center on Decentralization & AI
Hosts: Shunyu Yao
📅September 18, 2024
⏱️01:08:43
🌐English

Disclaimer: The transcript on this page is for the YouTube video titled "LLM Agents MOOC | UC Berkeley CS294-196 Fall 2024 | LLM Agents: History & Overview by Shunyu Yao" from "Berkeley RDI Center on Decentralization & AI". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=RM6ZArd2nVc

00:00:00Shunyu Yao

Okay, cool. Hi, my name is Shunyu. Super glad to be here and talk to you about LLM agents: a brief history and overview.

💬 0 comments
Add to My Notes
00:00:10Shunyu Yao

So today's plan is very straightforward. I want to talk about three things. So first, what is an LLM agent to start with? And then second, I want to talk about a brief history of LLM agents, both in the context of LLMs and in the context of agents. And lastly, I want to share some ideas on some future directions of agents.

💬 0 comments
Add to My Notes
00:00:30Shunyu Yao

As you know, this field is a moving piece, and it's very big and messy, so it's impossible to cover everything in agents. So I'll just try to do whatever is in the best of me. And you can see there's a QR code, and you can scan it and give me feedback, and I can improve the talk accordingly.

💬 0 comments
Add to My Notes
00:00:51Shunyu Yao

Okay, so let's get started. First, what is an LLM agent? Does anyone know? Do anyone think they know the answer? If so, raise a hand. Like, do you have a definition for what is an LLM agent?

💬 0 comments
Add to My Notes
00:01:08Shunyu Yao

Okay, there are like maybe three people. So that means this field is really a moving piece. So I think if we want to define what is an LLM agent, we want to first define the two components: what is an LLM and what is an agent?

💬 0 comments
Add to My Notes
00:01:25Shunyu Yao

And does everyone know what is an LLM? Okay, so what's left is we need to define what's an agent. And if you search Google Image, this is agent, right? But in the context of AI, obviously, we know it's a notoriously broad term. It can refer to a lot of different things, right? From autonomous car to Go to play video games to chatbot.

💬 0 comments
Add to My Notes
00:01:58Shunyu Yao

So first, what exactly is an agent? So, my definition is that it is an intelligent system that can interact with some environment. And depending on different environments, you can have different agents, right? You can have physical environments such as robots, autonomous cars, and so on. And you can have agents that interact with digital environments such as video games or iPhones. And if you count human as an environment, then a chatbot is also some kind of agent.

💬 0 comments
Add to My Notes
00:02:27Shunyu Yao

And if you want to define agent, you really need to define what is intelligent and what is the environment. And what's really interesting is that throughout the history of AI, the definition of what is intelligent often changes across time, right? So like 60 years ago, if you have a very basic chatbot using like three lines of rule, then it can be seen as intelligent. But right now, even ChatGPT is not surprising anymore. So I think a good question for you all is like, how do you even define intelligence?

💬 0 comments
Add to My Notes
00:02:56Shunyu Yao

Okay, so let's say you have some definition of agent. Then what is an LLM agent? So I really think there are three categories or three concepts. So I think the first level of concept is what is a text agent. And I think a text agent is defined as, so you have this agent interacting with the environment, and if both the text and action/observation is in language, then it's a text agent.

💬 0 comments
Add to My Notes
00:03:27Shunyu Yao

Obviously, you can have text agents that are not using LLMs. And in fact, we have text agents from the beginning of AI, like several decades ago. And I think the second level of definition is LLM agent, which is agents that are text agents that are also using LLMs to act, right? And I think the last level is what I call a reasoning agent. And the idea is you, those agents use LLMs to reason to act. And so right now, you might be confused, what is the difference between the second level and the third level, which I will explain later.

💬 0 comments
Add to My Notes
00:04:07Shunyu Yao

So like I said, people have been developing text agents from the beginning of AI, right? So for example, like back in the 1960s, there has already been chatbots. So Eliza is one of the earliest chatbots. And the idea is really simple. You just have a bunch of rules. And what's really interesting is that using like a bunch of rules, you can already make a chatbot that's quite human, right? And what it does is it keeps asking you questions or repeating what you said, and people find it very human.

💬 0 comments
Add to My Notes
00:04:42Shunyu Yao

But obviously, there are limitations to those kind of rule-based agents. As you can see, like if you want to design rules, then it often is very task-specific. And for each new domain, you need to develop some new rules, right? And lastly, those rules don't really work beyond a simple domain, right? Suppose you write many rules to build a chatbot, but then you need to write many rules for a video game agent, and so on and so forth.

💬 0 comments
Add to My Notes
00:05:15Shunyu Yao

Before LLMs, there's another very popular paradigm, which is to use RL to build text agents. And the idea is, I'm sure everybody has seen video games, right? So you can imagine text games where instead of pixels and keyboard, you're using text as observation and action. And you similarly have rewards. You can similarly use reinforcement learning to optimize the reward. And the idea is you can just optimize the reward and you exhibit some kind of language intelligence.

💬 0 comments
Add to My Notes
00:05:46Shunyu Yao

But again, this kind of method is pretty domain-specific, right? For each new domain, you need to train a new agent. And it really requires you to have a skilled reward signal for the task at hand, which many of the tasks don't. And lastly, it takes extensive training, which is a feature of RL.

💬 0 comments
Add to My Notes
00:06:08Shunyu Yao

So really, if you think about the promise of LLMs to revolutionize text agents, right? These LLMs are really just trained on next-token prediction on massive text corpora, yet during inference time, it could be prompted to solve various new tasks. So this kind of generality and few-shot learning will be really exciting to build agents.

💬 0 comments
Add to My Notes
00:06:37Shunyu Yao

So next, I want to give a brief overview of LLM agents, and it's also like a historic view, and it's obviously very simplified. So what's happening is, first we have something like LLMs in 2020. I think the beginning of LLMs is GPT-3. And then people start to explore that across different tasks. And some tasks happen to be reasoning tasks, such as symbolic question answering and so on. And some tasks happen to be what I call acting tasks. You can think of games or robotics, and so on and so forth.

💬 0 comments
Add to My Notes
00:07:12Shunyu Yao

And then we find that this paradigm of reasoning and paradigm of acting are starting to converge, and we start to build what I call reasoning agents that are actually quite different from all the previous agents. And from reasoning agents, we start to explore on one hand more interesting applications and tasks and domains, such as web interaction or software engineering or even scientific discovery and so on. And on the other hand, we start to explore new methods such as memory or learning or planning or multimodality and so on.

💬 0 comments
Add to My Notes
00:07:52Shunyu Yao

So first, I want to introduce, you know, what I mean by the paradigm of reasoning and what I mean by the paradigm of acting and how they converge and what is this paradigm of reasoning agents. And history is always messy, so for now, let's just assume, let's focus on one task, which is question answering, which can simplify our history discussion a little bit, and then we'll come to more tasks.

💬 0 comments
Add to My Notes
00:08:20Shunyu Yao

So question answering is a very intuitive task, right? So if you ask a language model, what is one plus two, it will tell you three, right? That's question answering. It's very intuitive. So it also happens to be one of the most useful tasks in NLP, right? So obviously people try to use language models to do question answering, and then people find a lot of questions, a lot of problems when you try to answer questions.

💬 0 comments
Add to My Notes
00:08:50Shunyu Yao

Okay, so if you have some question like this, it will be very hard for the Transformer language model to just output the answer directly, right? So it turns out you need some reasoning. And as covered in the last talk, like Chain of Thought reasoning and so on and so forth, there has been a lot of people investigating how to do better reasoning with language models.

💬 0 comments
Add to My Notes
00:09:12Shunyu Yao

You can also imagine a language model trying to answer something like this, and it will probably give the answer wrong because, for example, if a language model is trained before 2024, and the prime minister of the UK changes often, as you know, so it might get the answer wrong, right? So in that case, you need new knowledge, and people are working on that.

💬 0 comments
Add to My Notes
00:09:35Shunyu Yao

And for another example, like you can ask something that's really mathematical and really hard. And in that case, you cannot really expect a Transformer to give the answer right. So in some sense, you need some way of doing computation beyond the naive auto-regression of a Transformer.

💬 0 comments
Add to My Notes
00:09:55Shunyu Yao

So as you can see, there are many types of question answering tasks, and people find many problems when using language models to answer those questions, and then people come up with various solutions. So for example, if you're trying to solve the problem of computation, what you can do is you can first use the language model to generate a program, and then this program will run and give you a result. That's the way you can answer, you know, a question about prime factorization or what's the 50th Fibonacci number.

💬 0 comments
Add to My Notes
00:10:30Shunyu Yao

So for the problem of knowledge, there's this paradigm of retrieval-augmented generation, right? And the idea is very simple, right? You assume you have some extra corpora, for example, Wikipedia or the corpus of this company, for example. And then you have a retriever, whether it's a BM25 or DPR or so on and so forth. You can think of a retriever as kind of a search engine, right?

💬 0 comments
Add to My Notes
00:10:57Shunyu Yao

So what it does is, given a question, this retriever will actually just pull the relevant information from the corpus and then append that to the context of the language model so that it's much easier for the language model to answer the question.

💬 0 comments
Add to My Notes
00:11:15Shunyu Yao

So this is a very good pattern. However, what if there's no corpus for the knowledge or information that you care about, right? For example, if I care about today's weather in San Francisco, it's very hard to expect any existing corpus to have that, right? So people also find this solution called tool use. And the idea is you have this natural form of generation, which is to generate a sentence, but then you can introduce some special tokens so that it could invoke tool calls, right? For example, you have a special token of a calculator or a special token for a Wikipedia search or a special token for calling a weather API.

💬 0 comments
Add to My Notes
00:11:58Shunyu Yao

This is very powerful. Obviously, you can augment language models with a lot of different knowledge, information, and even computation. But if you look at this, this is not really a very natural format of text, right? There's no like blog post or a Wikipedia passage on the internet that looks like this. So if you want the language model to generate something like this, you have to fine-tune that in this very specific format. And it turns out to be very hard to call that more than once across the text.

💬 0 comments
Add to My Notes
00:12:36Shunyu Yao

So another natural question comes, right? What if you need both reasoning and knowledge? And people actually came up with a bunch of solutions for different tasks, right? For example, you can imagine interleaving the Chain of Thought and retrieval or generating follow-up questions and so on and so forth. But without needing to get into the details of all the methods, I just want to point out like the situation at the time was a little scattered, right?

💬 0 comments
Add to My Notes
00:13:05Shunyu Yao

So you have this single task called QA, but it turns out to be more than a single task. You actually have like tons of different benchmarks, right? And they happen to challenge language models in very different ways, and people come up with solutions for each of the benchmarks. So it feels very piecemeal, at least for me, right? And at least for me at the time, the question is, can we really have a very simple and unifying solution? And I think if we want to do that, we really need abstraction beyond individual tasks or methods. We need like a higher-level abstraction over what's happening.

💬 0 comments
Add to My Notes
00:13:50Shunyu Yao

So the abstraction that I find, at least for myself, is the abstraction of reasoning and acting. So what is reasoning? Hope you already know that from Dennis's talk last time. Chain of Thought, right? It's very intuitive, and it's just a very flexible and general way to augment test-time compute and to think for longer during inference time to solve more complex questions, right? However, if you only do Chain of Thought, you don't have any external knowledge or tools, right? Even the biggest, smartest model in the world does not know the weather in San Francisco today. So if you want to know that, you need external environment knowledge and tools.

💬 0 comments
Add to My Notes
00:14:33Shunyu Yao

And what I have described as like RAG or retrieval or code or tool use and so on and so forth, in some sense, it's just a paradigm of acting, because you're just assuming you are having an agent and you have various environments, whether it's retrieval, search engine, calculator API, or Python, right? And the benefit of interacting with the external environment is that it's very flexible and a general way to augment knowledge and computation and feedback and so on and so forth. However, it doesn't have reasoning, and we will see later why that's troublesome.

💬 0 comments
Add to My Notes
00:15:13Shunyu Yao

So the idea of this work called ReAct is actually very simple, right? So you have these two paradigms: reasoning and acting. And before ReAct, language models are either generating reasoning or acting. And for ReAct, the idea is to just generate both. And we will see that it's actually a great way to synergize both in the sense that reasoning can help acting, and acting can help reasoning. And it's actually quite simple, intuitive. You will see later it's actually, you can argue that's how I solve the task or you solve the task. It's a very human way to solve the task, and it's very general across the domain.

💬 0 comments
Add to My Notes
00:15:54Shunyu Yao

So the idea of ReAct is very simple, right? Suppose you want to do a task, and what you do is you write a prompt, and the prompt consists of a trajectory that looks like this. So you give an example task, and as a human, you just write down how you think and how you do what you do to solve the task, along with the observation along the way, right? So if you're trying to answer a question using a Google engine, you just think about some stuff and do some search. And then you write down that, and you also write down the result from Google. And you keep doing that until you solve the task.

💬 0 comments
Add to My Notes
00:16:27Shunyu Yao

You can give this one example, and then you can give a new task. And given this prompt, the language model will generate a thought and action. And this action is parsed and fed into the external environment. And then that would trigger some observation. And then the thought, action, observation is appended to the context of the language model. And then the language model generates the new thought and new action, and so on and so forth.

💬 0 comments
Add to My Notes
00:16:53Shunyu Yao

So obviously, you can do that using a single example. That's called one-shot prompting. You can do that with few examples. That's called few-shot prompting. If you have many, many examples, you can also fine-tune a language model to do that. So it's really about a way to use a language model rather than a prompting or fine-tuning.

💬 0 comments
Add to My Notes
00:17:15Shunyu Yao

So as a concrete example, let's say you want to answer a question, right? If I have $7 trillion, can I buy Apple, Nvidia, and Microsoft? I made this slide back in March, and that was a trendy topic at the time. So you can write down a prompt like that. You just say, "Okay, language model, now you're an agent, and you can do two types of actions. You can either Google, or you can finish with the answer. And you just need to write down the thought and action."

💬 0 comments
Add to My Notes
00:17:42Shunyu Yao

Okay, so that's very intuitive. And let's see what the language model does, right? So this is what GPT-4 does back in March. So it first generates a thought, right? So first, I need to understand, I need to find what is the market cap of those companies, and then add them together so that I can determine if $7 trillion can buy all three companies. And then this triggers this action to search on Google. And this Google search returns this snippet as a result. Unfortunately, it just contains all the market caps you need.

💬 0 comments
Add to My Notes
00:18:17Shunyu Yao

So the ReAct agent thinks, "And now I have the other market cap. All I just need to do is to add them together." So it uses the search engine as a calculator, adds them together, and gets a result. And it will think, "Okay, so $7.2 trillion is not enough. So you need additional money to buy that." I think if it's used today, then it's even more money because Nvidia is much higher now. Yeah, so that's how ReAct solves the task. And you can see it's a very intuitive way, very similar to how humans solve the task, right? You think about the situation, you do something to get some more knowledge or information, and then based on that information, you think more.

💬 0 comments
Add to My Notes
00:19:02Shunyu Yao

And then I try to be a little more adversarial. So instead of finding all the market caps, I inject this adversarial observation, right? "Nothing is found." And here comes the power of reasoning, right? So reasoning actually finds a way to adjust the plan and guide the action to adapt to the situation, right? Because the search is not... the result is not found, maybe I can search for individual market caps, right? So I can just search for the market cap of Apple.

💬 0 comments
Add to My Notes
00:19:29Shunyu Yao

And then I try to be adversarial again. I give the stock price instead of the market cap. And here reasoning helps again, right? Based on common sense, it figures out, right? This is probably the price, not the market cap. So if you cannot find the market cap, what you can do is you can find the number of shares, and then you can multiply the number of shares and the market and the stock price to get a market cap. And then you can do that for all three companies, and then you can solve the task.

💬 0 comments
Add to My Notes
00:19:56Shunyu Yao

So from the example, you can see that it's not really acting helping reasoning, right? Obviously, acting is helping reasoning to get real-time information or doing calculation in this case. But also, reasoning is constantly guiding the acting to plan the situation and replan the situation based on exceptions.

💬 0 comments
Add to My Notes
00:20:22Shunyu Yao

So you can imagine something like this to solve various question answering tasks. All you need to do is to provide different examples and provide different tools. So, okay, this is good. We're making progress. But I think what's really cool is that this paradigm goes beyond QA, right? So if you think about it, you can literally use it to solve any task. And to realize this, all you need to realize is that many tasks can be turned into a text game.

💬 0 comments
Add to My Notes
00:20:55Shunyu Yao

So imagine if you have a video game, what you can do is you can assume you have a, you know, video captioning or image capturing model, and you can have some controller that can turn language action into a keyboard action. And then you can literally turn many of the tasks into a text game, and then you can literally use ReAct to solve them. So it goes well beyond question answering.

💬 0 comments
Add to My Notes
00:21:23Shunyu Yao

So after the invention of the model, obviously another part of the history is there are people from reinforcement learning, robotics, video games, and so on and so forth, they're trying to apply this technique. And there are many works, and I'm only listing one, for example. And the idea is very intuitive. Like I said, you can try to turn all the observation into text observation, and then you can try to use a language model to generate a text action, and then you turn the text action into some original format of action, and then you solve the task.

💬 0 comments
Add to My Notes
00:21:56Shunyu Yao

But what's the issue of this, right? So this is an example from a video game where you're trying to do some household task in a kitchen, right? And the problem really is sometimes it's really hard to directly map observation into the action, because for one, you may have never seen the domain. Second, to process, you know, from the observation to action, you need to think. But if you don't have this thinking paradigm, all you're doing is just trying to imitate the observation to action mapping from the prompt or from the few-shot example.

💬 0 comments
Add to My Notes
00:22:36Shunyu Yao

So in this case, in the sink basin, there is no pepper shaker, so nothing happens. But because it doesn't have the capacity to think, it will just keep doing that and keep failing because it's like a language model, it's just trying to imitate it. So it's not really trained to solve the task like an agent.

💬 0 comments
Add to My Notes
00:23:00Shunyu Yao

So what we use is actually something very simple, right? You are literally just adding another type of action called "thinking." And thinking is a very interesting action because you can think about anything, right? So in this video game, you might only be able to go somewhere or pick up something. That's the action space defined by the environment. But you can think about anything. And you can see that this thinking action is very useful because it helps you plan the situation, it helps you keep track of the situation, and it helps you plan and replan if something wrong happens.

💬 0 comments
Add to My Notes
00:23:35Shunyu Yao

So as you can see, ReAct is a general pattern that helps across various tasks and is systematically better than if you only do reasoning or only do acting. This is interesting, right? And I just want to point out why this is interesting from a more theoretic perspective. So again, abstraction, right? So if you think about all the agents that you have, everything, right? From video games to AlphaGo to autonomous cars, whatever, like all the agents, one common feature is that you have an action space that's defined by the environment, right?

💬 0 comments
Add to My Notes
00:24:21Shunyu Yao

So assume you're solving a video game, say an Atari game, then your action space is left, right, up, down. You can be very good, you can be very bad, but your action space is fixed. And what's really different for a language agent or an LLM agent or a reasoning agent is that you have this augmented action called reasoning.

💬 0 comments
Add to My Notes
00:24:47Shunyu Yao

And what's really interesting about this augmented action is that it could be any language, right? You can think about anything. It's an infinite space. You can think about a paragraph, you can think about a sentence, you can think about a word, you can think about 10 million tokens. And it doesn't do anything to the world, right? No matter what you think, it doesn't really change the Earth or the video game you're playing. All it does is it changes your own context, right? It changes your memory, and then based on that, it changes your follow-up actions.

💬 0 comments
Add to My Notes
00:25:21Shunyu Yao

So that's why I think this new paradigm of reasoning agents is different. It's different because reasoning is an internal action for agents, and reasoning has a very special property because it's an infinite space of language. Cool.

💬 0 comments
Add to My Notes
00:25:45Shunyu Yao

So we've covered the most important part of the talk. I think the history goes on, right? So from now on, we have the paradigm of reasoning agents, and then we have more methods, more tasks, and there's a lot of progress obviously, and I cannot cover everything. So on the methodological side, I just want to cover one thing today, which is long-term memory.

💬 0 comments
Add to My Notes
00:26:12Shunyu Yao

So we just talked about what is a reasoning agent. And the idea is you have an external environment, be it a video game or a Google search engine or your car or whatever. And we just talked about the difference of a reasoning agent is that the agent can also think, right? Another way to think about this is you have an agent that has a short-term memory, which is the context window of the language model.

💬 0 comments
Add to My Notes
00:26:47Shunyu Yao

And it's interesting that you can append interesting thoughts and actions and observations to this context. But if you look at this context window of the language model, first, it's append-only, right? So you can only append new tokens to the context. And you have limited context, right? So it could be a thousand tokens two years ago, it could be a million tokens now, it could be 10 million tokens next year, but you have a limited size of context.

💬 0 comments
Add to My Notes
00:27:24Shunyu Yao

And even let's say we have a 10 million token window, you might have limited attention, right? So you can have a lot of distracting things if you're doing a long horizon task, right? And lastly, it is a short-term memory because this kind of memory does not persist over time or over new tasks, right?

💬 0 comments
Add to My Notes
00:27:54Shunyu Yao

So you can imagine, let's say this agent solved the Riemann hypothesis today, which is really good. But then unfortunately, if you don't fine-tune the language model, right, it doesn't change, right? So next time, it has to solve from scratch again, and there's no guarantee whether it will solve it tomorrow, right? So I think an analog I want to make is it's kind of like a goldfish, right? So folk wisdom is a goldfish only has three seconds of memory, right? So you can solve something remarkable, but if you cannot remember it, then you have to solve it again, and it's really a shame, right?

💬 0 comments
Add to My Notes
00:28:38Shunyu Yao

So hope that's motivating enough to introduce this concept of long-term memory, right? So it's just like, as a human, right, you cannot remember every detail in every day, right? But you maybe, you may write a diary, right? That's kind of like a long-term memory. You read and write important stuff for your life, for your future life, like important experience, important knowledge, or important skills. And hopefully that should persist over new experiences, right?

💬 0 comments
Add to My Notes
00:29:12Shunyu Yao

So you can also imagine a mathematician, right, writing a paper on how to prove the Riemann hypothesis. That's kind of like a long-term memory, right? Because then you can just read the paper and you can prove it. You don't have to solve it again.

💬 0 comments
Add to My Notes
00:29:30Shunyu Yao

So let's look at a very, very, very simple form of long-term memory in this work called ReFlexion, which is a very simple follow-up from ReAct. So let's say you're trying to solve a coding task, right? This is a task, and you can imagine you can write some program, you can run a program, you can reason, you can do whatever. But at the end of the day, right, you test it, and let's say it doesn't work, right? Some tests failed.

💬 0 comments
Add to My Notes
00:30:04Shunyu Yao

So if you don't have a long-term memory, then you just have to try again, right? But what's different now is if you have a long-term memory, what you can do is you can reflect on your experience, right? So if you wrote a program and it failed some tests, you can think about it, right? It's like, "Oh, I failed this task because I forgot about this corner case. So if I write this program again, I should remember this." And what you can do is you can persist this piece of information over time. Like when you write this program again, you can literally read this long-term memory, and then you can try to be better next time, and hopefully it will improve.

💬 0 comments
Add to My Notes
00:30:53Shunyu Yao

This turns out to be working really well for various tasks, but particularly coding, right? Because for coding, you have great feedback, which is the unit test result. And you can just keep reflecting on your failure or success, and then you can keep track of the experience as a sort of long-term memory, and then you can get better.

💬 0 comments
Add to My Notes
00:31:18Shunyu Yao

Another way to think about this is it's really a new way of doing learning, right? So if you think about the traditional form of reinforcement learning, right? So you do something, and then you get a scalar reward. And what you do is essentially trying to backpropagate the reward to update the weights of your policy. And there are like many, many algorithms to do that.

💬 0 comments
Add to My Notes
00:31:47Shunyu Yao

If you think about ReFlexion, right, it's really a different way of doing learning because first, you're not using a scalar reward, you can use anything, right? You can use a code execution result, you can use a compiler error, you can use the feedback from your teacher which is in text, so on and so forth. And it's not doing learning by gradient descent, right? It's learning by updating language, right? By language, I mean a long-term memory of task knowledge. And then you can think of this language as affecting the future behavior of the policy, right?

💬 0 comments
Add to My Notes
00:32:30Shunyu Yao

So this is only a very simple form of long-term memory, and I think follow-up work did more complicated stuff. You will hear about Voyager from Jim later, I guess, where you have like a memory of code-based skills, right? And the idea is, for example, you're trying to play Minecraft, and you learn how to build a sword in this kind of API code, then you can try to remember it. The next time if you want to kill a zombie, you can first pull the skill of building a sword. You don't have to try it from scratch, right?

💬 0 comments
Add to My Notes
00:33:07Shunyu Yao

And for example, in this work of Generative Agents, the idea is you have like 20 human-like agents in this small town trying to be human. You know, they have jobs, they have life, they have social interaction, so on and so forth. You have this episodic form of long-term memory where you literally... each agent keeps a log of all the events that's happened, right? Every hour, right? That's like a most detailed possible diary you can possibly have. And you can imagine like later if you want to do something, you can try to look at the log to decide what to work on, right? Because if you dropped off your kid at this place, you want to retrieve that piece of information and then you pick it up.

💬 0 comments
Add to My Notes
00:33:55Shunyu Yao

You can also have this kind of semantic memory where you can look at your diary, right, and you can draw some conclusions about other people and yourself, right? You can realize, you can reflect on that, and you can say, "Okay, Jim is actually a very curious guy, and I actually like video games." And this kind of knowledge can actually affect your behavior later.

💬 0 comments
Add to My Notes
00:34:28Shunyu Yao

And I think the final step to finishing this part is to realize that you can actually also think of the language model as a form of long-term memory, right? So you can learn by... learn, I mean improve. You can improve yourself, or you can say you can change yourself by either changing your parameters of the neural network, which is to fine-tune your language model, or you can store some piece of code or language or whatever in your long-term memory, and then you can retrieve from it later, right? So that's just two ways of learning.

💬 0 comments
Add to My Notes
00:35:09Shunyu Yao

But if you think of both the neural network and whatever text corpus as both a form of long-term memory, then you have a unified abstraction of learning. And then you have an agent that has this power of reasoning over a special form of short-term memory called the context window of the LLM model. And then you can have various forms of long-term memory. And in fact, you can show that this is almost just sufficient to express any agent.

💬 0 comments
Add to My Notes
00:35:43Shunyu Yao

So I have this paper called Koala, which I don't have time to cover today, but I encourage you to check out where the statement is that you can literally just express any agent by the memory, which is where the information is stored, the action space, like what the agent can do, and the decision-making procedure, basically given the space of actions, what... which action you want to take, right? You can literally express any agent with these three parts. So this is a very clean and sufficient way of thinking about any agent.

💬 0 comments
Add to My Notes
00:36:22Shunyu Yao

And I want to leave two questions for you to think, and I have an answer in this paper that you can try to retrieve. So the first question is, what makes an external environment different from internal memory, right? So imagine if the agent opens up a Google Doc and writes something there, is that a form of long-term memory or is that like some kind of action to change the external environment? Or imagine if the agent has an archive of the internet, right, and it tries to retrieve some knowledge from there. Is that a kind of action or is that a kind of retrieval from a long-term memory?

💬 0 comments
Add to My Notes
00:37:07Shunyu Yao

I think this question is interesting because if you think about physical agents like humans or autonomous cars, right, it's very easy to define what is external and what is internal, because for us, what's outside our skin is external, what's inside our skin is internal, right? It's very easy to define. But I want you to think about for digital agents, how can you even define that?

💬 0 comments
Add to My Notes
00:37:32Shunyu Yao

And lastly, how do you even define long-term memory versus short-term memory? Like suppose you have a language model context of 10 million tokens, can that still be called a long-term memory? Note that those terms are defined from kind of human psychology and neuroscience. And think about these two questions.

💬 0 comments
Add to My Notes
00:38:01Shunyu Yao

So, okay, so we have covered some brief history of LLM agents. I also want to talk about the history of LLM agents in the broader context of agents, right? We have talked about how we start from LLMs to derive various things and other developments of language agents. But if you look at a more ancient history, how is a reasoning agent different from all the previous paradigms of agents?

💬 0 comments
Add to My Notes
00:38:29Shunyu Yao

So here I want to give a very, very minimum history of agents, and it's definitely wrong. So it's just for illustration, right? Don't take that too seriously. But I think if you want to write a very minimal history of agents in one slide, at the beginning of AI, right, the paradigm is called symbolic AI, and you have symbolic AI agents. And the idea is kind of like programming, right? You can program all the rules to interact with all the different kind of environments, and you can have expert systems and stuff.

💬 0 comments
Add to My Notes
00:39:04Shunyu Yao

And then you have this period of AI winter, right? And then you have deep learning, and you have this very powerful paradigm of RL agents, and it's usually deep RL agents where you have a lot of amazing miracles from Atari to AlphaGo, and so on and so forth. And only very recently, we have LLM agents, right? So this is obviously wrong, but if I have to put things in one slide, this is kind of the perspective.

💬 0 comments
Add to My Notes
00:39:36Shunyu Yao

And remember the examples we looked at at the beginning of the talk, right? This is like a very typical example of a symbolic AI agent, and LSTM DQN is a very typical example of a deep RL agent in the text domain.

💬 0 comments
Add to My Notes
00:39:52Shunyu Yao

And I think one way to think about the difference between the three paradigms of agents is the problem is the same, right? So you have some observation from the environment and you want to create an action, right? You want to take some action. And the difference is what kind of representation, what kind of language do you use to process from the observation to the action, right?

💬 0 comments
Add to My Notes
00:40:21Shunyu Yao

So if you think about symbolic AI agents, essentially you're first mapping all the observation into some symbolic state, right? And then you're trying to use the symbolic state to derive some action. You can think of if-else rules. Essentially, you're just trying to map all the possible complex observations into a set of logical expressions, right?

💬 0 comments
Add to My Notes
00:40:44Shunyu Yao

And if you think about all the deep RL agents, a very abstract way of thinking of this is you have many different possible forms of observations, whether it could be pixels, it could be text, it could be anything. But from a different perspective, it doesn't really matter because it's mapped into some kind of embedding, right? It's processed by a neural network to some vectors or matrices, and then you use that to derive some actions, right?

💬 0 comments
Add to My Notes
00:41:16Shunyu Yao

And in some sense, what's different for a language agent or a reasoning agent is that you are literally using language as the intermediate representation to process observation to action, right? Instead of this neuro-embedding or this kind of symbolic state, you're literally thinking in language, which is kind of the human way of doing things, right?

💬 0 comments
Add to My Notes
00:41:38Shunyu Yao

And the problem with symbolic states or neuro-embeddings is that if you think about it, it takes intensive efforts to design those kind of symbolic agents, right? If you think about how Waymo is built as an autonomous car, you probably write millions of lines of rules and code, right? And if you think about all those different agents, most of them, it takes millions of steps to train them, right?

💬 0 comments
Add to My Notes
00:42:06Shunyu Yao

And the problem is both are kind of task-specific, right? If you write millions of lines of code for an autonomous car, you cannot really reuse that for playing a video game. Similarly, if you train an agent using millions of steps to play a video game, you cannot use that to drive cars, right?

💬 0 comments
Add to My Notes
00:42:30Shunyu Yao

Language is very different because first, you don't have to do too much, right? Because you already have rich priors from LLMs. That's why you can prompt to build LLM agents. It's really convenient. And it's very general, right? You can think about how you drive a car, you can think about how to play a video game, you can think about which house to buy, considering mortgage rates and stuff.

💬 0 comments
Add to My Notes
00:42:55Shunyu Yao

And thinking is very different from a symbolic state and deep RL because the symbolic state and the deep RL vector, they usually have a fixed size, but you can think arbitrarily long, right? You can think about a paragraph, you can think about a sentence, and that brings this whole new dimension of inference time scaling. And that's why fundamentally, a reasoning agent is different.

💬 0 comments
Add to My Notes
00:43:29Shunyu Yao

Okay, so I just realized I just covered half of the later half of the brief history of LLM agents, right? We talked about long-term memory and why the methodology is fundamentally different from the previous agents. I also want to briefly talk about the new applications and tasks that LLM agents enabled.

💬 0 comments
Add to My Notes
00:43:49Shunyu Yao

So as you can see in the beginning of my talk, the examples are basically question answering and playing games. And that's pretty much the... if you think about it, that's pretty much the predominant paradigm of NLP and RL, right? But I think what's really cool about language agents is that it really enables much more applications, and in particular, what I call digital automation, right?

💬 0 comments
Add to My Notes
00:44:16Shunyu Yao

So what I mean by digital automation is imagine if you have an assistant that can help you file reimbursement reports or help you write code, run experiments, help you find relevant papers, help you review papers, help you find papers that are relevant, right? If all of them can be achieved, then everybody can graduate undergrad in two years or PhD in three years or get tenure in three years. Like, everything can be sped up.

💬 0 comments
Add to My Notes
00:44:53Shunyu Yao

But if you think about it, before ChatGPT, right, there's literally no progress. If you think about Siri, right, which is the state-of-the-art digital agent before ChatGPT, right, it literally can do nothing, right? And why is that? I think the reason is that you really need to reason over real-world language, right? If you want to write a code, like this paradigm of sequence-to-sequence mapping is not enough. You have to think about what you write and why you write it. And you have to make decisions over open-ended actions over a long horizon, right?

💬 0 comments
Add to My Notes
00:45:29Shunyu Yao

But unfortunately, if you think about it, if you look at all the agent benchmarks before the existence of LLMs or LLM agents, they often look something like this, right? So they usually... they are usually very like synthetic tasks, very small scale, and not practical at all. And that's been limiting for the history of LLM agents, because even if you have the best agents in the world, if you don't have the good tasks, like how can you even show progress, right? Because like, let's say we solve this grid game with 100% accuracy, then what does it mean, right?

💬 0 comments
Add to My Notes
00:46:12Shunyu Yao

So I think the history of LLM agents on one side is all the methods getting better and better, but an equally, if not more important side of the history is we're getting more practical and more scalable tasks. So to have a flavor, right, this task is called WebShop, and I created it with my co-authors in 2021, 2022. And the idea is you can imagine LLM agents to help you do online shopping.

💬 0 comments
Add to My Notes
00:46:41Shunyu Yao

So you give the agent an instruction to find a particular type of product. It could just browse the web like a human, right? It could click links, it could type search queries, it could check different products and go back and search again. And if it has to search again, it has to explore different items, it has to think about how to reformulate the query, right? You can immediately notice, right, the environment is much more practical and much more open-ended than a grid world.

💬 0 comments
Add to My Notes
00:47:13Shunyu Yao

And let's say you find a good product, you can just click all the customization options, and you can click "buy now." And you can also give a reward from 0 to 1 indicating how good your shopping task is. So it's really like a very standard paradigm of a reinforcement learning environment, except that the observation and action is in text. And it turns out to be a very, very practical environment.

💬 0 comments
Add to My Notes
00:47:42Shunyu Yao

And WebShop is interesting because it's the first time people built large-scale complex environments based on large-scale real internet data, right? So at the time, we scraped more than a million Amazon products, and we built this website, and we built some automatic reward system to tell, you know, if you find a product and here's the instruction, then how can you give a reward to indicate how matching are the two things? And you can clearly see it's perhaps harder than the grid world task because you need to understand not only the images and language in real-world domains, but you also need to make decisions over a long horizon, right? Like you have to maybe explore 10 different products or in different search queries to find the perfect match.

💬 0 comments
Add to My Notes
00:48:44Shunyu Yao

And for example, on this direction of web interaction, follow-up work has made great progress. You know, beyond shopping, you can actually solve various tasks on the web. And you can also try to solve other practical tasks, for example, software engineering, right? So in this example, SWE-bench is a task where you are given a GitHub repo and an issue, right? So you are given a bunch of files in a repository, right? And you're given an issue, right? "This thing doesn't work, help me fix it." And you're supposed to output... you're supposed to output a patch file that can resolve the issue, right? So it's a very clean definition of the task, but it's very hard to solve, right? Because if you want to solve it, you have to interact with the source code base, right? You have to create unit tests, you have to run it, and you have to try various things, just like a software engineer.

💬 0 comments
Add to My Notes
00:49:57Shunyu Yao

Another example that I think is really cool is, it's... I think the current progress is well beyond digital automation, right? So in this example from ChemCrow, a work that I really like, the idea is they're using reasoning agents to try to find new chemicals. And what's really cool is that you give the agent a bunch of data about some chemicals, and you give them access to use tools like Python or the internet or whatever, and they could do some analysis and try to propose some kind of new chemical. And also, the action space of the agent is somehow extended into the physical space, because the action or the suggestion from the agent is then synthesized in the wet lab. And then you can imagine you can get feedback from the lab, and then you can use that to improve yourself and stuff like that. So I think it's really exciting that you can think of a language agent not only as operating in the digital domain, but also in the physical domain, not only in solving like tedious tasks like booking a DoorDash, but also more intelligent or creative tasks like software engineering or scientific discovery.

💬 0 comments
Add to My Notes
00:51:20Shunyu Yao

Okay, so great. So we have covered this slide finally. So in summary, I have talked about, you know, how we start from LLMs, we have this paradigm of reasoning, we have this paradigm of acting, they converge, and that brings up more diverse tasks and methods. And we have also covered in a more broader time scale the paradigms of agents and why this time is different. And also from a task perspective, right?

💬 0 comments
Add to My Notes
00:52:00Shunyu Yao

So the previous paradigm of tasks, if you think about in AI, you can think of games, you can think of simulations, you can think of robotics, but really LLM agents bring up this new dimension of task, which is to automate various things in the digital world.

💬 0 comments
Add to My Notes
00:52:25Shunyu Yao

So we have covered a lot of history, and I just want to summarize a little bit in terms of lessons for doing research, right? So I think personally, as you can see, you know, it turns out some of the most important work is sometimes the most simple work, right? You can argue like Chain of Thought is incredibly simple, and ReAct is incredibly simple. And simple is good because simple means general, right? If you have something extremely simple, then you have probably something extremely general, and that probably is the best research.

💬 0 comments
Add to My Notes
00:53:02Shunyu Yao

But it's hard. It's hard to be simple, right? So if you want to be simple and general, you need to both have the ability to think in abstraction, right? So you have to jump out of individual tasks or data points, you have to think in a higher level. But you also need to be very familiar with the individual task, the data, the problem you're trying to solve, right?

💬 0 comments
Add to My Notes
00:53:35Shunyu Yao

So notice that it could actually be distracting to be very familiar with all the task-specific methods, right? So remember in the history of QA I covered all those, a lot of task-specific methods. Like if you are very focused on this, then you might end up, you know, trying to create an incremental solution after that. But if you're familiar with not only QA but a lot of different tasks and you can think in abstraction, then you can propose something simpler and more general. And in this case, I think really learning the history helps and learning other subjects helps because they provide you some prior for how to build abstraction and they provide ways to think in abstraction.

💬 0 comments
Add to My Notes
00:54:26Shunyu Yao

Okay, so this is mostly the talk. I think I will just briefly talk about some thoughts on the future of LLM agents, right? So everything before this slide is history, and everything after this slide is kind of the state-of-the-art or the future. Obviously, the future is very multi-dimensional. There are many directions that are very exciting to work on. I want to talk about five keywords that I think are truly exciting topics that are first, very new, in the sense that if you get to work on this now, there might be a lot of low-hanging fruit, or you might have a chance to create some very fundamental results. And second, is somehow doable in the academia setup, so you don't have to be OpenAI to do this, but it's still good to be OpenAI.

💬 0 comments
Add to My Notes
00:55:29Shunyu Yao

So these five topics actually correspond to three recent works that I did, and I'll only cover them briefly. And if you have more interest, you should check out those papers yourself. So the topics are: first, training, how can we train models for agents? Where can we have the data? Second, interface, how can we build an environment for our agents? Third, robustness, right, how can we make sure things actually work in real life? Fourth, human, how can we make sure things actually work in real life with humans, right? And lastly, benchmark, how can we build good benchmarks?

💬 0 comments
Add to My Notes
00:56:18Shunyu Yao

So first, training, right? So I think it's interesting to note that like up until this year, language models and agents are kind of disentangled in the sense that the people that are training models and the people building agents are kind of different people, right? And what is the paradigm is that the model building people build some model, right? And then the agent building people build some agents on top of it using some fine-tuning or some prompting, mostly prompting.

💬 0 comments
Add to My Notes
00:56:54Shunyu Yao

However, these models are not trained for agents, right? So if you think about the historical root of language models, it's just a model that's trained to do text. Like people could never imagine it's one day used to solve like chemical discovery or software engineering, right? So that brings the issue of discrepancy of the data, right? So it's not trained to do those things, but then it's prompted to do those things. So the performance is not optimal.

💬 0 comments
Add to My Notes
00:57:21Shunyu Yao

And one solution to fix this is you should train models targeted for agents. And one thing you can do is you can use those prompted agents to generate a lot of data, and then you can use those data to fine-tune the model to be better as agents. And this is really good because first, you can improve all the agent capabilities not covered in the internet, right? So you can imagine on the internet, which is the predominant source of language model training, there is not a lot of like self-evaluation kind of data, right? People only give you like a well-written blog post, but no one really releases all the thought process and action process of how to write the blog post. But that's actually what matters for agent training, right? So you can actually prompt agents to have those trajectories, and you can train models on those things.

💬 0 comments
Add to My Notes
00:58:24Shunyu Yao

And that's, I think, really one way to fix the data problem because we all know internet data is running out. And how can we have the next trillion dollars to train models? This is very exciting. And I think a very, maybe not best, analog is you can think of the synergy between GPU and deep learning, right? Because GPU first was not designed for deep learning, right? It was first designed to play games. And then people explored the usage, found, "Oh, it's very good for deep learning."

💬 0 comments
Add to My Notes
00:58:58Shunyu Yao

And then what happens is that not only do people use existing GPUs to build better deep learning algorithms, but also the GPU builders build better GPUs to fit the deep learning algorithms, right? You can build a GPU that's specific for transformers or so on and so forth. I think we should also establish the synergy between models and agents.

💬 0 comments
Add to My Notes
00:59:18Shunyu Yao

And the second topic is interface. And in fact, human-computer interface has been a subject for decades, right? It has been a great topic in computer science. And really the idea is if you really cannot optimize the agent, you can optimize the environment, right? Because if you're trying to write a code, right, even if you're the same person, it makes a difference whether you're doing that in the text edit interface or the VS Code interface, right? You're still the same, like you're not being more smart, but if you have a better environment, then you can solve the task better.

💬 0 comments
Add to My Notes
00:59:55Shunyu Yao

I think the same thing happens for agents, right? So as a very concrete example, you can imagine how can the agent, like say, search files in an OS, right? So the human interface in the terminal, as we all know, is to use ls and cd and so on and so forth. It works for humans, but it's not the best interface for agents. You can also do something like you can define a new command called "search," and then it will give a result, and then you can use this action called "next" to get the next result. But it's probably still not the best for a language model.

💬 0 comments
Add to My Notes
01:00:41Shunyu Yao

So in this research called Agent UI, what we find is that what turns out to be the best way to help agents search files is to have this specific command called "search," and instead of giving one result at a time, you just give 10 results at a time, and then you use that, the agent decides what's the best file to look at, right? And you can actually do experiments and show you can use the same language model, you can use the same agent prompt, but the interface matters for downstream tasks.

💬 0 comments
Add to My Notes
01:01:12Shunyu Yao

So I think this is a very, very exciting topic, and it's only getting started, and it's a great research topic for academia. You don't need to have a lot of GPUs to do that. And it's interesting because models and humans are different. So social interfaces, right? You cannot expect language models to use VS Code to be the best code interface, right? So there must be something different, and we need to explore that.

💬 0 comments
Add to My Notes
01:01:40Shunyu Yao

And in this case, you can think of the difference as being that we humans just have a smaller short-term memory, right? If I give you 10 results at the same time, you cannot just read them. That's why for human interfaces, you have to design that in an iterative way, right? You have a "next" button, right? If you do control-F, you can only read one thing at a time. But actually for models, it's worse because models have a longer context window, right? So if you control-F, you should just probably give everything to the model, right?

💬 0 comments
Add to My Notes
01:02:11Shunyu Yao

So if you design better interfaces, it could help you solve tasks better with agents. It could also help you understand agents better, right? It could help you understand some of the fundamental differences between humans and models.

💬 0 comments
Add to My Notes
01:02:32Shunyu Yao

Lastly, I want to point out this topic of human in the loop and robustness. So I just want to point out there is this very big discrepancy between existing benchmarking and what people really care about in the real world, right? So you can think of a very typical agent task or AI task, say coding with unit tests. And this is a plot from AlphaCode 2, and basically the idea is if you sample more times, then the chance that you have a right submission increases, right? That's very intuitive.

💬 0 comments
Add to My Notes
01:03:07Shunyu Yao

And if you have a unit test, then you can sample many, many times, obviously. And what you really care about is what we call pass@k, right? What you really care about is, can I solve it one time out of 10,000 times or a thousand times or a million times, right? It's kind of like solving the Riemann hypothesis. You just need to do it once, right? What you care about is if you sample 10 million times, can you solve it once?

💬 0 comments
Add to My Notes
01:03:33Shunyu Yao

But if you think about all the... most of the jobs in the real world, it's more about robustness, right? So suppose you're trying to do customer service, right? Like LLM agents are already deployed for customer service, but sometimes they lose, and there are consequences, right? And if it does something, the company might have compensation and stuff, right? So arguably, you know, customer service is much easier than coding or proving the Riemann hypothesis, at least for humans, right?

💬 0 comments
Add to My Notes
01:04:10Shunyu Yao

But here, it really presents a different challenge because what you care about is not, "Can you solve it one time out of a thousand times?" What you care about is, "Can you solve it a thousand times out of a thousand times," right? You care about what if you fail it one time out of a thousand times, because if you fail one time, then you might lose a customer, right? So it's more about getting simple things done reliably.

💬 0 comments
Add to My Notes
01:04:34Shunyu Yao

So I think really that calls for a different way of doing benchmarking. And we have this recent work called ToolTalk, and the idea is first, you have a very practical task, which is to do customer service. Second, the agent is not only interacting with some kind of environment, like a digital environment, it's also interacting with a human, but it's a simulated human, right?

💬 0 comments
Add to My Notes
01:04:59Shunyu Yao

So the idea is the customer service agent, just like a human customer service agent, needs to interact with both the backend API of the company and also some kind of user. And the agent really needs to interact with both to solve the task. And the trajectory might look something like this, right? So the human might not give you all the information at the beginning, right, which is a predominant paradigm of all the tasks right now if you think about software engineering and so on and so forth. Imagine they say something like "change flight," and then you might need to actually prompt the user, "Oh, can you tell me like which flight you are changing?" And you need to interact with the user over multiple turns to figure out what they need and to help them, right? And this is very different.

💬 0 comments
Add to My Notes
01:05:46Shunyu Yao

And also, it makes the metric that you care about different, right? So you can imagine for the same task, right, you can sample the trajectory multiple times with the same user simulation. So if you look at the dashed line, it's called pass@k, which is measuring, you know, if you sample 10 times, can you solve it at least one time? And obviously, as you sample more, the chance that you solve it at least one time increases, right?

💬 0 comments
Add to My Notes
01:06:18Shunyu Yao

But here, you don't care about whether you can solve it one time out of 10 times. You care about whether you can solve it 10 times out of 10 times, because otherwise you will lose a client, you might lose a customer, right? So the solid line measures, if you sample more, what's the chance you can always solve the task across all the possible tasks.

💬 0 comments
Add to My Notes
01:06:38Shunyu Yao

And what we see in today's language models is that obviously they have different starting points, meaning like obviously they have different capabilities. But what's really concerning is they all have this decreasing trend, right? So if you sample more, the robustness always goes down. Like from the small models to big models, they all have a similar trend. The ideal trend should be something more flat, right? So if you can solve something, you should be more reliable to solve the same thing over time.

💬 0 comments
Add to My Notes
01:07:09Shunyu Yao

So I just want to point out that I think we also need some more effort in taking more real-world elements into benchmarking, and that requires new settings and metrics. So we have this blog post that talks about, you know, some thoughts on the future of language agents. And one way to think about that is you want to think about what kind of jobs they can replace, right? And if you think about it, right, maybe the first type of task is kind of not that intelligent, right, but really requires robustness, right? If you think about simple debugging or doing customer service or doing simple assistant over time and time.

💬 0 comments
Add to My Notes
01:07:56Shunyu Yao

And second, you need to collaborate with humans. And third, you might need to do very hard tasks, right? You might need to write a survey from scratch or discover a new chemical. And that requires some new type of ways for the agents to explore on its own. But I think it's in general very useful to think about what jobs they can replace and why they're not replacing those human jobs yet and what are the missing pieces and how can we improve it.

💬 0 comments
Add to My Notes
01:08:26Shunyu Yao

Lastly, lastly, this is a limited time, and we're going to have an EMNLP tutorial on language agents in November, and it will be three hours. So hopefully, it will be more comprehensive than this.

💬 0 comments
Add to My Notes
Video Player
My Notes📝
Highlighted paragraphs will appear here