CS 194/294-196 (LLM Agents) - Lecture 1, Denny Zhou
Disclaimer: The transcript on this page is for the YouTube video titled "CS 194/294-196 (LLM Agents) - Lecture 1, Denny Zhou" from "Berkeley RDI Center on Decentralization & AI". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.
Watch the original video here: https://www.youtube.com/watch?v=QL-FS_Zcmyo
Okay, thank you. Um, okay, so I will first start with some introduction, and then we'll get the actual content of this class started.
Okay, so first, my name is Dawn. I'm a professor in computer science here at UC Berkeley and also a co-director of a campus-wide center called the Center on Responsible Decentralized Intelligence. So I'm the instructor for this class, and also we have our guest co-instructor, Shin, from Google, who is also an alum. She was my former student here, teaching this class together.
And also we have our great Alex and Cun, and also we have our great readers, Tara and Ash. Okay, so this is the teaching staff who will be working together with you this semester.
Okay, great. So everyone's here. Everyone has been seeing the exciting growth of large language models. The speed of advancement is just astonishing. However, these large language models, they operate in a fairly simple manner: they take text input and produce text output. So what we will cover in this semester in this class is the next frontier: large language model agents.
So instead of just taking text as input and producing text as output, here we use a large language model as the key brain for reasoning and planning for the agents. And enable the agents to interact with external environments, observe the environments, and take actions in the environments. And the agents will be using external tools and also external databases and knowledge bases, and so on, for retrieval to help the agents to perform these tasks.
And the rich capabilities of these large language models makes these LLM agents very flexible, and they can easily operate in diverse environments without much particular training. And these LLM agents, they can interact with different types of environments, including, for example, surfing the web through different APIs online. And they can also be embodied even in a robot and operating in the physical world.
And they can sense the environments through different types of inputs, even in the multi-modal setting, even including various sensory inputs, and taking actions in these diverse environments. And through this interaction with the complex and diverse environments, they can update their memory, they can learn to use... they can interact with humans, and they obtain grounding through these interactions as well.
And these agents not only just interact with the environments, they can interact with other agents through multi-agent interactions and collaboration, including humans as well. And this multi-agent collaboration can help agents together to solve even more complex tasks.
So why are LLM agents the next frontier? Why do we need to empower LLMs with the agent framework? For a number of reasons. Solving real-world tasks is never just in one go with text inputs producing text outputs. Oftentimes, it involves a trial-and-error process. And leveraging external tools and retrieval from external knowledge can help expand LLM's capabilities.
And more importantly, this dynamic agentic flow, this agent workflow, can facilitate solving complex tasks through enabling task decomposition, allocation of sub-tasks to specialized modules, division of labor for project collaboration. And throughout the course, we'll also see that multi-agent generation can help inspire better responses.
Even though LLM agents has been a fairly recent development, we have already seen agents helping transform different application domains, through wide-ranging including education, law, finance, health care, cyber security, you name it. And the development is really exciting and is fast improving. There are many different leaderboards for different agent benchmarks that you can see online, and you can see the really fast improvements on all these different agent frameworks.
So overall, to better enable agent deployment, there are a number of key challenges that we still need to address. So first, we need to improve the reasoning and planning capabilities of agents. Agents tend to make mistakes when performing complex tasks end-to-end, and it's important to improve the reasoning and planning capabilities.
And also, we need to improve embodiment and learning from environment feedback for these LLM agents. LLM agents are still not efficient at recovering from mistakes for long-horizon tasks. We need to further develop methods and capabilities for continuous learning and self-improvement for these LLM agents, and also improve multimodal understanding, grounding, and real-world capabilities of these agents.
And also, as I mentioned, multi-agent can really help agents to provide better solutions for tasks, and developing theory of mind helps multi-agents to better develop as well. And safety and privacy, these issues are also very important for LLM agents. LLMs are susceptible to adversarial attacks, can emit harmful messages, or leak private data, and so on. Solving these challenges are also really important for deploying agents safely in the real world.
And also enabling human-agent interactions and ethics, how to effectively control LLM agent behaviors and design interaction modes between humans and our agents to best enable our agents to serve human needs is also really important.
So to help students learn and better develop methods to address these challenges, the course has been designed to cover a broad spectrum of topics, actually throughout the different layers of the agent framework and also the domains.
So first in the class, we'll cover key model capabilities, including reasoning, planning, multi-modal understanding. We will also cover popular real-world agent frameworks to enable students to learn how to better design agent applications and use various agentic flows easily. This will help students to also learn and to use LLM agent frameworks for workflow design, to use retrieval-augmented generation (RAG), and multi-agent systems.
And we'll also cover a number of exciting application domains using these LLM agents, including software code development, workflow automation, multi-modal applications, and enterprise applications. And finally, we'll also cover important topics on our agent's safety and ethics.
To cover these wide-ranging topics, we have assembled an amazing team of guest speakers and researchers to cover these topics. So the class will be led by me and Shin, and we have this amazing crew of guest speakers to help cover these important topics in class.
Before the talk, I want to ask one question for everyone. So what do you expect from AGI? Let me take a second to think about it.
So I can imagine many different answers, like solve the hardest math problems that humans cannot solve, for example, or discover new scientific theories, or even solve AGI.
My background is machine learning. I don't know in the current days if still many people study machine learning or not, because it's like Transformer is always what you need, right? As a machine learning person, I have a little intuition about AGI. AGI should be able to perform well from just a few examples, like what humans usually do.
In the past decades, the machine learning community spent great efforts to develop data-efficient methods like semi-supervised learning, active learning, meta-learning. And you know, if you look at newspapers in the past decade, people were always excited about one-point, two-point gains in the SOTA over papers. So in practice, actually, I never saw data-efficient approaches... I would like to say, miss and failed. You know, don't feel bad about that. I developed them back when I started almost... I actually... that moved me to solve a different problem: what's missing in machine learning? So I thought about them for years, and finally I found out the answer: reasoning. It was missing in the current days, in particular for people in the course today. It's so obvious, right? This lecture is about reasoning.
Humans can learn from just a few examples because humans can reason, not because of data statistics.
Let's start from a toy problem. In my research, I usually prefer a very simple problem, but it contains all the details, all the challenging places. So this problem is called last letter concatenation. If you are familiar with neuro-symbolic literature, you'll find similar problems. For this problem, given a person's name as input, the output will be the concatenation of the last letter of the first name and last name. For example, like Elon Musk, and the last letter of Elon is N, the last letter of Musk is K. So the output is NK. It's so simple.
And if you had this problem a few years ago, you probably would try to solve it by a machine learning model. For example, you could use a transformer model. One is encoder, one is decoder. And then you will find that, okay, you probably need tens of labeled examples to train the model, and finally you get an accuracy like 85% or 90% or something. Now it's interesting, for such a simple task—I mean, simple for humans, okay—if the method requires a vast amount of labeled data to learn it, would you like to call it as AI or not? AI means artificial intelligence, right? I suppose an intelligent model should be able to learn this task just using one or two examples.
Now let's see how this problem can be solved by using large language models. I suppose most people know large language models, but Professor Song told me to explain what LLMs are. Okay, LLM is a Transformer model trained to predict the next word. For example, given the text "AI is the future," where Musk said "AI is the future," just "AI is the" as the input, the model will predict what will be the next word. If the next prediction is not the word "future," we need to adjust parameters to make it produce the correct one. That's called backpropagation.
Of course, here you can train your model with many sentences. For example, you can use all text from the internet. If you don't want to go into details, you can simply think of training LLMs as training parrots to mimic human language. Actually, I created this sentence, and one guy told me he is very famous about training parrots. He is looking for a job.
When we train this model, okay, and then we can just mimic the process of the training. The training is about predicting the next token. We can use whatever as input and to see what will be the output. It just predicts the next token, and then you can input the generated token... you can use the input token and the next token as how to get an answer from the LLM.
For this problem, we can simply concatenate all the examples we have had as the input and also concatenate with the test example, Barack Obama here. We can try this using any LLM and see what happened. And probably you'll see you get a wrong answer here, it's KA. Of course, it's not correct, right? Because K is the last letter of Barack, and A is the last letter of Obama. The output should be KA. So, it's wrong, right? The problem, this is called few-shot prompting. It's just a mimic of the machine learning process. Instead of training the model, we just use the examples as input. That's the only difference.
In the current days, we know how to fix this prompting idea. We just need to add a reasoning process before the answer. Like, we just add the explicit process here. The last letter of Elon is N. The last letter of Musk is K. So, NK. Like that. It's called a reasoning process. And similarly for Jeff Bezos. And now we use this as the new input, and you will see, okay, we get a perfect response from the large language models. So, even like a human, one demonstration is enough to get accuracy 100%. That's exactly what I looked for. We cannot imagine any machine learning method can achieve this perfect generation here. There's no way.
But by the way, don't overrate what I said about machine learning. Machine learning is just so useful and important for doing research in the current days. I saw many naive mistakes from social media, news, even from the papers in conferences. All the naive mistakes are mostly from people who have no background on machine learning. They just randomly try different ideas.
It's interesting, you know, this kind of idea of adding intermediate steps has been proposed many years ago in the literature. So this is an amazing paper, we know, by researchers at DeepMind, published in 2017. So in their paper, they use natural language reasoning to solve math problems. In their paper, they even wrote, "derive the final answer through a series of small steps." And then they trained a seq-to-seq model from scratch. If you know Chain of Thought work, you'll be so surprised about this paper, right? The authors are just like time travelers. They know how to make a different approach.
In 2021, a team in OpenAI published an amazing dataset called GSM8K. They followed the idea in the 2017 paper. In this dataset, every problem is followed by intermediate steps as a solution and also the final answer. And this team created this amazing dataset, and they used that to fine-tune the GPT-3 model. They greatly scaled up the work by DeepMind in 2017.
Even in the same year, 2021, a group of researchers in Google Brain, now part of Google DeepMind, published a work like "Show Your Work: Scratchpads for Intermediate Computation with Language Models." They discovered similar ideas independently, but in the domain of program synthesis. That's why they used abstract symbols here instead of using natural language.
In the country, probably many people know our work, "Chain of Thought Prompting." And "Chain of Thought," actually, literally, "Chain of Thought" is not a term we invented. It's just a common English phrase. It means multi-step reasoning. So in this work, we extensively evaluated intermediate steps and showed amazing results on almost every NLP task.
So that's all the papers here together. In 2017, DeepMind published a paper on training with intermediate steps. In 2021, fine-tuning LLMs with intermediate steps. In 2021-2022, prompting with intermediate steps. I will say, okay, which part is more important? You can see here, actually, it doesn't matter if you are training or fine-tuning or prompting the model. What really matters appears here: intermediate steps. That's the key.
So let me summarize here. Regardless of training, fine-tuning, or prompting, when provided with examples that include intermediate steps, LLMs will generate responses that also include intermediate steps.
Okay, given the intermediate steps, one can ask the question: is it helpful to introduce reasoning strategies in those examples? For humans, when they solve a problem, they could have a strategy for solving it.
So this is our work from our team. It's "Least-to-Most Prompting." In this work, we enable easy-to-hard generation by decomposition. Probably many people saw this famous book, "How to Solve It" by Pólya, a classic book for math education. So there's a chapter about decomposition. So if you just... if you go into details, you may lose yourself in details.
Yeah. Now let's see what is the difference by decomposition. So given this math problem here... so by the way, in this talk, the math is kept at an elementary level. So every time I give a talk, before, actually, my daughter also came to my talk and made sure she could understand. She's at the fifth grade now. And you see Alice has three apples, Anna has two more apples than Alice, how many apples do they have together? Okay. We see the difference is that, okay, we first show the language models how to break down this problem into subproblems and then solve one by one. And that's why it's called "Least-to-Most," from least to most complex problems. It's a very simple idea, but surprisingly powerful. So I suggested how to decompose a complex task into simple tasks.
So this is a SCAN task for compositional generalization. You can look at the examples here. Given a natural language command, and we need to translate it to a sequence of actions that could be executed by robots, something like that. So if you use Least-to-Most prompting, we get an accuracy of 99.7%. We just used 0.1% of demonstration examples. So I wonder why I chose this task. Actually, I knew this task from Shin. She's here today, and she invented a beautiful approach to solve this task many years ago. When I looked at this task, I was really surprised because it looks so straightforward for humans. Why could it be so difficult for machines? Finally, we can make it by LLMs.
And this is another task. We can see here, text-to-code, again, a compositional generation part. I don't know if anyone knows the concept of compositional generation. Roughly speaking, the test examples are more difficult than the training examples or prompting examples. So for example, for the text-to-code problems, the test problems need longer snippets here. Our approach is a little bit changed, a little bit called Dynamic Least-to-Most Prompting. And we just used 1% of the data and achieved proper great results, way better than the SOTA results in the literature. And the SOTA results actually are by specialized architecture design and training, and they use, of course, all the training data.
Yeah. So far, any question here? Otherwise, I'll go to the next section. Yeah, okay, I suppose this part is quite familiar for everyone.
I have two kids. My daughter is 10 years old, and my son is 7 years old. Actually, when the Chain of Thought prompting paper came out, I heard a very interesting conversation between my daughter and my son. And my daughter asked her little brother, "So what's 11... 17 times 3?" The little brother said, "I don't know." And then she asked, "What's 10 times 3?" "30." "What's 7 times 3?" "21." "So what's 17 times 3?" "Oh, yeah, I know, 51!" And the funny thing is, my daughter shouted to me, "Daddy, Chain of Thought prompting also works for my little brother's brain!"
Okay. Now, okay, one may say, okay, why intermediate steps are helpful? So one may say, okay, that's so natural for humans. But if we are doing research, we have to think it deeper. You know, that's just something similar. LLMs are just models. We want to understand what happened.
And this year, we have a work published at ICLR 2024, and I collaborated with Percy Liang's group from Stanford. And in that work, we give a rigorous mathematical analysis. Okay, so here are the key results: Transformer generating intermediate steps can solve any inherently serial problem as long as its depth exceeds a constant threshold. I have to emphasize, it is a constant. That means independent of your input. However, if a Transformer is generating direct answers, it either requires a huge depth to solve it or cannot solve it at all. Yeah, please check the statements again, then I'm moving to the next slide. Probably you can see tons of practical implications of this theory. Yeah, if you couldn't solve a problem, you may think about generating more intermediate steps. And also, probably you could call some external tools like search to help with the intermediate steps, right? So I think in this LLM agent course, many people will talk about how to use external tools, and you could think about how to dive into LLM's features and limitations.
Yeah, so I have one of my big funs is to find problems my daughter can solve in seconds where LLMs fail.
Yeah. Okay. So far, we have talked about how to use examples to trigger LLMs to generate step-by-step reasoning. Now, is it possible to trigger it without using examples?
It's an amazing work. Actually, when this paper came out, I thought it was a joke. It turned out not. And then I was inspired a lot by this work. It's called "Let's think step-by-step." So given this question, okay, we don't need any examples. We just need to say, "Let's think step-by-step," and the model can generate the business steps. Yeah, it's really cool. But usually, you know, the zero-shot approach—that means zero-shot, there are no demonstration examples—is worse than few-shot. So we wonder, okay, if we have an approach that is still zero-shot but can do much better work.
So this leads to our another work. It's called "LLMs as Analogical Reasoners." So again, this beautiful book, "How to Solve It" and Pólya. So in the book, it says, okay, how to do analogical reasoning in solving math problems. So we see a new problem, you first ask yourself a question, "Do you know a related problem or methods or strategies here?" So after that, I'm going to find this and provide another paper. So this... I really like the quote from Banach. Yeah, if you studied functional analysis, you will know Banach space. And I was really amazed by that last sentence, "The ultimate mathematician is one who can see analogies between analogies." Of course, I show this here to let you know how far I am from a mathematician.
So given this simple problem, okay, of course, you can say, "Okay, let's think step-by-step." But now we can do it in a different way. Okay, we call a related problem, and then solve that one, then solve this one. Okay, you can see that actually, indeed, we call relevant examples and knowledge here. But those problems are not exactly the same problem, but they are useful for that. That's amazing. And we found that actually, of course, we tried two benchmarks and see it works really well. So you can see that the last row is from "Analogical Reasoner by prompt." Of course, you can optimize the prompts by yourself to get better results. The most important thing here to see is that it's much better than just saying "Let's think step-by-step." "Let's think step-by-step" here means zero-shot COT. Yeah. And even this approach outperforms manual Chain of Thought. So here, the main reason is that, you know, when we use this approach, the model automatically generates related questions for each different problem.
This is the results on BIG-Bench. Yeah, with great performance. And the results on Codeforces competitive programming. Yeah, if you are interested in competitive programming, you could try this approach. So what we didn't do here is about scaling. You can search the web for all the related problems and knowledges for the problem you want to solve.
So the key idea here, you know, is that we dynamically generate relevant examples and knowledge for each given problem, instead of using a fixed set of examples as in manual Chain of Thought prompting.
Okay, now we can see that, okay, we can use few-shot examples to show the model how to do step-by-step reasoning. We can do zero-shot without using any examples, just saying, "Let's think step-by-step." Now I could ask another question: is it possible to trigger step-by-step reasoning even without using any prompt like "Let's think step-by-step?" You could say, "Okay, all the models in the market are just like that, right?" You're right, but they did something. That means they already used many examples in the data mixture for training or pre-training.
So yeah, we found the answer is yes. This is in our work, "Chain of Self-Reasoning," without a prompt. Without a prompt means without saying anything. Just give the problem to the model, even for a language model not fine-tuned.
So let's look at an example here. "I have three apples. My dad has two more apples than me. How many apples do we have together?" For this example, let's see. The approach actually is very simple. At decoding, at the first step, we look at all the possible tokens here, at least to the five tokens here. Okay, so we started the first five tokens and then continue greedy decoding. Okay, for the first one, it says "five apples." Okay, the first token is five, and the next word is apples. And if you use... the token is "I," then the full generation will be, "I have three apples, my dad has two more apples than me, and so he has five apples." Yeah, it's correct. Yeah. And I see that. So that's very interesting, right? So we didn't say anything about reasoning here, but the model can do some reasoning, starting from different tokens.
Here is another example. We say, okay, "Was Nicolas Cage born in an even or odd year?" The first one says, "Cage was born in an odd year." "Odd" was the first token. And the second one says "even" and then period. The third one is "odd" then period. Okay, now probably you'd say, okay, if you know the model could have had chain-of-thought in their response, how do we find it? I say, okay, you can take a longer sentence. Longer sentences means the model could do some reasoning steps. Actually, a surprising thing is that if you look at the probability of the token "odd," if you look at the probability on the first row here, "Nicolas Cage was born in an odd year," the probability is quite low. However, if you see there's a reasoning path, like the last one, "Cage was born in 1964, an even year." This is a reasoning process here. And then the probability finally jumped to 98%. That's amazing, right? It seems that the model is so well-calibrated. I was really surprised at seeing those probabilities.
You can see that like two or three, if you say even or odd, the probabilities are really low.
So, key observation and principle: responses with step-by-step reasoning and generations started with the top-K tokens... we don't need to use any prompts here, not needed... have high confidence in decoding the final answer when a step-by-step reasoning path is present. So here is a comparison between greedy decoding and the Chain of Self-Reasoning decoding. We see that the decoding performance is much better.
Yeah, so far, any question here? Now, let's move to the next topic, right? Generating intermediate steps are helpful, even really helpful, you know. But any concerns on generating steps instead of direct answers? Any concerns?
Yeah. So probably all say it depends on your problem, your need. Yeah. So actually in the current days, you know, we need to always keep in mind that LLMs are probabilistic models of generating next tokens. They are not humans, no matter if I use key examples or not. Keep this in mind. So it's a probabilistic model.
So let's see what LLM does in decoding. So it's actually argmax probability of (reasoning path and final answer). Here's the problem. However, what we want is argmax probability of (final answer given the problem), right? That's what we learned in machine learning. This doesn't mean the reasoning path is not important, but I just say for the final answer, we have to make sure the final answer is correct and then look at the reasoning path. They are not aligned, right? The two different objectives.
Okay, now let's look one step further. Okay, the probability of the final answer given the problem, for the computer, we should sum over all possible reasoning paths. That's how it's computed from our course we learned, right? So given a math problem, you could find different solutions which lead to the same answer. Yeah, we need that. We need to do the summation here. And then, okay, how to compute this? If you studied machine learning, you know the answer here, right? Simple sampling.
Okay, now this leads to our work, "Self-Consistency." Probably many people know self-consistency, but my talk here is... I really want to let you see the underlying motivation, how we approach this problem from the first principle of machine learning.
So let's look at the question here. Okay, given this math problem, and you could sample the answer multiple times, yeah, again and again, and then finally you see, okay, the most frequent answer is 18. Okay, what we give here is not the most frequent reasoning path. We choose the most frequent answer. That's a huge difference. The reasoning path here is a latent variable. This idea is so simple. By using self-consistency, we simply crushed the SOTA results in the literature at that time. And you see that in doing research, we don't need... we can, you know, it's really just about your idea. You don't have to know a lot of things.
And of course, you know, given our explanation on self-consistency is about probability, it's about sampling. So imagine that more consistent results, more likely to be correct. When you look at the curves here, if the consistency is more than 80%, then the accuracy is nearly 100% here. Okay.
So when the output is a direct answer without intermediate steps, we use the sample several times and then choose the most common answer. Anyone like to answer? Okay, great. Yeah. One token, okay, that's already the token with maximum probability.
And for the second question, can we change self-consistency by letting the LLM generate multiple responses instead of sampling multiple times, and then choosing the most common answer? Does this make sense? No, great. Yeah. That is no. And for both answers, we just need to follow this principle: argmax probability of the final answer given the problem. That's all you need to understand. Self-consistency is a very, very simple principle. It was one of the first principles in machine learning. If you know more, probably you know, okay, this is maximum marginal likelihood inference.
Okay, so what about free-form answers? Is Universal Self-Consistency here? So this idea is a little bit different, but related. I put it here. And given this problem, "Where do people drink less coffee than they do in Mexico?" Now, if you look at the answers, and each answer is different from others, but the most common response is three here: Japan, China, and India.
Any question? Otherwise, we could move to the next section. Okay. Yeah, self-consistency. Sample your answer multiple times and then choose the most frequent answer as the final answer.
And next, I'm going to talk about limitations. The first one I'll talk about is LLMs can be easily distracted by context. From psychology studies, you know, irrelevant information may significantly decrease some children and even adults' problem-solving accuracy. So I want to check if this observation holds for LLMs.
So this is a simple problem here. The highlighted text is manually added. "Mario's mom which is $10" is irrelevant for the original problem. But I see after that, the model made a wrong solution here. So, it's interesting that if we add a prompt like "ignore irrelevant context," the model immediately notices that and makes a correct one. But it's still hard to take back if we make the problem contents opaque. So even if we can simply just add irrelevant sentences like "the sky is blue" and "the grass is green" or something, you know, those mountains... you can make this input long, you will see a significant performance drop across all LLMs.
The next limitation I'm going to talk about, LLMs cannot self-correct reasoning yet. Let's start from a math problem again. And actually, this problem is a little bit tricky if you look at it. And you see that the model gave a wrong answer. And then we prompted the model with, "Review your previous answer and find problems with your answer, okay?" And interestingly, after reviewing, the model recognized the mistake and proactively designed... this looks promising, right, you see? And then we say another problem, "Based on the problem you found, improve your answer." And the final answer here is correct.
If the original answer is correct, and we do the same prompt, the model could have made a mistake. That's the problem. So overall, while allowing an LLM to review their generated response can help correct inaccurate answers, it may also risk changing correct answers into incorrect ones. We ran extensive studies on some benchmarks like GSM8K, CommonSenseQA, and TimeQA, and we didn't notice any improvements from self-correction methods. They just make things worse.
Probably you saw some improvement from the literature. You know, they said they saw improvement in reasoning, and actually they used oracle answers. You can see the oracle here. Oracle means you only prompt LLMs to correct the answer when the answer is wrong. The problem is, the model doesn't know if the answer is correct or wrong.
And also this relates to multi-agent debate. One could use multiple agents to debate each other and to achieve agreement or consensus. And also, we tried this approach. You know, we found out actually the trick is how many responses are generated. For example, if we have three agents, and if we let everyone generate a response, that will be three. If we let them debate, and then reason on that, that will be nine responses together. So how about we just do self-consistency with nine responses and see what happened? We find that those approaches cannot outperform self-consistency. Self-consistency is much simpler: sample N times and take the most frequent answer as the final prediction.
So the lesson we learn here is oracle feedback is needed for an LLM to self-correct. If so, that led to our work on self-debug. Self-debug naturally leverages unit tests as an oracle. In coding problems, you can naturally have unit tests. Actually, we started this work quite early, and we didn't make a... and then finally we moved to this.
And the last limitation I want to talk about, the premise order matters in reasoning. So, you know, in the current days where every time we show technical reports from arXiv or somewhere, and the people will show great results. For example, recently, a model could be... have fluent results on GSM8K. And I have to trust those numbers. In the current days, the models are trained with all data from the internet. There are already some problems.
One of my... working on my team is to generate different evaluation tasks to test the models. So here, we just did a simple check. We are given this original GSM8K problem, we reorder the sentences a little bit and see if the model can solve it. So here, you know, in the original problem, "He loses 10 beers while getting home." And we could, you know, just move this sentence to the end and see what happened. We just did some changes for some GSM8K problems, and we noticed that there are about 10 points drop on solving rate across all foundation models. So here is the response. Here you can look at and compare the response for the original problem and the reordered problem. So you see that the model actually just... the model just knows how to solve the problem sequentially. They couldn't go back and forth.
And one could say, okay, that may be related to some semantic understanding about reasoning. Okay, then we designed another task, logical inference. It is more pure than the math problems. Yes, you say if-then, if-then, if-then, right? Even if we don't use real words, we just use random tokens here. And given the rules and the facts, and then the model does logical inference in the query. And the rules for the original problem, the rules are ordered according to their use in the inference process. But I point out, not all rules are necessary for the query. And another way, you know, we could just randomly order those rules. Okay, I only randomly order rules relevant for the query. If not relevant for the query, they just keep their positions. And surprisingly, we then saw a 30+ points performance drop across all models.
From my personal experience, I think it's really important to design experiments when doing your own research. This is just like a debugger.
Okay, now let me summarize the talk here. So the first thing I talked about is generating intermediate steps improves performance a lot. And you can do training, fine-tuning, prompting with intermediate steps, but you can also do zero-shot, analogical reasoning, or some kind of special decoding like the Chain of Self-Reasoning I presented today.
And also, self-consistency greatly improves step-by-step reasoning, no matter if you're from a fine-tuned model or from fine-tuning. And I also listed a lot of limitations, you know, like irrelevant context, self-correction, and premise order. All those are matters for reasoning performance.
So when I think about what's the next problem, right? And I think the most important here is, you know, I see we work on something, we say we work on AI. AI... that's not the problem. The problem is to define a right problem to work on and solve it from first principles, not just from principles. It is super important here.
And actually, I'm organizing a conference called Conference on Language Modeling (COLM) with a lot of amazing people. And this is the first-ever conference dedicated to language modeling, and welcome to join. Yeah, that's it. Thanks.