Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG
Stanford OnlineDisclaimer: The transcript on this page is for the YouTube video titled "Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG" from "Stanford Online". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.
Watch the original video here: https://www.youtube.com/watch?v=k1njvbBmfsw
Hi everyone, welcome to another lecture for CS230 Deep Learning. Today we're going to talk about enhancing large language model applications, and I call this lecture "Beyond LLM." It has a lot of newer content. The idea behind this lecture is that we started to learn about neurons, then we learned about layers, then we learned about deep neural networks, and then we learned a little bit about how to structure projects in C3. Now we're going one level beyond into what it would look like if you were building agentic AI systems at work, in a startup, or in a company.
It's probably one of the more practical lectures. Again, the goal is not to build a product end-to-end in the next hour or so, but rather to tell you all the techniques that AI engineers have cracked, figured out, or are exploring. So that after the class, you have sort of the breadth of view of different prompting techniques, different agentic workflows, multi-agent systems, and evals. When you want to dive deeper, you have the baggage to dive deeper and learn faster about it.
Okay, let's try to make it as interactive as possible as usual. When we look at the agenda, the agenda is going to start with the core idea behind challenges and opportunities for augmenting LLMs. So, we start from a base model. How do we maximize the performance of that base model? Then we'll dive deep into the first line of optimization, which is prompting methods, and we'll see a variety of them. Then we'll go slightly deeper. If we were to get our hands under the hood and do some fine-tuning, what would it look like? I'm not a fan of fine-tuning, and I talk a lot about that, but I'll explain why I try to avoid fine-tuning as much as possible.
Then we'll do a section four on Retrieval Augmented Generation, or RAG, which you've probably heard of in the news. Maybe some of you have played with RAGs. We're going to sort of unpack what a RAG is and how it works, and then the different methods within RAGs. Then we'll talk about agentic AI workflows. Andrew Ng is one of the call it "first ones" to have called this trend agentic AI workflows, and so we look at the definition that Andrew gives to agentic workflows and then we start seeing examples.
Section six is very practical. It's a case study where we will think about an agentic workflow and I ask you to measure if the agent actually works, and we brainstorm how we can measure if an agentic workflow is working the way you want it to work. There's plenty of methods called "evals" that solve that problem. And then we'll look briefly at multi-agent workflow and then we can have a sort of open-ended discussion where I'll share some thoughts on what's next in AI. And I'm looking forward to hearing from you all as well on that one.
Okay, so let's get started with the problem of augmenting LLMs. Open-ended question for you. You are all familiar with pre-trained models like GPT-3.5 Turbo or GPT-4o. What's the limitation of using just a base model? What are the typical issues that might arise as you're using a vanilla pre-trained model?
Lacks some domain knowledge.
You're perfectly right. You know, we had a group of students a few years ago—it was not LLM related—but they were building an autonomous farming device or vehicle that had a camera underneath taking pictures of crops to determine if the crop is sick or not, if it should be thrown away, or if it should be used. And that data set is not a data set you find out there. And the base model or a pre-trained computer vision model would lack that knowledge, of course. What else?
Maybe the... [inaudible] ...high quality data...
So just to repeat for people online, you're saying the model might have been trained on high quality data but the data in the wild is actually not that high quality. And in fact, yes, the distribution of the real world might differ—as we've seen with GANs—from the training set, and that might create an issue with pre-trained models. Although pre-trained LLMs are getting better at handling all sorts of data inputs. Yes.
Lacks current information.
The LLM is not up to date. And in fact, you're right. Imagine you have to retrain from scratch your LLM every couple of months. One story that I found funny, from probably three years ago or maybe more, five years ago, where during his first presidency, President Trump one day tweeted "Covfefe." You remember that tweet? Or no, just "Covfefe." And it was probably a typo or it was in his pocket, I don't know. But that word did not exist. The LLMs in fact that Twitter was running at the time could not recognize that word.
And so the recommender system sort of went wild because suddenly everybody was making fun of that tweet using the word "Covfefe" and the LLM was so confused on what does that mean, where should we show it, to whom should we show it. It is an example of nowadays, especially on social media, there's so many new trends and it's very hard to retrain an LLM to match the new trend and understand the new words out there. I mean, you know, you oftentimes hear Gen Z words like "rizz" or "mid" or whatever. I don't know all of them, but you probably want to find a way that can allow the LLM to understand those trends without retraining the LLM from scratch. Yeah. What else?
It's trained to have a breadth of knowledge and if you wanted to do something specialized that might...
Yeah. It might be trained on a breadth of knowledge but it might fail or not perform adequately on a narrow task that is very well defined. Think about enterprise applications. Yeah, enterprise application, you need high precision, high fidelity, low latency, and maybe the model is not great at that specific thing. It might do fine but just not good enough and you might want to augment it in a certain way. Yeah.
So it makes the model a lot heavier, a lot slower.
So maybe it has a lot of broad domain knowledge that might not be needed for your application and so you're using a massive heavy model when you actually are only using 2% of the model capability. You're perfectly right. You might not need all of it. So you might find ways to prune, quantize the model, modify it. All of these are good points.
I'm going to add a few more as well. LLMs are very difficult to control. Your last point is actually an example of that. You want to control the LLM to use a part of its knowledge, but it's not. It's in fact getting confused. We've seen that in history. In 2016, Microsoft created a notorious Twitter bot that learned from users and it quickly became a racist jerk. Microsoft ended up removing the bot 16 hours after launching it. The community was really fast at determining that this was a racist bot. And you know, you can empathize with Microsoft in the sense that it is actually hard to control an LLM. They might have done a better job to qualify before launching, but it is really hard to control.
And even more recently, this is a tweet from Sam Altman last November where there was this debate between Elon Musk and Sam Altman on whose LLM is the "left-wing propaganda machine" or the "right-wing propaganda machine" and they were hating on each other's LLMs. But that tells you at the end of the day that even those two teams, Grok and OpenAI, which are probably the best funded teams with a lot of talent, are not doing a great job at controlling their LLMs. And from time to time, if you hang out on X, you might see screenshots of users interacting with LLMs and the LLM saying something really controversial or racist or something that would not be considered great by social standards, I guess. And that tells you that the model is really hard to control.
The second aspect of it is something that you've mentioned earlier. LLMs may underperform in your task. And that might include specific knowledge gaps such as medical diagnosis. If you're doing medical diagnosis, you would rather have an LLM that is specialized for that and is great at it. And in fact, something that we haven't mentioned as a group has sources. So the answer is sourced specifically. You have a hard time believing something unless you have the actual source of the research that backs it up.
Inconsistencies in style and format. So, imagine you're building a legal AI agentic workflow. Legal has a very specific way to write and read where every word counts. You know, if you're negotiating a large contract, every word on that contract might mean something else when it comes to the court. And so it's very important that you use an LLM that is very good at it. The precision matters. And then you know, task specific understanding such as doing a classification on a niche field.
Here I pulled an example where let's say a biotech product is trying to use an LLM to categorize user reviews into positive, neutral or negative. Maybe for that company something that would be considered a negative review typically is actually considered a neutral review because the NPS of that industry tends to be way lower than other industries, let's say. That's a task specific understanding and the LLM needs to be aligned to what the company believes is the categorization that it wants. We will see an example of how to solve that problem in a second.
And then limited context handling. A lot of AI applications, especially in the enterprise, require data that has a lot of context. Just to give you a simple example, knowledge management is an important space that enterprises buy a lot of knowledge management tools. When you go on your drive and you have all your documents, ideally you could have an LLM running on top of that drive. You can ask any question and it will read immediately thousands of documents and answer "what was our Q4 performance in sales?" It was X dollars. It finds it super quickly. In practice, because LLMs do not have a large enough context, you cannot use a standalone vanilla pre-trained LLM to solve that problem. You will have to augment it. Does that make sense?
The other aspect around context windows is they are in fact limited. If you look at the context windows of the models from the last 5 years, even the best models today will range in context window or number of tokens it can take as input somewhere in the hundreds of thousands of tokens max. Just to give you a sense, 200,000 tokens is roughly two books. So that's how much you can upload and it can read pretty much. And you can imagine that when you're dealing with video understanding or heavier data files that is of course an issue. So you might have to chunk it, you might have to embed it, you might have to find other ways to get the LLM to handle larger contexts.
The attention mechanism is also powerful but problematic because it does not do a great job at attending in very large contexts. There is actually an interesting problem called "needle in a haystack." It's an AI problem, or call it a benchmark, where in order to test if your LLM is good at attending—at putting attention on a very specific fact within a large corpus—researchers might randomly insert in a book one sentence that outlines a certain fact. Such as "Arun and Max are having coffee at Blue Bottle" in the middle of the Bible, let's say, or some very long text. And then you ask the LLM, "What were Arun and Max having at Blue Bottle?" And you see if it remembers that it was coffee. It's actually a complex problem not because the question is complex but because you're asking the model to find a fact within a very large corpus and that's complicated.
So again, this is a limiting factor for LLMs. We'll talk about RAG in a second but I want to preview the debates around whether RAG is the right long-term approach for AI systems. So as a high level idea, a RAG is a mechanism, if you will, that embeds documents that an LLM can retrieve and then add as context to its initial prompt and answer a question. It has lots of application; knowledge management is an example. So imagine you have your drive again but every document is sort of compressed in representation and the LLM has access to that lower dimensional representation.
The debates that this tweet from Yu outlines is: in theory, if we have infinite compute, then RAG is useless because you can just read a massive corpus immediately and answer your question. But even in that case, latency might be an issue. Imagine the time it takes for an AI to read all your drive every single time you ask a question. It doesn't make sense. So, RAG has other advantages beyond even the accuracy. On top of that, the sourcing matters as well. So, RAG allows you to source. We'll talk about all that later. But there is always this debate in the community whether a certain method is actually future proof because in practice, as compute power doubles every year let's say, some of the methods we're learning right now might not be relevant 3 years from now. We don't know essentially.
And the analogy that he makes on context windows and why RAG approaches might be relevant even a long time from now is search. You know when you search on a search engine you still find sources of information and in fact in the background there is very detailed traversal algorithms that rank and find the specific links that might be the best to present you. Versus if you had to read the entire web every single time you're doing a search query without being able to narrow to certain portion of the space, that might again not be reasonable.
Okay, when we're thinking of improving LLMs, the easiest way we think of it is two dimensions. One dimension is we are going to improve the foundation model itself. So for example, we move from GPT-3.5 Turbo to GPT-4 to GPT-4o to GPT-5. Each of that is supposed to improve the base model. GPT-5 is another debate because it's sort of packaging other models within itself. But if you're thinking about 3.5, 4, and 4o, that's really what it is. The pre-trained model improves and so you should see your performance improve on your tasks.
Um but the other dimension is we can actually engineer leverage the LLM in a way that makes it better. So you can prompt simply GPT-4o. You can chain some prompts and improve the prompt and it will improve the performance. It's shown you can even put a RAG around it. You can put an agentic workflow around it. You can even put a multi-agent system around it. And that is another dimension for you to improve performance. So that's how I want you to think about it. Which LLM I'm using and then how can I maximize the performance of that LLM. This lecture is about the vertical axis. Those are the methods that we will see together.
Sounds good for the introduction. So let's move to prompt engineering. I'm going to start with an interesting study just to motivate why prompt engineering matters. There is a study from HBS, UPenn, as well as Harvard Business School and others also involving Wharton that took a subset of BCG consultants, individual contributors, split them into three groups. One group had no access to AI. One group had access to, I think it was GPT-4. And then one group had access to the LLM but also a training on how to prompt better. And then they observed the performance of these consultants across a wide variety of tasks.
There's a few things that they noticed that I thought was interesting. One is something they call the "Jagged Frontier." Meaning that certain tasks that consultants are doing fall beyond the jagged frontier. Meaning AI is not good enough. It's not improving human performance. In fact, it's actually making it worse. And some tasks are within the frontier, meaning that AI is actually significantly improving the performance, the speed, the quality of the consultant. Many tasks fail within and many tasks fail without and they shared their insights but the TL;DR is there is a frontier within which AI is absolutely helping and one where they call out this behavior of "falling asleep at the wheel" where people relied on AI on a task that was beyond the frontier and in fact it ended up going worst because the human was not reviewing the outputs carefully enough.
They did note that the group that was trained was better than the group that was not trained on prompt engineering, which also motivates why this lecture matters. So that you're within that group afterwards. One other insight were the Centaurs and the Cyborgs. They noticed that consultants had the tendency to work with AI in one of two ways and you might yourself be part of one of these groups. The Centaurs are mythical creatures that are half human, half... horses. Yeah, horses. Half horses, half something. And those were individuals that would divide and delegate. They might give a pretty big task to the AI. So imagine you're working on a PowerPoint, which consultants are known to do. You might actually write a very long prompt on how you wanted to do your PowerPoint and then let it work for some time and then come back and it's done.
When others would act as cyborgs. Cyborgs are fully blended bionic human robots, a human robot and robot augmented with robotic parts. And those individuals would not delegate fully a task. They would actually work super quickly with the model and like back and forth. I find that a lot of students are actually more working like cyborgs than centaurs. But while maybe in the enterprise when you're trying to automate a workflow, you're thinking more like a centaur. Yeah, that's just something good to keep in mind. Also, a lot of companies would tell you, "Oh, we're hiring prompt engineers," etc. It's a career. I don't buy that. I think it's just a skill that everybody should have. You're not going to make a career out of prompt engineering, but you're probably going to use it as a very powerful skill in your career.
So let's talk about basic prompt design principles. I'm giving you a very simple prompt here: "Summarize this document" and then the document is uploaded alongside it. And the model has not much context around what should be the summary, how long should be the summary, what should it talk about, etc. You can actually improve these prompts by doing something like "Summarize this 10-page scientific paper on renewable energy in five bullet points focusing on key findings and implications for policy makers." That's already better, right? You're sharing the audience and it's going to tailor it to the audience. You're saying that you want five bullet points and you want focus only on key findings. You know, that's a better prompt, you would argue. How could you even make this prompt better? What are other techniques that you've heard of or tried yourself that could make this one-shot prompt better?
Example.
Example. So, you mean here is an example of a great summary. Yeah, you're right. That's a good idea.
Act like you are...
Very popular technique: "Act like a renewable energy expert giving a conference at Davos," let's say. Yeah, that's great.
Sounds like you're really good at it, like you are the best in the world at this...
"You are the best in the world at this, explain..." [Laughter] Yeah actually, I mean these things work, it's funny but it does work to say "Act like XYZ." It's a very popular prompt template. But we'll see a few examples. What else could you do?
Critique your own project.
Critique your own project. So you're using reflection. So you might actually do one output and then ask it to critique it and then give it back. Yeah, we'll see that. That's a great one. That's the one that probably works best within those typically, but we'll see some examples. What else? Yeah.
Breaks.
Okay. Break the task down into steps. Do you know how that is called? Chain of Thoughts. So, this is actually a popular method that's been shown in research that it improves. You could actually give a clear instruction and also encourage the model to think step by step. Approach the task step by step and do not skip any step. And then you give it some steps such as: Step one, identify the three most important findings. Step two, explain how key each finding impact renewable energy policy. Step three, write the five bullet summary with each point addressing a finding, etc. So, chain of thoughts—I linked the paper from 2023 that popularized chain of thoughts. Chain of thoughts is very, very popular right now, especially in AI startups that are trying to control their LLMs.
Okay, to go back to your examples about "act like XYZ," what I like to do—Andrew also talks about that—is to look at other people's prompts. And in fact, online you have a lot of prompt repositories for free on GitHub. In fact, I linked the "Awesome Prompt Template" repo on GitHub where you have so many examples of great prompts that engineers have built. They said it works great for us and they published it online. And a lot of them start with act as, you know, "act as a Linux terminal," "act as an English translator," "act like a position interviewer," etc.
The advantage of a prompt template is that you can actually put it in your code and scale it for many user requests. So let me give you an example from Workera. You know Workera evaluates skills—some of you have taken the assessments already—and tries to personalize it to the user. And in fact, if you actually read in an HR system in an enterprise, in the HR system you might have "Jane is a product manager level three and she is in the US and her preferred language is English." And actually, that metadata can be inserted in a prompt template that we personalize for Jane. And similarly for Joe whose preferred language is Spanish. It will tailor it to Joe and that's called the prompt template.
Do the foundation models use something you have to...?
So the question is: do the foundation models use a prompt templates or do you have to integrate it yourself? So the foundation models probably use a system prompt that you don't see. Like when actually you type on ChatGPT, it is possible—it's not public—that OpenAI behind the scene has like "Act like a very helpful assistant for this user and by the way here is your memories about the user that we kept in a database. You can actually check your memories," and then your prompt goes under and then the generation starts. So probably they're using something like that, but it doesn't mean you can't add one yourself. So in fact, if you think about a prompt template for the Workera example I was showing, maybe it starts when you call OpenAI by "Act like a head assistant" and then underneath it's like "Act like a great AI mentor that helps people in their career" and OpenAI's prompt template also has "Follow the instruction from the creator" or something like that, you know, it's possible.
Questions about prompt templates? Again, I would encourage you to go and read examples of prompts. Some of them are quite thoughtful. Let's talk about zero-shot versus few-shot prompting. It came up earlier. Here's an example. Again, going back to the categorization of product reviews. Let's say that we're working on a task where the prompt is "Classify the tone of this sentence as positive, negative, or neutral." And then you paste the review which is "The product is fine but I was expecting more."
If I were to survey the room I would bet that some of you would say it's negative, some of you would say it's neutral, because you actually have a first part that is relatively positive—"It's fine"—and then the second part "I was expecting more" which is relatively negative. So where do you land? This can be a subjective question and maybe in one industry this would be considered amazing and another one it would be considered really bad because people are used to really flourishing reviews. And so the way you can actually align the model to your task is by converting that zero-shot prompt—zero-shot refers to the fact that it's not being given any example—into a few-shot prompts where the model is given in the prompt a set of examples to align it to what you want it to do.
So the example here is you paste the same prompt as before with the user review and then you add "Here are examples of tone classifications." "This exceeded my expectation completely." -> Positive. "It's okay but I wish it had more features." -> Negative. "The service was adequate. Neither good nor bad." -> Neutral. "Now classify the tone of this sentence..." after you've heard about these things. And the model then says "Negative." And the reason it says negative of course is likely because of the second example which was "It's okay but I wish it had more features" which we told the model that was negative because the model saw that it's aligned now with your expectations.
Few-shot prompts are very popular. And in fact, for AI startups that are slightly more sophisticated, you might see them keep a prompt up to date. Whenever a user says something, they might have a human label it and then add it as a few shots in the relevant prompt in their codebase. You can think of that as almost building a data set, but instead of actually building a separate data set like we've seen with supervised fine-tuning and then fine-tuning the model on it, you're just putting it directly in the prompt. And turns out it's probably faster to do that if you want to experiment quickly because you don't touch the model parameters. You just update your prompts. And you know if it's text examples, you can actually concatenate so many examples in a single prompt. At some point it will be too long and you will not have the necessary context window. But it's a pretty strong approach that is quick to align an LLM.
Research on how long can be until it starts with...
So question was is there any research on how long the prompt can be before the model essentially loses itself or doesn't follow instructions anymore? There is. The problem is that research is outdated every few months because models get better. And so I don't know where the state-of-the-art is. You can probably find it online on benchmarks. On the Workera product you have a voice conversation for some of you that have tried it where you're asked "Explain what is a prompt" and then you explain and then there's a scoring algorithm behind. We know that after eight turns the model loses itself. After eight turns because you always paste the previous user response it just starts going wild. And so the techniques we use in the background is we actually create chapters of the conversation. Maybe one chapter is the first eight prompts and then you actually start over from another prompt. You can summarize the first part of the conversation, insert the summary and then keep going. Those are engineering hacks that engineers might have figured out in the background. Yeah. Because yeah, eight turns makes a prompt quite long actually.
Let's move on to chaining. Chaining is the most popular technique out of everything we've seen so far in prompt engineering. It's not Chain of Thought. So Chain of Thought we've seen is "think step by step, step one, step two, step three, do not skip any step." This is different. This is chaining complex prompts to improve performance. And this is what it looks like. Um, you take a single-step prompt such as "Read this customer review and write a professional response that acknowledges their concern, explains the issue, offers a resolution," and then you paste the customer review, which is, "I ordered a laptop, it arrived 3 days late, the packaging was damaged, very disappointing. I need it that urgently for work." And then the output is an email that is immediately given to you by the LLM after it reads the prompt.
So, this might work, but it might be hard to control, you know, 'cause think about it. There's multiple steps that you have listed and everything is embedded in the same prompt. And if you wanted to debug step by step and know which step is weaker, you couldn't. You would have everything mixed together. So one advantage of chaining is you would separate the prompts so that you can debug them separately and it will also lead to an easier manner to improve your workflow.
Let's say a first prompt is "Extract the key issues. Identify the key concerns mentioned in this customer review." Paste a customer review. Second prompt: "Using these issues"—so you paste back the issues—"Draft an outline for a professional response that acknowledges concerns, explains possible reasons, and offer a resolution." So, this is not, you know, prompt number three, write the full response. So, "Using the outline, write the professional response." And then you get your final output.
So, in theory, you can't tell me, "Oh, the second approach is better than the first one" at first. But what you can notice is that we can actually test those three prompts separately from each other and determine if we will get the most gains out of engineering the first prompt, optimizing it, or the second one or the third one. We now have three prompts that are independent from each other. And you know, maybe if the outline was better, the performance of the email—how much it will... the open rate will be or the user satisfaction on the response—will actually get higher. And so chaining improves performance but most importantly helps you control your workflow and debug it more seamlessly.
So if we know that the three prompts independently work very well, if we combine them into one prompt and we highlight that step-by-step thinking process, does on average we get the same output or we still have to do that breakdown?
So let me try to rephrase. You say let's say we look at the first prompt which has all three tasks built in that prompt... why do we need the three steps?
Yeah.
Yeah. I mean, think about it. The intermediate output is what you want to see. Like if I'm debugging the first approach, the way I would do it is I would capture user insights. Like, here's the email, how good was the response? Thumbs up, thumbs down. Was your issue resolved? Thumbs up, thumbs down. Those would tell me "how good is my prompt?" And I can engineer that prompt, optimize it, and I would probably drive some gains. But I will not be able easily to trace back to what the problem was.
While in the second approach, not only I can use the end-to-end metrics to improve my process, I can also use the intermediate steps. For example, if I look at prompt two and I look at the outline and I see the outline is actually "meh," it's not great, then I think I can get a lot of gains out of the outline. Or the outline is actually really good, but the last prompt doesn't do a good job at translating it into an email. So the outline is exactly what I want the LLM to do, but the translation in a customer-facing email is not good. In fact, it doesn't follow our vocabulary internally. Then I know the third prompt is where I would get the most gains. So that's what it allows me to do. Have intermediate steps to review.
Are there any latency...?
We'll talk about it. Are there any latency concerns? Yes. In certain applications you don't want to use a chain or you don't want to use a long chain because it adds latency. We'll talk about that later. Good point. So practically this is what chaining complex prompts look like. You have your first prompt with your first task. It outputs... the output is pasted in the second prompt with the second task being defined. The output is then pasted into the third prompt with the third task being defined and so on. That's what it looks like in practice.
Super. We'll talk more later about testing your prompts, but there are methods now to do it and we'll see later in this lecture with our case study how we can test our prompts. But here is an example of how you might do it. You might have a summarization workflow prompt that is the baseline. It's a single prompt. You might have a refined summarization which is a modified prompt of this or a workflow with a chain, you know. And then you have your test case, which is the input that you want to summarize, let's say, and then you have the generated output, and you can have humans go and rate these outputs.
And you would notice that the baseline is better or worse than the refined prompt. Of course, this manual approach takes time. Um, but it's a good way to start and usually the advice is get hands-on at the beginning because you would quickly notice some issues and it will give you better intuition on what tweaks can lead to better performance. However, if you wanted to scale that system across many products, many parts of your codebase, you might want to find a way to do that automatically without asking humans to review and grade summaries, right?
One approach is to use platforms like... at Workera, our team uses a platform called Promptfoo that allows you to actually automate part of this testing. In a nutshell, what it does is it can allow you to run the same prompt with five different LLMs immediately, put everything in a table that makes it super easy for a human to grade, let's say. Or alternatively, it might allow you to define LLM judges. LLM judges can come in different flavors. For example, I can have an LLM judge that does a pair-wise comparison. So, what the LLM is asked to do is: "Here are two summaries. Just tell me which one is better than the other one."
That's what the LLM does. And that can be used as a proxy for how good the summarization baseline versus the refined version is. Another way to do an LLM judge is if you do it for a single answer grading. So "Here's a summary, grade it from one to five." And then you can go even deeper and do a reference-guided pairwise comparison or you add also a rubric. You say "A five is when a summary is below 100 characters, mentions at least three key points that are distinct, and starts with a first sentence that displays the overview and then goes into the detail. That's a great summary, number five out of five. Zero is the LLM failed to summarize and actually was very verbose," let's say. And so you put a rubric behind it and you have an LLM judge finding the rubric. Of course you can now pair different techniques. You can do a few-shot for the rubric. You can actually give examples of a five out of fives, four out of fours, three out of threes because now you know multiple techniques.
Okay. Does that make sense? Yeah. Okay. So that was the second section on prompt engineering, or the first line of optimization. Now, let's say you've exhausted all your chances for prompt engineering and you're thinking about actually touching the model, modifying its weights or fine-tuning it. In other words, um I was telling you I'm not a fan of fine-tuning. There's a few reasons why.
One, it requires substantial label data typically to fine-tune. Although now there are approaches that are getting better at fine-tuning that look more like few-shot prompting actually than fine-tuning. It's sort of merging, although one modifies the weight, the other doesn't modify the weights. Fine-tuned models may also overfit to specific data. We're going to see a funny example actually. Um losing their general purpose utility. So you might fine-tune a model and actually when someone asks a pretty generic question, it doesn't do well anymore. You know, it might do well on your task. So it might be relevant or not.
And then it's time and cost intensive. That's my main problem. And you know, at Workera, we steer away from fine-tuning as much as possible. Because by the time you're done fine-tuning your model, the next model is out and it's actually beating your fine-tuned version of the previous model. So I would steer away from fine-tuning as much as you can. The advantage of the prompt engineering methods we've seen is you can put the next best pre-trained model directly in your code. It will update everything immediately. Fine-tuning doesn't work like that.
There are advantages though where it still makes sense. If the task requires repeated high precision output such as legal, scientific explanation, and if the general purpose LLM struggles with domain specific language. So, let's look at a quick example together, which is an example from Ross Lazerovitz, I think it was a couple of years ago, September '23, where Ross tried to do Slack fine-tuning. So, he looked at a lot of Slack messages within his company and he was like, "I'm going to fine-tune a model that speaks like us or operates like us because this is how we work, right? This is the data that represents how people work at the company."
And so he actually went ahead and fine-tuned the model. Gave it a prompt like "Hey write a 500-word blog post on prompt engineering." And the model responded, "I shall work on that in the morning." And then he tries to push the model a little further and say "It's morning now." And the model said, "I'm writing right now. It's 6:30 a.m. here. Write it now." Okay, please. [Laughter] "Okay, I shall write it now. I actually don't know what you would like me to say about prompt engineering. I can only describe the process. The only thing that comes to mind for a headline is how do we build prompt?"
You know, it's kind of a funny example for fine-tuning because it's true that it went wrong. Like he was supposed to think like "I want the model to speak like us at work" and it ended up acting like people and not actually following instructions. So one example why I would steer away from fine-tuning.
Super. Uh, let's talk about RAGs. RAGs is important. It's important to know out there and at least having the basics. It's a very common interview question by the way. If you go interview for a job, they might ask you to explain in a nutshell to a 5-year-old what is a RAG, and hopefully after that you'll be able to do it. So, we've seen some of the challenges with standalone LLMs. Those challenges include the context window being small, the fact that it's hard to remember details within a large context window, knowledge gaps, cut-off dates you mentioned earlier. The model might be trained up to a date and then it cannot follow the trends or be up to date.
Hallucinations. There are some fields—think about medical diagnosis—where hallucinations are very costly. You can't afford a hallucination. Even in education, imagine deploying a model for the US youth education and it hallucinates and it teaches millions of people something completely wrong. It's a problem. And then lack of sources. A lot of fields love sources. Research fields love sources. Education loves sources. Legal loves sources as well. And so the pre-trained LLM doesn't do a good job to source. And in fact, if you have tried to find sources on a plain LLM, it actually hallucinates a lot. It makes up research papers. It just lists like completely fake stuff.
Um so how do we solve that? With a RAG. RAG integrates with external knowledge sources, databases, documents, APIs. It ensures that answers are more accurate, up-to-date and grounded because you can actually update your document. Your drive is always up to date. I mean ideally you're always pushing new documents to it. And when you query "what is our Q4 performance in sales," hopefully there is the last board deck in the drive and it can read the last board deck. And more developer control. We'll see why RAGs allow for targeted customization without actually requiring the retraining of the model. In fact, you don't touch the model with RAGs. It's really a technique that is put on top of the model.
So to see an example of a RAG, this is a question answering application where we're in the medical field and a user is asking a query: "What are the side effects of drug X?" This is an important question. You can't hallucinate. You need to source. You need to be up to date. Maybe there is a new update to that drug that is now in the database and you need to read that. So you have to... a RAG is a great example of what you would want to use here.
The way it works is you have your knowledge base of a bunch of documents. What you do is you use an embedding to embed those documents into lower dimensional representations. So for example if the document is a PDF, a long PDF, you might read the PDF, understand it and then embed it. We've seen plenty of embedding approaches together—triplet loss etc.—you remember. So imagine one of them here for LLMs is embedding those documents into lower representation. If the representation is too small, you will lose information. If it's too big, you will add latency, right? It's a trade-off.
You will store typically those representations into a database called a vector database. There's a lot of vector database providers out there. The vector database is essentially storing those vectors in a very efficient manner allowing the fast retrieval with a certain distance metric. So what you do is you also embed usually with the same algorithm the user prompts and you run a retrieval process which is essentially saying "based on the embedding from the user query and the vector database, find the relevant documents based on the distance between those embeddings."
Once you found the relevant documents, you pull them and then you add them to the user query with a system prompt or a prompt template on top. So the prompt template can be "Answer user query based on list of documents. If answer not in the document say I don't know." That's your prompt template where the user query is pasted, the documents are pasted and then your output should be what you want because it's now grounded in the document. You can also add to this prompt templates "tell me the exact page, chapter, line of the document that was relevant" and in fact link it as well just to be more precise. Any question on RAGs? There's a simple vanilla RAG. Yeah.
Document embeddings still retain information about what's down on what page and what paragraph?
Question is, do the document embeddings still retain the information of the location of the information within that document, especially in big documents? Great question. We'll get to it in a second because you're right that the vanilla RAG might not do a good job with very large documents. So let's say, you know, when you open a medication box and you have this gigantic white paper with all the information and it's very long, maybe a vanilla RAG would not cut it.
So what people have figured out is a bunch of techniques to improve RAGs and in fact chunking is a great technique that is very popular. So you might actually store in the vector database the embedding of the full document and on top of that you will also store a chapter level vector, you know, and when you retrieve you retrieve the document, you retrieve the chapter, and that allows you to be more precise with the sourcing. It's one example.
Another technique that's popular is HyDE—Hypothetical Document Embeddings—where a group of researchers published a paper showing that when you get your user query, one of the main problems is the user query actually does not look like your documents. For example, the user query might be "What are the side effect of drug X?" when actually in the document in the vector database the vectors represents very long documents. So how do you guarantee that the vector embedding is going to be close to the document embedding?
What they do is they use the user query to generate a fake hallucinated document. They embed that document and then they compare to the vector in the vector database. Does that make sense? So for example, the user says, "What is the side effect of drug X?" There's a prompt that this is given to another prompt that says, "Based on this user query, generate a five-page report answering the user query." It generates potentially a completely fake answer. You embed that and it will be closer to the document that you're looking for likely. Yeah, it's one example of a RAG approach. Again, the purpose of this lecture is not to go through all these three and explain you every single method that has been discovered for RAGs, but I just wanted to show you how much research has been done between 2020 and 2025 in RAGs and how many branches of research you now have that you can learn from. The survey paper is linked in the slides, by the way, and I'll share them after the lecture.
Super. So, we've made some progress. Hopefully now you feel like if you were to start an LLM application, you know how to do better prompts, you know how to do chains, you know how to do fine-tuning, you also know how to do retrieval and you have the baggage of techniques that you can go and read and find the codebase, pull the code, via code it, but you have the breadth. Now the next set of topics we're going to see is around the question of how could we extend the capabilities of LLM from performing single tasks and hands with external knowledge to handling multi-step autonomous workflows. And this is where we get into proper agent AI.
So let's talk about agentic AI workflows towards autonomous and specialized systems. Then we'll talk about evals. Then we'll see multi-agent systems. And we'll end with a little thoughts on what's next in AI.
So Andrew Ng actually coined the term agentic AI workflows. And his reason was that a lot of companies use, let's say agents, agents, agents everywhere. Agents everywhere. If you go and work at these companies, you would notice that they mean very different things by agent. Some people actually have a prompt and they call it an agent. You know, other people they have a very complex multi-agent system, they call it an agent. And so calling everything an agent doesn't do it justice. So Andrew says, "Let's call it agentic workflows because in practice it's a bunch of prompts with tools with additional resources, API calls that ultimately are put in a workflow and you can call that workflow agentic." So it's all about the multi-step process to complete the task.
Also calling it agentic workflow allows us to not mix it up with what I called agent the last lecture with reinforcement learning because in RL agent has a very specific definition—interacts with an environment, passes from one state to the other, has a reward and an observation. You remember that chart, right?
So, here's an example of how we move from a one-step prompt to a multi-step agentic workflow. Let's say a user queries a product, "What is your refund policy?" on a chatbot. And the response using a RAG says "Refunds are available within 30 days of purchase." And maybe the RAG can even look linked to the policy document. That's what we learned so far. Instead an agentic workflow can function like this. The user says, "Can I get a refund for my order?" And the response via the agentic workflow is the agent retrieves the refund policy using a RAG. The agent then follows up with the users and says "Can you provide your order number?" Then the agent queries an API to check the order details and finally it comes back to the user and confirms "Your order qualifies for a refund. The amount will be processed in 3 to 5 business days." This is much more thoughtful than the first version which is sort of vanilla, right? So that's what we're going to talk about in the next couple of slides is how do we get from the first one to the second one.
Uh there are plenty of specialized agentic workflows online. You know, you've heard and if you hang out in SF, you probably see a bunch of billboards, you know: AI software engineer, AI skills mentor you've interacted with in the class to Workera, AI SDR, AI lawyers, AI specialized cloud engineer. You know, it would be a stretch to say that everything works, but there's work being done towards that. I'm not personally a fan of putting a face behind those things. I think it's gimmicky and I think in a few years from now actually very few products will have a human face behind it. But might be a marketing tactic from some startups. It's more scary than it is engaging frankly.
Um okay, I want to talk about the paradigm shift. That's especially useful. Let's say you're a software engineer or you're planning to be a software engineer because software engineering as a discipline is sort of shifting or at least the best engineers I've worked with are able to move from a deterministic mindset to a fuzzy mindset and balance between the two whenever they need to get something done. So here's the paradigm shift between traditional software and agentic AI software.
The first one is the way you handle data. Traditional software deals with structured data. You have JSONs, you have databases. They're pasted in a very structured manner in a data engineering pipeline and then they're used to be displayed on a certain interface. The user might fill a form that is then retrieved and pasted in the database. All of that historically has been structured data. Now more and more companies are handling free form text, images, and all of that requires dynamic interpretation to transform an input into an output.
The software itself used to be deterministic. Now you have a lot of software that is fuzzy and fuzzy software creates so many issues. I mean, imagine if you let your user ask anything on your website. The chances that it breaks is tremendous. The chances that you're attacked is tremendous. The chances... it's really, really complicated. It's more complicated than people make it seem on Twitter. Fuzzy engineering is truly hard. Yeah, you might get hate as a company because one user did something that you authorized them to do that ended up breaking the database and ended up, you know, we've seen that with many companies in the last couple of years. So, it takes a very specialized engineering mindset to do fuzzy engineering, but also know when you need to be deterministic.
The other thing I call is with agentic AI software, you sort of want to think about your software as like your manager. So you're familiar with the monolith or microservices approaches in software where you structure your software in different boxes that can talk to each other and it allows teams to debug one section at a time. Now the equivalent with agentic AI is you think as a manager. So you think, "Okay, if I was to delegate my product to be done by a group of humans, what would be those roles?" Would I have a graphic designer that then puts together a chart and then sends it to a marketing manager that converts it into a nice blog post that then gives it to the performance marketing expert that then publishes the work the blog post and then optimizes and AB tests, then to a data scientist that analyzes the data and then puts hypothesis and validates them or invalidates them?
That's how you would typically think if you're building an agentic AI software, when actually the equivalent of that in traditional software might be completely different. It might be "We have a data engineer box right here that handles all our data engineering. And then here we have the UI/UX stuff. Everything UI/UX related goes here. And you know companies might structure it in very different ways. And here's the business logic that we want to care about. And there's five engineers working on the business logic," let's say.
Testing and debugging is also very different and we'll talk about it in the next section. The other thing that I feel matters is with AI in engineering the cost of experimentation is going down drastically and so people I feel should be more comfortable throwing away code. You know, it's like in traditional software engineering you probably don't throw away code a ton. You build a code and it's solid and it's bulletproof and then you update it over time. When we've seen AI companies be more comfortable throwing away codes. Yeah. Which has advantages in terms of the speed at which you move but also disadvantages in terms of the quality of your software that it can break more.
Okay. So anyway, just wanted to do an aparte on the paradigm shift from deterministic to fuzzy engineering. Oh, and actually I can give you an example from Workera that we learned probably over the last 12 months. Is like if you if you've used Workera you might have seen that the interface asks you sometimes multiple choice questions and sometimes it asks you multiple select and sometimes it asks you drag and drop ordering matching, whatever right? Those are example of deterministic item types meaning you answer the question on a multiple choice, there's one correct answer, it's fully deterministic.
On the other hand you sometimes have voice questions where you go through a role play or you have voice plus coding questions where your code is being read by the interface or whatever. Those are fuzzy, meaning the scoring algorithm might actually make mistakes and those mistakes might be costly. And so companies have to figure out a human in the loop system, which you might have seen with the appeal feature at the end. So at the end of the assessment, you have an appeal feature where it allows you to say, "I want to appeal the agent because I want to challenge what the agent said on my answer because I thought I was better than what the agent thought." And then you bring a human in the loop that then can fix the agent, can tell the agent, "Actually you were too harsh on the answer of this person." And you know, that's an example of a fuzzy engineered system that then adds a human in the loop to make it more aligned. And so if you're building a company, I would encourage you to think about what can I get done with determinism and let's get that done. And then the fuzzy stuff, I want to do fuzzy because it allows more interaction. It allows more back and forth, but I need to put guard rails around it. And how am I going to design those guard rails pretty much?
All right, here's another example from enterprise workflows which are likely to change due to agentic AI. This is a paper from McKinsey I believe from last year where they looked at a financial institution and they said that, "We observe that they often spend one to four weeks to create a credit risk memo." And here is the process: A relationship manager gathers data from 15 and more than 15 sources on the borrower loan type other factors. Then the relationship manager and the credit analyst collaboratively analyze that data from these sources. Then the credit analyst typically spends, you know, 20 hours or more writing a memo and then goes back to the relationship manager. They give feedback and then they go through this loop again and again and it takes a long time to get a credit memo out.
And then then run a research study where they changed the process. They said GenAI agents could actually cut time by 20 to 60% on credit risk memos. And the process has changed to the relationship manager directly working with the GenAI agent system, provides relevant materials that needs to produce the memo. The agent subsidizes the project into tasks that are assigned to specialist agents, gathers and analyzes the data from multiple sources, drafts a memo. Then the relationship manager and the credit analyst sit down together, review the memo, give feedback to the agent, and within, you know, 20 to 60% less time are done.
And so this is an example where you're actually not changing the human stakeholders, you're just changing the process and adding GenAI to reduce the time it takes to get a credit memo out. It turns out that imagine you're an enterprise and you have, you know, 100,000 employees—and there's a lot of enterprises with 100,000 employees out there—you are currently under crisis in terms of redesigning your workflows. It turns out that if you actually pull the job descriptions from the HR system and you interpret them, you also pull the business process workflows that you have encoded in your drive, you actually can find gains in multiple places. And in the next few years you're probably going to see workflows being more optimized to add GenAI.
Um even if that happens, the hardest part is changing people. We know this is great in theory but now let's try to fit that second workflow for 10,000 credit risk analysts and relationship managers. My guess is it will take years. It will take 10, 20 years to get to this being actually done at scale within an organization because change is so hard, you know. So hard to rewire business workflows, job descriptions, incentivize people to do different and be different and train them. And so, you know, this is what the world is going towards but it's going to take a long time, I think.
Um okay then I want to talk about how the agent actually works and what are the core components of an agent? Imagine a travel booking AI agent. That's an easy example you've all thought about. I still haven't been able to get an agent to book a trip for me—or I was scared because it was going to book a very expensive or long trip. But in theory, you can have a travel booking agent that has prompts. So the prompts we've seen, we know the methods to optimize those prompts. That travel agent also has a context management system which is essentially the memory of what it knows about the user. That context management system might include a core memory or working memory and an archival memory.
Okay. What the difference is within memory is not every memory needs to be fast to access. Like think about it. You're born on a product and the first question is "Hi what's your name?" and I say "My name is Keon." That's probably going to sit in the working memory because the agent every time he's going to talk to me is going to want to use my name, right? But then maybe the second question is "Keon what's your birthday?" and I give it my birthday. Does it need my birthday every day? Probably not. So it's probably going to park it on the long-term memory or the archival memory and those memories are slower to access, they're farther down the stack. And you know that structure allows agent to determine what's the working memory and what's the long-term memory, you know, and that makes it easier for the agent to retrieve super fast.
Cuz think about it when you interact with GPT you feel that it's very personal at times right? You feel like it understands you. Imagine every time you call it it has to read the memories, right? And that can be costly. It's like a very... it's a very burdensome cost because it happens every time you talk to it. So you want to be highly optimized with the working memory. You know, if it takes 3 seconds to look in the memory, every time you're going to talk to your LLM, it's going to take 3 seconds, which you don't want. So anyway, and then you have the tools. The tools can include APIs like a flight search API, hotel booking API, car rental API, weather API, and then the payment processing API.
And typically, you would want to tell your agent how that API works. It turns out that agents—or LLMs I should say—are very good at reading API documentation. So you give it the API documentation and it reads the JSON and it reads what does a GET request look like and this is the format that I need to push and then it pushes it in that format let's say and then it retrieves something. Does that make sense? Those different components.
You know Anthropic also talks about resources. Resources is data that is sitting somewhere that you might let your agent read. For example, if you're building your startups, you have a CRM. A CRM has data in it and you want to use lookups in that data. You will probably give a lookup tool and you will give access to the resource and it will do lookups whenever you want. Super fast. This type of architecture can be built with different degrees of autonomy from the least autonomous to the most autonomous and I'll give you a few examples.
Less autonomous would be you've hardcoded the steps. So let's say I tell the travel agent: "First identify the intent, then look up in the database the history of this customer with us and their preferences, then go to the write API blah blah blah, then go to the..." I would hardcode the steps. Okay, that's the least autonomous. The semi-autonomous is I might hardcode the tools but I'm not going to hardcode the steps. So, I'm going to tell the agent "Act like a travel agent and your task is to help the person book a travel and these are the tools that you have accessible to yourself." And so I'm not hard coding the steps. I'm just hard coding the tools that you have access to for yourself.
The more autonomous is the agent decides the steps and can create the tools. So that's where you might give actually access to a code editor to the agent and the agent might actually be able to ping any API in the web, perform some web search. It might even be able to create some code to display data to the user. It might even be able to perform some calculations like, "Oh, I'm going to calculate the fastest route to get from San Francisco to New York." And which one might be the most appropriate for what the user is looking for. And then "I want to calculate the distance between the airport and that hotel versus that hotel. And I'm going to write code to do that." So it's actually fully autonomous from that perspective. Okay. So yeah, remember those keywords: memory, prompts, tools, etc.
Now I presented the flight API, but it does not have to be an API. You probably have heard the term MCP or Model Context Protocol that was coined by Anthropic. I pasted the seminal article on MCP at the bottom of this slide. But let me explain in a nutshell why those things would differ. In the API case, you would actually teach your LLM to ping an API. So you would say "This is how you ping this API and this is the data that it will send you back" and you would have to do that in a one-off manner. So you would have to build or sort of give the API documentation of your flight API, your booking hotel API, your car rental API and then you would give tools for your model to communicate with those APIs. It doesn't scale very well.
Versus MCP. MCP, it's really about putting a system in the middle sort of that would make it simpler for your LLM to communicate with that endpoint. So for instance, you might have an MCP server and MCP client where you're trying to communicate with that travel database or the flight API or MCP and your agent might actually just communicate with it and say "Hey what do you need in order to give me more flight information?" and that agent will respond by "I would like you to tell me where is the origin flight where is the destination and what you're looking for at a high level. This is my requirement." "Okay, let me get back to you with my requirements. Oh, you forgot to tell me your budget, whatever." "Oh, let me give you my budget," etc. Um, and uh, it's agent-to-agent communication, which allows more scalability. You don't need to hardcode everything. Companies have displayed their MCPs out there and you can your agent can communicate with them and figure out how to get the data it needs. Does that make sense?
Yeah. Oh, sorry. Like rewriting any help... like it's suffering only changes in the API rather agent you can rewrite that?
Yes. Is it not just...
Uh, I think it is ultimately the question is: isn't it a shifting issue? Because anyway if an API has to be updated, the MCP has to be updated. So what do you say?
Right, yes, that's correct. But at least um, it allows the agent to sort of go back and forth and figure out what the requirements are. But at the end of the day, ideally, if you're a startup, you have some documentation and automatically have an agent or an LLM workflow that reads that documentation and updates the code accordingly. You know, but I agree, it's not something that is fully autonomous. Yeah. Yeah.
Why is that?
Which security specifically?
Yeah. So are there security issues with MCPs?
So think about it this way. MCPs, depending on the data that you get access to, might have different requirements—lower stake or higher stake. I'm not an expert at you know, the full range. But it wouldn't surprise me that um, uh, you know, when you expose an MCP... I think a lot of MCPs have authentication. So, you know, you might actually need a code to actually talk to it just like you would with an API or a key. Um, yeah, but that's a good question. I'm, you know, I'm not an expert at the security of these systems, but you know, we can look into it.
Any other questions on what we've seen with the agentic workflows, APIs, tools, MCPs, memory? All of that is under progress. So, even memory is not a solved problem by any means. It's pretty hard actually to get. Yes.
You don't need to confer to access the API, but technically engineer uh, your way to achieving something from the API you can do the same.
Exactly. Exactly. Yeah. Is MCP about efficiency or accessing more data? It's about efficiency. It's like, you know, let's say you have a coding agent um, and you know, it has an MCP client and there's multiple MCP servers that are exposed out there. Um, that agent can communicate very efficiently with them and find what it needs. Um, uh, and it's a more efficient process than actually displaying APIs and the APIs on that side and how to ping them and what the protocol is, you know. But you know, it's not about the data that is being exposed because ultimately you control the data that is being exposed.
You probably you know, depending on how the MCP is built, my guess is you probably expose yourself to other risks because your MCP server can see any input pretty much from another LLM and so it has to be robust. Um, but yeah, super.
So let's look at an example of a step-by-step uh, workflow for the travel agent. So let's say the user says, "I want to plan a trip to Paris from December 15th to 20th um, with flights, hotels near the Eiffel Tower, and then an itinerary of must-visit places." That's the task to the travel agent.
Step two, the agent plans the steps. So it says, "I'm going to find flights. Use the flight search API uh, to get options for December 15th. Search hotels, generate recommendations for places to visit, validate preferences, um, budget, etc. Book the trip with the payment processing API."
Step three—that's just the planning by the way. Step three, execute the plan. Use your tools, combine the results, and then proactive user interaction and booking. It might make a first proposal to the user and ask the user to validate or invalidate and then may repeat that planning and execution process. And then finally, it might actually update the memory. It might say, "Oh, I just learned through this interaction that the user only likes direct flights. Next time I'll only give direct flights." Or I notice users are fine with three-star hotels or four-star hotels and in fact they don't want to go above budget or something like that.
Um, so that hopefully makes sense by now on you know, how you might do that. My question for you is uh, how would you know if this works? And if you had such a system running in production, how would you improve it?
Yeah.
So that's an example. So let users rate their experience at the end. Uh, that would be an end-to-end test, right?
You're looking at the user experience through the steps and say how good was it from one to five, let's say. Yeah, it's a good way. And then if you learn that a user says one, how do you improve the workflow?
Okay, so you would go down a tree and say, "Okay, you said one. What was your issue?" And then the user says uh, the prices were too high, let's say, and then you would go back and fix that specific uh, tool or prompt or...
Yeah. Okay. Any other ideas?
Yeah, good. So that's a good insight. Separate the LLM related stuff from the non-LLM related stuff. The deterministic stuff. The deterministic stuff you might be able to fix it, you know, more objectively essentially. Yeah, what else?
So, give me an example of an objective issue that you can notice and how you would fix it versus a subjective issue. Yeah.
The flight which is cheaper directive... that's...
Okay so let's say you say there's the same flight but one is cheaper than the other. Let's say it's objectively worse and so you can capture that almost automatically. Yeah.
So you could actually build evals that are objective that are tracked across your users and you might actually run an analysis after and see that for the objective stuff.
We noticed that our LLM AI agentic AI workflow is bad with pricing. It just doesn't read prices as well because it always gives a more expensive option. Yeah, you're perfectly right. How about the subjective stuff?
Yeah.
Like do you choose a direct or indirect flight if the indirect is a little bit cheaper?
Yeah, good one. Do you choose a direct flight or an indirect flight if the indirect is cheaper but the direct is more comfortable? Um yeah, that's a good one actually. Um so how would you capture that information? Let's say this is used by thousands of users.
Um, could you feed something in about us?
Uh, could you feed something in? Yeah, I mean you could... you could feed something in uh, about the user preferences? Well, you could build a dataset that has some of that information. So, you build 10 prompts where the user is asking specifically for direct, is saying that "I prefer direct flights because I care about my time," let's say. And then you look at the output and you actually give a good... the example of a good output and you probably are able to capture the performance of your agentic workflow on this specific eval. Whether does it prioritize, does it understand price conscious, is it price conscious essentially and comfort conscious?
What about the tone? Let's say the LLM right now is not very friendly. How would you notice that and how would you fix it?
Yeah.
Test user and like run prompts and see if there's something wrong with...
Okay, have a test user run the prompt and see if there's something wrong with that. Tell me about the last step. How would you notice that something is wrong?
So have a couple of... evaluates response and see if it's like satisfied.
Yeah, I agree with your approach. Have LLM judges that evaluate the response against a certain rubric of what politeness looks like. So here in this case you could actually start uh with error analysis. So you start, you have a thousand users and you know you can pull up 20 user interactions and read through it and you might notice at first sight the LLM seems to be very rude. You know, it's just super, super short in its answers and it's not very helpful. Um, you notice that with your error analysis manually.
Then you go to the next stage. You actually put eval behind it. You say, "I'm going to create a set of um, a set of LLM judges that are going to look at the user interaction and are going to rate how polite it is and I'm going to give it a rubric." Then what I'm going to do is I'm going to flip my LLM. Instead of using GPT-4, I'm going to use Grok. And instead of using Grok, I'm going to use Llama. And then I'm going to run those three LLMs side by side, give it to my LLM judges, and then get my subjective score at the end to say, "Oh, X model was more polite on average."
Yeah, perfectly right. That's an example of an eval that is very specific and allows you to choose between LLMs. You could actually do the same eval across LLMs, but fix the LLM, change the prompt. You actually instead of saying "Act like a travel agent," you say "Act like a helpful travel agent," and then you see the influence of that word on your eval with the LLM as judges. Does that make sense? Okay.
Uh, super. So let's move forward and do a case study with eval and then we're almost done um, for today. Uh, let's say your product manager asks you to build an AI agent for customer support. Okay, where do you start? And here is an example of the user prompt: "I need to change my shipping address for order blah blah blah. I moved to a new address." So where do you start if I'm giving you that project? You know.
Yes.
So do some research, see benchmarks and how different models perform at customer support and then pick a model.
That's what you mean. Yeah. You... It's true. You could do that. What else could you do? Yeah.
Okay. Yeah, I like that. Try to decompose the different tasks that it will need and try to guess which ones will be more of a struggle, which ones should be fuzzy, which one should be deterministic.
Yeah, you're right.
To sit down for like a day or two with a customer and see how the task... probably task...
Yeah, similar to what you said. That's what I would recommend as well. You say, "I would sit down with a customer support agent for a day or two and I would decompose the task they're going through." I will ask them where do they struggle, how much time it takes. Yes, that's usually where you want to start with task decomposition.
So let's say we've done that work and we have this list—I'm simplifying—but the customer support agent human typically would extract info, then look up in the database to retrieve the customer record, then check the policy you know, "Are we allowed to update the address or is it a fixed data point?" um, and then draft the response email and send the email. Okay, so we've decomposed that task. Once you've decomposed that task, how do you design your agentic workflow?
Yes. Each step, which one... which one we're going to use... method or whatever in each task what are you going to use for resources...
Exactly. So to repeat, you're going to look at the decomposition of tasks, get an instinct of what's fuzzy, what's deterministic, and then determine which line is going to be an LLM one-shot, which one will require maybe a RAG, which one will require a tool, which one will require memory. So you will start designing that map. Completely right. That's also what I would recommend.
You might actually uh, draft it and say, "Okay, I take the user prompt um, and the first step of my task decomposition was extract information." That seems to be a vanilla LLM. You can guess that the vanilla LLM would probably be good enough at extracting "the user wants to change address and this is the order number and this is the new address." You probably don't need too much technology there other than the LLM.
Um, the next step it feels like you need a tool because you're actually going to have to look up in the database and also update the address. So that might be a tool and you might have to build a custom tool for the LLM to say, "Let me connect you to that database or let me give you access to that resource with an MCP." Yeah.
After that, you probably need an LLM again to draft the email, but you would probably paste confirmation. You paste a confirmation that "Your address has been updated from X to Y." And then the LLM will draft an answer. And of course, just to not forget, you might need a tool to send the email. You know, you might actually need to, you know, post something for the email to go out and then you'll get the output. Does that make sense? So, exactly what you described.
[Sighs] Okay, now moving to the next step. Once we have decomposed our tasks, then we have designed an agentic workflow around it. It took us five minutes. In practice, it would take you more if you're building your startup on that. You want to make sure your task decomposition is accurate, your thing is accurate here. And then you can have a lot of work done on every tool and optimize it and latency and cost.
But let's say, now we want to know if it works, you know. And I'm going to assume that you have LLM traces. LLM traces are very important. Actually, if you're interviewing with an AI startup, I would recommend you in the interview process to ask them, "Do you have LLM traces?" Because if they don't have LLM traces, it is pretty hard to debug an LLM system, you know, because you don't have visibility on the chain of complex prompts that were called and where the bug is. So it's a basic sort of part of an AI startup stack to have LLM traces.
[Laughter] So let's assume you have traces. How would you know if your system worked? You know, I'm going to summarize some of the things I heard earlier. Um, you gave us an example of an end-to-end metric. You look at the user satisfaction at the end. Um, you can also do a component-based approach where you actually will look at the tool, the database updates, and you will manually do an error analysis and see, "Oh, the tool actually always forgets to update the email." It just fails at writing, you know, and I'm gonna fix that. This is deterministic pretty much.
Um, or um, you know, when it tries to send the email and ping the system that is supposed to send the email, it doesn't send it in the right format and so it bugs at that point. Again, you could fix that. Um, draft of the email, the LLM doesn't do a great job. It's not very polite at drafting the email, you know. So you could look at component by component and it's actually easier to debug than to look at it end-to-end. You'll probably do a mix of both.
Um, another way to look at it is what is objective versus what is subjective. So for example, an objective example would be the LLM um, extracted the wrong order ID. You know, the user said "My order ID is X" and the LLM when it actually pasted looked up in the database, it used the wrong order ID. This is objectively wrong. You can actually write a Python code that checks just the alignment between what the user mentioned and what was actually pasted in the database or for the lookup.
You also have subjective stuff which we talked about where you probably want to do either human rating or LLM as judges. It's very relevant for subjective evals. [Snorts] Um, and finally you will find yourself having quantitative evals and more qualitative evals. So quantitative will be percentage of successful address updates. The latency - you could actually track the latency component-based and see which one is the slowest. Let's say sending the email is 5 seconds, you know, it's too long. Let's say you would notice component-based or the full workflow. And then you will decide where am I optimizing my latency and how am I going to do that.
And then finally qualitative, you might actually do some error analysis and look at you know, where are the hallucinations? Um, where are the tone mismatches? Uh, you know, are the users confused and by what they're confused? You know, that would be more qualitative and typically it would take more you know, white glove approaches to do that.
Okay, so here's what it could look like. I gave you some examples but you would build evals to determine objectively, subjectively, component-based, end-to-end based, and then quantitatively and qualitatively where is your LLM failing and where it's doing well. Does that give you a sense of the type of stuff you could do to fix improve that agentic workflow?
Super. Well, that was our case study on Evals. We're not going to delve deeper into it, but hopefully it gave you a sense of the type of stuff you can do with LLM judges with, you know, objective, subjective, component-based, end-to-end, etc.
Last section on multi-agent workflows. So you might ask uh, "Hey, why do we need a multi-agent workflow when the workflow already has multiple steps, already calls the LLM multiple times, already gives them tools? Why do we need multiple agents?" And so many people are talking about multi-agent systems online. It's not even a new thing frankly. I mean multi-agent systems have been around for a long time. The main advantage of a multi-agent system is going to be parallelism. It's like, is there something that I wish I would run in parallel, sort of independently, but maybe there are some syncs in the middle? But that's where you want to put a multi-agent system—it's when it's parallel.
The other advantage that some companies um have with multi-agent systems is an agent can be reused. So let's say in a company you have an agent that's been built for design. That agent can be used in the marketing team and it can be used in the product team, you know, and so now you're optimizing an agent which has multiple stakeholders that can communicate with it and benefit from its uh, performance.
Um, actually I'm going to ask you a question and take a few uh, maybe a minute to think about it. Let's say you were uh, building smart home automation for your apartment or your home. What agents would you want to build? Yeah, write it down and then I'm going to ask you in a minute to share some of the agents that you will build. Also, think about how you would put a hierarchy between these agents or how you would organize them or who should communicate with who. Okay. Take a minute for that. Be creative also because I'm gonna ask all of your agents and maybe you have an agent that nobody has thought of.
Okay, let's get started. Who wants to give me a set of agents that you would want for your smart home? Yes.
So uh, the first is like set of agents that track my movements in the house and drop information about my house. Another agent receive that information and adjust the room temperature... and another... usage of...
Okay so let me repeat. You have four agents, I think roughly. One that tracks biometric, like where are you in the home, where you're moving, how you're moving, things like that. That sort of knows your location. The second one um, determines the temperature of the rooms and has the ability to change it. The third one tracks energy efficiency and might be feedback on energy and energy usage and might be, I don't know, maybe it has the control over the temperature as well? I don't know actually, or the gas or the water. Might cut your water at some point... And then you have an orchestrator agent. What is exactly the orchestrator doing?
Instructions.
Okay. Passes instructions. So is that the agent that communicates mainly with the user?
Yep.
Okay. So if I have... I'm coming back home and I'm saying "I want the oven to be preheated," I communicate with the orchestrator and then it would funnel to another agent. Okay, sounds good. Yeah, so that's an example of a, I want to say a hierarchical um, multi-agent system. Um what else? Any other ideas? What would you add to that? Yeah.
Minimal action that you can do. Imagine entering a room or just entering a computer or just opening the minimum kind of action. You have like lot of agent per... [clears throat] ...and then depending on who it is and all the contact you have...
Oh, I like that. That's a really good one. So let me summarize: you have a security agent that determines if you can enter or not, and when you enter it understands who you are and then it gives you certain sets of permissions that might be different depending of if you're a parent or a... you know, you might have access to certain cars and not others or the kid cannot open the fridge or I don't know like something like that. Yeah. Or okay. I like that. That's a good one. Yeah. And it does feel like it's a complex enough workflow where you want a specific workflow tied to that. I agree. [Snorts] What else?
Yes. Continuing on the ambient stuff, you can get more complicated. So energy savings with... keeps open... as well... from the grocery store understand what's in your fridge or not... who are out to.
Well, that's really good actually. So, you mentioned two of them. One is maybe an agent that has access to external APIs that can understand the weather out there, the wind, the sun, and then has control over certain devices at home, temperature, blinds, things like that, and also understands your preferences for it. That does feel like it's a good use case because you could give that to the orchestrator but it might lose itself because it's doing too much. So you probably... and also these problems are tied together like temperature outdoor with the weather API might influence the temperature inside how you want it etc.
And then the second one which I also like is you might have an agent that looks at your fridge and what's inside and it might actually have access to the camera in the fridge for example. Um, and know your preferences and also has access to the e-commerce API to order Amazon groceries ahead of time. Um, I agree and maybe the orchestrator will be the communication line with the user but it might communicate with that agent um, in order to get it done. Uh yeah, I like those. So those are all uh, really good examples here.
Here is the list I had um, up there. So climate control, lighting, security, energy management, entertainment, notification agent, alerts about the system updates, energy saving, and orchestrator. So all of them you mentioned actually. Um, and then we didn't talk about the different interaction patterns, but you do have different ways to organize a multi-agent system. Flat, hierarchical. It sounds like this would be hierarchical. I agree. And the reason is UI/UX is I would rather have to only talk to the orchestrator rather than have to go to a specialized application to do something. Like, it feels like the orchestrator could be responsible for that. And so I agree I would probably go for a hierarchical setup here.
But maybe you might also add some connections between other agents like in the flat system where it's all to all. For example, uh with climate control and energy, if you want to connect those two, you might actually allow them to speak with each other. When you allow agents to speak with each other it is basically an MCP protocol by the way. So you treat the agent like a tool, exactly like a tool. Here is how you interact with this agent. Here is what it can tell you. Here is what it needs from you essentially.
Okay, super. And then without going into the details, there are advantages to multi-agent workflows versus, you know, single agents such as debugging. It's easier to specially debug a specialized agent than to debug an entire system. Parallelization as well. It's easier to have things run in parallel. Um, and you can earn time. Um, you know, there are some advantages to doing that. And I leave you with this slide if you want to go deeper.
Super. So we've learned so many techniques to optimize LLMs from prompts to chains to fine-tuning retrieval um, and to multi-agent systems as well. And then just to end on um, a couple of trends I want you to watch. Uh, I think next week is Thanksgiving. Is that it? Is Thanksgiving break? No, the week after. Okay. Well, ahead of the Thanksgiving break. So if you're traveling, you can think about these things.
Um, what's next in AI? I wanted to call out a couple of trends. Um, so Ilya Sutskever, one of the OGs of uh, you know LLMs um, and you know OpenAI co-founder, um raised that question about are we plateauing or not? You know, the question of are we going to see in the coming years LLMs sort of not improve as fast as we've seen in the past. It's been the feeling in the community probably that you know, the last version of GPT um, did not bring the level of performance that people were expecting. Although it did make it so much easier to use for consumers because you don't need to interact with different models, it's all under the same hood, so it seems that it's progressing. Um, but the plateau is unclear.
The way I would think about it is, um, the LLM scaling laws tell us that if we continue to improve compute and energy then LLMs should continue to improve, but at some point it's going to plateau. So what's going to take us to the next step? And it's probably architecture search. Still a lot of LLMs, even if we don't understand what's under the hood, are probably Transformer-based today. But we know that the human brain does not operate the same way. There's just certain things that we do that are much more efficient, much faster, we don't need as much data. So theoretically we have so much to learn in terms of architecture search that we haven't figured out.
It's not a surprise that you see those labs hire so many engineers because it is possible that in the next few years you're going to have thousands of engineers trying to figure out the different engineering hacks and tactics and architectural searches that are going to lead to better models. And one of them suddenly will find the next Transformer and it will reduce by 10x the need for compute and the need for energy.
Um, you know, it's sort of if you've read Isaac Asimov's uh, Foundation series, um, individuals can have an amazing impact on the future because of their decisions. You know, whoever discovered Transformers had a tremendous impact on the direction of AI. I think we're going to see more of that in the coming years where some group of researcher that is iterating fast might discover certain things that would suddenly unlock that plateau and take us to the next step and it's going to continue to improve like that. And so it doesn't surprise me that there's so many companies hiring engineers right now to figure out those hacks and those techniques.
Um, the other set of gains that we might see is from multimodality. So the way to think about it is we've had LLMs first text-based and then we've added imaging and today you know, models are very good at images. They're very good at text. Turns out that being good at images and being good at text makes the whole model better. So the fact that you're good at understanding a cat image makes you better at text as well for a cat. Now you add another modality like audio or video, the whole system gets better. So you're better at writing about a cat if you know what a cat sounds like, if you can look at a cat on an image as well. Does that make sense? So we see gains that are translated from one modality to another. And that might lead in the pinnacle of robotics where all these modalities come together and suddenly the robot is better at running away from a cat because it understands what a cat is, how it sounds like, what it looks like, etc. Does that make sense?
Um, the other one is the multiple methods working in harmony. In the Tuesday lectures, we've seen supervised learning, unsupervised learning, self-supervised learning, reinforcement learning, quality engineering, RAGs, etc. If you look at um, how babies learn, um it is probably a mix of those different approaches. Like a baby um, might have some meta-learning, meaning you know it has some survival instinct that is encoded in the DNA most likely. Um, and that's like the baby's pre-training if you will.
On top of that, the mom or the dad um is pointing at stuff and saying "bad, good, bad, good, good"—supervised learning. On top of that, the baby's falling on the ground and getting hurt and that's a reward signal for reinforcement learning. On top of that, the baby's observing other people doing stuff or other babies, you know, doing stuff—unsupervised learning. You see what I mean? It's... we're probably a mix of all these methods. And um, and I think that's where the trend is going is where those methods that you've seen in CS230 come together in order to build an AI system that learns fast, is low latency, is cheap, energy efficient, and makes the most out of all of these methods.
Um, finally, and this is especially true at Stanford, um, you have research going on that you would consider human-centric and some research that is non-human-centric. By human-centric, I should say human approaches that are modeled after the brain and approaches that are not modeled after humans because it turns out that the human body is very limiting. And so if you actually only do research on what the human brain looks like, you're probably missing out on compute and energy and stuff like that that you can optimize even beyond neuronal connections in the brain. But you still can learn a lot from the human brain. And that's why there are professors that are running labs right now that try to understand how does backpropagation work for humans. And in fact, it's probably that we don't have backpropagation. We don't use backpropagation. We only do forward propagation, let's say. So this type of stuff is interesting research that I would encourage you to read if you're curious about the direction of AI.
Um, and then finally um, one thing that's going to be pretty clear—I call it all the time—but it's the velocity at which things are moving. You're noticing part of the reason we're giving you a breadth in CS230 is because these methods are changing so fast. So I don't want to bother going and teaching you the number 17 method on RAG that optimizes the RAG because in two years you're not going to need it, you know. So I would rather you think about what is the breadth of things you want to understand and when you need it, you are sprinting and learning the exact thing you need faster because the half-life of skill is so low, you know. You want to come out of the class with a good breadth and then have the ability to go deep whenever you need after the class. And so that's sort of how that class is designed as well. Um, yeah, that's it for today. So thank you, um, thank you for participating.