CS 194/294-196 (LLM Agents) - Lecture 5, Omar Khattab
Disclaimer: The transcript on this page is for the YouTube video titled "CS 194/294-196 (LLM Agents) - Lecture 5, Omar Khattab" from "Berkeley RDI Center on Decentralization & AI". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.
Watch the original video here: https://www.youtube.com/watch?v=JEMYuzrKLUw
This is work with a very large number of people, over 200 contributors, and I include a lot of slides adapted from a couple of these folks, Krista and Michael. So the context of this lecture is that it's never been easier, as we're all aware, to build really impressive AI demos. And you know, pretty obviously, one of the best examples of this is none other than ChatGPT. It's this one interface where you can ask just about any questions that come to your mind, and not only will the system find answers that suit your questions, it will synthesize them on the fly conversationally. Now, we're all taking that for granted, but let's just marvel at how amazing that is. Now, what makes it really cool, uh, and a big step forward compared to before, is that it could also help us with tasks. So I have here an example of me asking ChatGPT to help with a piece of code. I'm asking it to parallelize a sequential Python loop I had.
Now, you know, and it does exactly as I asked. Now, at this stage, uh, we're all extremely familiar with the primary weakness of these language models, which is, in some sense, precisely their key strength, which is that they are so fluent, in fact, that when they make mistakes, um, they can be incredibly hard to, uh, incredibly hard to detect. So Stanford was not founded in 1891, although it's close. Maybe using that example at Berkeley is weird, but anyway. And the whole reason I asked for help with this piece of code was that, um, you know, I wanted to avoid data races. So when I was given a piece of code that simply naively parallelized the code and had a bug in it, I was glad I didn't trust that model, um, and you know, didn't simply plug in this code into my code base. So the big picture here is that even though it's incredibly impressive, incredibly easy to build impressive demos, turning monolithic language models into reliable AI systems remains really hard.
And that's what the rest of this talk, uh, will focus on. Now, these are not just problems, you know, you see in your kind of personal interactions with a, with a chatbot. These have real implications all over the place, and that's just one example, um, of this happening in practice. Now, I'm not saying ChatGPT is bad or language models are bad because it made a mistake or two. Um, what I'm saying though, uh, and that's because every AI system and everything we'll talk about today will always make mistakes. What I'm saying though, is that fundamentally the monolithic nature of language models, uh, makes them particularly hard to control when we are building systems to debug when they do make mistakes and to improve, uh, when we want to iterate on the development of our systems.
And increasingly the way people are tackling this, as we had in the title of the, of the lecture, is by building compound AI systems. So if you haven't heard of this term before, it just means modular systems or modular programs in which language models are not serving as the user-facing end-to-end system, but they're actually playing specialized modular roles inside this bigger architecture. So, you know, a very familiar example of that that I'm sure sort of prior talks discussed a lot in this, in this class, uh, is retrieval augmented systems. So you know, instead of building a system that takes a question like this and just gives it to your monolithic language model, just a black box, deep neural network and basically hopes that the system does the right thing. Um, you might break it down into smaller pieces where you have a retrieval model, consult a massive corpus with the question as a search query, retrieve the top-k, you know, most relevant results, feed those as context to the language model, and then basically prompt the model to use the information there and cite it in its answer. Now, there are lots of reasons that might be attractive. One of those is transparency. This system might still make mistakes or it might say the right thing, but either way, uh, we can always inspect sort of the trace of the system behavior, see what it retrieved and see why it generated what, what, you know, the information it did.
That would help us see when it's right, you know, it says something justified based on a citation that's factual. Um, it could also help us see it when it's wrong, uh, for example, it made, uh, an incorrect inference based on a relevant piece of information, or maybe it simply retrieved something irrelevant and extrapolated in a weird way. There's another reason this is really attractive, which is that, um, you know, a system like this can be a lot more efficient. It has more steps, for sure, um, but now the language model does not necessarily need to know everything about old topics because we've sort of offloaded that knowledge and a lot of that, like, basically, uh, step-by-step control flow, either to a knowledge base from which we're retrieving, um, or to simply a program that is executing these steps. So we've gained a lot from having these sort of, uh, this compositional approach.
Another, another example of a compound AI system beyond RAG, and something slightly more sophisticated, is taking that to the next step and asking, well, one of the biggest sort of powers of language models is that they are not limited to answering questions they've seen exactly before. They can sort of synthesize or compose information, or at least that's the, that's just the hope. Um, so this is exactly what a compound AI systems, system that does multi-hop reasoning or multi-hop retrieval augmented generation would do. It's the fact that we can now take questions, uh, and instead of simply giving them to a dumb search engine, we can ask our language model itself to act as a module that breaks down this question into little pieces, finds information about each piece, and then basically returns that to the language model whose job is to then take, uh, you know, whatever got retrieved and produce answers that maybe synthesize information in a more holistic way.
What's really great about sort of compound AI systems, uh, continuing that line of thought, is that we have a lot of control as the people building this system. Um, you know, you're not sort of bottlenecked at the next release of a language model, whether you're the one building it, which might be unlikely, or not. Uh, you actually have a lot of control over the development, um, of your architecture, and that allows you to iterate much faster and to build systems like this system here, "Bin," that I built near the start of my PhD, that do things that your individual language models at that time absolutely couldn't do. Another example is to take this an even, you know, another step further, uh, and to basically say, uh, could we have these systems generate long reports, like maybe articles with citations, like a Wikipedia page? And that's where, uh, you know, maybe you could say something like, "Write me an article on the, on the Kar retrieval model." And a system like this, which is Storm from Stanford, um, could then basically have a lot of modular components that are at a much finer level of granularity. So it could brainstorm, uh, it could, uh, generate outlines, it could revise these outlines, uh, it could then basically ask a model to generate questions about each part of the outline, retrieve sources, uh, synthesize information in a much more systematic way, uh, you know, than a simpler architecture would.
And what's really great here is that because we have sort of this ability to compose, um, better scoped, uh, uses of our language models, we can really iterate on the quality we get out of these systems. Um, so it's much easier for a language model to, you know, take information that's given to it and synthesize something reasonable as opposed to maybe tackle a bigger problem of trying to sort of, um, remember facts, for example. So by, by being able to manage all of these little compositions, we can do a lot more, uh, in terms of quality. And lastly, uh, the sort of final, uh, advantage of building these types of systems is inference time, uh, uh, scaling. So basically, uh, if you have a language model, it's pretty clear at this point that if you intelligently spend more compute at test time, so you have a question and you're trying to sort of leverage the language model, maybe to search over a large space of potential paths, for example, that that's something that can really help. So an example of a compound AI system where this really helps here is AlphaCodium, uh, which basically is the system that is, uh, targeted at generating code, but instead of simply asking a model to produce a, you know, a piece of code, it has all these steps in which it would, for example, um, you know, reflect on the problem, um, reflect on public tests that, you know, maybe exist, uh, for a, for a given, for a given task that you have, generate various solutions, rank them based on their performance, and then basically iterate from there. And it's not, you know, unintuitive that this, that type of decomposition, which reflects the way you might instruct a, you know, uh, an intern or a friend or something to, to approach a task like this, it's not unintuitive that that can really help quite a bit, and that's what this type of work shows. And this something that keeps popping up in many, many, many applications, uh, in many cases also, uh, you know, in task agnostic form, you know, in the form of methods that basically can scale compute, um, independently of a particular task.
So we talked about how amazing compound AI systems are and how sort of they give us these five things from quality, uh, to control, to transparency, to inference time scaling. So they're awesome, but the problem is, unfortunately, we're working with highly limited modules. At the end of the day, our language models, um, themselves, um, are extremely sensitive to how you ask them to play these roles of these little things you're trying to compose. And because of that, under the hood, these really beautiful diagrams we looked at, they are typically implemented, uh, you know, and we're all guilty of this, myself included, they're typically implemented with, you know, thousands of tokens of of English or some other natural language, trying to coerce a language model, uh, to play a, to play that role of each of these modules. And so this is something we see across all types of tasks, uh, where we're writing sort of tens of kilobytes of strings in JSON files, uh, trying to define our, our language model systems. And of course, if we could just do that and then succeed in a way that is sort of highly portable and in a way that's highly general, maybe it's fine, maybe it's a fine price to pay, uh, but the problem is that it's not, uh, it's not something that will generalize or sort of will systematically allow us to compose these systems.
So the problem here, at a high level, is that each of these prompts implementing each of the modules we looked at, is sort of coupling five different roles. So first role it's sort of trying to, uh, to, to cover is the role of a signature, the role of like, I want a function that's playing this, this module. And essentially, it's just a specification of here are your inputs and here's what they mean, and here's what, here's the transformation I want you to make to give me the outputs. It's also specifying the computation, uh, sort of that tries to specialize this signature, uh, with some kind of inference time strategy. So you might tell the model, "Hey, please think step by step," or you might tell the model, "Hey, you're an agent and you have these tools and you should call these tools and when you call them, I'll intercept it, I'll give you back the response." Or you might say something like, you know, "I'll, I'll generate 10 different things and explore all of them with some kind of reward model," for example. Um, so these cases, we're sort of expressing a lot of that in the prompt and in the code around the prompt.
Um, the prompt is also trying to sort of couple with all of that the computation that's well formatting all of the inputs we want to give to this sort of function or the signature, um, and is also encapsulating the logic for how we want to get several things and sort of parse them and cast them to the right types and maybe retry if they're not really formatted properly and all that stuff. It's also on top of that expressing, well, the objective, you know, you're, you're not just telling, uh, you know, you're not simply describing what the inputs and and output behavior is, you're really also trying to encode and your prompt a lot of information about dos and don'ts and basically what you're trying to maximize, you know, be factual or don't hallucinate or, you know, uh, don't cite pages that don't exist or whatever. Um, and so, you know, that's, that's essentially a different notion, uh, from simply declaring with the, what what that module is doing, as we see in the diagrams. And lastly, um, for any given language model, as we said, they are super sensitive and these pipelines are really complex. So you're going to see in practice that people work really hard to sort of, um, CSE the model to basically do the right thing through a lot of trial and error in English or in some natural language, uh, and maybe there's also sort of fine-tuning of the weights around that if, if you're doing a, you know, a more advanced application.
So the bigger picture here across these five roles is that existing compound AI systems are awesome. They have a lot of potential to be really modular and to sort of solve problems we can't solve with neural networks alone or with language models alone. Um, but the problem is, they're too stringly typed, if you will, uh, they couple the fundamental architecture of the system design that the developer wants to express, which nicely is the thing we see in the diagrams we saw of compound AI systems, um, they couple that with incidental choices that are super specific to the specific pipeline you have or the language model choice you've made, which are things that change all the time. So you have a pipeline that has five pieces, you go in and you want to slightly change the objective or introduce a six module, or maybe change the language model to a cheaper thing that just came out that is supposed to be as good. Good luck, your entire sequence of, you know, thousands of words of of prompts are basically irrelevant at that point. And this is something that really blocks adoption in practice, as well as portability of a lot of the cool systems we, we, we covered, where people in industry, for example, could, um, approach you and say, "Well, we really love that paper, um, but it's clear that the prompts there were tuned sort of and developed with a particular data or benchmark in mind that's not our use case. We have no visibility or clue how you decided to arrive at those prompts, so we're basically practically unable to use the stuff you built."
So, uh, an argument I'll be making today, and I'll be relying on in the rest of this lecture, is that we do know how to iteratively, uh, and controllably build systems and improve them in a modular way. And that sort of elusive concept is called programming. Um, so a lot of the rest of this talk will be about sort of this ambition of what if we could build compound AI systems as programs, right? Just computer code, standard Python, um, but with fuzzy natural language functions or natural language modules that can learn their behavior from data. So you're basically thinking of just writing code and then inserting these little pieces that are supposed to represent the boxes we saw earlier that are sort of exhibiting intelligent behavior. And then you specify an objective and you have the system try to learn sort of behavior around the natural language specifications that you're putting around these modules. If we could do that, we could lose a lot of the problems that we saw with existing compound AI systems where you're not going to have to be writing these prompts, you're going to be building a general design that could sort of be ported from a language model to another language model.
And in this analogy, the portability is sort of like working with, uh, with hardware, you know, if you write a piece of code in C on a certain CPU architecture and you move to a different CPU architecture, a good compiler is going to take your same high-level C, high-level between codes, and then compile it to potentially very different lower-level, uh, machine code on different architectures that makes sure that it works. So here, we want to, uh, the sort of the ambition that we have is that we want to write code in higher level, uh, uh, higher level abstractions that are sort of, um, independent as much as possible, uh, of these low-level details of how to get a language model to do the thing you want it to do, and then we're going to build compilers, um, or optimizers to take that and spit out specific sort of, uh, strategies for working on a given, uh, in a given infrastructure, which in this case means a language model or other things. So these are some of the systems we saw earlier.
And so, you know, the approach that we will take sort of to where it's this vision is called DSP. Um, someone said, I think Alex said data science Python. It has nothing to do with data science, it's a long story, uh, but DS is declaratively self-improving, declaratively self-improving Python. So it's a Python, but it's smarter. Um, and the way in which we're going to do this is we're going to say, well, let's define this thing called the language model program, which is just a Python program, just a Python class or function. Um, and in that function, uh, you know, it's going to take, it's a function that takes inputs in natural language, maybe a question that we want to answer, uh, or maybe like a report we want to summarize, or a topic we want to generate a, I don't know, an email about or something. And the output is also a natural language, maybe the answer, or the report, or the email, or whatever it is, and maybe, you know, the output could be several pieces of these. And what makes this interesting as a function is that in the normal course of its execution, it's just kind of loops and exceptions and go-to statements and all the things you shouldn't use and all that stuff. But in the, in the course of its execution, it's calling these functions that are modules that are fuzzy. So it's basically making calls like, "Hey, generate a search query for me," or, "Turn this natural language into SQL," or whatever it is. So it's making these kind of like high-level natural language claims, which look like prompts, but they're sort of much more structured and much shorter because they are only expressing, uh, sort of the, the stuff that you need to declare, like what is the actual behavior, not how should the model do it.
Um, so we have this function in which, in it, in its course of execution, it's calling, uh, these modules, and each module is basically defined by simply a declaration of what it should do, what inputs it takes, and what outputs it produce, it produces, usually that's one-liner basically in, in many cases. So here's an example, um, of an actual sort of simple DSP run I put before the talk. So basically, you can say, "I want to fact-checking module," you know, before that there might be like loops and exceptions and commented out code and all sorts of mess. And then you have basically, uh, you know, I want to apply a chain of thoughts strategy over this function signature that says, "I want you to take a set of claims and give me a list of Booleans that is the verdicts of, you know, are these claims true or not." And then I get a module as this function. I can give it a set, a list of claims, and it can basically think and then it can tell me, "Well, the first one is true, and the second one is false," right? So so far, this is basically just kind of an interface of a function that is declared, you know, with natural language types. We're basically just saying, you know, the, the keywords here are not special, but I can understand them because I understand English as a language model, and I can sort of see that, okay, the verdicts here is, is going to sort of suggest that probably until I optimize this or learn somehow, you know, this is, this is a true statement, and Python is not a compiled language in standard usage.
So this basically defines for us an optimization problem, uh, under the hood, which is basically, you're given this set of modules in our program that your program calls, um, and for every module, our goal is going to be, uh, to sort of basically decide how should we take this call and actually call a language model, which means coming up with the string that goes into the model, uh, which might be a lot more, you know, sophisticated or, or just very different from how it looked at the specification level we saw earlier in the signature. Um, the other thing we also can control in some cases, although we're not cover a lot of that today, is, well, you know, what are sort of the weight settings that we want to assign our language model, you know, in the sense of if we want to fine-tune the model, uh, to sort of perform better at our task. So, you know, the idea is now we have this function with all these modules, and if you can give me a training set, you know, of inputs to that function, and maybe some kind of hint or find a label or some kind of metadata, um, I should be able to basically try to find the settings of the of the prompts and the weights such that on average or, you know, in expectation, um, some kind of metric, uh, of calling the, the program with modules assigned in this way is maximized. That's the optimization problem that we will be dealing with, you know, once someone, uh, writes a program in an abstraction that could support, uh, this type of behavior, Uh, which, which is DSP.
Now, um, you know, the problem here is that this is really hard. Um, so we don't have, and we're not asking anywhere, um, for, you know, gradients across the system. The system might generate code and execute it. It might call external tools. It might run a calculator. Um, so we don't necessarily know how to optimize a system like this, certainly not, not directly with gradients. And we can't sort of like cheat and try to optimize each module on its own because we are not in general asking for labels for every step of the system. If we were to ask for labels for every step of the system, the whole argument about iterative development and modularity would collapse because I added the module and now you're asking me to go label some data, uh, whereas the ideal sort of, um, you know, development cycle that we want is that you're building a system, you notice a need, uh, you notice a need to add a module of some sort, or you want to experiment, you just inject it, you recompile, you see what happens, and see if your system is better or not. So basically, we can't assume the existence of of, um, of labels for each of these modules. So that's a problem because it's not obvious how to go about sort of optimizing this, and that's one of the things we'll talk about in the rest of these, uh, this slide.
And let's take a concrete example here so that the rest of this sort of is developed in a way we can, uh, we can follow. So suppose that we wanted to build this, uh, simple multi-hop retrieval augmented generation pipeline, and have it in visual form here, although the DSP code is basically the same. Um, you know, it's a function that takes a question, which is a, and it's going to output an answer, which is also a string. And in that function, you know, the last step is this question that we got with some context, we want to give it to a language model in some way and basically have it generate our answer. But the question is, how do we get this context? So the multi-hop thing here is that we will have a loop. Maybe the loop could be smarter, it could like decide when to stop or something, but for now, we'll just hardcode that it goes twice.
And in every iteration, we'll ask our language model to take whatever context we've built up so far, which is empty at the beginning, um, generate a search query, take this query, dump it into the context, maybe append it or something, and then basically we can loop again so that the system is generating queries to seek things that we haven't found so far. Um, so a question that, you know, is is would come up in, in this, in this scope would be something like, "I don't know, how many floors are in the castle that David Gregory inherited?" An example I like to use. So the first query you might ask is, "Well, who is David Gregory?" Um, and then you might learn that he's a Scottish guy from, I don't know, the 1600s or something, and that he inherited a castle called Kenady Castle. Then you might ask, "Well, how many floors are in Canard Castle?" given that the original question was trying to seek that. Um, and sort of that gives you the behavior that we're seeing here in terms of the multi-hop behavior.
So that sort of visual function can be written in DSP code as as as like here, we're not going to look at too much code, uh, but this is just a module. If you're familiar with something like PyTorch or just deep networks in general, we're kind of borrowing some of the syntax. Um, but in this module, uh, we're going to, uh, have an initialization sort of method that just declares the, uh, sub modules that we have in our compound AI system. So we're going to have a sort of a Chain of Thought strategy, express our sort of signature that says you take some context, you take a question, and you generate a search query, and all of these are strings, so we're not telling you they're strings because that's the default. And then there's another module that takes the same inputs or same types of inputs and generates an answer. Now, the sort of part of your code that is the actual sort of program logic is here, where you're simply just, you know, the loop we saw in the last slide can sort of be expressed. So that's the forward method can define the program logic that we have.
We talked about signatures, so they are telling the system what the module should do, uh, as opposed to working really hard with a certain language model as to how it should express that behavior. And the modules, as we said, you know, know there are many, uh, sort of strategies to spend compute, uh, you know, given the language model. So sort of define a general sort of, um, um, approach to take a signature and actually express it in terms of language model, and those sort of help, uh, you know, uh, improve quality in many settings. So there are sort of puristic that we can apply very similar to layers in neural networks. So I'll go sort of an a tangent here and say like, you could, if you want, and this is sort of like, if, if, if that's, this is not essential to understanding the rest of the talk, but sort of intuitively you could think of this as these are normally Chain of Thought or other things, just conceptual things that you can't actually compose because the Chain of Thought, you know, in the original paper, which is super cool, is basically just a bunch of prompts they wrote for individual tasks. So for answering math questions, you, you write some examples and you ask the model to do stuff. You're going to apply it to a different task, you got to write different prompts. But what we're saying here is why can't we borrow from layers in neural networks where you give them dimensions that sort of just describe what type of types of tensors they accept and what types of tensors they give you, um, and then have that behavior be expressed? So in a neural network architecture, you could say, "I want attention that takes a following sort of vectors and gives me those other vectors," and sort of you could have that or you could have an RNN do this or a linear layer or convolution or whatever.
So similarly here, we're asking why can't we take these, uh, sort of inference strategies and make them general in a sort of meta-programming sense and sort of define these, you know, define them over these signatures, uh, and have them express that behavior in a general way that can be composed? Awesome. So the question here is, this is the abstraction, this is how you write the program, how do you actually sort of what are we supposed to do so this actually works as a system? Because it looks pretty to me, it looks pretty, maybe not to everybody, but to me, this looks really elegant. But the question is, how do we take this and actually sort of give you a system that, you know, you could actually deploy and that you're happy with and you could iterate on and all that? So iterating on it is easy because you could come in and like add one more module, add an if statement, throw an exception, and gadget, and you can do all kinds of things.
But the question is really, how do we, uh, sort of translate these strategies with signatures into actual prompts under the hood? So you can take this here, and the very first step is pretty trivial, although it's important to do it right, uh, which is that sort of internally we can sort of translate this into a basic prompt, one that will not necessarily work really well or anything. There's no guarantees about it, um, through sort of built-in adapters and predictors. So predictors are just this guy, which is basically another module that has some logic inside, and really what it does fundamentally is it adds one more output here, another field here that asks for reasoning. Um, but the adapter is going to take this guy and is going to say, "Well, you know, depending on how you implement it, or which, which one you choose, or whatever, um, it's going to say, 'Well, I have a model here that's maybe a chat model or maybe it's an instruct model, and I want to sort of format these fields in a certain way just to kickstart the process.'" So it's going to basically give us a basic prompt under the hood, that's something like, "Hey, given a field context and question, you know, give me a query, and here is the format you want to follow because I want to be able to parse the stuff that you give me." Nothing about this is going to work particularly well, but it's is going to get us started.
Now, the role of optimizers on top of this, which is sort of a different component, many algorithms we'll discuss, um, is to take this initial prompt and to take the whole program in which there are many of these modules, many of these prompts, and to sort of look at them as parameters, or look at them as like, you know, having a lot of variables that we can tinker with and figure out how to maximize sort of the metric that we saw in the in the objective we wanted to maximize a few slides ago. Um, so we might start here, and maybe the system would perform, and you know, at 37% accuracy on a given, you know, strict metric. So the actual quality might be a bit higher, but you, let's say this is something that has, uh, you know, we're trying to sort of have a metric that's highly precise but has somewhat lower recall. Um, and we might then basically just ask, "Hey, I like the MEO V2 Optimizer in DSP," much the same way as in a neural network sort of situation, you might say, "I like Adam or I like, I don't know, RMS prop." You give it some data, you give it the program, you give it a metric, and you ask it to do a good job, and it spits out, you know, a better prompt, maybe fancier instructions, and gives it some examples, uh, sort of and sticks them into this, into the prompt of one of the modules or several of the modules such that the quality is, you know, substantially higher. So, you know, on a certain real task, sort of, uh, instead of tweaking, you know, 2,000 tokens of prompts that look something like this, and you know, at the end of the day, sort of getting an accuracy like 33% with a certain model from OpenAI, um, we can sort of explore a much larger design space and get higher scores in interesting ways, you know, through these compositions.
So let's look at at sort of a results table here on the multi-hop question answering task that we've been sort of using as a running example. The simplest thing you could do in, in DSY, or sort of just when you're thinking about a compound AI system, is the trivial, is a trivial compound AI system that's not actually compound, uh, in which you just ask the system to take the input and just predict the output and you don't bother optimizing it. So if you do that, so we did this a while back, this is, you know, exactly a year ago, uh, this result, um, you can get GPT 3.5 to perform, you know, sort of at this, at this level, and a Llama 2 model sort of is a little bit, uh, behind. But you could say, "Well, we know that a decomposition like RAG could help, especially that this is a factual task, we're answering questions." Um, so we could do a RAG system, and the interesting thing is, well, we could also try, you know, running one of the optimizers that we'll talk about in a minute. And what you start seeing is the cool thing that already, you know, a small model, um, can start to perform better than a larger model, uh, with a sort of simpler architecture, or even with a sophisticated architecture without optimization. But the large model can preserve a lot of its advantage, uh, here as well. Now, you could iterate on your program and build a multi-hop compound AI system, and there you can see that, you know, with optimization, you can boost the quality quite a bit with large models, as well as sort of with open models. And with more recent models, all these numbers are substantially higher nowadays.
Now, what's really cool, uh, although we're not going to be able to spend a lot of time on that, is the very same program that we wrote here, which is giving us all of these results across models and across sort of optimiz, optimization, uh, decisions that we can make. We can also use optimizers that sort of update the weights that do fine-tuning, essentially, uh, a whole space of reinforcement learning is open in front of us. Um, and we can get very small models to sort of, um, imitate these bigger ones, and we can have like a model beneath a billion parameters that's like many years old. T5 is, I don't know, 2019, uh, score competitively, in fact, better than, you know, uh, close sort of at that, at that time, frontier models, uh, sort of, um, uh, on, uh, on, uh, the, the given task.
So question is, uh, that we will be looking at over the rest of the lecture is what are the, what is the space of optimizers like, and what are the interesting choices we can make, and what works and what doesn't? Um, so in general, uh, there are too many DSi optimizers to discuss today. They vary in how they tune the prompts and weights in a program, but there's actually kind of a general pattern that you can see in many of them, in most of them. Um, the first thing is, in general, as we saw earlier, you're going to, for every module, just guess an initial prompt, and this is actually something in DP happening outside the optimizer. In general, it's happening through what we call the adapter that just sort of like takes your signature and makes a sort of a deterministic construction, most of the time deterministic, sometimes with the language model, um, of what the initial prompt should look like. And then the next thing we'll do is rejection sampling. So we're going to take the inputs that you gave us in the sort of in, in your problem specification, maybe we'll raise the temperature of the language model, uh, maybe we'll plug in a large model, uh, maybe we just keep the system simple.
Um, and then we'll basically run through your program with these basic prompts to sort of try to collect trajectories throughout all of these steps, um, that lead your metric to sort of assign high scores. So you've basically given us this program, you've given us a bunch of inputs, and you've given us a metric that can assess when things are good or bad. Maybe it's just checking if the answer is correct, or maybe it's asking a language model, or maybe asking a DSP program to sort of evaluate the system on the rubric, or any number of other design choices that you're allowed to sort of, you know, experiment with. Um, so we're going to use those in order to basically start to collect traces, um, of every module or examples of every module of its inputs and outputs when they're chained together, sort of, uh, lead to high scores through your, through your metric.
We can then use these examples in a bunch of different ways to update the modules of your program. The simplest thing you could think of is if you have examples of every module that have sort of proven to work in the past, it doesn't mean that they're correct, but it means that maybe there's some hope that they're useful. And the simplest thing you could do is try to stick them into the prompt as demonstrations, as examples that just say, "Hey, when I got this input and I did this behavior, it seemed to work pretty well in the past." Um, and you know, once you do that, maybe you can actually start exploring, well, which of the examples, and can we try to be intelligent about this or not? Which of these examples that we just built, uh, can be useful, um, for actually taking this module, plugging it into a program, and getting the program to work better on a bunch of other examples?
Another thing you can do is you could try to sort of induct instructions if you have a lot of examples of a module, you have a better sense of understanding of what the module should do, uh, and when that leads to good behavior and when that doesn't lead to good behavior. And you can give those to a language model and ask, basically, you know, "What is the thing that we're trying to ask here?" And a thing to keep in mind is that in all of these cases, language models are being used to build these components, but we're not assuming that language models are good at it. Language models are going to give you 20 different things, and you know, 19 of them are going to suck. But because we can try so many, and we can explore that space to varying level, levels of intelligence, we can basically strike gold fairly often and find the combination of these pieces that leads, uh, leads to better prompts in a way that is basically independent of a particular setup that you have. So you can adjust your program or your language model choices, rerun the system, and and, you know, have optimization happen. And then, of course, you could run, you know, once you've built these examples, if you have enough of them, you know, you could basically, uh, fine-tune the model on them, or do reinforcement learning, uh, depending on how you're sampling, uh, or do, you know, preference fine-tuning and other other approaches there as well.
So, you know, there are various papers, uh, that we have that explore these types of strategies. The two ones I, I, I sort of requested in the, in the, uh, suggested readings were, uh, sort of the one introducing MEO, and the one introducing a strategy, uh, for combining prompt optimization and fine-tuning. And the rest of this talk will be about the MEO, the MEO paper.
So before we discuss MEO, um, you know, this stuff, this is sort of we discussed a sequence of abstractions and sort of at a high level, how a set of algorithms that we call optimizers take programs written with these abstractions and sort of give you compound AI systems that work well, but you're expressing in a way that is sort of more portable and I think a lot more elegant. Um, but what's cool here is that these things actually work pretty well in practice. So a few months ago, University of Toronto researchers participated in this Meta competition against 15 other, you know, schools and industry, uh, groups and whatnot, um, and they won the competition in sort of building question answering systems for medical, um, you know, domains by, uh, 20% by 20 point margin against the next best system. And, you know, one of the biggest differences between that their approach and others is they use DSP to sort of express their system, and they use DSP prompt optimizers. They used an early version of MEO, in fact, uh, um, to, to achieve this result on three different, you know, they sort of won three different settings of that competition, which had three settings. Um, a month later, uh, folks at the University of Maryland, uh, who also sort of, whose lead author, um, sort of is responsible or is leading this, uh, prompting guide with, I think like tens of thousands of users or or more, um, sort of worked on a suicide detection task that I believe he got from industry.
They wrote a really nice paper about it. He worked for 20 hours, uh, they documented all of the sort of prompting strategies that he explored for building this system for suicide detection. Basically, it's a glorified classifier. Um, and you know, in his words, he applied DSP in 10 minutes, and it sort of outperformed his best system by 40 to 50%. So they have a really nice paper about this from the University of Medicine. And then, you know, it has enabled sort of a lot of state-of-the-art systems, um, from Path, uh, for training retrieval models, where we're optimizing prompts that synthesize the data, uh, that sort of is used to train, to train small models. Uh, under the hood, I think there's a lot of sort of, uh, interesting ideas there, um, to, uh, AA, which is a system for classifying with language models, uh, when you have like 50 labels but 10,000 classes, which I thought was an impossible thing, just didn't make sense for the language models, uh, but Carl built this in DSP, and sort of showed that it's possible. Um, and then Storm, which generates Wikipedia articles, um, from a topic that you give it.
So now, sort of having see some of these examples, uh, let's, uh, sort of, uh, look more closely at the paper, "Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs," uh, by co-first authors Krista Opel Long and Michael Ryan at, at Stanford, and a lot of the slides that I have next, sort of, uh, are, are sort of, uh, borrowed from Krista and Michael, uh, you know, in the DSP work. So the problem setting, just to restate it, is we are given a bunch of inputs, examples of a task, so maybe questions, let's say, and we are given that, you know, the developer built language model program of the size we discussed earlier, just a function with a bunch of modules, uh, expressed in natural language. Um, and you know, each of them might look like this. So, you know, we looked at the context question, give me an answer, any, it's just kind of just trying to do this behavior. Uh, I don't know if example is very useful at that point, but "The Victorians is a documentary series written by an author born in what year?" So, you know, maybe you want to answer that 1950, by have looking at context that you received in the earlier loop that we have here.
We're also asking for a metric. Um, some metrics, you know, by nature, require labels, like for example, here if we know that the answer is 1950, we don't need labels for the whole thing. But, you know, if we know the answer is 1950, evaluation is easy. Um, other times, maybe the metric is like, "I don't actually know what the answer is, but I want to get answers that are grounded in the context that got retrieved, because I trust the context, and if the answer is grounded in it and is relevant, then that probably means it's correct." That's a different metric that does not need labels. Um, so there are sort of a very large scope of things, but it's really important that you sort of explicitly define what is it that you're trying to maximize, um, and once you've done that, you don't do this sort of as part of your prompts, you do this in a way that's independent of your program, um, so that you can explore sort of along both axes, uh, pretty modularly. That's something important that you should keep in mind, and that abstraction sort of really forces you to do that. Okay, yeah, so the goal is to give us an optimized language model program Prime, uh, by, in the case of the, in the optimizers, we look at next, we're going to keep the weights frozen. In fact, we'll assume we don't have access to them. Um, instead, we are going to, uh, optimize the instructions, so like the descriptions of the task that goes to the model, as well as the demonstrations, uh, which are sort of these few-shot examples that just show the model in the prompt, you know, "Here are some inputs, and here's the corresponding behavior we saw in the outputs that seem to work well in the past."
Some assumptions that we are making is that we don't want to, um, assume any access to, you know, log probabilities or model weights, um, because you want to iterate sort of fast, you want to work in in natural language space with APIs that are high level. We have a lot of work where we relax this assumption and we look it into fine-tuning, um, but that's sort of, uh, you know, uh, and actually there's, we have work that shows you really need both, uh, to get the best performance in many cases. But if you had to pick one, prompt optimization tends to be more powerful. Um, we also assume no intermediate metrics or labels, as we said earlier, and we want to be very budget-conscious. So we want to minimize two things. We don't want to ask you for a lot of example inputs of your task because it's hard to create inputs, although technically inputs is the easiest thing to scale because you can build a demo and then put it in front of your friends, and then you have a lot of inputs when they ask questions, you know, if, if your privacy policy allows you to do that. Um, but we don't, we don't want to sort of ask for a lot of input. So maybe we want 50 or a couple hundred, uh, you know, not tens of thousands, for example. Um, and we don't want to call the language model too many times because it takes time or is expensive, et cetera. So these are the assumptions that we want to sort of, or the constraints that we have, uh, on this problem.
Now, there are two types of problems we want to tackle in general. Um, the first is that, well, what's a prompt like? Is this long sort of combinatorial string, and it's, you know, anything really goes, it's, it's all valid. So how do you even explore the space, especially if you don't have gradients and if you don't want to call the model too many times? Seems basically hopeless. Um, and the second problem is, well, if you have so many modules, which is really, this is kind of an, I think an even worse one when you combine the two together, is if you have so many modules, you know, and you're changing things all over the place, how do you know what is leading to improvement or what's sort of hurting? Because these pieces interact, you know, you might improve the prompt in one part locally, but actually hurts because the other part made an assumption about the output of the first part, and now you sort of have to account for these types of, uh, blame assignment. So I'll sort of cover three methods. Uh, this is not exhaustive by any stretch of the imagination, but these three are sort of like really good sort of, uh, different methods of of tackling this problem.
So the simplest thing you could do, in my opinion, you know, you know, is is at least, at least for the value it gives you, is to sort of bootstrap or self-generate few-shot examples by running a dumb version of your program to build examples, and then to plug them in and to search over that space. And the cool thing is, you can iterate that. You can take the better program, build even better examples, you know, build ensembles of them, use larger models to build, it's a very sort of large compositional space. But in a sort of simplest form, um, you know, you have the program, you take a training input, you basically say like, "I'll execute the Python in here." Some of these are just normal Python, you know, return statements, or loops, or, I don't know, like you call code executor or whatever, but some of them are modules. And so we sort of, they are special. They're, we're sort of tracking them, um, and and tracing their inputs and outputs. And basically we say, "Well, you know, these things are generating search queries and generating answers. And if the metric at the end ends up liking the answer and tells me that's good, well, that whole trajectory of like inputs and outputs seems interesting enough to keep it around." Now, this can be a sort of a set of demonstrations of all our modules, and we can simply basically say, "What if we take those, we plug into them, we plug them into the modules of our program, and then we could do, in the simplest case, just a random search. Take a kind of a bag of of our subset of these demonstrations, plug them into the respective, you know, three or four prompts, and take that, run it on a small validation set, maybe could be, try to be smart about sort of how you're doing that evaluation, and then basically look at that score and try to maximize that score." So that's sort of like the, you know, I think the, the, the, the simplest successful thing you could try.
A different approach is to look into the prompt optimization literature, not for language model programs, but for language model calls. And those folks sort of have a lot of rich sort of, uh, you know, uh, concurrent sort of research, but under much stronger constraints. So in the prompt optimization literature, which is, you know, I mean, these things are sort of, of course, morphing closer together, but they started pretty separately. Um, you are assuming that you have labels, so you have inputs and outputs, and it's one prompt, that's the whole system, there's no like program involved. Um, and you're assuming that sort of there's this little part, like, you know, a one-liner somewhere that you want to search over. The space of this sort of exploration is very large. People do all sorts of things from, you know, gradient-based exploration to other sorts of forms of reinforcement learning and all types of stuff. But one approach that's really cool and really high level is called "optimization through prompting," um, from DeepMind.
And the way that works is basically they want to plug in a little prefix at the end of the prompt, sort of to just get the model on the right track. And so, you know, they start with "think step by step," but they want to end up at the end with like, "I don't know, take a deep breath and think step by step." Um, and they show that that could actually help some models, you know, do a lot better. It's a much smaller sort of scope, but it's an interesting space of what to explore. And we want to sort of see if we could take this and apply it to language model programs with the weaker assumptions we have. So, uh, you know, the way OPRO works for a single prompt is they basically go to a model and say, "Here is the original thing, give me 10 variants or 100 variants, and let me basically just evaluate all of them on a little custom evaluation set, and let me basically, um, look at which ones work best. Maybe the top 10 go back to the model and say, 'Here's the top 10, here is how well they performed, give me more.'" And the idea now is the model might be smart enough to say, "I see some patterns between the successful things and the unsuccessful instructions that I can sort of do essentially a form of, um, you know, you might call that, I don't know, uh, you know, basically mutating, mutating these, these, uh, these instructions, and and I can basically try them again and repeat this process." So that's the, the, the OPRO approach.
So how could you take sort of OPRO and naively apply it to a language model program? Well, the simplest thing, I think, is, uh, coordinate ascent. So what that means is, uh, and we're not going to spend too much time here, I don't know if the figure is big enough, uh, but, you know, let's say you have two modules. The simplest thing you could do is basically go to the model, generate a lot of these proposals, plug one at a time, keeping the other one sort of fixed, uh, keeping the other modules, uh, fixed at the, at the initial instruction, evaluate all 10 approaches, and then basically see which one worked best, and maybe freeze it, and sort of repeat for optimizing the next stage. And it can sort of loop, uh, through this process, um, of of optimization. The problem is, this is horrendously expensive because, you know, every time you're exploring optimizing one module, you're sort of running the whole program, uh, while freezing the other modules, and you're making this big greedy assumption that the best sort of, uh, choice at any given stage is going to be useful, uh, once you freeze it and try to optimize the rest of the parts. You're not co-optimizing, uh, these pieces. It's very expensive, and actually doesn't even work that well.
A different approach here would be, uh, to to give up an assumption or to sort of make an assumption. So you could give up on the assumption of explicit credit assignment. You don't really care, uh, to figure out sort of fixing all variables and changing one at a time. You could just ask the models to generate prompts for every stage all at once, and sort of hope that that process converges in a nice place. Or you could go to the model and say, "Actually, it's a bit of a harder task, it's a bit of a harder ask, but here is the sequence of all prompts of modules, and I want you to propose, you know, new prompts for all of the other modules." And you could try to do it jointly, which of course has the risk of potentially confusing the model. It's a much harder task, but it, um, you know, is is is something that, let's say, um, you know, with with an Oracle language model, it's strictly more powerful by far. Now, if you want this whole thing to work well in a language model program, it is really important that you're not myopic about optimization. So you don't just look at a signature just saying, "Generate a search query," and you, you know, the user didn't say search query, the user just said query. Is it a SQL query, a search query for Google, for retrieval model that is? You have offline, these things are different, and sort of you at the mercy of the quality of the proposals you're getting. So you might as well try to make sure that they're contextualized, um, sort of appropriately with respect to the, uh, program you have.
So that's this notion of you want the proposals to be grounded. So sort of in a traditional prompt optimization setting, there is the history of the prior steps, how we got here, what things we tried, how well they performed, and then there is maybe a training set of explicit examples that you sort of got from the user because your whole system sort of doesn't have any missing pieces, it's just input and output. So, um, you know, we have this intuition that to take this and to sort of generalize it to language model programs, uh, it would be very useful to provide a lot of contextual information about the program setup that we have. Um, and so the first thing is because we don't have these training input output labels for every module in general, um, or for any modules in many cases, we can start plugging in the bootstrap demos that we looked at earlier. So we, you know, they're basically as good, um, you know, for all we know, um, and they seem to be successful when you're doing bootstrap you shot, so maybe we could just plug them in as the examples that we'll use for constructing these prompts.
Now, that by itself is gives you a lot because you could sort of see what works and potentially what doesn't. Um, but, and here's a sort of an, you know, an example of a question, you want to generate a search query, um, and sort of the reasoning that the model generates when it's building the search query that, that it produced that eventually sort of led to a trajectory that was successful, um, on a certain question. The other thing is, well, you know, you want to give the model sort of an understanding of the whole task, so it would be useful if the, of the, if the model building your instructions could see a summary of the data set, um, that we're playing with. Um, and so this is, you know, sort of an example of that where the model basically says, "Looks like this is on the, uh, multihop data set, consists of factual trivia style questions across a wide range of topics, et cetera." So you basically have a component that builds up the summary by, you know, essentially some sort of, uh, you could think of it as a map-reduce over the data set with the language model.
Um, and then an interesting thing that you know you might want to ground your system in is, well, could it actually see the whole pipeline so it understands the role of every module in it? And this basically works by literally taking the sort of, you know, inspecting the code that you have for a program and then giving it to the language model that understands the syntax and basically it builds a representation of the, uh, you know, uh, the, the program here in natural language that says, "Hey, we have a program that appears to be designed to answer complex questions by retrieving and processing information from multiple sources. In this case, it's set up for two hops," et cetera. And you know, the module in this program is responsible for generating a search query, et cetera. And you might also, uh, sort of try to maximize sort of the diversity of the stuff here by just sampling a random instruction that's, you know, that from the standard literature and just kind of plugging those in, um, for your, uh, proposer. So like, "Hey, don't be afraid to be creative," or all these things we could just take them out of the loop, you know, you don't have to write them. We will write them once, and then we can basically reuse them for building these, uh, these systems.
So we just discussed sort of extending, uh, optimization, uh, proposing with language models. Um, so, uh, we've covered sort of two, two general strategies for optimization here: one is building examples and searching over them. Uh, the second is giving the model as much grounding as possible in the language model program, and, um, asking it to propose instructions that we can sort of, uh, iterate over, you know, through learning from the previous attempts.
And then the last approach here is going to be a little bit different where our goal is to optimize both the instructions and the few-shot examples, but in an efficient way, uh, of of dealing with credit assignment. The intuition here is language models are not particularly good yet at, uh, sort of dealing with credit assignment themselves, um, but because we're working with all of these sort of spaces of discrete proposals like a lot of instructions or a lot of examples that we're building, uh, we could actually borrow a lot of intuition from the literature on hyperparameter optimization, um, and we can sort of build what is called a surrogate model, uh, a model that is basically, um, optimized to predict the quality of, uh, any configuration, um, possible of our system, and that we can use to, uh, sample sort of, uh, proposals for the entire system that we can actually test. So MEO, uh, the MEO Optimizer works in three steps. First thing is it bootstraps demonstrations of the task, we've explained what that means. Um, it builds candidate instructions, um, using a language model program inside the optimizer that has all these pieces we looked at earlier, like the summarizer, the language model program, you know, uh, uh, you know, describer, all these, all these little pieces.
And, um, you know, that handles this form of credit assignment here by relying on just a simple probabilistic model from the hyperparameter optimization literature. So what we have here is a language model program with two modules, and in every module, we have essentially two bulky parameters. One is what's the string describing the task, which is the instruction, and we get a basic one from the adapter that we discussed earlier, and then no examples at first, but this is the list that we can learn sort of, uh, as a bunch of these input output examples. And that fact that we've done the bootstrapping and the candidate proposal in a grounded way means that we can basically start exploring this discrete space of what is the right combination across these modules of instructions and of sets of examples or lists of examples in each module that would lead to the highest quality. And of course, we can't try all of them. So we're going to rely on this Bayesian optimizer to sort of basically, um, help us, uh, give us an acquisition function that allows us to sort of like make good guesses as to which combinations we should try.
Once we pick a, pick a com, a combination, we can plug it into a version of the program and then basically evaluate it on a mini batch of the data set that we have on the validation set that we have. So maybe the data set has like 200 examples, we could sample 30 of them, get a score, and then go back with that score, let's say 75%, 50%, whatever, and feed it back to update the sort of the surrogate model that we have. So it's like, "Hey, if this is the set of combinations I chose that gets me, I think 50%, I think because, well, it depends on the random sample in the mini batch." Um, so in the future, you know, we're trying to sort of, uh, uh, get an acquisition function that sort of has a one or, you know, let's say a property like, "I I would like to get the most promising, uh, improvement, uh, in in future proposals." And there are of course many sorts of such choices you could make, and we can sort of repeat this this process and trials, um, until, um, we sort of, uh, find a really good combination at the end of the day. And what's happening here is that this thing gets smarter over time. So the combinations of proposals tend to go up, um, sort of over time as opposed to say random search where you're just trying combinations.
So how do these things compare? Uh, we looked earlier at a slide that had results sort of with like pretty powerful optimizers with like, yes, optimizing, or like just using the off-the-shelf thing. We also looked at the results sort of with people writing prompts by their own hand, but it's also sometimes instructive to try to think of like ways of benchmarking these different optimizers against each other. It's actually really hard, and it's a place where we need more contributions because you want to build tasks that are representative, tasks that are hard, not overfit by given language models, but where you also can sort of build interesting and realistic compound AI systems that we can sort of, um, optimize and explore. Um, so an attempt at this is Langprobe language model program benchmark, um, that, uh, sort of has these tasks. So like multihop stuff, classification inference, whatever, and you know, some of them have two modules, some have four modules, some have one module. So here are some sort of initial kind of part of the of the test set results on a bunch of these tasks, um, and we're looking at optimizing instructions only, uh, through we looked at sort of module level OPRO, we discussed before, without grounding and with grounding.
And then zero-shot MePRO does not allowed to use examples but only uses instructions. And the first thing that jumps at you is that, well, you know, on average, optimizing instructions compared to like using a, you know, basic kind of adapter can help quite a bit. So, you know, can get a few points here, or, uh, you know, a bunch of points there, can get some nice BMP there. Uh, sometimes it overfits to the training set if it's really small, and it's actually worse, um, but in general, it really does struggle with a lot of these tasks to optimize prompts alone, instructions alone. Um, and it's not really obvious which of these wins. If you look into optimizing with demonstrations, that picture is pretty different. Like, you can already get, you know, 10-point jumps, um, several point jumps here, uh, several here as well, you know, overfits here, um, and it seems that instructions are actually quite powerful, um, and you know, here you can get very large gains as well.
And the interesting thing is, when you look under the hood at sort of this process of just random search over bootstrap demonstrations that the system generated, um, it's wild how much they vary. You have stuff that's worse, uh, here than the initial sort of zero-shot approach, but you have stuff sort of on the frontier that's way better. And it's really interesting what, you know, all of these were generated in the same way, um, by running the, the program and basically doing rejection sampling. So it is interesting to see like how big of a difference, uh, sort of they can make when you plug them into the program. And the last thing is, well, you could take MePRO, which does few-shot instruction optimization both together, um, and you can get sort of the best results, you know, quite often, most of the time with, with some nice jumps across various pieces of these. Um, an interesting thing is, well, you know, optimizing examples tends to be a winner, but there's sort of patterns where you can spot that focusing on instruction optimization is actually leading to more visible improvements, uh, which is basically cases where the original, basically cases where the task has this property of like a conditional, um, you know, uh, pattern where seeing one example doesn't actually teach you the whole pattern, or seeing two examples doesn't teach you the whole pattern because there are sort of more cases than you can cover with just examples, or the kind of the, uh, precise threshold or the precise kind of region in which a rule applies is not clear from examples by themselves.
So kind of concluding here, um, and maybe opening for questions more explicitly, since they're open anyway, um, sort of just some set of lessons on what you might call natural language programming, which is sort of what these set, uh, what this set of abstractions in DSP allow us to do. Um, a big lesson, uh, not to forget compound AI systems, is that programs can be a lot more accurate, more controllable, and more transparent than using deep neural networks alone, models alone. Um, and that we sort of just need the declarative program. You don't need to sit down and write sorts of 10,000 token sequence of five prompts. Um, you can really write 10 lines of code, just express an objective, and pick an optimizer, and run this thing on your favorite language model.
Um, and the high-level optimizers can bootstrap, you know, examples for prompts or propose instructions and explore that space, and that they're pretty bad at any individual proposal on average, but you can sort of, uh, run a large-scale search on this that does actually find things that in in some cases, in many cases, uh, sort of outperform the stuff you, you can do by hand. But the real power here is because you're freed sort of from the choices of of how do I tune each module, you can explore the space of compound AI systems more directly, where you're getting a lot of value from iterating on the modules you're building, how they're connected together, um, you know, what is the exact right objective that I want to maximize, in many cases it's not obvious, um, et cetera, et cetera. So DSP makes it possible to approach this sort of natural language programming by sort of yanking out, uh, hand-written prompts and giving us these, this notion of signatures, um, throwing away prompting techniques and inference time strategies, um, that are sort of really fuzzy, and, you know, you know, what is a, I don't know, what is a Chain of Thought really, uh, and giving you actual sort of predictors, uh, you can compose as modules that can take your signatures and sort of apply in a meta-programming sense the strategy on top. Um, manual prompt engineering, you throw that away, and you sort of work with optimized, uh, programs where tuning the instructions, uh, or the weights, sort of in a given, uh, strategy.
And this is something that's sort of being widely used in production and in open source. Everything is at ds.ai, including links to all the papers I discussed or didn't discuss. Uh, so it's used a JetBlue, Data Bricks, Walmart, VMware, Repet, Hayes, Laughs, you know, Sephora, Moody's, and other stuff. You can sort of find a nice list of public-facing use cases, people that are sort of fine going on the record, uh, sort of to learn what people are actually doing with it. This, and it's a really great way, actually, to learn of like, not just what people are doing with DSP, but like, what kinds of compound AI systems are getting actually deployed, and sort of who's deploying them, and how are they optimizing them? Many of these folks sort of go on podcasts and describe how they're optimizing their systems for, I don't know, law firms or other things. Um, yeah, so this is sort of a nice, uh, great folks and in the open source community, uh, cool collaborators that sort of make all the stuff possible. And, uh, you know, you could pip install DSP and get started, uh, right now.
Just to conclude, key lessons on optimization in natural language. Um, in many cases, in isolation, nothing seems to be building good examples of the task automatically. What we call bootstrapping seems to be this notion of "show, don't tell." But on tasks where there are these conditional rules that are sort of scoped in hard to to detect ways, optimizing instructions seems powerful. And I should make it super clear that the biggest, coolest thing in DSPI is that we've isolated signature from optimizers, from adapters, from, you know, metrics, and basically, and from inference time strategies, and that you can iterate on any of these four or five things, uh, independently and compose with everyone else. So all of the programs that exist right now, if you decide to build an RL-based prompt optimizer, you know, or if you decide to introduce an inference time strategy as a predictor, everybody could just like, you know, change one line in their code, and then, you know, you could see boost across the entire set of, you know, use cases that exist out there if you do it well. Um, and, you know, there are sorts of tricks that make this work that we've discussed here.
All right, so actually, I had this slide, and I like to use this slide, um, I, you know, but right before the talk or or the lecture, someone told me, you know, uh, many of the other lectures, and maybe I'm counted with that, uh, you know, proudly, but sort of are from sort of large closed labs that don't publish, they're certainly not academics, certainly not anymore, and they seem to be leading all the progress and whatnot. And so, so I really want to say that like, you know, a big goal of DSP, sort of a meta goal, is to enable open research to again lead, uh, AI progress. And open research here has a lot of advantages for us in terms of like, you know, uh, why we'd want that to be the case. It's not part of the scope of the stock. Um, but I'll tell you how this DSP really makes a difference in this, in this space, uh, and it's to basically, uh, show that that progress is really going to come through modularity.
So we're not asking people to sort of figure out sort of how to fund, you know, billion-dollar runs, uh, in isolation, or these ad-hoc tricks that sort of you apply, and then, you know, two weeks later the model changes and it's not really very relevant anymore. But instead, sort of we've sort of outlined this space that I hope I convinced you, uh, of is, you know, of basically how do you scope, uh, out your programs well? How do you develop general, uh, inference time strategies that act as predictors that anyone could apply to their signatures? And how we can devise new optimizers that could allow to any could apply to any of these programs that sort of, uh, give us systems that are stronger than the sort of the sum of their parts, if you will? Um, and, you know, through that, I hope that a lot of open research and optimizers, predictors, modules, and whatnot, can sort of lead to the type of progress we, we saw with neural networks where people, you know, different people developed attention, uh, transformers, convolution, and other things in a way that is highly distributed and was obviously incredibly successful, uh, as opposed to a, you know, maybe the way in which maybe we iterate on large language models in a closed way. Now.