TextPurr Logo

TextPurr

Loading...
Loading...

LLM Agents MOOC | UC Berkeley CS294-196 Fa24 | Agentic AI Frameworks/AutoGen & Multimodal Assistant

Berkeley RDI Center on Decentralization & AI
Agentic AI Frameworks & AutoGen by Chi Wang, AutoGen-AI Building a Multimodal Knowledge Assistant by Jerry Liu, LlamaIndex
Hosts: Chi, Jerry
📅September 25, 2024
⏱️01:04:48
🌐English

Disclaimer: The transcript on this page is for the YouTube video titled "LLM Agents MOOC | UC Berkeley CS294-196 Fa24 | Agentic AI Frameworks/AutoGen & Multimodal Assistant" from "Berkeley RDI Center on Decentralization & AI". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=OOdtmCMSOo4

00:00:01Chi

Okay, uh, so yeah, here's the agenda. It's very simple. You have two parts as a general, and then we'll dive into the AutoGen framework. For both parts, there are two big motivating questions I want to cover. Number one is, what are the future AI applications like? And then number two is, how do we empower every developer to build them?

💬 0 comments
Add to My Notes
00:00:22Chi

So, let's start with the first question: what are the future AI applications like? So starting from around the year like 2022, we see the big power of the language models and other generative models. So with generative AI, people have seen a superior capability of generating content like text and images. They are much better than the generative techniques like 20 years ago or 15 years ago where I was doing my PhD and studying like topic modeling and other older generative modeling techniques.

💬 0 comments
Add to My Notes
00:00:56Chi

Apparently, these days, the new technique has passed a bar that we can think about in more creative applications, higher level of AI capabilities. And so what's the next, what are the best way to leverage these agentic AI techniques?

💬 0 comments
Add to My Notes
00:01:14Chi

Starting from last year, early last year, we started to think about that question, and we made a lot of technical bets along the way. The most important bet would be that we think the future applications can be agentic, and the agents can be a new way for humans to interact with the digital world and help us execute more complex tasks on behalf of humans and get to higher and higher level of complexity.

💬 0 comments
Add to My Notes
00:01:42Chi

And we got a lot... at that time, many people still have doubts about whether agentic AI can be a viable notion, but as time goes by, we have seen more and more confirmation and evidence, including earlier this year, there's an article published by Berkeley talking about how we're observing more and more AI results shifting from using a simple language model to building compound AI systems. So that's one of the similar observations and technical bets we made.

💬 0 comments
Add to My Notes
00:02:17Chi

So if we think about the examples of agentic AI, you'll notice that building AI agents like personal assistants, autonomous bots, or gaming agents... the notion of agentic AI is not new. But what was new is that with the power of the new generative AI techniques and the large models, for example, we're able to build some of the old agentic applications much easier and much more capable.

💬 0 comments
Add to My Notes
00:02:47Chi

And at the same time, we are also seeing some very new, novel agentic applications that go beyond the imagination we had before, such as some science agents that can do science discovery automatically, web agents that can automate some web browsing and web automation tasks, and also software agents that can build even software from beginning to scratch.

💬 0 comments
Add to My Notes
00:03:15Chi

I bet you have seen a lot of demos before, but today let me show you a very recent demo I just got this morning from a startup called Zinle. This is an example of building a website to extract the models from Hugging Face and download automatically. So after you give the request to the AI, they started working on them automatically, starting from analyzing the task, looking for information, and installing the necessary dependencies. They do use a multi-agent framework to have agents of different roles to finish the task and can do multi-step tasks.

💬 0 comments
Add to My Notes
00:03:58Chi

Now we are seeing that AI is creating different files. All the reasoning between the agents are done automatically using both language models and tools. So AI agents are talking to each other. Now they're developing the website. And finally, they're trying to build and compile the website.

💬 0 comments
Add to My Notes
00:04:44Chi

Okay, here we go. This is a website built by the AI. It looks like it's functioning. There are many models already listed in the website, and a user can search for a model, export model data, and download the models. So it looks like the video is playing in a slower speed than normal, not sure if that's because of the internet issue.

💬 0 comments
Add to My Notes
00:05:19Chi

And furthermore, if you happen to make some mistakes, for example, if you remove a very critical line of code from the files that are created, let's see what happens. So we are running the AI again for executing the same task. So this time, the AI is able to... is trying to also perform the same task, but because of that removal of the critical lines of code, we're supposed to hit some error.

💬 0 comments
Add to My Notes
00:05:58Chi

Yeah, so we see an error message there. It's a missing script. So let's see what AI will do. So just within a few steps, the AI is able to correct the mistake and finish the exact line which was removed, so it completes the task. That shows some self-healing and self-recovery ability of the AI agents.

💬 0 comments
Add to My Notes
00:06:21Chi

So, hope that gives us an idea about the potential kind of promise the AI can give us. In the future, we may build software in a totally different way.

💬 0 comments
Add to My Notes
00:06:32Chi

So to summarize the key benefits of agentic AI, number one is many people understand that AI is able to use a more natural interface to talk with humans. So you can tell them what you want exactly. In this case, you just tell them we want to build a website with certain requirements, and then you can iterate over the natural language.

💬 0 comments
Add to My Notes
00:06:54Chi

And also, by giving the agents more capability, they are able to operate and finish more complex tasks with minimal human supervision. So that has tremendous value in terms of automation.

💬 0 comments
Add to My Notes
00:07:08Chi

And the third point, I think it's less talked about among people, but I have a strong emphasis, that is, agentic AI can be also a very useful software architecture to enable us to build software in a totally different way, so that you can have multiple agents working with each other to finish much more context in a recursive way.

💬 0 comments
Add to My Notes
00:07:31Chi

I will use a simple example to go over these benefits. So let's look at one particular example application about solving some supply optimization on the cloud. This application is built by my former colleague from Microsoft Research that allows users like coffee shop owners to answer questions like, "What if we change the shipping constraints? How would that affect my operating cost?" This is a difficult question they can't get answers from like ChatGPT because the AI needs to understand the user-specific data, constraints, and even some optimization tools to be able to understand the question and get the answer.

💬 0 comments
Add to My Notes
00:08:13Chi

But in this case, they use AutoGen to build three agents that can solve this question very nicely. The three agents are Commander, Writer, and Safeguard. So let's see how that works.

💬 0 comments
Add to My Notes
00:08:27Chi

Initially, the user sends the question, "What if we prohibit shipping from Supplier 1 to Retailer 3?" to the Commander. The Commander receives that question. Before it tries to answer that question, it will hold the current conversation and initiate another nested chat between the Commander and the Writer. The Writer, in this case, is using a large language model as a backend, tries to understand the question, and proposes some code that can be used to answer this question. In this case, it's just some Python code of adding a specific constraint into an optimization problem and returns that code to the Commander.

💬 0 comments
Add to My Notes
00:09:05Chi

The Commander sees that code. Again, it holds the conversation with the Writer and starts another nested chat between the Commander and Safeguard. The Safeguard is another agent using a language model to check whether the code is safe. So in this case, it turns out safe. The Commander will finish that conversation with the Safeguard and start executing the code.

💬 0 comments
Add to My Notes
00:09:29Chi

Here's the code execution result. It runs the optimization program with the updated constraint and checks the new result. So, the Commander will then go back to the previous conversation with the Writer, send the execution result back to the Writer. Here are the execution results. And after the Writer sees that, it tries to come up with the final answer.

💬 0 comments
Add to My Notes
00:09:51Chi

This is a simple example of using multiple agents and multiple steps to finish the task. In general, there can be all sorts of problems, like the code can be unsafe, in that case, the Commander will not execute the code. And in case there's an execution error, the Commander will return the execution error back to the Writer, and the Writer needs to rewrite the code. So in general, that can be multiple turns back and forth to finish the complex task. But here, we're just showing one simplest case that is a smooth trace.

💬 0 comments
Add to My Notes
00:10:23Chi

And you can tell how the end user now is able to still use a relatively simple... doesn't need to know there are multiple agents running at the back end. They just ask a simple question and then get an answer in natural language back. The user doesn't understand any knowledge about coding or optimization. So this is an example of using multiple agents which allows the end user to achieve tasks that are much harder than before.

💬 0 comments
Add to My Notes
00:10:53Chi

Now, if you look at the programming perspective, how do they construct such a program? The way they do this is also very different from traditional programs. There are several steps. The first step is to create these several agents. So in this case, they're creating the Writer, Safeguard, Commander, and user using one line for each. And then they need to define the interaction patterns among these agents. So here, they registered two nested chats to the Commander. For each chat, we need to define who are the agents involved in the chat, who are the sender and the receiver, and what's the trigger method for that conversation.

💬 0 comments
Add to My Notes
00:11:31Chi

And then their behavior patterns are defined this way. And finally, we just initiate the chat from the user proxy agent to the Commander agent by sending the initial task requirements, and then every other step will also follow automatically. So the framework will handle all the steps and nested chats and eventually return the result back.

💬 0 comments
Add to My Notes
00:12:00Chi

So I hope that gives us an intuitive idea what agentic programming means. And also, I'm using two slides to summarize their benefits. Number one is to enable us using these multiple agents to handle more complex tasks and improve the response quality for multiple reasons. You see that the multi-agents talking to each other can be a very natural way to make improvement over interactions. And we can divide and conquer a complex task to decompose that to smaller tasks and have more quality results for each step. And finally, we can define agents that are not necessarily depending on language models. We can have special-purpose agents that perform grounding or validation using knowledge outside of the models so that we can address the weakness inherent in the models.

💬 0 comments
Add to My Notes
00:12:59Chi

If you look at the chart on the right, this is an experiment result about if we decompose the task into two agents for the Writer and Safeguard versus if we put all the instructions in one single agent, how is the performance on the Safeguard side? If we compare the performance, we see that for GPT-4, the multi-agent setup has a 20% higher recall than the single-agent setup. And for GPT-3.5, the difference is even higher. So it indicates that for certain scenarios, it's beneficial to decompose the tasks and have agents to perform relatively simple tasks compared to just asking the agent to do too much work in one try. And this is especially depending on the task complexity and model capacity, right? So in general, the more complex the task and the weaker the model, there's a stronger need to have this multi-agent workflow.

💬 0 comments
Add to My Notes
00:14:00Chi

And another perspective is from the programming perspective. So in general, it's easier to understand, maintain, and extend the system if you have a modular design. For example, if you want to keep most of the agents unchanged but just change, for example, how you change the Safeguard behavior, if you want to switch from using a language model to using tools or using a human, it's very easy to do that with a modular design. And this allows natural human participation because a human can take over any of the agents in the system, and by having a natural conversation with other agents, there's no need to change how humans interact with other agents.

💬 0 comments
Add to My Notes
00:14:43Chi

Combining these benefits, we see that agentic programming has a promising potential to enable fast and creative experimentation and build novel applications. But it's not easy to design a framework that can meet all these promises. In general, we need to consider several factors to design such a framework. We want to have a unified agentic abstraction that can unify all these different types of entities. And we want to accommodate all the flexible needs for multi-agent orchestrations and satisfy all the different application needs. And we also need to make sure we have effective implementation of all the agentic design patterns.

💬 0 comments
Add to My Notes
00:15:29Chi

Let me use the next few slides to explain them. So first of all, for agentic abstraction, we want to unify the notion so that we don't need to have a hard time to reason about all these different types of entities, such as humans of different roles, tools, or language models from different providers. In general, these are all needed for building a common AI system, but if we can have a single concept to think about them, it will make it easier to reason about them. So let me explain more in the rest of the talk.

💬 0 comments
Add to My Notes
00:16:10Chi

One immediate benefit of that is now we can use multi-agent orchestration for building actually complex applications of common AI systems. But to do that, we also need to think about all the different requirements of different AI agent patterns of interactions. For example, sometimes the developer wants a more static workflow so that they can clearly define each step and the order of agents for the tasks. But in other cases, if it's hard to know all the possible situations that an agent needs to handle, then we want to give agents more autonomy and enable them to create more dynamic workflows.

💬 0 comments
Add to My Notes
00:16:52Chi

And similarly, sometimes they want to use natural language to tell the agents to perform the task, and at other times they want to use programming language to have a more precise control. So there's a tradeoff between the flexibility and the controllability. And there are many other tradeoffs you need to consider. For example, when do we share the context among agents and when do we isolate them, have a hierarchical setup, for example? And when do we want agents to cooperate, when do we want them to compete to finish a task better? And there's also consideration about centralized versus decentralized and automation versus intervention.

💬 0 comments
Add to My Notes
00:17:39Chi

So a good framework should be able to accommodate all these different requirements. And also, as people develop agents more and more, there are all sorts of good design patterns emerging. And we need to also check whether a framework can meet all these design pattern requirements. There's conversation; we want the agent to be able to deal with flexible conversations. There's also prompting and reasoning techniques like ReAct, reflection, Chain of Thought, Tree of Thoughts, and so on. So there are more of such emerging reasoning techniques, including like Monte Carlo search-based methods. Tool using is a very important part of the agent design patterns, and so is planning and integrating with multiple models, modality, and memories.

💬 0 comments
Add to My Notes
00:18:32Chi

And it looks like a lot, right? So it's hard to have a framework that can consider all these different factors in the beginning. But if you think about this from the first-order principle, you may wonder, is it possible to have a single kind of design point to start with and have all the other patterns to be able to... so that we can derive all the other patterns from that? So yeah, so can anyone think about among all these different patterns, can you think of using just one single design pattern of them to accommodate all of the others?

💬 0 comments
Add to My Notes
00:19:08Chi

Yeah, I believe that different people will have maybe different ideas and start from a different part. You can, you may pick either one of them. So here, I want to mention one personal kind of design point of view, that is conversation. This has a long time history. Back in college, I learned that conversation is an important way of making progress in learning and proving theorems. And that's a takeaway. And when I see the power of ChatGPT, I immediately relate that back to my personal experience and decide, okay, so let's consider trying to use the multi-agent conversation as a central mechanism and see how far we can go.

💬 0 comments
Add to My Notes
00:19:51Chi

So it turns out we can go very far. So here I'm listing some examples of agentic AI frameworks, like AutoGen. And as I mentioned, it's based on this multi-agent conversation programming. And it turns out to be very comprehensive and flexible, that's able to integrate with all the different design patterns I mentioned and also other frameworks. And you will hear more from Jerry about LlamaIndex. And there are also some other like LangChain-based frameworks such as LangGraph and CrewAI. They focus on different, also starting reasoning points. Like LangGraph, for example, focuses on providing some graph-based control flows, and CrewAI focuses on providing some high-level static agent task workflows.

💬 0 comments
Add to My Notes
00:20:38Chi

Yeah, so that's a very brief overview of the AI agent, sorry, agent frameworks in general. Next, I will go into the specific AutoGen. I will try to explain more about that.

💬 0 comments
Add to My Notes
00:20:53Chi

So AutoGen is a programming framework for agentic AI. It was initially developed inside another open-source project called FLAML. FLAML is for automated machine learning and hyperparameter tuning. And later we spun it off to a standalone repo. And this year, recently, we also created a standalone GitHub organization to have an open governance structure. So everyone is welcome to join and contribute.

💬 0 comments
Add to My Notes
00:21:20Chi

And here's a brief history about the journey. In early last year, we started prototyping these frameworks, and initially, we just built a multi-agent conversation framework with the capability of code execution. And in August, we published the first version of the research paper. In October, we moved this to a standalone GitHub repo, and it's got a lot of recognition from the community. And such as this year, we got also a best paper at the ICLR 2024 LLM Agents Workshop. And we're seeing more and more interesting use cases from the community, including both enterprise companies and startup companies. We just also offered an online course on DeepLearning.AI about AI agent development with AutoGen.

💬 0 comments
Add to My Notes
00:22:12Chi

So that's just a very brief history about it. In general, AutoGen has very rich features about all the different design patterns, but today I mainly want to focus on the most essential concept because this is a key to understand all the other patterns. So we want any developer to reason about a very complex workflow in simply two steps. The first step is to define the agents, and the second step is to get them to talk.

💬 0 comments
Add to My Notes
00:22:43Chi

So here, we see two key concepts. One is a conversable agent, another is conversation programming. The agent concept in AutoGen is very generic and can abstract a lot of different entities. You can choose to use a language model as a backend for an agent, or you can use a tool or human input as a backend. And you can also mix multiple types of entities together.

💬 0 comments
Add to My Notes
00:23:10Chi

And furthermore, if you use this multi-agent conversation patterns like you see on the right-hand side, you can build agents that can contain other agents inside it, have inner conversations, and then wrap them up as an interface to talk to other agents. So in this way, you can nest multiple chats inside an agent, have inner conversations, and build more and more complex agents in a recursive way.

💬 0 comments
Add to My Notes
00:23:42Chi

And the examples of conversation patterns include sequential chats, nested chat, and group chat. I will explain a few of them in the next few slides. The simplest one is a two-agent conversation. Even though this is a very simple conversation pattern, it already allows us to perform some advanced reasoning such as reflection. For example, we can construct a writer for writing a blog post and another critic agent to make suggestions to the blog post. And you can have these two agents iterate and improve the quality of the blog post in this way.

💬 0 comments
Add to My Notes
00:24:18Chi

And now if you think that's not good enough, we can add more other reasoning, advanced techniques using this nested chat. For example, instead of using a single language model to do the critique, we can use a nested chat. And in the nested chat, we construct sequential chats that contain multiple steps of chats. We first send the message to the SEO reviewer, and then we send it to a legal reviewer, ethics reviewer, and so on. And finally, you can have a meta-reviewer that summarizes all the reviewed comments in different areas and gives the final comment. And from the writer's point of view, it's still talking to a single critic agent, but underneath, the critic agent uses multiple other agents. So that's the idea of how we can use nested chat to potentially extend the agent capability using conversations.

💬 0 comments
Add to My Notes
00:25:17Chi

And think about tool using. How do we enable tool using using conversations? Here's an example of building a game of conversational chess. We want the AI to be able to play chess while chit-chatting with each other and making fun of each other. So if you just ask two language model agents to play chess directly, they often make mistakes and form like random moves on the board which are not legal. So the game is not watchable at all.

💬 0 comments
Add to My Notes
00:25:47Chi

Now, the solution can be that we can add a third agent called Chessboard. This board agent is a tool-based agent. It uses some Python tools, a Python library, to manage the chessboard and provide the tools to other language model agents. So the language model agents can only make legal moves, and otherwise, they need to iteratively refine their moves until the moves are legal. In that way, we can make sure the game can carry on nicely. And again, we're using a nested chat between a tool proposal agent using a language model and a tool execution agent that's using a tool as a backend and having them talk to each other to finish that tool execution functionality.

💬 0 comments
Add to My Notes
00:26:31Chi

And there can be other more complex workflow patterns and solving more complex tasks using planning and using a more dynamic way like group chat. In group chat, the users only need to define agents of different roles and then can put them in a group chat, and they can automatically decide which agent to speak next depending on the current progress of the task. There'll be a group chat manager which can monitor the progress and make the selection of the speakers.

💬 0 comments
Add to My Notes
00:27:04Chi

And furthermore, you could add more constraints or finite state machine transition logic about what orders the agents should follow. You don't have to make a very strict order. You can just give some candidates about, like after agent A speaks, only agent B and C can speak, not other agents, things like that. So there's still some decision made by the language model and some autonomy there, but you can add certain constraints to make sure at least the selection is within the scope. And you can further add some transition logic to tell the agent, when you see certain situations, you should go this route, otherwise you go to the other parts. And you can use either natural language or programming language to specify that.

💬 0 comments
Add to My Notes
00:27:55Chi

Yeah, in general, there are many other types of conversation and applications enabled by these conversation patterns. For to check our website for more notebook examples, they are nicely tagged so you can search for anything. For example, search for how to integrate AutoGen with LlamaIndex, and there are notebook examples like that.

💬 0 comments
Add to My Notes
00:28:17Chi

And in the paper, we just showed a few like very simple examples using simple conversation patterns, but in general, we can build many more complex tasks using these building blocks. I've seen developers building all kinds of sorts of complex tasks.

💬 0 comments
Add to My Notes
00:28:34Chi

So here's an overview of the categories of domains that we are seeing from the community. The top two categories are software development and agent platforms, then research, data processing, gaming, long-tail, there's web browsing tasks, finance, healthcare, education, and even blockchain, which is not shown in the list.

💬 0 comments
Add to My Notes
00:28:59Chi

And here I want to select a few interesting examples. The first one is in the science and engineering domain. This professor, Markus Buehler from MIT, has done multiple works in different science domains, including protein design, material design, using AutoGen to build teams of agents that simulate the behavior of a scientific team or engineering team, all the way from like collecting the data, generating hypotheses, and then conducting experiments and verifying the hypotheses.

💬 0 comments
Add to My Notes
00:29:34Chi

They have also built a very recent work called SciAgents, which can use a knowledge graph to reason about any scientific domain and make interesting connections and do advanced reasoning using that knowledge graph. They construct a team of agents of multiple roles, and they can start from understanding an ontology and then make multiple science and critics to finish the very complex workflow, including generating all the different concepts and possible hypotheses and then selecting the best out of them. So this is a very promising use case I see that in the future has a potential to accelerate our scientific discovery. And maybe soon after, we'll have some AI-designed medicine or AI-designed architectures and so on, or interesting new materials.

💬 0 comments
Add to My Notes
00:30:29Chi

Here's another domain of web agents. This is an agent called Agent-E, developed by a startup called Emergent AI. They use AutoGen to build a hierarchical agent team so that they can perform some very complex tasks on the web, like automatically booking flight tickets or automatically filling forms for medical clinics. This leverages some planning agent, some web browsing agent, and has some deeper understanding of the content from the HTML DOM. They haven't used any multimodal models, so they're only leveraging the HTML content and not leveraging the images and the vision models. They already achieved the state-of-the-art performance on this WebVoyager benchmark and outperformed previous techniques using multimodal models.

💬 0 comments
Add to My Notes
00:31:32Chi

There's still a large room for improvement. You can see the overall success rate is just 73%. So by combining multimodal models and more agentic workflows, there's a potential to get it higher, even higher. But there's some very good like foundational design principles we can learn from that work.

💬 0 comments
Add to My Notes
00:31:52Chi

And I also want to share a very recent quote I learned from a company doing construction. They're trying to use AutoGen to help the users without the special knowledge or expert knowledge to be able to finish their construction projects. And they mentioned the benefit of AutoGen as being able to rapidly explore many different agentic design patterns, configurations, and conducting experiments at scale.

💬 0 comments
Add to My Notes
00:32:25Chi

And that, to summarize, we have seen a big enterprise customer interest from pretty much every vertical domain and got contributors or users from universities, organizations, or companies from all over the world, including some contributors from Berkeley. There's a work called MemGPT created by Berkeley students, and they also have an integration of AutoGen. Very nice work.

💬 0 comments
Add to My Notes
00:32:52Chi

So we see that there's still a lot of like changes and progress in AutoGen. And here are just a few examples in several categories, including evaluation, interface, learning/teaching, and optimization. In the evaluation category, we're building agent-based evaluation tools to help developers understand how effective their application is, how good their agents are. It's a really difficult task because there are a lot of text generated, and it's hard to understand what exactly is going on. But using agents, they're able to automatically come up with success criteria based on the application and task and then suggest the scores for each dimension.

💬 0 comments
Add to My Notes
00:33:43Chi

And furthermore, we can extend that idea to improve agents over time, right, by providing the feedback to the agents themselves and have agent-based optimizations or learning or teaching capabilities.

💬 0 comments
Add to My Notes
00:33:58Chi

One centerpiece of AutoGen is still to improve the programming interface to make it even easier for people to build all sorts of agentic applications. So I want to talk about one particular research that excites me. This is called AutoBuild. So one remaining question for many developers is, what is the most effective multi-agent workflow I should use? What agents should I create for my particular task? So AutoBuild is designed to address that issue as an initial attempt.

💬 0 comments
Add to My Notes
00:34:35Chi

So it works in this way: the user first provides a task to describe the high-level requirements, and the system will suggest agents of different roles automatically. And these agents can be put together in a group chat to solve the task. And for new, more specific tasks, we can reuse this created team to solve that without users needing to specifically specify which agents to use.

💬 0 comments
Add to My Notes
00:35:04Chi

And we can further extend that idea from a static agent team to an adaptive agent team. So there's a technique called AdaptiveBuild. And in this case, we can first decompose a complex task into smaller steps, and then for each step, we propose a specific agent team for it. We can choose from an existing library of agents or decide to create new agents that don't exist before. And after finishing one step, we now check what new agents do we need for the next step. So these agents can be dynamically connected. And as we create more agents, we can also add them back to the library to improve the overall system over time.

💬 0 comments
Add to My Notes
00:35:47Chi

And we made experiments on several different benchmarks, including math, programming, data analytics, and so on, and find very promising results that outperform previous techniques of a similar kind.

💬 0 comments
Add to My Notes
00:36:02Chi

Yeah, this is just one particular example of research we're doing. There are many more challenging questions we still need the community to work on them together. The biggest question is how to design an optimal multi-agent workflow for any application, considering multiple factors like quality, the monetary cost, latency, and manual effort. In general, we still want to improve the agent capability from the reasoning, planning, multimodalities, or learning capabilities. And we also want to ensure like scalability and make sure humans have a good way to guide their safety and so on.

💬 0 comments
Add to My Notes
00:36:45Chi

Yeah, so that's the end of my lecture. So I want to acknowledge all the open-source contributors. And you can find our Discord community. We have a very large community on Discord and find this new GitHub organization. And I'm happy to follow up with questions. Thank you very much.

💬 0 comments
Add to My Notes
00:37:03Jerry

Um, hey everyone. That was a great talk by Chi from AutoGen. And yeah, you know, I hope to build upon that. I think this talk is a bit less generic, I guess, in terms of covering a lot of the different agent architectures and stuff. This is actually quite about a specific use case around building a multimodal knowledge assistant. And so it's really focused on maybe some principles around RAG and how do you actually extend that to building like a research agent. And it's like a use case that we've been exploring pretty deeply as a company. But of course, you know, there's plenty of agent stuff out there. You should definitely check out AutoGen, check out LlamaIndex, try building stuff on your own, and then see what we can come up with.

💬 0 comments
Add to My Notes
00:37:46Jerry

So let's get started. And first, if you're not familiar with LlamaIndex, LlamaIndex is a company that helps any developer build context-augmented LLMs from prototype to production. So we have an open-source toolkit. The open-source toolkit is a developer toolkit for building agents over your data, well, different types of LLM applications. These days, there's a lot of people building, you know, agents. We started off helping people build like RAG systems, and now, you know, we're going into territories where people are building slightly more advanced stuff where you're using LLMs for multi-step reasoning as opposed to just like single-shot, you know, synthesis and prompt-like generation.

💬 0 comments
Add to My Notes
00:38:27Jerry

We also have an enterprise product, which I probably won't talk about too much today, but it's basically like a managed service to help you, you know, offload your data indexing and RAG and all that stuff. I'll talk a little bit about one specific piece of this, which is document parsing or, you know, data parsing. And we think this is like a pretty important piece in any sort of like context-augmented pipeline.

💬 0 comments
Add to My Notes
00:38:49Jerry

So let's get started. How many of you know what RAG is? Yes. So one part of RAG is you have a database. And so a database of some knowledge. And the way it works is you first want to do retrieval from that database to return basically just a bunch of text, to chunk it, and then, you know, given the chunked context, embed it and put it into, for instance, a vector database. And we'll talk a little bit about that as well.

💬 0 comments
Add to My Notes
00:39:17Jerry

This overall goal of building a knowledge assistant is that, you know, a lot of companies have these types of use cases where they have like a lot of data. They have like a million PDFs, they have a bunch of PowerPoint presentations, they have Excel files, and you want to build some interface where you can take in some task as input and give back an output. That's really it. If you think about a chatbot, that's basically a chatbot.

💬 0 comments
Add to My Notes
00:39:38Jerry

So you have a lot of data, you want your LLMs to basically understand that data, and then you want the LLM to do stuff with that data. An example could be generating like a short answer. It could also be generating like a structured output, a research report. It could take actions for you like send an email, it could schedule a calendar meeting, write code, do a lot of these things.

💬 0 comments
Add to My Notes
00:40:00Jerry

We talk a lot about RAG as a company, and especially if you're just getting started, if you follow like RAG 101 on how this works, we call that basic RAG. And so what is basic RAG? Basic RAG is you take your unstructured data, you load it with some, you know, standard document parser, and then you chunk it into, you know, every thousand tokens or so. You just slice that text up into a bunch of slices, and then you stuff each slice into an embedding model like OpenAI embeddings, and then you put it into a vector store.

💬 0 comments
Add to My Notes
00:40:35Jerry

Then when you return stuff or you do retrieval from this database, you typically do, you know, semantic search or vector search to return the most relevant items from this knowledge base and stuff it into the LLM prompt window. This entire pipeline gives what I call like a basic RAG pipeline. It'll work kind of in being able to answer basic questions you have over that data.

💬 0 comments
Add to My Notes
00:40:59Jerry

However, there's a bunch of limitations. And the four limitations we list here is one is the data processing layer is pretty primitive. When you do the chunking, you're not really taking into account all the different elements within that data. You're not really taking into account, you know, are there tables or images or weird sections, and you want to semantically preserve different chunks together.

💬 0 comments
Add to My Notes
00:41:21Jerry

You're only using the LLM for synthesis, and so you're not using it for any sort of reasoning or planning. And so if you think about all of what Chi just talked about with respect to agentic coordination and stuff, you're doing none of that with a basic RAG pipeline. And so it's kind of a waste of LLM capabilities, especially with the latest models like GPT-4o, 3.5, like Claude, and whatever. A lot of these models have much greater capabilities than just being able to summarize over a piece of text. And so you want to figure out how do you actually use the model capabilities to do, you know, more advanced reasoning or planning.

💬 0 comments
Add to My Notes
00:41:55Jerry

The other piece, you know, standard RAG pipelines are typically one-shot, so they're not personalized. It's just like after you ask a question, it's going to forget about it. And so every new interaction will basically be stateless. This has certain advantages, but you know, if you're building a personalized knowledge assistant, ideally you're able to add like a memory layer to that.

💬 0 comments
Add to My Notes
00:42:14Jerry

So a lot of what we asked ourselves is, can we do more than a basic RAG pipeline? There's a lot of questions or tasks that a naive RAG pipeline can't give an answer to. This leads to hallucinations for the end user. Like having a chatbot where it can't answer like 80% of the questions you might want to ask it is kind of like a limited value-add. And so how do you, you know, build this more generalized knowledge assistant that can take in questions of arbitrary complexity and answer that over arbitrary amounts of data?

💬 0 comments
Add to My Notes
00:42:47Jerry

We think a better knowledge assistant has four main ingredients. And actually, specifically, the focus of this talk is on how do you take in like a multimodal or build a multimodal knowledge assistant. So instead of just, you know, reasoning over like standard text files, how do you reason over like an entire research report with a lot of diagrams and pictures, images? How do you basically reason over all the visual data that exists, you know, on the internet in addition to just text?

💬 0 comments
Add to My Notes
00:43:20Jerry

So the first piece is we need a core, high-quality multimodal retrieval pipeline. We want to then maybe generalize the output and think about something that's a little bit more complex than your standard chatbot response. So generating a research report, doing data analysis, taking actions. Three is agentic reasoning over the inputs. So this is where, you know, instead of just taking in the user question and only using the LLM for synthesis, applying Chain of Thought, tool use, reflection, all that fancy stuff to try to, you know, break down the question, do some planning, and actually step-by-step work towards an overall goal. And last is deployment.

💬 0 comments
Add to My Notes
00:44:03Jerry

We'll see how long I have. I plan to probably talk for 15, 20 more minutes. So just cover some of the high-level details. And for some of the actual examples, they're basically linked in the slides in case you want to check them out.

💬 0 comments
Add to My Notes
00:44:15Jerry

The first piece is setting up a multimodal RAG. If you're familiar with RAG, you might be familiar with RAG over text data, but what we're really interested in is having RAG actually operate over just like visual data. And by visual data, I don't just mean like a JPEG file. I mean, even if your PowerPoints, for instance, or like a research archive paper, it's going to have like, you know, charts, diagrams, it's going to have like weird layouts. And the issue with a lot of standard RAG pipelines is they do a terrible job at actually extracting out that information for you.

💬 0 comments
Add to My Notes
00:44:50Jerry

And so, like I mentioned, any LLM or RAG or agent application is only as good as your data processing pipeline. And if you're familiar with the "garbage in, garbage out" principle in traditional machine learning, I think for LLM application development, it's no different. So you basically want to have good data quality as a necessary component of any production LLM app. This ETL layer for LLMs consists of basically, you want to do some parsing from the document, you want to figure out a smart way to chunk it, and then you need a smart way to, you know, index it and put it into the LLM prompt window.

💬 0 comments
Add to My Notes
00:45:28Jerry

This data source that I talked about, like this case study of complex documents, is a pretty common data format that we see across a lot of different companies. And so a lot of documents can be classified as complex: embedded tables, charts, images, there's like irregular layouts, there's like headers and footers. And a lot of times, you know, when you kind of apply like off-the-shelf components to parse a lot of this data, it ends up in a broken format, and the LLM hallucinates the answer.

💬 0 comments
Add to My Notes
00:46:00Jerry

Users want to ask different types of questions over this data. So you have a bank of, like, say, PDFs, you want to ask questions over it. It could be simple, pointed questions. It could be multi-document comparisons. It could be longer-running like research tasks. A research task could be, you know, "Given these 10 arXiv papers around like LLM quantization or something, generate like a condensed summary or like a survey paper," right? And so that's something that's a bit longer or higher level in nature compared to something that's just a simple search and retrieval task.

💬 0 comments
Add to My Notes
00:46:31Jerry

So we'll start with the basics, which is just parsing. Ideally, a document parser can actually structure this complex data for any downstream use case. I won't talk about it too much because, you know, the goal of this is really about agents, but really, without needing to know the internals of like document parsing, you need to have like a good PDF parser, basically. Because if you have a bad PDF parser, then you're going to load in some PowerPoint or PDF, it's going to not really extract out the right text from that PDF. And then when you feed in text that's been hallucinated by the parser, the LLM is going to have a really hard time understanding that, no matter how good the LLM is.

💬 0 comments
Add to My Notes
00:47:11Jerry

So ideally, you basically want a parser that can parse out like text chunks, tables, diagrams, all that stuff into like semantically consistent ways. That is one of the things that we do, is we make a pretty good AI-powered PDF parser. We're at like 30,000+ users right now. And yeah, I mean, if you're interested in trying it out, everybody gets like a thousand credits or pages per day. And so it's used at, you know, different small companies to larger enterprises.

💬 0 comments
Add to My Notes
00:47:42Jerry

So you want to have a good parser, and what this enables is it's actually an important piece in being able to structure your data in the right way. If you think about different types of data, so here is like maybe like an investor slide deck, here's a 10-K annual financial report, here's like an Excel sheet, here's a form. Having that parsing and extraction step to extract out stuff in a clean format of data gives... it's just much easier for LLMs and retrieval processes to understand that afterwards.

💬 0 comments
Add to My Notes
00:48:14Jerry

Once you've actually, you know, parsed out a PDF into its consistent elements in a good way, you can then leverage like hierarchical indexing and retrieval to do something more fancy than your standard RAG pipeline. So given this like, you know, structure or document structure of like text chunks, tables, and diagrams, for each of these elements, you know, a standard RAG pipeline will basically directly try to embed each of these chunks. But we found, for instance, a better approach is you actually extract out a bunch of different representations that point to the source chunk.

💬 0 comments
Add to My Notes
00:48:50Jerry

So for instance, for a table, you might want to extract out a variety of different summaries that point to that table. For a picture, I mean, a picture you can't feed into a text embedding model anyways, and so you need to maybe use some model to extract out a variety of different summaries that point to that picture. For bigger text chunks, you might want to extract out like smaller text chunks that feed to that bigger text chunk.

💬 0 comments
Add to My Notes
00:49:11Jerry

Once you extract out these representations, we call these nodes because these are the things that will be embedded and indexed by the vector database that you're using. So if you're using a vector database like Pinecone or whatnot, you know, you can basically extract and index this metadata that is associated with the source element but is not like the direct source element itself.

💬 0 comments
Add to My Notes
00:49:33Jerry

Then during the retrieval process, you know, given a user question, it'll first retrieve the nodes, and because the nodes have a reference to the source document, you can basically dereference it and then feed the resulting element into the model. Notice that most models these days are multimodal in nature. If you look at GPT-4o, Claude 3.5 Sonnet, the latest Gemini models by Google, they can take in both text and images. And so the nice thing here is that you can still use text embedding models to represent the element itself, but then when you actually feed stuff into the LLM, you're able to feed in both text and images. And so what I just outlined to you is a basic way of building a multimodal RAG pipeline.

💬 0 comments
Add to My Notes
00:50:20Jerry

So a multimodal RAG pipeline can take in, you know, any sort of document. It could have different types of visual elements in that document, and it will store both text and image chunks. And so in order to index the image chunks, you could use like CLIP embeddings, or you could do what I said, which is like, you know, use a model to extract out text representations and link the text representation to the image chunk. And so when you set this up, then during retrieval, you feed in or you return both the text and the image chunks, and you feed both to a multimodal model.

💬 0 comments
Add to My Notes
00:50:57Jerry

I'm going to skip this example, but here is like a basic example here that shows you how to build like a standard multimodal RAG pipeline. Notice up until now, you know, I haven't really talked about agents yet. This is just like setting up the basics of a, you know, multimodal RAG. And so you don't get the benefits of like Chain of Thought or reasoning or tool use quite yet. All this is doing is saying, you know, given a specific question you have about maybe a more complex data set like research reports or slide decks, you're able to ask questions and get answers over, you know, the visual elements on the page.

💬 0 comments
Add to My Notes
00:51:28Jerry

So the next piece... actually, I might skip this just due to time, but basically the high-level idea here is, you know, a lot of the promise of agents is that it's not just going to give you back a response in the form of like a chat response, but actually generate entire units of output for you. So producing like its own PowerPoint or PDF or, you know, taking actions for you.

💬 0 comments
Add to My Notes
00:51:51Jerry

Or, you know, do you guys use... anybody use Claude? Like instead of ChatGPT, anyone use Claude? You know how like when you ask it to write a paper for you for your Berkeley essay, then it'll actually generate like an entire thing on the side? That's an example of like report generation, right? It'll actually, you know, give you like a thing that you can just directly copy and paste into something that you then later edit.

💬 0 comments
Add to My Notes
00:52:13Jerry

For instance, like that, this is like a pretty common use case we're seeing in the enterprise too. So a lot of, you know, consultants, like knowledge workers, are interested in basically generalizing beyond the capabilities of just giving you back an unformatted response, but giving you something that you can just directly use on its own. So whether that's code, whether that's like reports, it's all very interesting. And I'll probably skip some of the like architectures of how to you actually build this for now.

💬 0 comments
Add to My Notes
00:52:40Jerry

And now we talk about agentic reasoning over your inputs. So this is the third section. So we have like, you know, a multimodal RAG pipeline in place. Now let's add some layers of agentic reasoning to basically build agentic RAG.

💬 0 comments
Add to My Notes
00:52:53Jerry

Naive RAG works well for pointed questions but fails on more complex tasks. And again, this is due to all the reasons I mentioned above, right? You're just retrieving a fixed number of chunks. You're not really using the LLM to break down the question in the beginning. And so if you ask a question like summarization questions where you need the entire document instead of just a set of chunks, comparison questions where, you know, you actually need to look into two or three or more documents, multi-part questions, same thing, or a high-level task, you don't really get good results if you use a standard RAG pipeline.

💬 0 comments
Add to My Notes
00:53:28Jerry

So there's a wide spectrum of different types of agentic applications you can build. And I think Chi probably gave much better architectures of like, you know, how multi-agents can collaborate with each other and achieve very advanced things. The way we think about it is like there's kind of both like simple to advanced agent components. At the right, you basically have entire like generalized agent architectures, right? And so this includes like, you know, a ReAct loop, for instance, which is like one of the most common agent architectures these days, came out like about two years ago almost. And you know, basically just uses some Chain of Thought plus tool use to basically give you some generic, you know, agent architecture. You can plug in whatever tools you want, and it'll roughly try to reason over them to solve the task at hand.

💬 0 comments
Add to My Notes
00:54:12Jerry

This also includes LLM Compiler, right, which is a Seg paper. And so this basically generalizes a little bit beyond ReAct into doing some pre-planning, right? So instead of just planning that next step at a time, I'll actually plan out a DAG, optimize it, run it, and basically replan periodically.

💬 0 comments
Add to My Notes
00:54:32Jerry

What we actually see a lot of people these days build is some people do use ReAct. It's a pretty easy architecture to get started, but other people just like take some of the existing components and build more constrained architectures. And part of this is just due to the desire for reliability. Even though it's less expressive, even though it can't do everything, some people are still building up that trust towards AI, and so they're trying to solve a specific use case in the beginning. And so by solving this specific use case, they can leverage more specific components and try to, you know, maybe solve it in a more constrained fashion.

💬 0 comments
Add to My Notes
00:55:05Jerry

This includes, by the way, like tool use, like maybe leveraging a memory module, like function calling. We see at a lot of places people are very interested in like structured output generation, tool use, being able to call an existing API, and then also doing some basic like query decomposition, whether that is Chain of Thought or like parallel, like given a question, break it down into a bunch of different sub-questions.

💬 0 comments
Add to My Notes
00:55:32Jerry

This overall thing we call agentic RAG because it really is just like an agent layer on top of RAG. If you think about like RAG or retrieval or, you know, retrieval from a vector database as a tool, you can think about an agent that operates on top of these tools. So instead of like, you know, giving a query, directly feeding it to the vector database, first passing it through this general agent reasoning layer. And this reasoning layer can, you know, decide to do a bunch of things to the query and also decide what tools to call in order to give back the right response. The end result is that you're able to build a more personalized Q&A system that can handle more complex questions.

💬 0 comments
Add to My Notes
00:56:09Jerry

And this is an example of what I mean by unconstrained versus constrained flows. So for instance, a more constrained flow might just be, you know, you have a task and you just have a simple router prompt. A router prompt is just an LLM prompt that just selects one option out of n. And all it does is, given this task, feed the task to one of the downstream tools based on, you know, the decision in the router prompt. And then feed it to maybe like a reflection layer and then give back a response.

💬 0 comments
Add to My Notes
00:56:45Jerry

There's no loops in this orchestration. All that happens is, you know, it will hit the router, go through a tool, go through, you know, another prompt that just reflects and tries to validate if it's correct, and then generates a response. This I define as more constrained because a lot of the control flow is actually defined through humans, you know, through you guys, versus through the agent. And typically the programs that, you know, are more constrained look like if-else statements and like while loops that you actually write instead of the agent.

💬 0 comments
Add to My Notes
00:57:16Jerry

If you're using like a more generalized agent architecture, like if you're using, you know, like ReAct or LLM Compiler or Tree of Thoughts or whatever, then it's a little bit more general because you're basically saying, "I don't actually know what the specific plan I want the agent to follow is. I'm just going to give the agent a bunch of tools and let it figure it out." And so this is more expressive because it can technically solve a greater variety of different tasks than you trying to, you know, hardcode that flow beforehand. But it's also less reliable, right? It might veer off and call stuff, like call tools that you really didn't want it to call. It might, you know, just repeat, get stuck in an infinite loop somewhere and never converge. And it's also more expensive. Typically, these types of agent architectures use bigger prompts. You're stuffing in more tools at once, and so the marginal token costs are much higher.

💬 0 comments
Add to My Notes
00:58:09Jerry

And so we see a little bit fewer architectures being built with very wide, unconstrained flows. But a good rule of thumb is if you're interested in, you know, using ReAct or something, try to stuff in like four or five tools and try to limit it to less than like 10 with the current models.

💬 0 comments
Add to My Notes
00:58:26Jerry

We have core capabilities in LlamaIndex to basically help you build workflows. So we call all these things like workflows. They're basically all agentic in some nature. The very rough definition of an agent—everyone has a different definition, you guys might disagree with me—a very rough definition is just a, you know, program, like a computer program that has non-zero LLM calls. That's like a very general definition. And we basically help you write those types of programs.

💬 0 comments
Add to My Notes
00:58:55Jerry

So whether you try to define a very constrained program where you're writing the if-else conditions or you're letting an agent handle that task, we basically have like an event-driven orchestration system where, you know, every step, you can listen for a message, you can pass a message to a downstream step, you can pass messages back and forth between two different steps. And basically, at a certain point, you know, these steps can be just like regular Python code, they could be LLM calls, they could be anything you want. At a certain point, the program stops and it gives you back a response.

💬 0 comments
Add to My Notes
00:59:28Jerry

And so we're building out this like very fundamental, low-level orchestration because we believe that there's some like interesting properties among like agentic behavior where it is fundamentally a little bit event-driven. And this also provides like a nice base to help like users deploy workflows to production in the event that you want to translate your program into like a Python service. So check it out. I mean, I think there are some links here, but you know, the stuff is all linked in the docs.

💬 0 comments
Add to My Notes
00:59:58Jerry

I'm going to skip this piece. And then, yeah, some use cases that maybe are interesting to cover, which, you know, I think the links are basically here in case you want to check it out, is like we're very interested in like a report generation use case. So this is again something that we see pop up across a lot of different companies. And basically, given a bank of data, you want to actually produce some output from that data.

💬 0 comments
Add to My Notes
01:00:26Jerry

An example architecture for this is you have like a researcher and a writer, and maybe like a reviewer as well. And you can think about this as like a multi-agent system, depends on how you define it, but basically you have like a researcher that, you know, does a little bit of RAG. It retrieves relevant chunks and documents from some database and maybe like, given a task, make sure that it basically... it's like going on the internet and fetching stuff and storing it in your notes. It'll put that stuff in a data cache. Like basically, this contains all the relevant information you need to, you know, generate the report.

💬 0 comments
Add to My Notes
01:01:00Jerry

The second step is like a writer. And this writer might use this data cache to then make like an LLM call that will then, you know, generate this interleaving sequence of like text, image blocks, tables, and give you back a full output. So we have an example architecture here, and you know, we also have some example repos where you generate like an entire slide deck instead of just a report.

💬 0 comments
Add to My Notes
01:01:23Jerry

Another use case, by the way, which isn't in these slides but is very interesting is customer support. I think if you look at just practical enterprise use cases of agents, customer support for external-facing use cases is probably number one. This is, you know, basically like there's just a lot of automation that could be baked into the decision flow to basically increase your deflection rate and basically ensure that the user ends up having like a much better experience than going through like, you know, those automated phone menus. And so we see that popping up in a lot of different places too.

💬 0 comments
Add to My Notes
01:02:02Jerry

And then the last bit is really around running agents in production. I think, you know, so far, if you start off like building a lot of these components, you're probably going to start off in like a Jupyter notebook. And that's totally fine. You know, when you start off building a prototype, it makes sense to do something that's very local, that's very narrowly scoped, and you basically see if it works over test data.

💬 0 comments
Add to My Notes
01:02:25Jerry

An interesting design exercise is to think about what a like complex multi-agent architecture looks like and how we can leverage like existing production infrastructure components to like achieve that vision of like, you know, multi-agents in production. Ideally, you know, if you think about agent one, agent two, agent three, every agent is responsible for solving some task, and they can all communicate with each other in some way. So you ideally can encapsulate their behavior behind some API interface, and then you can, you know, standardize the way they communicate with each other through some sort of core messaging layer. You can easily scale up the number of agents in this overall system to add more to this like multi-agent network. And you can also take into account, you know, a large volume of client requests with different sessions.

💬 0 comments
Add to My Notes
01:03:14Jerry

So this is basically what we're building, and you know, we're... it's like a work in progress, but we've made a lot of progress in the past few months, is like how do you actually deploy agentic workflows as microservices in production? So you model every agent workflow as like a service API. We allow you to, you know, spin this up locally and also deploy this on, for instance, like Kubernetes. All agent communication happens via like a central message queue. You can have like human-in-the-loop as a service. So for instance, like, you know, if this agent actually needs your input, it'll send a message back to you, await your response, and then you give it, you know, an input before it resumes execution.

💬 0 comments
Add to My Notes
01:03:56Jerry

How many of you have seen like the Devin demo, like the Cognition Labs Devin? You know what I'm talking about? How many of you know what like Devin is? All right, so so like half of you. I think if you took a look at the demo, one example that does is like this coding agent will just generate an entire repository for you, but sometimes it will stop, right? Sometimes it will say, "I don't actually have enough clarity to basically kind of give you the response. Can you actually tell me what to do next?" And so if you played around with Devin, that's basically what it does. And that's an example of the human-in-the-loop, this like kind of interesting back-and-forth like client-server communication where the server is actually waiting on the client to send like a human feedback message.

💬 0 comments
Add to My Notes
01:04:34Jerry

So that's basically it. All these components, I think, are kind of like step-by-step towards this idea of building a production-grade multimodal knowledge assistant over your data. And yeah, thanks.

💬 0 comments
Add to My Notes
Video Player
My Notes📝
Highlighted paragraphs will appear here