CS 194/294-196 (LLM Agents) - Lecture 7, Nicolas Chapados and Alexandre Drouin
Disclaimer: The transcript on this page is for the YouTube video titled "CS 194/294-196 (LLM Agents) - Lecture 7, Nicolas Chapados and Alexandre Drouin" from "Berkeley RDI Center on Decentralization & AI". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.
Watch the original video here: https://www.youtube.com/watch?v=-yf-e-9FvOc
We're both very thrilled to be here, super happy. Thanks for the invitation. Again, it's already the midpoint, lecture seven, the midpoint in this amazing series of lectures.
So so far, all of you have had the opportunity to learn a bit about the building blocks for agentic systems, things like how we do reasoning or tool use with AI agents.
So today, we want to frame agents a little bit about how they can be used in an Enterprise context, in the context of people trying to get actual work done with agents.
And the focus will be really about understanding what are the use cases, who are the users, the human users for those agents, what role do they have, what kind of work are they trying to do, and what is the context in which these AI agents should be deployed to help the human users in the context of their work activities.
For context, both Alex and I work at a company called ServiceNow. You may or may not have heard of it. It's a global company, 25,000 employees. The headquarters are just down the bay in Santa Clara. And really, workflows are the bread and butter of what ServiceNow does. We build a platform to automate Enterprise workflows. We have products in areas like IT Service Management, Human Resources, Customer Support. And ServiceNow is also a developer platform to help both professional developers and line of business regular people build new workflows.
And to give a sense of how this platform is used today, and the same for all Enterprise workflows platform, one way to think about how these platforms work is that they bring together people who need to get work done in the context of their daily activities with people who are doing the actual work.
So suppose, for example, that you had a bad morning and you spilled coffee on your laptop. Your laptop is no longer working. What do you do? Well, if you're at a modern workplace, you may hop onto your mobile phone and open a ticket for your IT help desk and request a new laptop to be shipped to you. Today, this ticket is going to be processed by a human user, by a human agent who will read the steps, follow instructions to get the new laptop shipped to you. There's a lot of manual steps that are involved in this process, this resolution of of everyday issues that people in the workplace experience. And what we will see is that there are huge opportunities for AI agents to help automate partly some of this work.
So before we go further, both Alex and I work for ServiceNow Research. And of course, our team works a lot on agents, and we have multiple subteams that work on various aspects of agent development. But we also think about all the other layers of the stack. So for example, we have a team that trains and fine-tunes Foundation models. This is the team that worked with Hugging Face, for example, to release the StarCoder model that some of you may have heard about.
And at the other end of the stack, we have a lot of people that worry about things like AI trustworthiness, governance, safety, security, and the kind of practical concerns that we will touch on a little bit today in the context of agents, but that are absolutely necessary if you want to deploy AI successfully.
And before moving into the topic for today, our friendly legal team asked us to display this disclaimer, so here you go.
And a quick note about the agenda, what we're going to be covering today. So we're going to start with a few words of background. Of course, this is not the first lecture. You have heard many, many things about AI agents, so I will not give any kind of explanation again, but just set, um, explain the vocabulary and make sure that we agree on the terminology. And I want to discuss a little bit what's the matter with Enterprise workflows and how a lot of work is happening in businesses today.
Then we're going to be talking a bit more about API agents, somewhat about architectures, but more deeply about a new framework that we released as open source just a couple days ago called tape agents, that offered a very principled way to develop agents and optimize them. We're very excited about this new piece of work.
Then I will pass the baton to Alex who will talk to us about web agents, how we can develop them, showing that building a simple web agent is actually pretty simple to do if you know what you're doing, but measuring their performance, making them better is actually much harder. So how can we do that in a more principled way and what does the tooling to build, develop, optimize web agents look like?
Then I will conclude by talking a little bit about the potential for agents in the workplace and the kind of large-scale transformations that AI agents could bring to us in the future.
So let's start with a quick definition. Again, this is a class about LLM-powered agents. We assume that this is clear. And just a couple words of definition: we assume that those agents have a degree of autonomy, that they can plan, they can take action in an environment, they can receive feedback from the environment, and they can execute goals that that are given by a human user. And usually this process happens over multiple iterations. So you have certainly heard variants of that definition earlier in this series of lectures.
Now, maybe to make things a bit more crisp and a bit more precise, LLM agents, we really need to distinguish them from what, if you come from a machine learning background, is an earlier generation of agents, agents that are a bit more narrow in scope, more restricted architectures that are typically, that typically have a fairly restricted action space, that could be trained using techniques like reinforcement learning, and that would be very good at playing games, for example, sometimes at superhuman performance. We're not going to be talking about that today. Those agents, although they are really, really good at playing Minecraft, for example, your game-playing agents would be terrible at writing an email to your boss.
And in contrast, the LLM agents, what is really nice about them is that the large language model at the core of those agents has been trained on a large fraction of the internet, and in doing that, it has seen a lot of instructions about common software that people use every day. There is a lot of web pages that talk about how to use PowerPoint, how to use Excel, and even business software like Salesforce, SAP, and even ServiceNow. There's a lot of documentation and newsgroup posts about these things. So those LLMs already have a good sense of how to use software that people use every day, and the goal of an LLM agent to a big extent will be to leverage this kind of knowledge to get, uh, to execute tasks that people want to get done.
And within LLM agents, we're going to further distinguish two types of agents. One is an API-calling agent, so it has a tool set that can be described using a somewhat formal specification, let's say a formal specification of a set of APIs or endpoints, and those would be primarily exchanging textual information.
And on the other hand, we will also talk about web agents. And those are able to use the web pretty much like you and I would, by pointing and clicking on buttons on web pages, by filling out forms, by navigating from page to page to gather information, to memorize little pieces of information that you can reuse elsewhere in the task that you need to do. So of course, each of those has pros and cons. We will touch upon both today.
And to frame this within the context of work that folks need to get done in their daily activities as employees of big companies, imagine that we're facing an issue where a user called John is not able to access a particular knowledge base article on the company intranet. So John is a good employee, he will file a ticket with his IT help desk, and this ticket gets assigned to Sandy. And Sandy looks at the ticket and tries to solve the issue for John.
So what does she, Sandy, do? Well, she first starts by seeing if there's an existing knowledge base article that shows how to resolve this issue. Maybe she doesn't find any. So she looks to see if there are earlier incidents that looked a little bit like this problem that she's trying to solve. Maybe she finds one, and in it she sees that she needs to check the access control that John has, and she will change the access control to give the right permissions because John didn't have the right controls. She changes the permissions and finally, she resolves the case.
And since Sandy is working in a very advanced workplace, she has a brand new generative AI capability that just got deployed in the last year. And lo and behold, the generative AI capability is able to do the incident summarization for her, write resolution notes, and finally close the case. Okay? So as you can see, there are many, many manual steps. Even if you're helped with some generative AI capabilities, there are many, many manual steps that remain in solving everyday work items that people need to do in the course of their daily work.
And this kind of incident is replicated millions and millions and millions of times a day across all the companies in the world. And to give a sense of how the landscape of Enterprise automation has evolved, what we see is that there has been a progression in the past couple of years, going from systems that are very scripted, that are very manually programmed with a very low degree of automation, things like either manual scripting or robotic process automation. And gradually we have seen increased degrees of automation that require increased degrees of intelligence, going with classical AI workflows about recommender systems, for example, conversational workflows that have become more prevalent in the past couple of years, especially with generative AI, and today agentic workflows that require the most intelligence but also have the most promise for automation.
And what's the, what's the promise for agents in the workplace or in Enterprise automation is that they really have a unique opportunity to automate the masses of low-volume, low-value tasks that people encounter every day. So what are those, you may ask? Well, let's start by looking at what are not low-value, low-volume. The big rocks of Enterprise automation, things that have a lot of value or that occur very, very often, those probably have already been solved because people found it valuable to sit down and figure out how to program an automation for those use cases. So things like employee onboarding workflows, for example, and things that occur quite often that are predictable, replicable, a good chance that there is existing low-code, no-code environment or robotic process automation processes that have been put in place to automate those.
But again, think about everything you do every day, all the little grains of sand that are too unique to be scripted or to be automated. They never occur twice in exactly the same way. So it's very, very hard to say, "Oh, I'm going to automate scheduling my tweets or answering my emails or filling out, updating a customer relationship management system, or even organizing a meeting for 15 people across four different companies." Those things are very hard to automate using classical technologies, and this is where agents have the potential to really shine and help do a much better job at all those little grains of sand that individually, each of them is tiny, but in aggregate, they represent a massive amount of work that people need to get done every day.
So let's take a look at a simple example of a web agent that we will be developing and expanding on a little bit later today. A couple months ago, we were invited to give a presentation at the big conference that NVIDIA puts together, the GPU Technology Conference, GTC. And of course, we did not know how to get there, so we asked our web agent for help.
And the web agent that you're going to see on the left side of the screen is the interaction that me as a human, I was having with the web agent. And on the right is everything that the web agent is doing automatically through a Chrome-like web browser. So let's get this rolling. So I'm asking the web agent, "Hey, I'm attending GTC 2024. How can I get there from the NVIDIA headquarters?" So the agent goes off and, okay, well, it's starting with Google. It's asking Google how to get there. Good progress. It's getting to maps. And Google Maps says, "Hey, I don't know anything about GTC. It's not an item on the map." So we're going back to... so the agent is asking me, "Hey, can you give me a bit more instructions there?" And I tell it I want to attend actually the GTC conference.
So the agent now has done some replanning. It's using Google to figure out the location of the conference. It's finding it as the San Jose Convention Center, and this is what it is entering into Google Maps now to figure out the way to get from the NVIDIA headquarters to the San Jose Convention Center. Okay, so this is an example of the kind of interaction that we soon will be getting, or that we hope we should be getting with our agents as systems that are allowed to ask more questions to humans, but that have some smarts in terms of planning and replanning of their own.
Okay, so let's shift gear now and talk a little bit about API agents. The typical architecture, this slide is probably going to be a repeat of some of the work that was covered earlier, but again, just to set our minds in the in the right state. So we start with an LLM agent that is able to emit some actions, and those are sent to an environment, and the environment sends back observations back to the LLM agent.
The agent has access to some short-term memory, so what have been the past actions taken and the past responses from the environment. It also has access to some planning processes, so things like ReAct and so on. So there are many different strategies that have been proposed. We're not going to get too much into those today. And of course, for agents to be useful, it has to access some tools. And the tools there can be quite varied, including simple calculations, running code, accessing the web, and so on. And some more advanced agents also have the ability to synthesize new tools, and those can be stored in some kind of long-term memory.
But beyond those basic concepts, we want to have the right software framework to help build agents. Okay? Just like it would be crazy for anyone today to try to program machine learning algorithms from scratch, everybody uses PyTorch because it's so much faster, it's so much easier. So we want to have the right framework to build agents.
And I want to talk about a new framework for holistic agent development, so both the tooling to help build and debug agents, but also the tooling to optimize agents from optimization traces. We introduced this framework called tape agents just a couple days ago. In fact, the QR code is the link to the resources. All of that will be given again at the end, don't worry.
And what we're trying to accomplish with tape agents is to get a little bit of the best of both worlds. So on the one hand, we have frameworks that really address the need for the software engineering part of agents. We want to be able to specify those agents to a very fine degree of granularity, to have pre-made components, to have facilities like concurrency and streaming from the LLM. And you've heard before about frameworks like LangGraph, AutoGen, and so on. So this is great for developing agents.
And on the other hand, we have a need to optimize agents. Okay? Those agents, they all are the result of prompting an LLM. So whoever says prompting says prompt optimization, prompt engineering. Some of it is done by hand, but you can do a much better job if the machine can do it on your behalf. And in some contexts, you may even want to optimize the LLM itself to make it a really good agent LLM. So we have frameworks like DSPy that is very, very nice for doing this kind of optimization work. But until tape agents, there was no framework that was really good at doing those two things together.
Okay, so really what we hope to accomplish with tape agent is this best-of-both-worlds type of situation where we have great tools to build and engineer and debug agents, while at the same time getting the right primitives and the right concepts in place that enable both prompt optimization and fine-tuning of the underlying LLM.
So as a framework, tape agent itself builds on a single unifying abstraction that we call the tape. So what is a tape? It is simply a log, a recording of all the thoughts and actions that all agents in the system... we assume we can have more than one agent, so all the agents in the system store their thoughts and their actions to the tape. And whenever the environment is called and we place an action in the environment, whatever results from the environment is also added to the tape. Okay?
So we have this single unifying log and record of of things. And you can say, "Okay, well, it's just a dumb log. What is so special about it?" Well, first, it's not a dumb log. It is, it has quite a lot of structure. And importantly, it is a piece of data itself, and we can have agents that take as input the tapes that result from the execution of other agents for further processing. Okay? So introducing this notion of having a tape as a full-blown data structure that is very, very rich in the content of its metadata and can be processed by further downstream algorithms is really where the power of this framework applies.
So in terms of just the execution model, to set things clearly in your mind, bring your attention to the agent on the right, Agent B. And this one gets a read of the entire history of the tape so far, and the agent selects which of its internal nodes it's going to execute at a particular point in time. And the nodes, again, just from an abstraction perspective, they are the little pieces that decide how to prompt an LLM. An agent can have as many nodes or as few nodes as you wish. It's really up to you as an agent developer to decide on the internal structure of those agents.
But the nodes then call the LLM. The LLM either suggests a few thoughts or a few actions. Those get added to the tape, and we have an orchestrator that can then go call the environment for executing any unprocessed actions, get the reply from the environment, and the loop continues. Okay? So it's a fairly straightforward execution model, but in that, you will notice that we have the ability to have multiple agents, and each agent can get, can delegate work to some sub-agents as well. So it's very modular, it leads to a very flexible structure of agent.
And furthermore, we can have very, very nice things that result from this "tape as data" idea. Of course, we have full-blown auditing and debugging capabilities that are very, very obvious. And we can have some algorithms that take the tape as input and do cycles of prompt optimization to it to get better prompts. And we can have also agents that distill the work of some fairly complicated teacher agents and enable building much more cost-effective student agents as a result. So we can have an optimization of the underlying LLM as a result of that. We'll see in a minute what this looks like.
So a very simple example of that loop that I talked about... There is a full technical report. The link was provided in the pre-read to this lecture, so I will not get too, too deeply into the details, but this is more to give a flavor of what's going on here.
Assume you have a simple two-agent system, very straightforward question-answering assistant about the financial status of companies, for example. So we have the assistant itself that talks to human users, and this agent is able to rely on a sub-agent that will perform web searches for it. Okay? So this simple two-agent setup here, we combine it with this quite straightforward execution loop that we described earlier, and this gives rise in this very detailed, granular tape recording of the interactions between all the agents, their internal monologue, their internal reasoning traces, the actions they take, and how the environment responds. And of course, the tapes in practice are quite involved. You can reproduce this example through the introductory notebook that you can run on your laptop. This is super easy to run.
And what this all gives us though, is the ability for agents to process tapes in a way that enables easy optimization. We don't have time today to go deeply into the details of this example, but at a high level, what we want to do is to provide agents that have a high degree of sophistication that can provide very, very high quality user interaction, but that are also quite cheap to run.
So we run a case study where we built on the notion of having an assistant that helps you fill out complex forms to request work to be done in the context of your daily work activity. And this process is about enabling a very complex agent to be distilled down into a simple agent. What this lets us do is to create very cost-effective, GREAT conversational agents. Okay?
And GREAT, of course I did not misspell it on purpose. It stands for the attributes that we want to monitor in the agents that we build. And we selected really these attributes as covering all the quality attributes that would make great conversational agents. Okay? So things like we want the agent to be Grounded, to never hallucinate, to always tell the truth. We want the agent to be Responsive, to be Accurate, to be Disciplined, Transparent, and Helpful. And all those attributes can be measured independently, and we can optimize for all of those together.
So in a case study that we ran that is very, very much detailed in the technical report, we built a set of agents to fulfill a task, which is helping human users fill out those very complex forms that they sometimes encounter in the context of their work, and making it a lot easier to fill out the forms. And what we want to do is to build a fantastic conversational experience, think about GPT-4 level quality or better, but much, much more deeply.
And we run a case study by inventing fictitious companies, having a few companies to train the model and a few companies to test for out-of-distribution effectiveness of the model. And the methodology that we followed there is to have a very complicated, very high parameter count agent that is running on Llama 3.1 405 billion parameter model, so very costly to execute, generate a lot of data that is recorded in the form of these discrete tapes, these discrete execution traces. And using this data, we fine-tune a Llama 8 billion parameter model to recover the performance of the much, much bigger parent that was used as a starting point.
And I will go straight to the to the graph there. We achieved substantial success in doing that. What we see on the chart there is on the x-axis, how much it costs to run a million conversation turns with those agents. And on the y-axis is this measure of quality, this GREAT score that we introduced previously and that can be computed quite crisply.
And with no surprise, on the upper right-hand side, we have the largest agents available. So we have a Llama 405 billion parameter model that is having a great score and a zero-shot GPT-4o model that is also having a great score, but both of those are very expensive to run. We're talking about $30,000 to $40,000 per million conversation turns. So those are very expensive agents.
But by using the methodology that is enabled by this concept of saving tapes and treating them as data to generate fine-tuning data for a simpler agent, we see that we're able to fine-tune an 8 billion parameter model that has the same performance as the much larger model but is more than 300 times cheaper to run. Okay, so a huge gain in performance there. And if you're familiar with other agentic frameworks, we also have a quick table, but I will go quickly over that, comparing tape agents to other frameworks.
Now, time is flying. I will pass it to Alex who will shift gears and tell us about web agents.
Great, thank you, Nicolas. Hi everyone. Yeah, so Nicolas just told us about API agents. Those are great when you have APIs available, when you have tools that can do some of the execution for you. But now I'm going to talk about another kind of agent that promises to be more general than that, and this is web agents. And basically, this is a great improviser, right? So when there's no APIs available, you could expect an agent to be able to interact with the user interface just like you and me would do when using some software. Oops, that doesn't work.
Okay, so what's a web agent? To be original, I'll show you an agent loop that I'm sure you've seen many times right now, except it's going to be style-transferred for web agents. So suppose you have a user with a goal in mind, and then the web agent is just a policy that perceives an environment, which in our case is a web browser, and perceives observations from the environment and then decides which actions to make. Okay? So it's a simple agentic loop where the web agent here, in our case, will be an LLM who's choosing to do an action based on whatever observation comes from the browser.
This is actually quite challenging, because first of all, you need to understand the goal, you need to understand what the human wants. You need to understand the... you need to have situational awareness, basically. You need to understand what you're seeing in the browser, what's the user interface showing.
And long-term planning is really challenging here because most of the time, what you want, the end result is not on the current page, right? So you need to transition, navigate some menus, and then ultimately you'll end up on the right page, but this is not necessarily perceivable from the current set of observations that are available to the agent. And then once you've made a plan and you know what you want to do, you still have to execute the actions correctly. And as we'll see in in the following slides, even this is quite challenging for current agents.
Okay, so how do you make a basic web agent? Okay, this is the most basic thing you can do. Suppose you have a human who's asking for something. So "fly me to Yellowstone for the next long weekend." And there's a browser. Now what you could do is take your favorite LLM and build a prompt. And in this prompt, you could put the task description, so you can put the goal. You can put the observation, which here would be, for example, the HTML of a web page. You could put the raw HTML directly in the prompt if you have a long enough context window. And then you can define what the action space is. So is the agent allowed to click on controls? Is it allowed to fill some forms, for example? And then you would give this to an LLM and then ask for the actions it wants to make based on the observation. And then you want to actually go and apply those in the browser. So often what people use is a combination of Python and Playwright, which is a browser automation library. You also have Selenium that can do similar things. So that's a very basic web agent.
And concretely, what you could do here is an example from the MiniWoB dataset. So I could build this prompt like that. I could say the task is to enter "Enola" in the text box and then press submit. So I would put the content of the page there in the prompt. I would say the action space, what you're allowed to do here is fill a field with its ID. So you would say "fill ID number one" and it would fill that field, move the mouse, and so on, right? And then I would ask the LLM, "What do you want to do now? Output your actions between those brackets here." And the LLM would potentially answer with this, so fill the field with "Enola" and then click, right? So here you can see a GPT-4, a GPT-4-based agent that's actually doing that when prompted with this kind of prompt.
So this seems very simple, but it's actually already quite powerful, surprisingly. So what you're seeing here is a very basic agent, done pretty much like what I just showed you. And it's based on GPT-4 and ReAct-style prompting, so we're asking the agent to output its thoughts and then actions it wants to make. You can see that at the top. And it's actually filling out the expense reports of Nicolas when he traveled to a conference recently. Here you can see it actually filling out the form. This is GPT-4 that's acting with the very simple prompt that that I just showed you.
Okay, so you can see how this already makes, at least for me, it makes me dream, right? You can say like, "Wow, this is great. We could imagine having assistants like that that will automate tasks for us in the browser." This is the next evolution of user interfaces and so on, right? So we were quite excited when we saw that.
The truth is that this is very brittle, right? So this is a prompt, it works right now. We have some great anecdotal examples like that where it works really well, but for this to be adopted in practice, it needs to work robustly, right? So it needs to work most of the time. We need to be able to ensure that it's not going to be jailbroken or hijacked by injections in the web pages, right? So this needs to be reliable.
And by the way, this is also one other limitation of this kind of agent right now is that it's quite slow. So here the video I just showed you was sped up by eight times, right? So in practice, you really need to wait like 15 seconds for the mouse to move from one point to the other, which is not yet what we want for a smooth user interface, but you can see it has potential.
Okay, so given the fact that we need this to be very robust for it to be actually applicable in practice, there's been a lot of effort on building benchmarks. So you may have seen some, I'm going to outline a few recent benchmarks, but this has been a trend mostly because we want to know beyond anecdotes how those work in practice. So here you are seeing the MiniWoB dataset, which are very simple web tasks like filling a field, clicking submit, or just dragging a box from one side of the screen to the other. This is great. It's a good start, but it's not realistic.
So people have started to propose benchmarks based on real-world websites. So first kind of benchmark are trace-based benchmarks, and these are mostly based on traces of actions that have been performed by human annotators. So they receive a goal, and they have a website, and then their actions are recorded. So they clicked on that, they filled this field, and so on. So this is nice, it's more realistic, but the evaluation here is mostly based on gold traces. So these are gold standard traces. Basically, you check if the agent is doing similar actions to what the human did.
A limitation of that is that it does not account for the fact that there may exist alternative solutions to a problem. And also, when you store those traces on the internet, there's a risk that they will be memorized by LLMs that are trained by crawling, for example.
Oh, yeah, one exception here is a brand new kind of benchmark by Mistral, where an LLM is used to generate traces. So basically, it's dropped randomly on the internet at some point, acts by performing random actions, and then retrospectively annotates what a goal could have been that would have led them there. So it's a way to generate thousands of those traces without having to rely on human annotators. I thought that was a pretty interesting idea. And you can use that for evaluation, but you can also use it for fine-tuning. We'll get back to that later.
Great, so beyond trace-based datasets, there's been a trend on making live benchmarks. So instead of evaluating based on traces of actions, you evaluate based on the end result. So for example, you could say, "Well, I expect that if this task is solved correctly, the database will be in a certain state." And then instead of checking the trace of actions, you actually just check if the database is in the correct state. Some other benchmarks rely on question answering, and they check if you return the correct answer instead of checking the set of steps that you took to get to that answer. This is nice because it allows for alternate solutions, and it's also, there's a lower chance of memorization, right? Because the traces are not stored in a public repository.
So two good examples of that are the WebArena and the Visual WebArena benchmarks. So WebArena is actually a set of tasks where you receive a goal and you need to perform some actions in software like Reddit or GitLab, and there's certain terminal conditions that are checked to see if the agent was able to solve the task correctly. These are actually great because the way you benchmark on this kind of benchmark is that you run a local server and then everything is served locally. So it's great if you want to do parallel experiments. You don't have bandwidth issues. It's really nice. The only issue here is that it's limited to open-source software and you need some complex installation. You need to set up your own server locally.
So another kind of benchmark are based on remote servers, and these are really on the open web. They're more realistic because there's latency, which is the reality when you go over the real internet. And it's also nice because you can support websites that are not necessarily open source. You don't have to host them locally. The downside of that is that they can be unreliable. There can be network issues. You need an internet connection to be able to benchmark.
Yeah, so different kinds of benchmarks here. WorkArena is a benchmark that we proposed. It's based on the actual ServiceNow product. So I'm going to dig a bit deeper in that, but WorkArena is a remote-hosted benchmark, so nothing happens on your local machine. So what WorkArena is, is an open-source benchmark. So the benchmark is open source, the product is not open source, but the benchmark is. It's 600 work-related tasks that are performed on a ServiceNow product.
So we thought that by having access to ServiceNow and the fact that millions of people use it every day in their daily job, we thought, well, that's a great tool to make a benchmark to see how web agents are able to help people in the workplace. So what we did, we started by making some tasks that are really simple. This is just, we call it WorkArena Level 1. This is just interacting with those six basic components. So for example, sorting a list or filtering a list in the product, searching a knowledge base to find the answer to a question like, "What's the Wi-Fi password for the office?", reading a dashboard saying, "Which item do we have the most in stock?", for example. So these are the kind of tasks that we have in the Level 1 of WorkArena, and we go from that all the way to really complex, realistic workflows. And I'll show you some examples in the coming slides.
But before I move on, I just want to explain how this remote-hosted benchmark works, how do we make it work? Obviously, we cannot ship you a Docker image that contains ServiceNow because this means we would be giving you the codebase of the product. So what we do here is we rely on something called developer instances, and these are clones of the product that anyone can get. You just go to a website called developer.servicenow.com, you request an instance, and you got your own copy of the product that you can use for... it's used for developers to learn how to program custom tools for the product.
And so the benchmark is built on that, basically. You give the credentials to your instance, and then the agent interacts with the front end, so it interacts with the web pages. And we do all the validation by using backend APIs, so we check if the agent correctly solved the task by going to check the state of the database. So yeah, it's based on a real product that people use for real, but everyone can access it. The tasks in the benchmark are open source, just not the ServiceNow product.
Great. So just to make this a bit more concrete, here's GPT-4 actually solving some tasks in WorkArena. You could see it ordering an Apple Watch, you can see it reading dashboards, filling out some forms, sorting some lists. So these are all tasks from WorkArena Level 1. This is the simplest kind of tasks that we have in the benchmark.
Now, I want to give you a more difficult task, an example of a more difficult task. This is actually what we call WorkArena Level 3. It's pretty hard. So here, what you see is a ticket. Okay, so it's an interface that shows a ticket that's assigned to the human. In this case, the agent is that human, and they have to solve the task. So basically, what they have to do here is to go check a dashboard and they need to find which item is the lowest in inventory, and then order some items to replenish it from a service catalog.
So you can see that all the agent receives as instruction is, "Please solve this." And it starts from this page, basically, which is a ticket. And then it says, "If you're not sure what to do, you can always refer to the knowledge base for further instructions," and it gives it instructions to which article they could check. So the first step for the agent here is to go to the knowledge base, so it needs to use the menus and navigate to this knowledge base, find it, find the right article, extract the information on how to proceed, for example, how to get to the dashboards. Then it needs to go open the dashboard, read it, find which item is the least available in stock. And then it needs to navigate all the way to a service catalog where it's able to order this item.
So three high-level steps, but with many, many low-level steps in between. There's navigation involved. It's not clear. Imagine if I drop you in a room with multiple doors and you don't really see what's behind them. You need to get to the right spot, but you can't explore too much, right, because you have a limited number of steps. So this is actually a really hard task.
And so in addition to this example, we have all kinds of capabilities that are being evaluated in these tasks. One of them, I'll give you one example, it's planning and problem solving. So for example, one of the tasks in WorkArena involves looking at a set of employees, seeing who is the most busy, and taking some work from the most busy employee and reassigning it to the least busy one. So basically, workload balancing across employees. There's other ones like scheduling with constraints, so scheduling some events with some constraints. We have tasks about budget management, all kinds of things.
Sophisticated memorization tasks basically involve collecting bits and pieces of information across many pages and in the end solving a task based on everything you gathered. And finally, we have some tasks that are just infeasible, and the agent needs to say, "Hey, this cannot be done," right? So it needs to use some critical thinking to be able to say that, "No, this doesn't have a solution." If you're interested, the paper is here and you can take a look. It's also in the pre-reading for the course.
So okay, now I'll show you how well the agents that we have right now do on this benchmark. The title is a bit of a spoiler, but what you see... what you see on the right here is the scores from a human evaluation that we performed on WorkArena Level 2 and Level 3. And these are the really the realistic workflows, the ones I just showed you. So you can see that humans have a really high score. The average score was roughly 94% across all our human evaluators on those tasks.
And so when we look at WorkArena L1, unfortunately we don't have human eval for that, but you can see that our best agent does roughly... it does 42.7% success rate, which means you give it a task, 42.7% of the time it solves it. So that's WorkArena 1, and this is a good agent because it achieved a very high, well, one of the highest scores on WebArena is 23.5%. This is how we know it's a decent agent.
Okay, now let's look at what it does on WorkArena Level 2. A lot of zeros there, unfortunately. So we have some success, you can see that there is something there, but most of the tasks are not solved correctly. And when we move it to Level 3, which is the example that you've seen before, we get zeros all across the... okay, despite the fact that humans get 94% success rate on those tasks.
So what you see here is that the example I showed you in the beginning was exciting. Like it seems that there is potential there, but obviously there seems to be some work to be done by the community. And we think that this benchmark will really help in making progress towards that. So you may be curious, like why doesn't this work? Why do we have zeros everywhere? A lot of time it's failure to plan. So the agent only perceives what is on its local page and fails to explore. Sometimes we have hallucinations, hallucinated controls. The agent will imagine a button that's called "solve task" and the action it will make is "click solve task," basically. So it invents shortcuts for itself, or basically incorrect syntax when it's outputting which actions to perform.
Okay, so WorkArena is not the only benchmark that has been proposed recently. As you can see, there are many. There's tens of thousands of tasks in those benchmarks, all for web agents. And each of those evaluates different aspects. But what's unfortunate is that all of these benchmarks are not necessarily being evaluated with the same protocol, the same kinds of agents, so it's quite hard to compare.
So we are making tools to try to unify everything. So we made this framework called BrowserGym, and essentially it's a unified evaluation platform. And we tried to regroup most of the major benchmarks, and we're still adding some. If you want to contribute to that, actually, we accept pull requests. So if you think there's one of those benchmarks that you think is interesting, feel free to... we'll work with you to get it into the platform.
So essentially, what BrowserGym does is that it gives you a standardized observation space. So you see this agentic loop on the side. Well, essentially, at the top you have HTML, but you have accessibility tree, which is a shortened version of the HTML, a bit refined. And then you have screenshots. We have a chatbot modality. So there's a bunch of different observation modalities that we provide in BrowserGym.
And we also provide a standardized action space. So you can see examples at the bottom here. So you can do a mouse click, you can do a click on... it says "click bid something." B is a numerical ID that we assign automatically to every control on the page. So basically, to click on something, the agent would say "click number two" and we would go and roll it out using Python Playwright. So this is BrowserGym. And again, we regroup most of the major benchmarks, and we're still adding some in there.
What's also cool is that BrowserGym allows to do human evaluation. This is what we use to do the human eval in WorkArena. Essentially, you get the same kind of chatbot interface with the web browser, except you get this thing where you can assign a curriculum of tasks to a human, and then they can ask for validation or they can simply give up, go to the next task, for example. So all of this is in BrowserGym. You can pip install it or you can use this QR code to to access it.
Cool. So in addition to BrowserGym, we proposed AgentLab, which is another set of tools for building web agents. Here, AgentLab, the goal of it is to... it's to be a toolbox to build web agents. So we provide simple building blocks to build web agents, but also tools to understand their behavior, so why they're not working, debug their performance, for example, and a lot of tools to run large-scale experiments and ensure reproducibility.
So one thing that we propose is called Agent X-ray. Here you see that I'm selecting an experiment that I conducted. I click on a task from one of the benchmarks, and I click on a random seed. Then what I see is essentially what the agent was seeing. I can see the screenshots of what the agent was seeing, can see, basically, this is the accessibility tree of a web page. You can see the reasoning traces of the agent. You can see which actions were performed. And you see also this kind of... this kind of timeline here at the top. This is giving you profiling information on how much time was spent reasoning, how much time was spent performing the actions, how many steps the agent performs. So this basically, Agent X-ray, the goal is to help you understand what's going on with your agent and try to improve it.
Great. So in addition to that, we proposed tools for reproducibility in AgentLab. And essentially, one challenge we tried to address is the fact that it's really hard to benchmark on dynamic environments that we don't control. So websites are updated continuously. LLM, API-based LLMs can be silently updated. We don't really see what's going on in the background. And Python packages evolve. So how do you make benchmarks that remain comparable through time, right?
So what we're trying to do in AgentLab, and this is work in progress, there's already something in our GitHub repository, but we're building that as we speak. So we're trying to find ways to extract standardized observation traces. So you got observations, actions, observations, and so on. And try to propose public repositories for those traces to be stored, but with the properties of the agent that was used, with the versions of the Python packages, with the dates and everything. So we're trying to make it as reproducible as possible and make mechanisms for this to be automatic as much as possible. In addition to that, we're trying to make leaderboards for the major benchmarks with using these traces and the experimental information with mechanisms that allow to reproduce the results and check if they still remain valid through time, right, since the environment is evolving.
Okay, and so by combining AgentLab and BrowserGym, we think there's a great opportunity because first of all, we have a standardized observation space, standardized action space, standardized traces, and public repositories to store those traces. So we think that with this mechanism, we could, we have the potential of doing basically distributed trace collections. So everyone could help us gather traces of web agents acting on the open web, and we might be able to use that to collect a big repository of traces that we could further use for fine-tuning and making better agents in the future. So AgentLab and BrowserGym provide the tools we need to do that.
Okay, so before I hand it over to Nicolas, I just want to finish with the challenges of web agents. Okay, there is hope. I think we'll make great progress on those benchmarks in the coming year. I just want to review the major hurdles here.
So long-context understanding. The websites are huge. So we're talking about tens of thousands of tokens, hundreds of thousands of tokens easily, even for the accessibility trees, which are supposed to be shorter than the raw HTML. This makes huge observations. The amount of relevant information in those contexts is very small, so it's already a challenge to be able to pick it out correctly.
Long-term planning, as I said, it's quite hard to be on a page with no information of what is two or three pages away and try to reason about, "Okay, I'm going to go to this menu, and then I'm going to click on that, and eventually I'll end up in the right place." We have intuition for that. We are used to using Amazon, for example, so we know how to search that and navigate that, and analogous websites are not obscure to us, but for an LLM it could. So there's work to be done in that space. This Mistral paper that I mentioned is a good step in that direction by collecting traces of random actions at a very large scale. This might help for fine-tuning LLMs to better understand where they could get in a few steps.
Learning and adaptability. So we need to be able to learn from observation, learn from demonstrations and mistakes. So there's some work in that direction. I wanted to mention the Agent-Q work that uses Monte Carlo Tree Search and DPO to try to fine-tune agents and do inference-time search. I think this is an interesting work.
There is multimodality that's very important. Actually, the Visual WebArena benchmark proposes a set of tasks that you simply can't solve if you're not looking at the vision modality. So if I give you a task that is "order me a shirt like that," well, you need to be able to see what the shirt is to be able to solve it, right?
Cost and efficiency. As I said, the videos that we showed you were sped up by a lot, right? So we need to make this fast. We need to make it efficient. I think shrinking the context size is one direction. There's a potential to use retrieval to retrieve only the relevant elements from the observation space. There's potential for using multi-agent architectures where you have, for example, the date picker agent, which is a very small LLM that is only used to do a very small task of filling out date pickers, for example, right? So perhaps there's some multi-agent architectures that could be used there. And finally, fine-tuning smaller LLMs.
So one that is very crucial, so I'm finishing with this one, is safety. Okay, so no one wants to roll out an agent that can be easily hijacked. And current agents, the ones I showed you in the demo, actually, it has been tested, if you put white on white text on the page that says, "Hey GPT, follow these new instructions," it will follow those new instructions, right? So we need robustness. We need agents that are able to be robust to injection on the web page, not only in the HTML, but also in the text box, right? You don't want people to be able to write instructions there. And I think this challenge is even bigger now that we have multimodal models because you're subject to attacks from both, from all modalities essentially. So it amplifies the problem.
And so this is one of Nicolas's examples from 2030, but we switched it to 2026 because I think it's not so far away. But you can imagine malicious Chrome plugins or malicious browser plugins that detect when you log into your bank's website and inject using JavaScript some instructions on a web page to hijack the agent and get them to do a wire transfer to some account, right? So these seem far-fetched, but they're quite real when you start to think about web agents. So we really need to be robust to these kind of things.
All right, so that's it for web agents, and I'll hand back to Nicolas. Thank you.
Alex. We're coming on to the last part of this presentation where we take a step back and look at what is the potential impact of those agent technologies in the broader workplace. And there is no question that AI agents are on their way, will soon be on their way to change completely the nature of work.
And to go back to the example we gave earlier, it was only the beginning. We, if you remember, Sandy was struggling to get the knowledge base permissions for John to see this article. And at the time, she was only able to get a little bit of help from generative AI to make her job a tiny bit faster.
So let's see how AI agents will be able to tremendously help Sandy be much, much more productive. So in a world where AI agents are deployed, first, Sandy will not even look at the ticket right away. So there will be a ticket that will be coming her way, but before she even opens it, a set of AI agents will already have tried to do a partial solution of the ticket itself.
So we'll be getting an agent that looks at the ticket itself and tries to solve it. And it's going to propose the solution as a set of steps. And the steps are going to be, "Well, maybe we can start by finding what permissions John has access to. Then we can fix the... find the right permissions for the knowledge base article and allowing John's permissions to where he needs to access the KB article." And then once this plan is done, maybe we want to ask the human for confirmation.
So we have an orchestrator agent that is asking Sandy for permission. Sandy looks at the plan and says, "Yeah, it looks all right to me. Let's proceed." So once Sandy gives her approval, then we have another bunch of AI agents that will each solve the little subtasks that need to be done. Then is passed to a user access agent that is contemplating giving John the right access, but before doing anything destructive is asking again Sandy for confirmation, "Hey Sandy, are you okay with me giving John permission to access this knowledge base?" Sandy says, "Yes, of course." And finally, the ticket is closed.
Okay, so we see a much more grand and a lot more work that's being undertaken by a team of AI agents all working together to help out Sandy. I will, in the interest of time, skip quickly on this beautiful chart saying the state of web agents and how they can automate.
But once we take a look at the very granular ways in which agents contribute to specific workflows, we can ask ourselves, what's the potential for them to have an impact on the broader world of knowledge work? And before we do that, we need to ask ourselves, what is knowledge work? Of course, knowledge work is not a single thing. It's many, many, many different things. We have meeting scheduling and coordination, ethical hacking, language localization. We have literally thousands and thousands of tasks in the economy that are broadly categorized as knowledge work.
So how do we get a sense of the impact that AI agents are going to have on this very, very broad spectrum of occupations? Fortunately, there are tools that we can use to help get started doing this kind of assessment. In particular, in the US, the Bureau of Labor Statistics has already compiled a big database called the O*NET database that is literally a list of all the occupations in the US economy. And for each occupation, each job description, it has a set of tasks that need to be done in the daily activities of this job.
Okay, so to take one example, if we have "Software Quality Assurance Analyst and Tester," we see the job description that is coming from the O*NET database. This is free, you can download it, there's no problem there. And then the set of tasks and the skills that a person needs to do. Okay, so if we want to assess the impact of AI agents, one way to go is to take a look at a database like this and try to understand, is each task that this software quality assurance person is doing, how far can it be automated through AI agents?
Okay, this is one way to proceed. Of course, some people have thought about it before, and we have technology adoption curves. This is coming from a McKinsey report from a few months ago that takes a look, you know, across OECD countries in this case, what is the anticipated adoption curve for the broad set of generative AI technologies, of which AI agents are a subset, across the economy, with both expected adoption and worst-case adoption curves.
One thing to keep in mind is that when we try to analyze adoption curves like this, there are many, many factors that are gating the adoption, and technology is only one of those. The process of assessing technological maturity only reflects a partial picture of where we are going at the level of the economy.
And broadly speaking, we can think about two ways of assessing impact. There is a top-down way that is coming from a top-down analysis of a database like the O*NET database and trying to ask GPT to guess for each task of each job, "How much you think the generative AI will help automate this task?" And there is a bottom-up assessment in which we can use a benchmark like WorkArena to track at a very, very granular level the real actual automation rate for the state of technology today as it pertains to specific tasks. Okay?
So none of those approaches is perfect. One is fuzzy, the other is incomplete, but together they can help us triangulate a little bit and get a sense of what is the anticipated or the likely rate of technology adoption, how fast the job descriptions will change because part of them will be done by AI agents. And this kind of analysis can be used to help decision-makers and companies plan for reskilling programs for employees, for example, and make sure that we can navigate technology readiness in a way that benefits all people.
But all in all, this is getting to be very exciting. We can already see ways in which humans and machines are working together in new ways. There is a very influential study that came out a few months ago from Harvard Business School and BCG, the consulting firm, that looked at the ways in which people were working jointly with advanced generative AI technologies. And they found that there were two main patterns in the division of labor between human and machine.
One pattern is broadly called the centaur pattern, in which there is a strategic decomposition of labor. The human is planning the high-level task, and big chunks of work are outsourced, if you will, to AI systems. And the second way is called the cyborg, in which there is a very intimate, very granular collaboration between human and machine to solve tasks and to get work done.
And it's still very much early days. There are enormous opportunities for user experience and computer-human interaction research there to really figure out what will be the most productive patterns for each person. Maybe it depends on the personality type, certainly the kind of work to be done, and so on.
So finally, it's already time to conclude. We leave it with a couple of resources to dig further. We made a timeline of the influential work if you want to dig further in the realm of LLM agents and benchmarks. So on the top are various frameworks that have been introduced prior to 2024, along with benchmarks. And we have the stuff that came out this year. There has been a ton of activity in the space. This is for benchmarks.
And on the web agent front, we have the early research starting with the mini, the MiniWoB work in 2017 and continuing this year with a ton of new agentic frameworks for web agents as well as new benchmarks that Alex described.
So with this, it's time to ask our lecture agent to create the slides for our presentation next year. And it's also time to thank our amazing colleagues at ServiceNow that made all of what you saw today possible. QR codes for everything we talked about—tape agents, WorkArena, BrowserGym, AgentLab—are there. If you're interested, we take pull requests. All of that is open source for you to study and learn.