Context Engineering for AI Agents with LangChain and Manus
Disclaimer: The transcript on this page is for the YouTube video titled "Context Engineering for AI Agents with LangChain and Manus" from "LangChain". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.
Watch the original video here: https://www.youtube.com/watch?v=6_BcCthVvb8
All right. Well, thank you all for coming. We'll go ahead and kick off the webinar now and I'm sure people will continue to stream in. I'm Lance, one of the founding engineers at LangChain, and I'm joined by Pete from Manus. Pete, do you want to introduce yourself quickly?
Yeah. Hey guys, I'm the co-founder and chief scientist of Manus. So basically, I designed the agent framework and a lot of things in Manus, and I'm super excited to be here today. Thanks, Lance, for having me.
Yeah, we're really excited to do this because first, Manus is a really cool product—I've been using it for a long time—but also they put out a really nice blog post on context engineering a few months ago that influenced me a lot. So, I want to give a quick overview of context engineering as I see it, and I'll reference their piece. Then Pete's actually going to give a presentation talking about some new ideas not covered in the piece. So if you've already read it, this will cover some things that are new which hopefully will be quite interesting for you. I'll kind of set the stage, hand it over to Pete, and then we'll do some Q&A.
So you might have heard this term "context engineering" and it kind of emerged earlier this year. If you look through time with Google search trends, prompt engineering was kind of initiated following ChatGPT. That's showing December 2022. When we got this new thing, a chat model, there became a great deal of interest in: how do we prompt these things? Prompt engineering kind of emerged as a discipline for working with chat models and prompting them.
Now context engineering emerged this year around May. We saw it really rising in Google trends and it corresponds a bit with this idea of the "Year of Agents." Why is that? One of the things that people have observed if you've been building agents is that context grows, and it grows in a very particular way when you build an agent. What I mean is we have an LLM bound to some number of tools. That LLM can call tools autonomously in a loop. The challenge is, for every tool called, you get a tool observation back and that's appended to this chat list. These messages grow over time and so you can kind of get this unbounded explosion of messages as agents run.
As an example, Manus talked about in their piece that typical tasks require around 50 tool calls. Anthropic has mentioned similarly that production agents can engage in conversations spanning hundreds of turns. So the challenge is that agents, because they are increasingly long-running and autonomous and they utilize tools freely, can accumulate a large amount of context through this accumulation of tool calls. Google put out a really nice report talking about context rot. The observation simply is that performance drops as context grows. So this paradox—this challenging situation—is that agents utilize lots of context because of tool calling, but we know that performance drops as context grows.
So this is a challenge that many of us have faced, and it kind of spearheaded, or I think seeded, this term of context engineering. Karpathy, of course, kind of coined it on Twitter earlier this year. You can think about context engineering as the delicate art and science of filling the context window with just the right information needed for the next step. So trying to combat this context explosion that happens when you build agents and they call tools freely. All those tool messages accumulate in your messages queue. How do we cull such that the right information is presented to the agent to make the correct next decision at all points in time?
To address this, there are a few common themes I want to highlight that we've seen across a number of different pieces of work, including Manus, which I'll mention here.
Idea one is context offloading. So we've seen this trend over and over. The central idea is you don't need all context to live in this messages history of your agent. You can take information and offload it, send it somewhere else, so it's outside the context window, but it can be retrieved, which we'll talk about later.
So, one of the most popular ideas here is just using a file system. Take the output of a tool message as an example, dump it to the file system, and send back to your agent just some minimal piece of information necessary so it can reference the full context if it needs to. But that full payload—for example, a web search result that's very token-heavy—isn't spammed into your context window for perpetuity.
So you've seen this across a number of different projects. Manus uses this. We have a project called Deep Agents that utilizes the file system. Open Deep Research utilizes actual agent state which has a similar role to an external file system. Claude Code, of course, uses this very extensively. Long-running agents utilize it very extensively. So this idea of offloading context to a file system is very common and popular across many different examples of production agents that we're seeing today.
The second idea is reducing context. So offloading is very simply taking some piece of information, like a tool message that's token-heavy, and not sending it all back to your messages list—dumping it to a file system where it can be retrieved only as needed. That's offloading. Reducing the context is similar, but instead, you're just summarizing or compressing information.
Summarizing tool call outputs is one intuitive way to do this. So we do this with Open Deep Research as an example. Pruning tool calls or tool messages is another. One thing that's very interesting is Claude has actually added this; if you look at some of their most recent releases, they now support this out of the box. So this idea of pruning old tool calls with tool outputs or tool messages is something that Claude has now kind of built into their SDK.
Summarizing or compacting full message history—you see this with Claude Code in its compaction feature once you hit a certain percentage of your overall context window. Cognition also talks about this idea of summarizing or pruning at agent-to-agent handoffs. So this idea of reducing context is a very popular theme we see across a lot of different examples, from Claude Code to our Open Deep Research, Cognition, and Claude 3.5 has incorporated this as well.
Retrieving context—now this is one of the classic debates today that you might see raging on X or Twitter: the right approach for retrieving context. Lee Robinson from Cursor just had a very nice talk at OpenAI Demo Day talking about how Cursor, for example, uses indexing and semantic search as well as more simple file-based search tools like glob and grep. Claude Code force-only uses the file system and simple search tools, notably glob and grep. So there are different ways to retrieve context on demand for your agent. Indexing via semantic search versus file system and simple file search tools—both can be highly effective. There are pros and cons we could talk about in the Q&A, but of course, context retrieval is central for building effective agents.
Context isolation is the other major theme we've seen quite a bit of, in particular splitting context across multi-agents. So what's the point here? Each sub-agent has its own context window and sub-agents allow for separation of concerns. Manus talks about this. Our Deep Agents work uses this. Open Deep Research uses it. Claude sub-agents are utilized in their researcher and also Claude Ghost supports sub-agents. So sub-agents are a very common way to perform context isolation we've seen across many different projects.
Now one thing I thought was very interesting is caching context, and Manus talks about this quite a bit. I'll let Pete speak to this a bit later but I think it's a very interesting trick as well.
So I'll just show a brief example that we've seen across Open Deep Research. This is a very popular repo that we have. It's basically an open-source Deep Research implementation and it performs on par with some of the best implementations out there. You can check our repo, and we have results from Deep Research Bench showing that we're top 10. It has three phases: scoping of the research, the research phase itself using a multi-agent architecture, and then a final one-shot writing phase.
We use offloading. So we basically create a brief to scope our research plan. We offload that. So we don't just save that in the context window because that context window is going to get peppered with other things. We offload it, so it's saved independently. It can be accessed, in our case from the LangGraph state, but it could also be from a file system; it's the same idea. So you create a research plan, you offload it, it's always accessible. You go do a bunch of work, and you can pull that back in on demand—so you can put it kind of at the end of your message list so it's accessible and readily available to your agent to perform, for example, the writing phase.
We use offloading, as you can see, to help steer the research and writing phases. We use reduction to summarize observation from token-heavy surf tool calls. That's done inside research itself. And we use context isolation across sub-agents within research itself.
And this is kind of a summary of a bunch of different of these various ideas across a bunch of different projects. And actually, Pete is going to speak to Manus in particular and some of the lessons they've learned. This just kind of sets up the stage. This summarizes what I talked about—these different themes of offloading, reducing context, retrieving context, isolating, caching, and a number of popular projects and kind of where they're used. I will share these slides to the notes. And I do want to let Pete go ahead and present now because I want to make sure we have plenty of time for him and for questions. But this just sets the stage. And Pete, I'll let you take it from here. I'll stop sharing.
Okay. Can you see my slides?
Yeah.
Okay. Perfect. Thank you, Lance. I'm super excited to be here today to share some fresh lessons on context engineering that we learned from building Manus. Here I say "fresh lessons" because I realized that the last blog post that you mentioned I wrote about context engineering was back in July. And yeah, it's the Year of the Agent, so July is basically the last century. And of course, before this session, I went back and read it again, and luckily I think most of what I wrote in that blog still holds up today. But I just don't want to waste everybody's time by just repeating what's already inside that blog. So today I think instead I want to dig into some areas that I either didn't go deep enough on before or didn't touch at all. So actually, we'll be focusing on the "Discourage" column in Lance's earlier slides because I think exploring those non-consensus ideas often leads to the biggest inspirations.
Yeah. So here's the topic for today's talk. First, we'll cover a bit about the bigger question of why we need context engineering, and then we'll have more on context reduction, more on context isolation, and finally some new stuff about context offloading which we are testing internally here at Manus. Everything I'm sharing today is in production in Manus; it's battle-tested. But I don't know how long it will last because, you know, things are changing super fast.
Okay, let's start with the first big question: why do we even need context engineering, especially when fine-tuning or post-training models has become much more accessible today? For example, folks at the Thinking Machine team just released the Tinker API, which I like a lot. But for me, the question "why context engineering" actually came through several painful stages of realization.
Before starting Manus, I've already spent over 10 years in natural language processing (NLP), which is basically what we call building language models. But before ChatGPT—and Manus is actually my second or third company—at my previous startup, we trained our own language model from scratch to do open domain information extraction and building knowledge graph and semantic search engines on top of them. And it was painful. Our product's innovation speed was completely capped by the model's iteration speed. Even back then, the models were much smaller compared to today, but still, a single training plus evaluation cycle could take maybe one or two weeks. The worst part is that at that time we hadn't reached PMF (Product Market Fit) yet and we were spending all that time improving benchmarks that might not even matter for the product. So I think instead of building specialized models too early, startups really should lean on general models and context engineering for as long as possible.
Well, of course, I guess now that's some kind of common wisdom. But as your product matures and open-source base models get stronger, I know it's very tempting to think, "Hey, maybe I should just pick a strong base model, fine-tune it with my data, and make it really good at my use case." We've tried that too. And guess what? It's another trap. To make RL work really well, you usually fix an action space, design a reward around your current product behavior, and generate tons of on-policy rollouts and feedback. But this is also dangerous because we're still in the early days of AI and agents. Everything can shift under our feet overnight.
For us, the classic example was the launch of MCP (Model Context Protocol). Actually, it completely changed the design of Manus from a compact static action space to something infinitely extensible. And if you have ever trained your own model, you know that this kind of open domain problem is super hard to optimize. Well, of course, you could pour massive effort into post-training that ensures generalization, but then aren't you basically trying to become an LLM company yourself? Because you're basically rebuilding the same layer that they have already built. And that's a duplication of effort. So maybe after all that buildup, here's my point: Be firm about where you draw the line. Right now, context engineering is the clearest and most practical boundary between application and model. So trust your choice.
All right, enough philosophy and let's talk about some real tech. First topic: context reduction. Here I want to clarify two different kinds of compaction operations because we think context reduction is fascinating but it's also a new concept. There's a lot of ways to do this and here in Manus, we divide them into compaction and summarization.
For compaction in Manus, every tool call and tool result actually has two different formats: a full format and a compact one. The compact version strips out any information that can be reconstructed from the file system or external state. For example here, let's say you have a tool that writes to a file and it probably has two fields: a path and a content field. But once the tool returns, you can ensure that the file already exists in the environment. So in the compact format, we can safely drop the super long content field and just keep the path. And if your agent is smart enough, whenever it needs to read that file again, it can simply retrieve it via the path. So no information is truly lost; it's just externalized. We think this kind of reversibility is crucial because agents do chain predictions based on previous actions and observations, and you never know which past action will suddenly become super important 10 steps later. You cannot predict it. So this is a reversible reduction by using compaction.
Of course, compaction only takes you so far. Eventually, your context will still grow and will hit the ceiling. And that's when we combine compaction with the more traditional summarization, but we do it very carefully. For example here, before summarizing, we might offload key parts of the context into files. And sometimes we even do more aggressively—we can dump the entire pre-summary context as a text file or simply a log file into the file system so that we can always recover it later. Like Lance just mentioned, some people just use glob and grep. glob also works for log files. So if the model is smart enough, it even knows how to retrieve those pre-summarized contexts.
The difference here is that compaction is reversible but summarization isn't. Both reduce context lengths but they behave very differently. To make both methods coexist, we have to track some context length thresholds. At the top, you'll have your model's hard context limit, say 1 million tokens—pretty common today. But in reality, most models start degrading much earlier, typically maybe around 200k, and you'll begin to see what we call "context rot"—repetitions, slower inferences, degraded quality. So by doing a lot of evaluation, it's very important for you to identify that pre-rot threshold. It's typically 128K to 200K, and use it as the trigger for context reduction.
Whenever your context size approaches it, you have to trigger context reduction, but starting from compaction, not summarization. And compaction doesn't mean compressing the entire history. We might compact the oldest 50% of tool calls while keeping the newer ones in full detail so the model still has fresh few-shot examples of how to use tools properly. Otherwise, in the worst case, the model will imitate the behavior and output those compact formats with missing fields, and that's totally wrong.
After compaction, we have to check how much free context that we actually gain from this compaction operation. Sometimes after multiple rounds of compaction, the gain is tiny because even if it's compact, it still uses context. And that's when we go for summarization. But also keep in mind that when summarizing, we always use the full version of the data, not the compact one. And we still keep the last few tool calls and tool results in full detail, not summary, because it allows the model to know where it left off and continue more smoothly. Otherwise, you'll see after summarization sometimes the model will change its style, change its tone. We find out keeping a few tool call/tool result examples really helps.
Okay, now we've covered reduction. Let's talk about isolation. I really agree with Cognition's blog where they warn against using multi-agent setups because when you have multiple agents, syncing information between them becomes a nightmare. But this isn't a new problem. Multiprocess or multi-thread coordination has been a classic challenge in the early days of computer programming. And I think we could borrow some wisdom here.
I don't know how many Golang coders are here today, but in the Go programming language community, there's a famous quote from this gopher: "Do not communicate by sharing memory; instead, share memory by communicating." Of course, this isn't directly about agents and it's sometimes even wrong for agents, but I think the important thing is it highlights two distinct patterns here: by communicating or by sharing memory. If we translate the term "memory" here into context, we can see that parallel pretty clearly.
"By communicating" is the easier one to understand because it is the classic sub-agent setup here. For example, the main agent writes a prompt and the prompt is sent to a sub-agent, and the sub-agent's entire context only consists of that instruction. We think if a task has a short, clear instruction and only the final output matters—say searching a codebase for a specific snippet—then just use the communication pattern and keep it simple. Because the main agent doesn't care how the sub-agent finds the code, it only needs the result. And this is what Claude Code does typically, using its task tool to delegate a separated clear task to some sub-agents.
But for more complex scenarios, "by sharing memory" means that the sub-agent can see the entire previous context—all the tool usage history—but the sub-agent has its own system prompt and its own action space. For example, imagine a Deep Research scenario: the final report depends on a lot of intermediate searches and notes. In that case, you should consider using the shared memory pattern, or in our language, "by sharing context." Even if you can save all those notes and searches into a file and make the sub-agent read everything again, you're just wasting latency and context. And if you count the amount of tokens, maybe you're using even more tokens to do this. So we think for those scenarios that require a full history, just use a shared memory pattern. But be aware that sharing context is kind of expensive because each sub-agent has a larger input to prefill, which means you'll spend more on input tokens. And since the system prompt and the action space differs, you cannot reuse the KV cache, so you have to pay the full price.
Finally, let's talk a little bit about context offloading. When people say offload, they usually mean moving parts of the working context into external files. But as your system grows, especially if you decide to integrate MCP, one day you realize that the tools themselves can also take up a lot of context, and having too many tools in context leads to confusion. We call it "context confusion," and the model might call the wrong ones or even non-existing ones. So we have to find a way to also offload the tools.
A common approach right now is doing dynamic RAG (Retrieval-Augmented Generation) on tool descriptions. For example, loading tools on demand based on the current task or the current status. But that also causes two issues. First, since tool definitions sit at the front of the context, your KV resets every time. And most importantly, the model's past calls to removed tools are still in the context, so it might fool the model into calling invalid tools or using invalid parameters.
So to address this, we're experimenting with a new layered action space in Manus. Essentially, we can let Manus choose from three different levels of abstractions: Number one, function calling; Number two, sandbox utilities; and Number three, packages and APIs. We go deeper into these three layers of layered action space.
Let's start from level one: function calling. And this is a classic, everyone knows it. It is schema-safe thanks to constrained decoding. But we all know the downsides. For example, breaking the cache and maybe too many tools causing confusion. So in Manus right now, we only use a fixed number of atomic functions. For example, reading and writing files, executing shell commands, searching files in the internet, and maybe some browser operations. We think these atomic functions have super clear boundaries and they can work together to compose much more complex workflows. Then we offload everything else to the next layer, which is the sandbox utilities.
As you know, each Manus session runs inside a full virtual machine sandbox. It's running on our own customized Linux system, and that means Manus can use the shell commands to run pre-installed utilities that we develop for Manus. For example, we have some format converters, we have speech recognition utilities, and even a very special one—we call it the Manus MCP CLI—which is how we call MCP. We do not inject MCP tools into the function calling space. Instead, we do everything inside that sandbox through the command line interface. Utilities are great because you can add new capabilities without touching the model's function calling space. It's just some commands pre-installed in your computer, and if you're familiar with Linux, you always know how to find those new commands and you can even run --help to figure out how to use a new tool.
Another good thing is for larger outputs, they can just write to files or return the result in pages, and you can use all those Linux tools like grep, cat, less, more to process that result on the fly. The trade-off here is it's super good for large outputs but it's not that good for low latency back-and-forth interactions with the front end, because you always have to visualize the interactions of your agent and show it to the user. So this is pretty tricky here, but we think it already offloads a lot of things.
And then we have another layer, the final layer, we call it Packages and APIs. Here Manus can write Python scripts to call pre-authorized API or custom packages. For example, Manus might use a 3D designing library for modeling or call a financial API to fetch market data. And here actually, we've purchased all these APIs on behalf of a user and pay the money for them; it's included in the subscription. So basically we have a lot of API keys pre-installed in Manus and Manus can access these APIs using the keys. I think these are perfect for tasks that require lots of computation in memory but do not need to push all that data into the model context.
For example, imagine if you're analyzing a stock's entire year of price data. You don't feed the model all the numbers. Instead, you should let the script compute it and only put the summary back into the context. And you know, since code and APIs are super composable, you can actually chain a lot of things in one step. For example, in a typical API, you can do "get city names," "get city ID," "get weather" all in one Python script. There's also a paper from one of my friends called CodeAct. A lot of people were discussing it. I think it's the same idea because code is composable and it can do a lot of things in one step, but also it's not schema-safe. It's very hard to do constrained decoding on CodeAct. So we think you should find the right scenario for these features. For us, everything that can be handled inside a compiler or interpreter runtime, we do that using code; otherwise, we use sandbox utilities or function calls. And the good thing is, from the model's point of view, all three levels still go through the standard function calls. So the interface stays simple, cache friendly, and orthogonal across functions.
So let's zoom out and connect the five dimensions: offload, reduce, retrieve, isolate, and cache. You can find out that they are not independent. We can see that offload and retrieve enables more efficient reduction, and stable retrieve makes isolation safe. But isolation also slows down context and reduces the frequency of reduction. However, more isolation and reduction also affects cache efficiency and the quality of output. So at the end of the day, I think context engineering is the science and art that requires a perfect balance between multiple potentially conflicting objectives. It's really hard.
All right. Before we wrap up, I want to leave you with maybe one final thought, and it's kind of the opposite of everything I just said: Please avoid context over-engineering. Looking back at the past six or seven months since Manus launched, actually the biggest leap we've ever seen didn't come from adding more fancy context management layers or clever retrieval hacks. They all came from simplifying, or from removing unnecessary tricks and trusting the model a little more. Every time we simplify the architecture, the system got faster, more stable, and smarter. We think the goal of context engineering is to make the model's job simpler but not harder. So if you take one thing from today, I think it should be: Build less and understand more. Thank you so much everyone and thanks again to Lance and the LangChain team for having me. Can't wait to see what you guys all build next. Now back to Lance.
Yeah, amazing. Thank you for that. So we have a nice set of questions here. Maybe we can just start hitting them and we can kind of reference back to the slides if needed. And Pete, are your slides available to everyone?
Oh yeah. Yeah, I can share the PDF version afterwards.
Yes, sounds good. Yeah. Well, why don't I start looking through some of the questions and maybe we can start with the more recent ones first. So how does the LLM call the various shell tools? How does it know which tools exist and how to invoke them? Maybe you can explain a little bit about the multi-tier sandboxing setup that you use with Manus.
Yeah. I think imagine you're the person using a new computer. For example, if you know Linux, you can imagine all the tools are located in /usr/bin. So actually we do two things. First of all, we have a hint in the system prompt telling Manus that hey, there's a lot of pre-installed command line utilities located in some specific folder. And also, for the most frequently used ones, we already injected them in the system prompt, but it's super compact. We do not tell the agent how to use the tools. We only list them and we can tell the agent that you can use the --help flag safely because all the utilities are developed by our team and they have the same format.
Got it. How about, I know you talked a lot about using the file system. What's your take on using indexing? And do you utilize like... do you spin up vector stores on the fly if the context you're working with gets sufficiently large? How do you approach that?
Yeah, I think there's no right and wrong in this space like you've mentioned. But at Manus, we do not use index databases because right now, you know, every sandbox in a Manus session is a new one and users want to interact with things fast. So actually we don't have the time to build the index on the fly. So we're more like Claude Code; we rely on grep and glob. But I think if you consider building something like more long-term memory or if you want to integrate some enterprise knowledge base, you still have to rely on that external vector index because it's about the amount of information that you can access. But for Manus, it operates in a sandbox and for coding agents you operate in the codebase. So it depends on the scale.
Yeah. So that's a good follow-up then. So let's say I'm a user. I have my Manus account. I interact with Manus across many sessions. Do you have the notion of memory? So Claude has .claudemd files; they persist across all the different sessions of Claude Code. How about you guys? How do you handle kind of long-term memory?
Yeah. Actually in Manus we have a concept called "Knowledge" which is kind of like explicit memory. For example, every time you can tell Manus, "Hey, remember, every time I ask for something, deliver it in maybe in Excel," and it's not automatically inserted into some memory. It will pop up a dialogue and say, "Here's what I learned from our previous conversation, and would you like to accept it or reject it?" So this is the explicit one. It requires user confirmation.
But also, we are discovering new ways to do it more automatically. For example, a pretty interesting thing in agents is that compared to chat bots, users often correct the agent more often. For example, a common mistake that Manus makes is when doing data visualization. You know, if you're using Chinese, Japanese or Korean, a lot of time there will be some font issues and there will be errors in those rendered visualizations. So the user will often say, "Hey you should use Noto CJK font." And for these kind of things, a different user will have the same correction. We need to maybe find out a way to leverage these kind of collective feedback and use it. That's kind of like what we call a self-improving agent with online learning, but in a parameter-free way.
Yeah. How about a different question that was raised here and also I think about quite a bit. You mentioned towards the end of your talk that you gained a lot from removing things, and a lot of that is probably because of the fact that also the models are getting better. So model capabilities are increasing and so you can kind of remove scaffolding over time. How do you think about this? Because this is one of the biggest challenges that I've faced is like, over time the model gets better and I can remove things like certain parts of my scaffolding. So you're building on top of this foundation that's like the water's rising. Do you revisit your architecture every some number of months with new releases and just delete as the models get better? And how do you approach that problem?
Yeah, this is a super good question here because you know, actually we have already refactored Manus five times. And we've launched Manus in March and now it's October—already five times. So we think you cannot stop because models are not only improving but they are changing. Models' behavior is changing over time. One way is you can work closely with those model providers, but we also have another internal theory for how we evaluate or how we design our agent architecture. I covered a little bit on Twitter before. It's basically like, we do not care about the performance of a static benchmark. Instead, we fix the AI agent architecture and we switch between models. If your architecture can gain a lot from switching from a weaker model to a stronger model, then somehow your architecture is more future-proof because the weaker model tomorrow might be as good as a stronger model today. Yeah. So we think switching between weaker and stronger models can give you some early signals of what will happen next year and give you some time to prepare your architecture. So for Manus, we often do this kind of review every one or two months and we often do some research internally using open-source models and maybe early access to proprietary models to prepare the next release, even before the launch of the next model.
Yeah. It's a good observation. You can actually do testing of your architecture by toggling different models that exist today. Yeah, that makes a lot of sense. What about best practices or considerations for formats for storing data? So like markdown files, plain text, log... anything you prefer in particular? How do you think about that kind of file formats?
Yeah. I think it's not about plain text or markdown, but we always prioritize line-based formats because it allows the models to use grep or read from a range of lines. And also markdown can sometimes cause some troubles. You know, models are trained to use markdown really well and sometimes it will... maybe for some model, I don't want to say that name, but they often output too many bullet points if you use markdown too often. So actually we want to use more plain text.
Yeah, makes sense. How about on the topic of compaction versus summarization? Let's hit on summarization. This is an interesting one that I've been asked a lot before. How do you prompt to produce good summaries? So, for example, summarization, like you said, it's irreversible. So if you don't prompt it properly, you can actually lose information. The best answer I came up with is just tuning your prompt for high recall. But how do you approach this? So summarization, how do you think about prompting for summarization?
Yeah, actually we tried a lot of optimizing the prompt for summarization. But it turns out a simple approach works really well: you do not use a free-form prompt to let the AI generate everything. Instead, you could define a kind of a schema. It's just a form. There are a lot of fields and let the AI fill them. For example, "Here are the files that I've modified," "Here's the goal of the user," "Here's where I left off." And if you use this kind of more structured schema, at least the output is kind of stable and you can iterate on this. So just do not use free-form summarizations.
Got it. Yeah, that's a great observation. So use structured outputs rather than free-form summarization to enforce certain things are always summarized. Yeah, that makes a lot of sense. How about with compaction then? And actually, I want to make sure I understood that. So with compaction, let's say it's like a search tool. You have the raw search tool output and would that be your raw message, and then the compaction would just be like a file name or something? Is that right?
Yeah, it is. It's not only about the tool call. It also applies to the result of the tool. Interestingly, we find out that almost every action in Manus is kind of reversible if you can offload it to the file system or an external state. And for most of these tasks, you already have a unique identifier for it. For example, for file operations, of course, you have the file path; for browser operations, you have the URL; and even for search actions, you have the query. So it's naturally already there.
Yeah. Okay. That's a great one and I just want to hit that again because I've had this problem a lot. So, for example, I'm an agent that uses search. I perform a tool call, it returns a token-heavy tool call. I don't want to return that whole tool message to the agent. I've done things like some kind of summarization or compaction and sent the summary back. But how do you approach that? Because you might want all that information to be accessible for the agent for his next decision. But you don't want that huge context block to live inside your message history.
So how do you approach that? You could send the whole message back but then remove it later. That's what Claude does now. You could do a summarization first and send the summary over. You could send everything and then do compaction so that later on you don't have the whole context in your message history. You only have like a link to the file. How do you think about that specifically if you see what I'm saying?
Yeah, I know. Actually, it depends on the scenario. For example, for complex search—I mean for complex search, it's not just one query. For example, you have multiple queries and you want to gather some important things and drop everything else. In this case, I think we should use sub-agents or internally we call it "agent as tool." So from the model's perspective, it's still a kind of function, maybe called "Advanced Search." It's a function call "Advanced Search." But what it triggers is actually another sub-agent. But that sub-agent is more like a workflow or agentic workflow that has a fixed output schema and that is the result that returns to the agent.
But for other kinds of more simpler search, for example just searching Google, we just use the full detail format and append it into the context and rely on the compaction thing. But also we always instruct the model to write down the intermediate insights or key findings into files in case that the compaction happens earlier than the model expected. And if you do this really well, actually you don't lose a lot of information by compaction because sometimes those old tool calls are irrelevant after time.
Yeah, that makes sense. Um and I like the idea of "agent as tool." We do that quite a bit and that makes that is highly effective. But that brings up another interesting point about agent-agent communication. How do you address that? So Walden Yan from Cognition had a very nice blog post talking about this as like a major problem that they have with Devin—communication between agents. How do you think about that problem and ensuring sufficient information is transferred but not overloading, like you said, the prefill of the sub-agent with too much context? So how do you think about that?
Yeah. You know, at Manus we've launched a feature called Wide Research a month ago. Internally we call it "Agentic MapReduce" because we got inspired from the design of MapReduce. And it's kind of special for Manus because there's a full virtual machine behind the session. So one way we pass information or pass context from the main agent to sub-agent is by sharing the same sandbox, so the file system is there and you can only pass different paths here.
I think sending information to a sub-agent is not that hard. The more complex thing is about how to have the correct output from different agents. And what we did here is we have a trick: for every time if the main agent wants to spawn up a new sub-agent or maybe 10 sub-agents, you have to let the main agent define the output schema. And in the sub-agent perspective, you have a special tool called "Submit Result." And we use constrained decoding to ensure that what the sub-agent submits back to the main agent is the schema that is defined by the main agent. Yeah. So you can imagine that this kind of map-reduce operation... it will generate a kind of spreadsheet and the spreadsheet is constrained by the schema.
That's an interesting theme that seems to come up a lot with how you design Manus. You use schemas and structured outputs both for summarization and for this agent-agent communication. So it's kind of like using schemas as contracts between agent/sub-agent or between a tool and your agent to ensure that sufficient information is passed in a structured way, in a complete way. Like when you're doing summarization you use a schema as well.
Yeah.
Okay fantastic. This is very helpful. I'm poking around some other interesting questions here. Any thoughts on models like... I think you guys use Anthropic but do you work with open models? Do you do fine-tuning? You talked a lot about kind of working with KV cache, so for that maybe using open models. How do you think about model choice?
Yeah, actually right now we don't use any open-source model right now because I think it's not about quality, it's interestingly about cost. You know, we often think that open-source models can lower the cost, but if you're at the scale of Manus and if you're building a real agent where the input is way longer than the output, then KV cache is super important. And distributed KV cache is very hard to implement if you use open-source solutions. And if you use those frontier LLM providers, they have more solid infrastructure for distributed cache globally. So sometimes if you do the math, at least for Manus we find out that using these flagship models can sometimes be even cheaper than using open-source models.
Right now we're not only using Anthropic—of course Anthropic's model is the best choice for agentic tasks—but we're also seeing the progress in Gemini and in OpenAI's new model. I think right now these frontier labs are not converging in directions. For example, if you're doing coding, of course you should use Claude; and if you want to do more multimodality things you should use Gemini; and OpenAI models are super good at complex math and reasoning. So I think for application companies like us, one of our advantages is that we do not have to build on top of only one model. You can do some task-level routing or maybe even subtask or step-level routing if you can pull in that kind of KV cache validation. So I think it's an advantage for us and we do a lot of evaluations internally to know which models to use for which subtask.
Yeah. Yeah, that makes a lot of sense. I want to clarify one little thing. So with KV cache, so what specific features from the providers are you using for cache management? So okay, I know like Anthropic has input caching as an example. Yeah, that that's what you mean. Okay, got it.
Cool. I'm just looking through some of the other questions. Yeah, tool selection is a good one. Right. So, you were talking about this. You don't use like indexing of tool descriptions and fetching tools on the fly based on semantic similarity. How do you handle that? Like what's the threshold for too many tools? Yeah, tool choice is a classic. How do you think about that?
Yeah. First of all, it depends on the model. Different models have different capacity for tools. But I think a rule of thumb is try not to include more than 30 tools. It's just a random number in my mind. But actually, I think like if you're building a general AI agent like Manus, you want to make sure those native functions are super atomic. So actually there are not that many atomic functions that we need to put inside the action space. So for Manus, right now we only have like 10 or 20 atomic functions and everything else is in the sandbox. Yeah. So we don't have to pull things dynamically.
Yeah good point actually. Let's explain that a little bit more. So you have let's say 10 tools that can be called directly by the agent. But then I guess it's like you said, the agent can also choose to for example write a script and then execute a script. So that expands its action space hugely without giving it like... you don't have an independent tool for each possible script. Of course that's insane. So a very general tool to like write a script and then run it does a lot. Is that what you mean?
Yeah. Yeah. Exactly. Because you know why we are super confident to call Manus a "general agent"? Because it runs on a computer and computers are Turing complete. The computer is the best invention of humans. Theoretically, an agent can do anything that maybe a junior intern can do using a computer. So with the shell tool and the text editor, we think it's already complete. So you can offload a lot of things to the sandbox.
Yeah. Okay, that makes a lot of sense, right? And then how does Manus... so is are all... so okay, maybe I'll back up. You mentioned code with code agents. My understanding is the model will actually always produce a script and that'll then be run inside a code sandbox. So every tool call is effectively like a script is generated and run. It sounds like you do some hybrid where sometimes Manus can just call tools directly but other times it can actually choose to do something in the sandbox. Is that right? So it's kind of a hybrid approach.
Yeah. I think this is super important because actually we tried to use CodeAct entirely for Manus, but the problem is if you're using code, you cannot leverage constraint decoding and things can go wrong. But you know, CodeAct has some special use cases as I mentioned earlier in slides. For example, processing a large amount of data. You don't have to put everything in the tool result; instead you put it inside the runtime memory of Python and you only get the result back to the model. So we think you should do it in a hybrid way.
Got it. Allow for tool calling and you have some number of tools, maybe 10 or something, that just get called directly, and some number of tools that actually run in the sandbox itself. Perfect. That makes a ton of sense. Very interesting.
Um and then maybe... how do you keep a reference of all the previously gen... I guess you have so you basically will generate a bunch of files. Oh actually sorry maybe I'll talk about something else. How about planning? Tell me about planning and and I know Manus has this "to-do" tool where it generates a to-do list and start of tasks. Yeah, tell me about that.
Yeah, I think this is very interesting because at the beginning Manus uses that todo.md paradigm. It's kind of... I don't want to use the word "stupid," but actually it wastes a lot of turns. You know, back in maybe March or April, if you check the log of some Manus task, maybe one third of the action is about updating the to-do list. It wastes a lot of tokens. Yeah. So right now we are using a more structuralized planning. For example, if you use Manus, there's a planner at the bottom of the system. Internally, it's also kind of a tool call—we implemented using the "agent as tool" paradigm so that there's a separate agent that is managing the plan. So actually right now the latest version of Manus, we are no longer using that todo.md thing. Of course todo.md still works and it can generate good results, but if you want to save tokens you can find another way.
Got it. Yeah. So you have like a planner agent and it's more like for a subtask it'll be more like "agent as tool call" type things.
Yeah. Got it. And you know it is very important to have a separate agent that has a different perspective so it can do some external reviews. And you can use different models for planning. For example, sometimes O1 (OpenAI o1) can generate some very interesting insights.
Yeah. Well that's a great one actually. So think about multi-agent then, and so like how do you think about that? So you might have like a planning agent with its own context window, makes a plan, produces like some kind of plan object, maybe it's a file or maybe it just calls sub-agents directly. How do you think about that? Like and how many different sub-agents do you typically recommend using?
Yeah, I think this is also depends on your design, but here at Manus, actually Manus is not kind of like the typical multi-agent system. For example, we've seen a lot of different agents that divide by role. For example, you have a "designer agent," "programming agent," "manager agent." We don't do that because we think why we have this is because this is how human companies work and this is due to the limitation of human context. So in Manus, Manus is a multi-agent system but we do not divide by role. We only have very few agents. For example, we have a huge general executor agent and a planner agent and a knowledge management agent and maybe some data API registration agent. Yeah. So we are very cautious about adding more sub-agents because of the reason that we've mentioned before: communication is very hard. And we implement more kinds of sub-agents as "agent as tools" as we mentioned before.
Yeah, that's a great point. I see this mistake a lot, or I don't know if it's a mistake, but you see anthropomorphizing agents a lot—like "it's my designer agent"—and I think it's kind of a forced analogy to think about like a human org chart in your sub-agents. So got it. So for you it's like a planner and knowledge manager. A knowledge manager might do what? Like what will be the task of knowledge manager?
Yeah, it's even more simple as we mentioned like we have a knowledge system in Manus. What the knowledge agent does is that it reviews the conversation between the user and the agent and figures out what should be saved in the long-term memory. So it's that simple.
Got it. Yeah. Okay. It's like a memory manager, planner, and then you have sub-agents that could just take on like a general executor sub-agent that could just call all the tools or actions in the sandbox. That makes sense. Keep it simple. I like that a lot. That makes a lot of sense.
How about guardrailing? Someone asked a question about kind of safety and guardrailing. How do you think about this? I guess that's the nice thing about a sandbox, but tell me a little bit about that. How you think about it?
Yeah, I think this is a very sensitive question because like you know, if you have a sandbox that's connected to the internet, everything is dangerous. Yeah. So we have put a lot of effort in guardrailing. At least we do not let the information get out of the sandbox. For example, if you got prompt injected, we have some checks on outgoing traffic. For example, we'll ensure that no token things will go out of the sandbox. And if the user wants to print something out of the sandbox, we have those kind of removing things to ensure that no information goes out of the sandbox.
But you know, for another kind of thing is that we have a browser inside of Manus and the browser is very complicated. For example, if you log into your websites, you can choose to let Manus persist your login state and this turns out to be very tricky because sometimes the content of the web page can also be malicious. Maybe they're doing prompt injection and this I think is somehow out of scope for application companies. So we're working very closely with those computer use model providers, for example Anthropic and Google. They're adding a lot of guardrails here. So right now in Manus, every time you do some sensitive operations whether inside the browser or in the sandbox, Manus will require a manual confirmation and you must accept it or otherwise you have to take over it to finish it yourself. So I think it's pretty hard for us to design a well-designed solution but it's a progressive approach. So right now we're letting the user take over more frequently, but if the guardrail itself in the model gets better, we can do less.
Yeah. How about the topic of evals? This has been discussed a lot quite a bit online if you probably seen you know Claude Code. They talked a lot about just doing less formal evals at least for code because code evals are more or less saturated; lots of internal dog fooding. How do you think about evals? Are they useful? What evals are actually useful? What's your approach?
Yeah. You know at the beginning at the launch of Manus we were using public academic benchmarks like GAIA, but then after launching to the public we find out that it's super misaligned. You know models that get high scores on GAIA, the user don't like it. So right now we use three different kinds of evaluations. First of all, most importantly is that for every completed session in Manus, we'll request the user to give a feedback—to give one to five stars. This is the gold standard; we always care about the average user rating. This is number one.
And number two, we're still using some internal automated tests with verifiable results. For example, we have created our own data set with clear answers. But also we still use a lot of public academic benchmarks but we also created some data sets that's more focused on execution because most benchmarks out there are more about read-only tasks. So we designed some executing tasks or transactional tasks because we have the sandbox we can frequently reset the test environment. So these are the automated parts. And most importantly number three, we have a lot of interns. You know, you have to use a lot of real human interns to do evaluations on things like website generation or data visualization because it's very hard to design a good reward model that knows whether the output is visually appealing—it's about the taste. So we still rely on a lot of that.
Perfect. Yeah. Let me ask you I know we're coming up on time, but I do want to ask you about this emerging trend of reinforcement learning with verifiable rewards versus just building tool calling agents. So like Claude Code, extremely good, and they have the benefit because they built the harness and they can perform RL on their harness and it can get really really good with the tools they provide in the harness. Do you guys do RL or how do you think about that? Because of course in that case you would have... using open models. I've been playing with this quite a bit lately. How do you think about that? Just like using tool calling out of the box with model providers versus doing RL yourself inside your environment with your harness.
Yeah. I mentioned like before starting Manus I was kind of a model training guy. I've been doing pre-training, post-training, RL for a lot of years but I have to say that right now if you have sufficient resources you can try, but actually as I mentioned earlier, MCP is a big changer here. Because if you want to support MCP, you're not using a fixed action space. And if it's not a fixed action space, it's very hard to design a good reward and you cannot generate a lot of the rollouts and feedbacks will be unbalanced. So if you want to build a model that supports MCP, you are literally building a foundation model by yourself. So I think everyone in the community—model companies—they're doing the same thing for you. So right now, I don't think we should spend that much time on doing RL right now. But like as I mentioned earlier, we are just discovering exploring new ways to do maybe call it personalization or some sort of online learning but using parameter-free ways, for example collective feedbacks.
Yeah. One little one along those lines is: is it the case that for example Anthropic's done reinforcement learning with verified rewards on some set of tools using Claude Code... Have you found that you can kind of mock your harness to use similar tool names to kind of unlock the same capability if that makes sense? Like for example, I believe they've obviously performed utilized glob, uses grep, uses some other set of tools for manipulating the file system. Can you effectively reproduce that same functionality by having the exact same tools with the same tool name, same descriptions in your harness? Or kind of how do you think about that—like unlocking the... Yeah. Right. You see what I'm saying?
Yeah. I know the clear answer here, but for us, we actually try not to use the same name because it will... if you design your own function, you maybe have different requirements for that function and the parameters, the input arguments might be different. So you don't want to confuse the model. If the model is trained on a lot of post-training data that has some internal tools, you don't want to let the models be confused.
Okay. Okay. Got it. Got it. Perfect. Um well, I think we're actually at time and I want to respect your time because I know it's early. You're in Singapore. It's very early for you. So well this was really good. Thank you. We'll definitely make sure this recording is available. We'll make sure slides are available. Any parting things you want to mention, things you want to call out, calls to action? Yeah, people should go use Manus, but the floor is yours.
Yeah. I just want to say everybody try this. We have a free tier.
Yeah. Yeah. Absolutely. Hey, thanks a lot, Pete. I'd love to do this again sometime.
Yeah. Thanks for having me.
Yep. Okay. Bye. Bye.