Context Engineering for AI Agents with LangChain and Manus

LangChain

Join us for a deep dive into context engineering – the critical practice that determines how well your AI agents perform in production. Lance Martin from LangChain and Manus co-founder Yichao "Peak" Ji share battle-tested strategies for managing context windows, optimizing performance, and building agents that scale. Peak was recently named one of MIT's Innovators Under 35 for his work on AI agents. Here, we cover Manus's context engineering approach. Strategies include: (1) **Context reduction** via dual-form tool results (full/compact) with policy-based compaction and schema-driven summarization; (2) **Context offloading** through layered action spaces (function calling → sandbox utils → packages/APIs) with filesystem-based state management and shell utilities instead of vectorstore indexing; (3) **Context isolation** using minimal sub-agents (planner, knowledge manager, executor) with agent-as-tool paradigm and constrained decoding for schema-based inter-agent communication. 📊 Access the Presentations: Lance Martin's slides (LangChain): https://docs.google.com/presentation/d/16aaXLu40GugY-kOpqDU4e-S0hD1FmHcNyF0rRRnb1OU/edit?slide=id.p#slide=id.p Yichao "Peak" Ji's slides (Manus): https://drive.google.com/file/d/1QGJ-BrdiTGslS71sYH4OJoidsry3Ps9g/view?usp=sharing Ready to start building reliable agents? Sign up for LangSmith, our agent observability & evals platform: https://www.langchain.com/langsmith/?utm_medium=social&utm_source=youtube&utm_campaign=q4-2025_meetup-manus_co Chapters 0:01:00 Introduction to context engineering 0:12:00 Why context engineering in Manus 0:15:00 Context reduction in Manus 0:19:20 Context isolation in Manus 0:22:17 Context offloading in Manus 0:29:00 Avoid context over-engineering 0:31:00 Q&A: Explain sandbox utils in Manus 0:31:55 Q&A: Indexing (vectorstore) vs just using files 0:32:50 Q&A: Memory in Manus 0:34:30 Q&A: Manus and The Bitter Lesson 0:36:44 Q&A: Data format 0:37:45 Q&A: Summarization tips 0:40:00 Q&A: Sub-agents as tools 0:43:57 Q&A: Model choice 0:46:20 Q&A: Tool selection 0:49:48 Q&A: Planning 0:53:35 Q&A: Guardrails 0:55:39 Q&A: Evals 0:57:15 Q&A: Using RL

Hosts: Pete, Lance

📺Watch on YouTube

📅October 14, 2025

⏱️01:00:53

🌐English

🤍0 likes

Disclaimer: The transcript on this page is for the YouTube video titled "Context Engineering for AI Agents with LangChain and Manus" from "LangChain". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=6_BcCthVvb8

00:00:06Lance

All right. Well, thank you all for coming. We'll go ahead and kick off the webinar now and I'm sure people will continue to stream in. I'm Lance, one of the founding engineers at LangChain, and I'm joined by Pete from Manus. Pete, do you want to introduce yourself quickly?

🤍0 likes💬 0 comments

Add to My Notes

00:00:26Pete

Yeah. Hey guys, I'm the co-founder and chief scientist of Manus. So basically, I designed the agent framework and a lot of things in Manus, and I'm super excited to be here today. Thanks, Lance, for having me.

🤍0 likes💬 0 comments

Add to My Notes

00:00:36Lance

Yeah, we're really excited to do this because first, Manus is a really cool product—I've been using it for a long time—but also they put out a really nice blog post on context engineering a few months ago that influenced me a lot. So, I want to give a quick overview of context engineering as I see it, and I'll reference their piece. Then Pete's actually going to give a presentation talking about some new ideas not covered in the piece. So if you've already read it, this will cover some things that are new which hopefully will be quite interesting for you. I'll kind of set the stage, hand it over to Pete, and then we'll do some Q&A.

🤍0 likes💬 0 comments

Add to My Notes

00:01:10Lance

So you might have heard this term "context engineering" and it kind of emerged earlier this year. If you look through time with Google search trends, prompt engineering was kind of initiated following ChatGPT. That's showing December 2022. When we got this new thing, a chat model, there became a great deal of interest in: how do we prompt these things? Prompt engineering kind of emerged as a discipline for working with chat models and prompting them.

🤍0 likes💬 0 comments

Add to My Notes

00:01:38Lance

Now context engineering emerged this year around May. We saw it really rising in Google trends and it corresponds a bit with this idea of the "Year of Agents." Why is that? One of the things that people have observed if you've been building agents is that context grows, and it grows in a very particular way when you build an agent. What I mean is we have an LLM bound to some number of tools. That LLM can call tools autonomously in a loop. The challenge is, for every tool called, you get a tool observation back and that's appended to this chat list. These messages grow over time and so you can kind of get this unbounded explosion of messages as agents run.

🤍1 like💬 0 comments

Add to My Notes

00:02:26Lance

As an example, Manus talked about in their piece that typical tasks require around 50 tool calls. Anthropic has mentioned similarly that production agents can engage in conversations spanning hundreds of turns. So the challenge is that agents, because they are increasingly long-running and autonomous and they utilize tools freely, can accumulate a large amount of context through this accumulation of tool calls. Google put out a really nice report talking about context rot. The observation simply is that performance drops as context grows. So this paradox—this challenging situation—is that agents utilize lots of context because of tool calling, but we know that performance drops as context grows.

🤍1 like💬 0 comments

Add to My Notes

00:03:08Lance

So this is a challenge that many of us have faced, and it kind of spearheaded, or I think seeded, this term of context engineering. Karpathy, of course, kind of coined it on Twitter earlier this year. You can think about context engineering as the delicate art and science of filling the context window with just the right information needed for the next step. So trying to combat this context explosion that happens when you build agents and they call tools freely. All those tool messages accumulate in your messages queue. How do we cull such that the right information is presented to the agent to make the correct next decision at all points in time?

🤍1 like💬 0 comments

Add to My Notes

00:03:45Lance

To address this, there are a few common themes I want to highlight that we've seen across a number of different pieces of work, including Manus, which I'll mention here.

🤍0 likes💬 0 comments

Add to My Notes

00:03:56Lance

Idea one is context offloading. So we've seen this trend over and over. The central idea is you don't need all context to live in this messages history of your agent. You can take information and offload it, send it somewhere else, so it's outside the context window, but it can be retrieved, which we'll talk about later.

🤍1 like💬 0 comments

Add to My Notes

00:04:21Lance

So, one of the most popular ideas here is just using a file system. Take the output of a tool message as an example, dump it to the file system, and send back to your agent just some minimal piece of information necessary so it can reference the full context if it needs to. But that full payload—for example, a web search result that's very token-heavy—isn't spammed into your context window for perpetuity.

🤍1 like💬 0 comments

Add to My Notes

00:04:44Lance

So you've seen this across a number of different projects. Manus uses this. We have a project called Deep Agents that utilizes the file system. Open Deep Research utilizes actual agent state which has a similar role to an external file system. Claude Code, of course, uses this very extensively. Long-running agents utilize it very extensively. So this idea of offloading context to a file system is very common and popular across many different examples of production agents that we're seeing today.

🤍1 like💬 0 comments

Add to My Notes

00:05:13Lance

The second idea is reducing context. So offloading is very simply taking some piece of information, like a tool message that's token-heavy, and not sending it all back to your messages list—dumping it to a file system where it can be retrieved only as needed. That's offloading. Reducing the context is similar, but instead, you're just summarizing or compressing information.

🤍0 likes💬 0 comments

Add to My Notes

00:05:38Lance

Summarizing tool call outputs is one intuitive way to do this. So we do this with Open Deep Research as an example. Pruning tool calls or tool messages is another. One thing that's very interesting is Claude has actually added this; if you look at some of their most recent releases, they now support this out of the box. So this idea of pruning old tool calls with tool outputs or tool messages is something that Claude has now kind of built into their SDK.

🤍1 like💬 0 comments

Add to My Notes

00:06:07Lance

Summarizing or compacting full message history—you see this with Claude Code in its compaction feature once you hit a certain percentage of your overall context window. Cognition also talks about this idea of summarizing or pruning at agent-to-agent handoffs. So this idea of reducing context is a very popular theme we see across a lot of different examples, from Claude Code to our Open Deep Research, Cognition, and Claude 3.5 has incorporated this as well.

🤍1 like💬 0 comments

Add to My Notes

00:06:32Lance

Retrieving context—now this is one of the classic debates today that you might see raging on X or Twitter: the right approach for retrieving context. Lee Robinson from Cursor just had a very nice talk at OpenAI Demo Day talking about how Cursor, for example, uses indexing and semantic search as well as more simple file-based search tools like glob and grep. Claude Code force-only uses the file system and simple search tools, notably glob and grep. So there are different ways to retrieve context on demand for your agent. Indexing via semantic search versus file system and simple file search tools—both can be highly effective. There are pros and cons we could talk about in the Q&A, but of course, context retrieval is central for building effective agents.

🤍1 like💬 0 comments

Add to My Notes

00:07:23Lance

Context isolation is the other major theme we've seen quite a bit of, in particular splitting context across multi-agents. So what's the point here? Each sub-agent has its own context window and sub-agents allow for separation of concerns. Manus talks about this. Our Deep Agents work uses this. Open Deep Research uses it. Claude sub-agents are utilized in their researcher and also Claude Ghost supports sub-agents. So sub-agents are a very common way to perform context isolation we've seen across many different projects.

🤍1 like💬 0 comments

Add to My Notes

00:08:05Lance

Now one thing I thought was very interesting is caching context, and Manus talks about this quite a bit. I'll let Pete speak to this a bit later but I think it's a very interesting trick as well.

🤍0 likes💬 0 comments

Add to My Notes

00:08:15Lance

So I'll just show a brief example that we've seen across Open Deep Research. This is a very popular repo that we have. It's basically an open-source Deep Research implementation and it performs on par with some of the best implementations out there. You can check our repo, and we have results from Deep Research Bench showing that we're top 10. It has three phases: scoping of the research, the research phase itself using a multi-agent architecture, and then a final one-shot writing phase.

🤍0 likes💬 0 comments

Add to My Notes

00:08:41Lance

We use offloading. So we basically create a brief to scope our research plan. We offload that. So we don't just save that in the context window because that context window is going to get peppered with other things. We offload it, so it's saved independently. It can be accessed, in our case from the LangGraph state, but it could also be from a file system; it's the same idea. So you create a research plan, you offload it, it's always accessible. You go do a bunch of work, and you can pull that back in on demand—so you can put it kind of at the end of your message list so it's accessible and readily available to your agent to perform, for example, the writing phase.

🤍1 like💬 0 comments

Add to My Notes

00:09:16Lance

We use offloading, as you can see, to help steer the research and writing phases. We use reduction to summarize observation from token-heavy surf tool calls. That's done inside research itself. And we use context isolation across sub-agents within research itself.

🤍1 like💬 0 comments

Add to My Notes

00:09:37Lance

And this is kind of a summary of a bunch of different of these various ideas across a bunch of different projects. And actually, Pete is going to speak to Manus in particular and some of the lessons they've learned. This just kind of sets up the stage. This summarizes what I talked about—these different themes of offloading, reducing context, retrieving context, isolating, caching, and a number of popular projects and kind of where they're used. I will share these slides to the notes. And I do want to let Pete go ahead and present now because I want to make sure we have plenty of time for him and for questions. But this just sets the stage. And Pete, I'll let you take it from here. I'll stop sharing.

🤍0 likes💬 0 comments

Add to My Notes

00:10:25Pete

Okay. Can you see my slides?

🤍0 likes💬 0 comments

Add to My Notes

00:10:28Lance

Yeah.

🤍0 likes💬 0 comments

Add to My Notes

00:10:30Pete

Okay. Perfect. Thank you, Lance. I'm super excited to be here today to share some fresh lessons on context engineering that we learned from building Manus. Here I say "fresh lessons" because I realized that the last blog post that you mentioned I wrote about context engineering was back in July. And yeah, it's the Year of the Agent, so July is basically the last century. And of course, before this session, I went back and read it again, and luckily I think most of what I wrote in that blog still holds up today. But I just don't want to waste everybody's time by just repeating what's already inside that blog. So today I think instead I want to dig into some areas that I either didn't go deep enough on before or didn't touch at all. So actually, we'll be focusing on the "Discourage" column in Lance's earlier slides because I think exploring those non-consensus ideas often leads to the biggest inspirations.

🤍0 likes💬 0 comments

Add to My Notes

00:11:28Pete

Yeah. So here's the topic for today's talk. First, we'll cover a bit about the bigger question of why we need context engineering, and then we'll have more on context reduction, more on context isolation, and finally some new stuff about context offloading which we are testing internally here at Manus. Everything I'm sharing today is in production in Manus; it's battle-tested. But I don't know how long it will last because, you know, things are changing super fast.

🤍0 likes💬 0 comments

Add to My Notes

00:11:58Pete

Okay, let's start with the first big question: why do we even need context engineering, especially when fine-tuning or post-training models has become much more accessible today? For example, folks at the Thinking Machine team just released the Tinker API, which I like a lot. But for me, the question "why context engineering" actually came through several painful stages of realization.

🤍0 likes💬 0 comments

Add to My Notes

00:12:26Pete

Before starting Manus, I've already spent over 10 years in natural language processing (NLP), which is basically what we call building language models. But before ChatGPT—and Manus is actually my second or third company—at my previous startup, we trained our own language model from scratch to do open domain information extraction and building knowledge graph and semantic search engines on top of them. And it was painful. Our product's innovation speed was completely capped by the model's iteration speed. Even back then, the models were much smaller compared to today, but still, a single training plus evaluation cycle could take maybe one or two weeks. The worst part is that at that time we hadn't reached PMF (Product Market Fit) yet and we were spending all that time improving benchmarks that might not even matter for the product. So I think instead of building specialized models too early, startups really should lean on general models and context engineering for as long as possible.

🤍1 like💬 0 comments

Add to My Notes

00:13:30Pete

Well, of course, I guess now that's some kind of common wisdom. But as your product matures and open-source base models get stronger, I know it's very tempting to think, "Hey, maybe I should just pick a strong base model, fine-tune it with my data, and make it really good at my use case." We've tried that too. And guess what? It's another trap. To make RL work really well, you usually fix an action space, design a reward around your current product behavior, and generate tons of on-policy rollouts and feedback. But this is also dangerous because we're still in the early days of AI and agents. Everything can shift under our feet overnight.

🤍1 like💬 0 comments

Add to My Notes

00:14:10Pete

For us, the classic example was the launch of MCP (Model Context Protocol). Actually, it completely changed the design of Manus from a compact static action space to something infinitely extensible. And if you have ever trained your own model, you know that this kind of open domain problem is super hard to optimize. Well, of course, you could pour massive effort into post-training that ensures generalization, but then aren't you basically trying to become an LLM company yourself? Because you're basically rebuilding the same layer that they have already built. And that's a duplication of effort. So maybe after all that buildup, here's my point: Be firm about where you draw the line. Right now, context engineering is the clearest and most practical boundary between application and model. So trust your choice.

🤍1 like💬 0 comments

Add to My Notes

00:15:00Pete

All right, enough philosophy and let's talk about some real tech. First topic: context reduction. Here I want to clarify two different kinds of compaction operations because we think context reduction is fascinating but it's also a new concept. There's a lot of ways to do this and here in Manus, we divide them into compaction and summarization.

🤍0 likes💬 0 comments

Add to My Notes

00:15:22Pete

For compaction in Manus, every tool call and tool result actually has two different formats: a full format and a compact one. The compact version strips out any information that can be reconstructed from the file system or external state. For example here, let's say you have a tool that writes to a file and it probably has two fields: a path and a content field. But once the tool returns, you can ensure that the file already exists in the environment. So in the compact format, we can safely drop the super long content field and just keep the path. And if your agent is smart enough, whenever it needs to read that file again, it can simply retrieve it via the path. So no information is truly lost; it's just externalized. We think this kind of reversibility is crucial because agents do chain predictions based on previous actions and observations, and you never know which past action will suddenly become super important 10 steps later. You cannot predict it. So this is a reversible reduction by using compaction.

🤍1 like💬 0 comments

Add to My Notes

00:16:29Pete

Of course, compaction only takes you so far. Eventually, your context will still grow and will hit the ceiling. And that's when we combine compaction with the more traditional summarization, but we do it very carefully. For example here, before summarizing, we might offload key parts of the context into files. And sometimes we even do more aggressively—we can dump the entire pre-summary context as a text file or simply a log file into the file system so that we can always recover it later. Like Lance just mentioned, some people just use glob and grep. glob also works for log files. So if the model is smart enough, it even knows how to retrieve those pre-summarized contexts.

🤍1 like💬 0 comments

Add to My Notes

00:17:14Pete

The difference here is that compaction is reversible but summarization isn't. Both reduce context lengths but they behave very differently. To make both methods coexist, we have to track some context length thresholds. At the top, you'll have your model's hard context limit, say 1 million tokens—pretty common today. But in reality, most models start degrading much earlier, typically maybe around 200k, and you'll begin to see what we call "context rot"—repetitions, slower inferences, degraded quality. So by doing a lot of evaluation, it's very important for you to identify that pre-rot threshold. It's typically 128K to 200K, and use it as the trigger for context reduction.

🤍1 like💬 0 comments

Add to My Notes

00:18:04Pete

Whenever your context size approaches it, you have to trigger context reduction, but starting from compaction, not summarization. And compaction doesn't mean compressing the entire history. We might compact the oldest 50% of tool calls while keeping the newer ones in full detail so the model still has fresh few-shot examples of how to use tools properly. Otherwise, in the worst case, the model will imitate the behavior and output those compact formats with missing fields, and that's totally wrong.

🤍1 like💬 0 comments

Add to My Notes

00:18:38Pete

After compaction, we have to check how much free context that we actually gain from this compaction operation. Sometimes after multiple rounds of compaction, the gain is tiny because even if it's compact, it still uses context. And that's when we go for summarization. But also keep in mind that when summarizing, we always use the full version of the data, not the compact one. And we still keep the last few tool calls and tool results in full detail, not summary, because it allows the model to know where it left off and continue more smoothly. Otherwise, you'll see after summarization sometimes the model will change its style, change its tone. We find out keeping a few tool call/tool result examples really helps.

🤍1 like💬 1 comment

Add to My Notes

00:19:22Pete

Okay, now we've covered reduction. Let's talk about isolation. I really agree with Cognition's blog where they warn against using multi-agent setups because when you have multiple agents, syncing information between them becomes a nightmare. But this isn't a new problem. Multiprocess or multi-thread coordination has been a classic challenge in the early days of computer programming. And I think we could borrow some wisdom here.

🤍0 likes💬 0 comments

Add to My Notes

00:19:54Pete

I don't know how many Golang coders are here today, but in the Go programming language community, there's a famous quote from this gopher: "Do not communicate by sharing memory; instead, share memory by communicating." Of course, this isn't directly about agents and it's sometimes even wrong for agents, but I think the important thing is it highlights two distinct patterns here: by communicating or by sharing memory. If we translate the term "memory" here into context, we can see that parallel pretty clearly.

🤍1 like💬 1 comment

Add to My Notes

00:20:25Pete

"By communicating" is the easier one to understand because it is the classic sub-agent setup here. For example, the main agent writes a prompt and the prompt is sent to a sub-agent, and the sub-agent's entire context only consists of that instruction. We think if a task has a short, clear instruction and only the final output matters—say searching a codebase for a specific snippet—then just use the communication pattern and keep it simple. Because the main agent doesn't care how the sub-agent finds the code, it only needs the result. And this is what Claude Code does typically, using its task tool to delegate a separated clear task to some sub-agents.

🤍1 like💬 0 comments

Add to My Notes

00:21:09Pete

But for more complex scenarios, "by sharing memory" means that the sub-agent can see the entire previous context—all the tool usage history—but the sub-agent has its own system prompt and its own action space. For example, imagine a Deep Research scenario: the final report depends on a lot of intermediate searches and notes. In that case, you should consider using the shared memory pattern, or in our language, "by sharing context." Even if you can save all those notes and searches into a file and make the sub-agent read everything again, you're just wasting latency and context. And if you count the amount of tokens, maybe you're using even more tokens to do this. So we think for those scenarios that require a full history, just use a shared memory pattern. But be aware that sharing context is kind of expensive because each sub-agent has a larger input to prefill, which means you'll spend more on input tokens. And since the system prompt and the action space differs, you cannot reuse the KV cache, so you have to pay the full price.

🤍1 like💬 0 comments

Add to My Notes

00:22:18Pete

Finally, let's talk a little bit about context offloading. When people say offload, they usually mean moving parts of the working context into external files. But as your system grows, especially if you decide to integrate MCP, one day you realize that the tools themselves can also take up a lot of context, and having too many tools in context leads to confusion. We call it "context confusion," and the model might call the wrong ones or even non-existing ones. So we have to find a way to also offload the tools.

🤍1 like💬 0 comments

Add to My Notes

00:22:54Pete

A common approach right now is doing dynamic RAG (Retrieval-Augmented Generation) on tool descriptions. For example, loading tools on demand based on the current task or the current status. But that also causes two issues. First, since tool definitions sit at the front of the context, your KV resets every time. And most importantly, the model's past calls to removed tools are still in the context, so it might fool the model into calling invalid tools or using invalid parameters.

🤍1 like💬 0 comments

Add to My Notes

00:23:23Pete

So to address this, we're experimenting with a new layered action space in Manus. Essentially, we can let Manus choose from three different levels of abstractions: Number one, function calling; Number two, sandbox utilities; and Number three, packages and APIs. We go deeper into these three layers of layered action space.

🤍1 like💬 0 comments

Add to My Notes

00:23:44Pete

Let's start from level one: function calling. And this is a classic, everyone knows it. It is schema-safe thanks to constrained decoding. But we all know the downsides. For example, breaking the cache and maybe too many tools causing confusion. So in Manus right now, we only use a fixed number of atomic functions. For example, reading and writing files, executing shell commands, searching files in the internet, and maybe some browser operations. We think these atomic functions have super clear boundaries and they can work together to compose much more complex workflows. Then we offload everything else to the next layer, which is the sandbox utilities.

🤍1 like💬 0 comments

Add to My Notes

00:24:28Pete

As you know, each Manus session runs inside a full virtual machine sandbox. It's running on our own customized Linux system, and that means Manus can use the shell commands to run pre-installed utilities that we develop for Manus. For example, we have some format converters, we have speech recognition utilities, and even a very special one—we call it the Manus MCP CLI—which is how we call MCP. We do not inject MCP tools into the function calling space. Instead, we do everything inside that sandbox through the command line interface. Utilities are great because you can add new capabilities without touching the model's function calling space. It's just some commands pre-installed in your computer, and if you're familiar with Linux, you always know how to find those new commands and you can even run --help to figure out how to use a new tool.

🤍1 like💬 0 comments

Add to My Notes

00:25:25Pete

Another good thing is for larger outputs, they can just write to files or return the result in pages, and you can use all those Linux tools like grep, cat, less, more to process that result on the fly. The trade-off here is it's super good for large outputs but it's not that good for low latency back-and-forth interactions with the front end, because you always have to visualize the interactions of your agent and show it to the user. So this is pretty tricky here, but we think it already offloads a lot of things.

🤍1 like💬 0 comments

Add to My Notes

00:25:58Pete

And then we have another layer, the final layer, we call it Packages and APIs. Here Manus can write Python scripts to call pre-authorized API or custom packages. For example, Manus might use a 3D designing library for modeling or call a financial API to fetch market data. And here actually, we've purchased all these APIs on behalf of a user and pay the money for them; it's included in the subscription. So basically we have a lot of API keys pre-installed in Manus and Manus can access these APIs using the keys. I think these are perfect for tasks that require lots of computation in memory but do not need to push all that data into the model context.

🤍1 like💬 0 comments

Add to My Notes

00:26:42Pete

For example, imagine if you're analyzing a stock's entire year of price data. You don't feed the model all the numbers. Instead, you should let the script compute it and only put the summary back into the context. And you know, since code and APIs are super composable, you can actually chain a lot of things in one step. For example, in a typical API, you can do "get city names," "get city ID," "get weather" all in one Python script. There's also a paper from one of my friends called CodeAct. A lot of people were discussing it. I think it's the same idea because code is composable and it can do a lot of things in one step, but also it's not schema-safe. It's very hard to do constrained decoding on CodeAct. So we think you should find the right scenario for these features. For us, everything that can be handled inside a compiler or interpreter runtime, we do that using code; otherwise, we use sandbox utilities or function calls. And the good thing is, from the model's point of view, all three levels still go through the standard function calls. So the interface stays simple, cache friendly, and orthogonal across functions because you know uh we mentioned sandbox utilities you're still accessing these tools using the shell tool. Accessing these tools using the shell function and also like if you're using APIs in thirdparty applications you're just using the file function to write or read file and then execute it using the shell function so you think it does not add like add overhead to the model it's still all the things that models are trained and they're already familiar with.

🤍1 like💬 0 comments

Add to My Notes

00:28:24Pete

So let's zoom out and connect the five dimensions: offload, reduce, retrieve, isolate, and cache. You can find out that they are not independent. We can see that offload and retrieve enables more efficient reduction, and stable retrieve makes isolation safe. But isolation also slows down context and reduces the frequency of reduction. However, more isolation and reduction also affects cache efficiency and the quality of output. So at the end of the day, I think context engineering is the science and art that requires a perfect balance between multiple potentially conflicting objectives. It's really hard.

🤍1 like💬 0 comments

Add to My Notes

00:29:06Pete

All right. Before we wrap up, I want to leave you with maybe one final thought, and it's kind of the opposite of everything I just said: Please avoid context over-engineering. Looking back at the past six or seven months since Manus launched, actually the biggest leap we've ever seen didn't come from adding more fancy context management layers or clever retrieval hacks. They all came from simplifying, or from removing unnecessary tricks and trusting the model a little more. Every time we simplify the architecture, the system got faster, more stable, and smarter. We think the goal of context engineering is to make the model's job simpler but not harder. So if you take one thing from today, I think it should be: Build less and understand more. Thank you so much everyone and thanks again to Lance and the LangChain team for having me. Can't wait to see what you guys all build next. Now back to Lance.

🤍0 likes💬 0 comments

Add to My Notes

00:30:10Lance

Yeah, amazing. Thank you for that. So we have a nice set of questions here. Maybe we can just start hitting them and we can kind of reference back to the slides if needed. And Pete, are your slides available to everyone?

🤍0 likes💬 0 comments

Add to My Notes

00:30:25Pete

Oh yeah. Yeah, I can share the PDF version afterwards.

🤍0 likes💬 0 comments

Add to My Notes

00:30:28Lance

Yes, sounds good. Yeah. Well, why don't I start looking through some of the questions and maybe we can start with the more recent ones first. So how does the LLM call the various shell tools? How does it know which tools exist and how to invoke them? Maybe you can explain a little bit about the multi-tier sandboxing setup that you use with Manus.

🤍0 likes💬 0 comments

Add to My Notes

00:31:00Pete

Yeah. I think imagine you're the person using a new computer. For example, if you know Linux, you can imagine all the tools are located in /usr/bin. So actually we do two things. First of all, we have a hint in the system prompt telling Manus that hey, there's a lot of pre-installed command line utilities located in some specific folder. And also, for the most frequently used ones, we already injected them in the system prompt, but it's super compact. We do not tell the agent how to use the tools. We only list them and we can tell the agent that you can use the --help flag safely because all the utilities are developed by our team and they have the same format.

🤍0 likes💬 0 comments

Add to My Notes

00:31:42Lance

Got it. How about, I know you talked a lot about using the file system. What's your take on using indexing? And do you utilize like... do you spin up vector stores on the fly if the context you're working with gets sufficiently large? How do you approach that?

🤍0 likes💬 0 comments

Add to My Notes

00:31:55Pete

Yeah, I think there's no right and wrong in this space like you've mentioned. But at Manus, we do not use index databases because right now, you know, every sandbox in a Manus session is a new one and users want to interact with things fast. So actually we don't have the time to build the index on the fly. So we're more like Claude Code; we rely on grep and glob. But I think if you consider building something like more long-term memory or if you want to integrate some enterprise knowledge base, you still have to rely on that external vector index because it's about the amount of information that you can access. But for Manus, it operates in a sandbox and for coding agents you operate in the codebase. So it depends on the scale.

🤍1 like💬 0 comments

Add to My Notes

00:32:43Lance

Yeah. So that's a good follow-up then. So let's say I'm a user. I have my Manus account. I interact with Manus across many sessions. Do you have the notion of memory? So Claude has .claudemd files; they persist across all the different sessions of Claude Code. How about you guys? How do you handle kind of long-term memory?

🤍1 like💬 0 comments

Add to My Notes

00:32:59Pete

Yeah. Actually in Manus we have a concept called "Knowledge" which is kind of like explicit memory. For example, every time you can tell Manus, "Hey, remember, every time I ask for something, deliver it in maybe in Excel," and it's not automatically inserted into some memory. It will pop up a dialogue and say, "Here's what I learned from our previous conversation, and would you like to accept it or reject it?" So this is the explicit one. It requires user confirmation.

🤍1 like💬 0 comments

Add to My Notes

00:33:25Pete

But also, we are discovering new ways to do it more automatically. For example, a pretty interesting thing in agents is that compared to chat bots, users often correct the agent more often. For example, a common mistake that Manus makes is when doing data visualization. You know, if you're using Chinese, Japanese or Korean, a lot of time there will be some font issues and there will be errors in those rendered visualizations. So the user will often say, "Hey you should use Noto CJK font." And for these kind of things, a different user will have the same correction. We need to maybe find out a way to leverage these kind of collective feedback and use it. That's kind of like what we call a self-improving agent with online learning, but in a parameter-free way.

🤍1 like💬 0 comments

Add to My Notes

00:34:15Lance

Yeah. How about a different question that was raised here and also I think about quite a bit. You mentioned towards the end of your talk that you gained a lot from removing things, and a lot of that is probably because of the fact that also the models are getting better. So model capabilities are increasing and so you can kind of remove scaffolding over time. How do you think about this? Because this is one of the biggest challenges that I've faced is like, over time the model gets better and I can remove things like certain parts of my scaffolding. So you're building on top of this foundation that's like the water's rising. Do you revisit your architecture every some number of months with new releases and just delete as the models get better? And how do you approach that problem?

🤍0 likes💬 0 comments

Add to My Notes

00:34:55Pete

Yeah, this is a super good question here because you know, actually we have already refactored Manus five times. And we've launched Manus in March and now it's October—already five times. So we think you cannot stop because models are not only improving but they are changing. Models' behavior is changing over time. One way is you can work closely with those model providers, but we also have another internal theory for how we evaluate or how we design our agent architecture. I covered a little bit on Twitter before. It's basically like, we do not care about the performance of a static benchmark. Instead, we fix the AI agent architecture and we switch between models. If your architecture can gain a lot from switching from a weaker model to a stronger model, then somehow your architecture is more future-proof because the weaker model tomorrow might be as good as a stronger model today. Yeah. So we think switching between weaker and stronger models can give you some early signals of what will happen next year and give you some time to prepare your architecture. So for Manus, we often do this kind of review every one or two months and we often do some research internally using open-source models and maybe early access to proprietary models to prepare the next release, even before the launch of the next model.

🤍1 like💬 1 comment

Add to My Notes

00:36:26Lance

Yeah. It's a good observation. You can actually do testing of your architecture by toggling different models that exist today. Yeah, that makes a lot of sense. What about best practices or considerations for formats for storing data? So like markdown files, plain text, log... anything you prefer in particular? How do you think about that kind of file formats?

🤍0 likes💬 0 comments

Add to My Notes

00:36:50Pete

Yeah. I think it's not about plain text or markdown, but we always prioritize line-based formats because it allows the models to use grep or read from a range of lines. And also markdown can sometimes cause some troubles. You know, models are trained to use markdown really well and sometimes it will... maybe for some model, I don't want to say that name, but they often output too many bullet points if you use markdown too often. So actually we want to use more plain text.

🤍1 like💬 0 comments

Add to My Notes

00:37:27Lance

Yeah, makes sense. How about on the topic of compaction versus summarization? Let's hit on summarization. This is an interesting one that I've been asked a lot before. How do you prompt to produce good summaries? So, for example, summarization, like you said, it's irreversible. So if you don't prompt it properly, you can actually lose information. The best answer I came up with is just tuning your prompt for high recall. But how do you approach this? So summarization, how do you think about prompting for summarization?

🤍0 likes💬 0 comments

Add to My Notes

00:37:57Pete

Yeah, actually we tried a lot of optimizing the prompt for summarization. But it turns out a simple approach works really well: you do not use a free-form prompt to let the AI generate everything. Instead, you could define a kind of a schema. It's just a form. There are a lot of fields and let the AI fill them. For example, "Here are the files that I've modified," "Here's the goal of the user," "Here's where I left off." And if you use this kind of more structured schema, at least the output is kind of stable and you can iterate on this. So just do not use free-form summarizations.

🤍1 like💬 0 comments

Add to My Notes

00:38:30Lance

Got it. Yeah, that's a great observation. So use structured outputs rather than free-form summarization to enforce certain things are always summarized. Yeah, that makes a lot of sense. How about with compaction then? And actually, I want to make sure I understood that. So with compaction, let's say it's like a search tool. You have the raw search tool output and would that be your raw message, and then the compaction would just be like a file name or something? Is that right?

🤍0 likes💬 0 comments

Add to My Notes

00:38:56Pete

Yeah, it is. It's not only about the tool call. It also applies to the result of the tool. Interestingly, we find out that almost every action in Manus is kind of reversible if you can offload it to the file system or an external state. And for most of these tasks, you already have a unique identifier for it. For example, for file operations, of course, you have the file path; for browser operations, you have the URL; and even for search actions, you have the query. So it's naturally already there.

🤍1 like💬 0 comments

Add to My Notes

00:39:29Lance

Yeah. Okay. That's a great one and I just want to hit that again because I've had this problem a lot. So, for example, I'm an agent that uses search. I perform a tool call, it returns a token-heavy tool call. I don't want to return that whole tool message to the agent. I've done things like some kind of summarization or compaction and sent the summary back. But how do you approach that? Because you might want all that information to be accessible for the agent for his next decision. But you don't want that huge context block to live inside your message history.

🤍0 likes💬 0 comments

Add to My Notes

00:40:01Lance

So how do you approach that? You could send the whole message back but then remove it later. That's what Claude does now. You could do a summarization first and send the summary over. You could send everything and then do compaction so that later on you don't have the whole context in your message history. You only have like a link to the file. How do you think about that specifically if you see what I'm saying?

🤍0 likes💬 0 comments

Add to My Notes

00:40:27Pete

Yeah, I know. Actually, it depends on the scenario. For example, for complex search—I mean for complex search, it's not just one query. For example, you have multiple queries and you want to gather some important things and drop everything else. In this case, I think we should use sub-agents or internally we call it "agent as tool." So from the model's perspective, it's still a kind of function, maybe called "Advanced Search." It's a function call "Advanced Search." But what it triggers is actually another sub-agent. But that sub-agent is more like a workflow or agentic workflow that has a fixed output schema and that is the result that returns to the agent.

🤍1 like💬 0 comments

Add to My Notes

00:41:01Pete

But for other kinds of more simpler search, for example just searching Google, we just use the full detail format and append it into the context and rely on the compaction thing. But also we always instruct the model to write down the intermediate insights or key findings into files in case that the compaction happens earlier than the model expected. And if you do this really well, actually you don't lose a lot of information by compaction because sometimes those old tool calls are irrelevant after time.

🤍1 like💬 0 comments

Add to My Notes

00:41:32Lance

Yeah, that makes sense. Um and I like the idea of "agent as tool." We do that quite a bit and that makes that is highly effective. But that brings up another interesting point about agent-agent communication. How do you address that? So Walden Yan from Cognition had a very nice blog post talking about this as like a major problem that they have with Devin—communication between agents. How do you think about that problem and ensuring sufficient information is transferred but not overloading, like you said, the prefill of the sub-agent with too much context? So how do you think about that?

🤍0 likes💬 0 comments

Add to My Notes

00:42:08Pete

Yeah. You know, at Manus we've launched a feature called Wide Research a month ago. Internally we call it "Agentic MapReduce" because we got inspired from the design of MapReduce. And it's kind of special for Manus because there's a full virtual machine behind the session. So one way we pass information or pass context from the main agent to sub-agent is by sharing the same sandbox, so the file system is there and you can only pass different paths here.

🤍1 like💬 0 comments

Add to My Notes

00:42:35Pete

I think sending information to a sub-agent is not that hard. The more complex thing is about how to have the correct output from different agents. And what we did here is we have a trick: for every time if the main agent wants to spawn up a new sub-agent or maybe 10 sub-agents, you have to let the main agent define the output schema. And in the sub-agent perspective, you have a special tool called "Submit Result." And we use constrained decoding to ensure that what the sub-agent submits back to the main agent is the schema that is defined by the main agent. Yeah. So you can imagine that this kind of map-reduce operation... it will generate a kind of spreadsheet and the spreadsheet is constrained by the schema.

🤍1 like💬 0 comments

Add to My Notes

00:43:21Lance

That's an interesting theme that seems to come up a lot with how you design Manus. You use schemas and structured outputs both for summarization and for this agent-agent communication. So it's kind of like using schemas as contracts between agent/sub-agent or between a tool and your agent to ensure that sufficient information is passed in a structured way, in a complete way. Like when you're doing summarization you use a schema as well.

🤍1 like💬 0 comments

Add to My Notes

00:43:46Pete

Yeah.

🤍0 likes💬 0 comments

Add to My Notes

00:43:46Lance

Okay fantastic. This is very helpful. I'm poking around some other interesting questions here. Any thoughts on models like... I think you guys use Anthropic but do you work with open models? Do you do fine-tuning? You talked a lot about kind of working with KV cache, so for that maybe using open models. How do you think about model choice?

🤍0 likes💬 0 comments

Add to My Notes

00:44:08Pete

Yeah, actually right now we don't use any open-source model right now because I think it's not about quality, it's interestingly about cost. You know, we often think that open-source models can lower the cost, but if you're at the scale of Manus and if you're building a real agent where the input is way longer than the output, then KV cache is super important. And distributed KV cache is very hard to implement if you use open-source solutions. And if you use those frontier LLM providers, they have more solid infrastructure for distributed cache globally. So sometimes if you do the math, at least for Manus we find out that using these flagship models can sometimes be even cheaper than using open-source models.

🤍1 like💬 0 comments

Add to My Notes

00:44:51Pete

Right now we're not only using Anthropic—of course Anthropic's model is the best choice for agentic tasks—but we're also seeing the progress in Gemini and in OpenAI's new model. I think right now these frontier labs are not converging in directions. For example, if you're doing coding, of course you should use Claude; and if you want to do more multimodality things you should use Gemini; and OpenAI models are super good at complex math and reasoning. So I think for application companies like us, one of our advantages is that we do not have to build on top of only one model. You can do some task-level routing or maybe even subtask or step-level routing if you can pull in that kind of KV cache validation. So I think it's an advantage for us and we do a lot of evaluations internally to know which models to use for which subtask.

🤍1 like💬 1 comment

Add to My Notes

00:45:41Lance

Yeah. Yeah, that makes a lot of sense. I want to clarify one little thing. So with KV cache, so what specific features from the providers are you using for cache management? So okay, I know like Anthropic has input caching as an example. Yeah, that that's what you mean. Okay, got it.

🤍0 likes💬 0 comments

Add to My Notes

00:46:05Lance

Cool. I'm just looking through some of the other questions. Yeah, tool selection is a good one. Right. So, you were talking about this. You don't use like indexing of tool descriptions and fetching tools on the fly based on semantic similarity. How do you handle that? Like what's the threshold for too many tools? Yeah, tool choice is a classic. How do you think about that?

🤍0 likes💬 0 comments

Add to My Notes

00:46:34Pete

Yeah. First of all, it depends on the model. Different models have different capacity for tools. But I think a rule of thumb is try not to include more than 30 tools. It's just a random number in my mind. But actually, I think like if you're building a general AI agent like Manus, you want to make sure those native functions are super atomic. So actually there are not that many atomic functions that we need to put inside the action space. So for Manus, right now we only have like 10 or 20 atomic functions and everything else is in the sandbox. Yeah. So we don't have to pull things dynamically.

🤍1 like💬 0 comments

Add to My Notes

00:47:14Lance

Yeah good point actually. Let's explain that a little bit more. So you have let's say 10 tools that can be called directly by the agent. But then I guess it's like you said, the agent can also choose to for example write a script and then execute a script. So that expands its action space hugely without giving it like... you don't have an independent tool for each possible script. Of course that's insane. So a very general tool to like write a script and then run it does a lot. Is that what you mean?

🤍0 likes💬 0 comments

Add to My Notes

00:47:43Pete

Yeah. Yeah. Exactly. Because you know why we are super confident to call Manus a "general agent"? Because it runs on a computer and computers are Turing complete. The computer is the best invention of humans. Theoretically, an agent can do anything that maybe a junior intern can do using a computer. So with the shell tool and the text editor, we think it's already complete. So you can offload a lot of things to the sandbox.

🤍1 like💬 0 comments

Add to My Notes

00:48:09Lance

Yeah. Okay, that makes a lot of sense, right? And then how does Manus... so is are all... so okay, maybe I'll back up. You mentioned code with code agents. My understanding is the model will actually always produce a script and that'll then be run inside a code sandbox. So every tool call is effectively like a script is generated and run. It sounds like you do some hybrid where sometimes Manus can just call tools directly but other times it can actually choose to do something in the sandbox. Is that right? So it's kind of a hybrid approach.

🤍0 likes💬 0 comments

Add to My Notes

00:48:46Pete

Yeah. I think this is super important because actually we tried to use CodeAct entirely for Manus, but the problem is if you're using code, you cannot leverage constraint decoding and things can go wrong. But you know, CodeAct has some special use cases as I mentioned earlier in slides. For example, processing a large amount of data. You don't have to put everything in the tool result; instead you put it inside the runtime memory of Python and you only get the result back to the model. So we think you should do it in a hybrid way.

🤍1 like💬 0 comments

Add to My Notes

00:49:18Lance

Got it. Allow for tool calling and you have some number of tools, maybe 10 or something, that just get called directly, and some number of tools that actually run in the sandbox itself. Perfect. That makes a ton of sense. Very interesting.

🤍0 likes💬 0 comments

Add to My Notes

00:49:34Lance

Um and then maybe... how do you keep a reference of all the previously gen... I guess you have so you basically will generate a bunch of files. Oh actually sorry maybe I'll talk about something else. How about planning? Tell me about planning and and I know Manus has this "to-do" tool where it generates a to-do list and start of tasks. Yeah, tell me about that.

🤍0 likes💬 0 comments

Add to My Notes

00:49:52Pete

Yeah, I think this is very interesting because at the beginning Manus uses that todo.md paradigm. It's kind of... I don't want to use the word "stupid," but actually it wastes a lot of turns. You know, back in maybe March or April, if you check the log of some Manus task, maybe one third of the action is about updating the to-do list. It wastes a lot of tokens. Yeah. So right now we are using a more structuralized planning. For example, if you use Manus, there's a planner at the bottom of the system. Internally, it's also kind of a tool call—we implemented using the "agent as tool" paradigm so that there's a separate agent that is managing the plan. So actually right now the latest version of Manus, we are no longer using that todo.md thing. Of course todo.md still works and it can generate good results, but if you want to save tokens you can find another way.

🤍1 like💬 0 comments

Add to My Notes

00:50:45Lance

Got it. Yeah. So you have like a planner agent and it's more like for a subtask it'll be more like "agent as tool call" type things.

🤍0 likes💬 0 comments

Add to My Notes

00:50:52Pete

Yeah. Got it. And you know it is very important to have a separate agent that has a different perspective so it can do some external reviews. And you can use different models for planning. For example, sometimes O1 (OpenAI o1) can generate some very interesting insights.

🤍1 like💬 0 comments

Add to My Notes

00:51:07Lance

Yeah. Well that's a great one actually. So think about multi-agent then, and so like how do you think about that? So you might have like a planning agent with its own context window, makes a plan, produces like some kind of plan object, maybe it's a file or maybe it just calls sub-agents directly. How do you think about that? Like and how many different sub-agents do you typically recommend using?

🤍1 like💬 0 comments

Add to My Notes

00:51:29Pete

Yeah, I think this is also depends on your design, but here at Manus, actually Manus is not kind of like the typical multi-agent system. For example, we've seen a lot of different agents that divide by role. For example, you have a "designer agent," "programming agent," "manager agent." We don't do that because we think why we have this is because this is how human companies work and this is due to the limitation of human context. So in Manus, Manus is a multi-agent system but we do not divide by role. We only have very few agents. For example, we have a huge general executor agent and a planner agent and a knowledge management agent and maybe some data API registration agent. Yeah. So we are very cautious about adding more sub-agents because of the reason that we've mentioned before: communication is very hard. And we implement more kinds of sub-agents as "agent as tools" as we mentioned before.

🤍1 like💬 0 comments

Add to My Notes

00:52:23Lance

Yeah, that's a great point. I see this mistake a lot, or I don't know if it's a mistake, but you see anthropomorphizing agents a lot—like "it's my designer agent"—and I think it's kind of a forced analogy to think about like a human org chart in your sub-agents. So got it. So for you it's like a planner and knowledge manager. A knowledge manager might do what? Like what will be the task of knowledge manager?

🤍0 likes💬 0 comments

Add to My Notes

00:52:47Pete

Yeah, it's even more simple as we mentioned like we have a knowledge system in Manus. What the knowledge agent does is that it reviews the conversation between the user and the agent and figures out what should be saved in the long-term memory. So it's that simple.

🤍1 like💬 0 comments

Add to My Notes

00:53:02Lance

Got it. Yeah. Okay. It's like a memory manager, planner, and then you have sub-agents that could just take on like a general executor sub-agent that could just call all the tools or actions in the sandbox. That makes sense. Keep it simple. I like that a lot. That makes a lot of sense.

🤍0 likes💬 0 comments

Add to My Notes

00:53:20Lance

How about guardrailing? Someone asked a question about kind of safety and guardrailing. How do you think about this? I guess that's the nice thing about a sandbox, but tell me a little bit about that. How you think about it?

🤍0 likes💬 0 comments

Add to My Notes

00:53:39Pete

Yeah, I think this is a very sensitive question because like you know, if you have a sandbox that's connected to the internet, everything is dangerous. Yeah. So we have put a lot of effort in guardrailing. At least we do not let the information get out of the sandbox. For example, if you got prompt injected, we have some checks on outgoing traffic. For example, we'll ensure that no token things will go out of the sandbox. And if the user wants to print something out of the sandbox, we have those kind of removing things to ensure that no information goes out of the sandbox.

🤍1 like💬 0 comments

Add to My Notes

00:54:20Pete

But you know, for another kind of thing is that we have a browser inside of Manus and the browser is very complicated. For example, if you log into your websites, you can choose to let Manus persist your login state and this turns out to be very tricky because sometimes the content of the web page can also be malicious. Maybe they're doing prompt injection and this I think is somehow out of scope for application companies. So we're working very closely with those computer use model providers, for example Anthropic and Google. They're adding a lot of guardrails here. So right now in Manus, every time you do some sensitive operations whether inside the browser or in the sandbox, Manus will require a manual confirmation and you must accept it or otherwise you have to take over it to finish it yourself. So I think it's pretty hard for us to design a well-designed solution but it's a progressive approach. So right now we're letting the user take over more frequently, but if the guardrail itself in the model gets better, we can do less.

🤍1 like💬 0 comments

Add to My Notes

00:55:20Lance

Yeah. How about the topic of evals? This has been discussed a lot quite a bit online if you probably seen you know Claude Code. They talked a lot about just doing less formal evals at least for code because code evals are more or less saturated; lots of internal dog fooding. How do you think about evals? Are they useful? What evals are actually useful? What's your approach?

🤍0 likes💬 1 comment

Add to My Notes

00:55:43Pete

Yeah. You know at the beginning at the launch of Manus we were using public academic benchmarks like GAIA, but then after launching to the public we find out that it's super misaligned. You know models that get high scores on GAIA, the user don't like it. So right now we use three different kinds of evaluations. First of all, most importantly is that for every completed session in Manus, we'll request the user to give a feedback—to give one to five stars. This is the gold standard; we always care about the average user rating. This is number one.

🤍1 like💬 0 comments

Add to My Notes

00:56:15Pete

And number two, we're still using some internal automated tests with verifiable results. For example, we have created our own data set with clear answers. But also we still use a lot of public academic benchmarks but we also created some data sets that's more focused on execution because most benchmarks out there are more about read-only tasks. So we designed some executing tasks or transactional tasks because we have the sandbox we can frequently reset the test environment. So these are the automated parts. And most importantly number three, we have a lot of interns. You know, you have to use a lot of real human interns to do evaluations on things like website generation or data visualization because it's very hard to design a good reward model that knows whether the output is visually appealing—it's about the taste. So we still rely on a lot of that.

🤍1 like💬 0 comments

Add to My Notes

00:57:11Lance

Perfect. Yeah. Let me ask you I know we're coming up on time, but I do want to ask you about this emerging trend of reinforcement learning with verifiable rewards versus just building tool calling agents. So like Claude Code, extremely good, and they have the benefit because they built the harness and they can perform RL on their harness and it can get really really good with the tools they provide in the harness. Do you guys do RL or how do you think about that? Because of course in that case you would have... using open models. I've been playing with this quite a bit lately. How do you think about that? Just like using tool calling out of the box with model providers versus doing RL yourself inside your environment with your harness.

🤍1 like💬 1 comment

Add to My Notes

00:57:50Pete

Yeah. I mentioned like before starting Manus I was kind of a model training guy. I've been doing pre-training, post-training, RL for a lot of years but I have to say that right now if you have sufficient resources you can try, but actually as I mentioned earlier, MCP is a big changer here. Because if you want to support MCP, you're not using a fixed action space. And if it's not a fixed action space, it's very hard to design a good reward and you cannot generate a lot of the rollouts and feedbacks will be unbalanced. So if you want to build a model that supports MCP, you are literally building a foundation model by yourself. So I think everyone in the community—model companies—they're doing the same thing for you. So right now, I don't think we should spend that much time on doing RL right now. But like as I mentioned earlier, we are just discovering exploring new ways to do maybe call it personalization or some sort of online learning but using parameter-free ways, for example collective feedbacks.

🤍1 like💬 0 comments

Add to My Notes

00:58:54Lance

Yeah. One little one along those lines is: is it the case that for example Anthropic's done reinforcement learning with verified rewards on some set of tools using Claude Code... Have you found that you can kind of mock your harness to use similar tool names to kind of unlock the same capability if that makes sense? Like for example, I believe they've obviously performed utilized glob, uses grep, uses some other set of tools for manipulating the file system. Can you effectively reproduce that same functionality by having the exact same tools with the same tool name, same descriptions in your harness? Or kind of how do you think about that—like unlocking the... Yeah. Right. You see what I'm saying?

🤍1 like💬 0 comments

Add to My Notes

00:59:39Pete

Yeah. I know the clear answer here, but for us, we actually try not to use the same name because it will... if you design your own function, you maybe have different requirements for that function and the parameters, the input arguments might be different. So you don't want to confuse the model. If the model is trained on a lot of post-training data that has some internal tools, you don't want to let the models be confused.

🤍1 like💬 0 comments

Add to My Notes

01:00:01Lance

Okay. Okay. Got it. Got it. Perfect. Um well, I think we're actually at time and I want to respect your time because I know it's early. You're in Singapore. It's very early for you. So well this was really good. Thank you. We'll definitely make sure this recording is available. We'll make sure slides are available. Any parting things you want to mention, things you want to call out, calls to action? Yeah, people should go use Manus, but the floor is yours.

🤍0 likes💬 0 comments

Add to My Notes

01:00:33Pete

Yeah. I just want to say everybody try this. We have a free tier.

🤍0 likes💬 0 comments

Add to My Notes

01:00:38Lance

Yeah. Yeah. Absolutely. Hey, thanks a lot, Pete. I'd love to do this again sometime.

🤍0 likes💬 0 comments

Add to My Notes

01:00:43Pete

Yeah. Thanks for having me.

🤍0 likes💬 0 comments

Add to My Notes

01:00:45Lance

Yep. Okay. Bye. Bye.

🤍0 likes💬 0 comments

Add to My Notes

Video Player

My Notes📝

Highlighted paragraphs will appear here