Transcribed Webinar on Context Engineering By YiChao (Peak) Ji From Manus

64 minute read

Published:

I transcribed the conversation from a highly informative webinar on context engineering, Yichao “Peak” Ji (Manus). You can watch the full session on YouTube.

Lance: All right. Thank you all for coming. We’ll go ahead and kick off the webinar now. I’m sure people will continue to stream in. I’m Lance, one of the founding engineers at LangChain. I’m joined by Peak from Manus. Peak, do you want to introduce yourself quickly?

Yichao (Peak): Yeah. Hey guys, I’m the co-founder and chief scientist of Manus. So, basically, I designed the agent framework and a lot of things in Manus, and I’m super excited to be here today. Thanks, Lance, for having me.

Lance: Yeah, we’re really excited to do this. Because first, Manus is a really cool product. I’ve been using it for a long time. Also, they put out a really nice blog post on context engineering a few months ago, that influenced me a lot. So I want to give a quick overview of context engineering as I see it. And I’ll reference their piece. And then Peak is actually going to give a presentation talking about some new ideas not covered in the piece. So if you’ve already read it, Peak is going to cover some things that are new, which will hopefully be quite interesting for you. But I’ll kind of set the stage and I’ll hand it over to Peak. And then we’ll do some Q&A.

Lance: So, you might have heard this term ‘context engineering’ and it emerged earlier this year. If you look through time with Google search trends, prompt engineering was kind of initiated following chatGPT. So that’s showing December 2022. And when we got this new thing, a chat model, there became a great deal of interest in how do we prompt these things. Prompt engineering kind of emerged as a discipline for working with chat models and prompting them. Now context engineering emerged this year around May. We saw it really rising in Google trends and it corresponds a bit with this idea of the year of agents. And so why is that? One of the things that people have observed, if you’ve been building agents, is that context grows and it grows in a very particular way. What I mean is we have an LLM bound to some number of tools that LLM can call autonomously in a loop. The challenge is for every tool called, you get a tool observation back, and that’s appended to this chat list. These messages grow over time, and so you can kind of get this unbounded explosion messages as agents run. As an example, Manus talked about their piece that typical tasks require around 50 tool calls. Anthropic mentioned similarly that production agents can engage in conversations spanning hundreds of turns. So the challenge is that because agents are increasingly long-running and autonomous, they utilize tools freely, you can accumulate a large amount of context through this accumulation of tool calls. Chroma put out a really nice report talking about the observation, simply that performance drops as context grows. So this paradox, this challenging situation, agents utilize lots of context because of tool calling but we know that performance drops as context grows. So this is a challenge that many of us have faced. This term of context engineering, Karpathy of course kind of coined it on Twitter earlier this year. You can think about context engineering as the delicate art and science of filling the context window with just the right information needed for the next step. So trying to combat this context explosion that happens when you build agents which call tools freely. All those tool messages accumulate in your messages queue. How do we kind of call such that the right information is presented to the agent to make the correct next decision at all points in time. So to address this, there’s a few common themes I want to highlight that we’ve seen across a number of different pieces of work, including Manus, which I’ll mention here.

Lance: Idea one is context offloading. We’ve seen this trend over and over. The central idea is you don’t need all context to live in this messages history of your agent. You can take information and offload it, send it somewhere else, so it’s outside the context window, but it can be retrieved, which we’ll talk about later. One of the most popular ideas here is just using a file system. Taking the output of a tool message as an example, dump it to the file system, send back to your agent just some minimal piece of information necessary, so it can reference the full context if it needs to, but that full payload, for example, web search result that’s very token-heavy isn’t spammed into your context window for perpetuity. You’ve seen this across a number of different projects. Manus uses this. We have a project called Deep Agents that utilizes the file system. Open Deep Research utilizes it. Actually agent state has a similar role to external file system. Claude Code of course uses this very extensively. Long running agents utilize it very extensively. So this idea of offloading context to a file system is very common and popular across many different examples of production agents that we’re seeing today.

Lance: The second idea is reducing context. So offloading is simply taking some piece of information, a tool message that’s token-heavy, and not sending it all back to your messages list, dumping it to a file system where it can be retrieved only as needed. That’s offloading. Reducing the context is similar, but instead, you’re just summarizing or compressing information. Summarizing tool call outputs is one intuitive way to do this. We do this with Open Deep Research as an example. One thing that’s very interesting is Claude 4.5 has actually added this to, if you look at some of their most recent releases, they now support this out of the box. So this idea of pruning old tool calls with tool outputs or tool messages is something that Claude is now kind of built into their SDK. Summarize your compacting full message history, you see this with Claude Code in its compaction feature, once you hit a certain percentage of your overall context window. Cognition also talks about the idea of summarizing/prunning at agent to agent handoffs. So this idea of reducing context is a very popular theme. We see across a lot of different examples from Claude Code to our Open Deep Research, Cognition. Claude 4.5 has incorporated this as well.

Lance: Retrieving context. Now this is one of the classic debates today that you might see raging on X or Twitter: the right approach for retrieving context. Lee Robinson from Cursor just had a very nice talk and I’ll make sure these slides are all shared, so you can see these links. He had a very nice talk at OpenAI demo day talking about Cursor, for example, uses indexing and semantic search, as well as more kind of simple file-based search tools, like glob and grep. Claude Code, of course, only uses the file system and simple search tools, notably glob and grep. So there’s different ways to retrieve context on demand for your agent. Indexing and semantic search, file system and simple file search tools, both can be highly effective. There’s pros and cons we could talk about in the Q&A. But of course, context retrieval is central for building effective agents.

Lance: Context isolation is the other major theme we’ve seen quite a bit of. In particular, splitting context across multi-agents. So what’s the point here? Each sub agent has its own context window and sub agents allow for separation of concerns. Manus’ Wide Agent talks about this. Our Deep Agents uses this. Open Deep Research uses it. Sub agents are utilized in Claude’s multi-agent researcher. And also Claude Code supports sub agent. So sub agents are a very common way to perform context isolation. We’ve seen across many different projects.

Lance: Now one thing I thought was very interesting is caching context. Manus talks about this quite a bit. I’ll let Peak speak to this a bit later. But I think it’s a very interesting trick as well.

Lance: So I’ll just show a brief example that we’ve seen across Open Deep Research. This is a very popular repo that we have. It’s basically an open-source deep research implementation, and it performs on par with some of the best implementations out there. You can check our repo, and we have results from Deep Research bench showing that we’re top 10. It has three phases: scoping of the research, the research phase itself using a multi-agent architecture, and then a final one-shot writing phase. We use offloading: we basically create a brief to scope our research plan and offload that, so we don’t just save it in the context window because that context window is going to get peppered with other things. We offload it, so it’s saved independently and can be accessed, in our case from the langgraph state, but it could also be from the file system. It’s the same idea. So you create a research plan, you offload it. It’s always accessible. You go do a bunch of work. You can pull that back in on demand, so you can put it kind of at the end of your message list, so it’s accessible and readily available to your agent to perform, for example the writing phase. We use offloading as you can see to help steer the research and writing phases. We use reduction to summarize observations from token-heavy search tool calls. That’s done inside research itself. And we use context isolation across sub-agents within research itself. This is kind a summary of these various ideas across a bunch of different projects. And actually, Peak is going to speak to Manus in particular and some of the lessons they’ve learned. This just kinds of sets up the stage. And this just kind of summarizes what I talked about. These different themes of offloading, reducing, context retrieving, context isolating, caching, and a number of popular projects, and kind of where they were used. And a few differnet links. I will share these slides in the notes. I do want to let Peak go ahead and present now. Because I want to make sure we have plenty of time for him and for questions. But this just sets the stage. Peak, I’ll let you take it from here and I’ll stop sharing.

Yichao (Peak): Okay. Can you see my slides?

Lance: Yeah. Okay. Perfect.

Yichao (Peak): Okay. Thank you, Lance. I’m super excited to be here today to share some fresh lessons on context engineering that we learned from building Manus. Here I say fresh lessons because I realized that the last blog post that you mentioned I wrote about context engineering was back in July. And it’s the year of the agents, so July is basically the last centry. And, of course, before this session I went back and read it again, and luckily I think most of what I wrote in that blog still hold up today. But I just don’t want to waste everybody’s time by just repeating what’s already inside that blog. So today I think, instead, I want to dig into some areas that I either didn’t go deep enough on before or didn’t touch at all. So actually we’ll be focusing on the discourage column in Lance’s earlier slide. Because personally I think exploring those non-consensus ideas often leads to the biggest inspirations.

Yichao (Peak): So here’s the topics for today’s talk. First, we’ll cover a bit about the bigger question of why we need context engineering. And then we’ll have more on context reduction, more on context isolation and finally some new stuffs about context offloading, which we are testing internally here at Manus. So everything I’m sharing today is in production in Manus. It’s battle tested, but I don’t know how long it will last. Because things are changing super fast.

Yichao (Peak): Okay let’s start with the first big question. It’s why do we even need context engineering, especially when fine-tuning or post-training models has become much more accessible today. For example, folks at the Thinking Machine team, they just released the Tinker API, which I like a lot. Beautiful design. But for me, the question why context engineering actually came through several painful stages of realization. Before starting Manus, I’ve already spent over 10 years in natural language processing or NLP, which is basically what we call building language models but before chatGPT. And Manus is actually my second or third company. At my previous startup, we trained our own language model from scratch to do open domain information extraction, and building knowledge graph and semantic search engines on top of them, and it was painful. Our product’s innovation speed was completely capped by the model’s iteration speed, even back then the models were much smaller comparing to today. But still, a single training plus evaluation cycle could take maybe one or two weeks. And the worst part is that, at that time, we hadn’t reached PMF yet. And we’re spending all that time improving benchmark that might not even matter for the product. So I think, instead of building specialized models too early, startups really should lean on general models and context engineering for as long as possible. Well, of course, I guess now that’s some kind of common wisdom. But as your product matures and open source base model gets stronger, I know it’s very tempting to think, hey, maybe I should just pick a strong base model, fine-tune it with my data, and make it really good at my use case. We’ve tried that too. And guess what? It’s another trap. To make RL work really well, you usually fix an action space, design a reward around your current product behavior, and generate tons of on policy rollouts and feedback. But, this is also dangerous because we’re still in the early days of AI and agents, everything can shift under a feet overnight. For us, the classic example was the launch of MCP. Actually, it completely changed the design of Manus from a compact static action space to something it’s infinitely extensible. And if you have ever trained your own model, this kind of open domain problem is super hard to optimize. Well, of course, you could pour massive effort into post training that ensures generalization, but then aren’t you basically trying to become an LLM company yourself? Because you’re basically rebuilding the same layer that they have already built. And that’s a duplication of effort. So maybe after all that buildup, here’s my point. Be firm about where you draw the line. Right now, context engineering is the clearest and most practical boundary between application and model. So trust your choice.

Yichao (Peak): All right, enough philosophy and let’s talk about some real tech. First topic, context reduction. Here I want to clarify two different kinds of compaction operations, because we think context reduction is fascinating, but it’s also a new concept. There’s a lot of way to do this. And here in Manus, we divide them into compaction and summarization. For compaction in Manus, every tool call and tool result we actually has two different formats: a full format and a compact one. The compact version strips out any information that can be reconstructed from the file system or external state. For example here, let’s say you have a tool that writes to a file and it probably has two fields: a path and a content field. But once the tool returns, you can ensure that the file already exists in the environment. So in the compact format, we can safely drop the super long content field and just keep the path. And if your agent is smart enough, whenever it needs to read that file again, it can simply retrieve it via the path. So no information is truly lost, it’s just externalized. We think this kind of reversibility is crucial because agents do chain predictions based on previous actions and observations, and you never know which past action will suddenly become super important 10 steps later. You cannot predict it. So this is a reversible reduction by using compaction.

Yichao (Peak): Of course, compaction only take you so far. Eventually your context will still grow and will hit the ceiling. And that’s when we combine compaction with the more traditional summarization. But we do it very carefully. For example here before summarizing, we might offload key parts of the context into files. And sometimes we even do more aggressively, we can dump the entire pre-summary context as a text file, or simply a log file into the file system, so that we can always recover it later. Like Lance just mentioned some people just use glob and grep. Glob and grep also work for log files. So if the model is smart enough, it even knows how to retrieve those presummarized context. So I think the difference here is that, compaction is reversible but summarization isn’t. Both reduce context length, but they behave very differently.

Yichao (Peak): And to make both methods coexist, we have to track some context length thresholds. At the top you’ll have your models’ hard context limit, say 1 million tokens, pretty common today. But in reality, most models start degrading much earlier, typically maybe around 200k. And you’ll begin to see, what we call a context rot, like repetitions, slower inferences, degraded quality. So by doing a lot of evaluations, it’s very important for you to identify that pre-rot threshold. It’s typically 128K to 200K and use it as the trigger for context reduction. And whenever your context size approaches it, you have to trigger context reduction. But starting from compaction, not summarization. And compaction doesn’t mean compressing the entire history. We might compactate the oldest 50% of tool calls while keeping the newer ones in full detail, so the model still has fresh few-shot examples to know how to use tools properly. Otherwise in the worst case, the model will imitate the behavior and output those compact format with missing fields and that’s totally wrong. And after compaction we have to check how much free context that we actually gain from this compaction operation. Sometimes in this graph after multiple rounds of compaction, the gain is tiny. Because even it’s compact, it still uses context. And that’s when we go for summarization. But also keep in mind that, when summarizing, we always use the full version of the data not the compact one. And we still keep the last few tool calls and tool results in full detail, not summary. Because it can allow the model to know where it left off, and it will continue more smoothly. Otherwise you’ll see after summarization, sometimes the model will change its style, change its tone. and we find out keeping a few tool call, tool result examples really help.

Yichao (Peak): Okay, now we’ve covered reduction and let’s talk about isolation. I really agree with Cognition’s blog where they warn against using multi-agent setups. Because when you have multiple agents, syncing information between them becomes a nightmare. But, this isn’t a new problem. Multiprocess or multi-thread coordination has been a classic challenge in the early days of computer programming. And I think we could borrow some wisdoms here. I don’t know how many Go lang coders are here today, but in the go programming language community, there’s a famous quote from this gopher “do not communicate by sharing memory, instead share memory by communicating”. Of course this isn’t directly about agent, and it’s sometimes even wrong for agents. But I think the important thing is, it highlights two distinct patterns here, which is by communicating or by sharing memory. If we translate the term memory here into context, we can see that parallel pretty clear. By communicating is the easier one to understand, because it is the classic subagent setup. Here for example, the main agent writes a prompt, and the prompt is sent to a subagent and the sub agent’s entire context only consists of that instruction. We think if a task has a short clear instruction, and only the final output matters, say searching a codebase for a specific snippet, then just use the communication pattern and keep it simple. Because the main agent doesn’t care how the subagent find the code. Tt only needs the result, and this is what Claude Code does typically using its task tool to delegate a separated clear task to some sub agents. But for more complex scenarios, in contrast, by sharing memory means that the subagent can see the entire previous context. It means all the tool usage history, but the subagent has its own system prompt and its own action space. For example, imagine a deep research scenario, the final report depends on a lot of intermediate searches and notes, and in that case, you should consider using the share memory pattern, or in our language by sharing context. Because even you can save all that notes and searches into file and making the sub agent to read everything again, but you’re just wasting latency and context. And if you count the amount of tokens, maybe you’re using even more tokens to do this. So we think for those scenario that requires a full history, just use a share memory pattern. But be aware that sharing context is kind of expensive, because each subagent has a larger input to prefill, which is you’ll spend more on input tokens. And since the system prompt and the access space differs, you cannot reuse the KV cache. So you have to pay the full price.

Yichao (Peak): and finally let’s talk a little bit about context offloading. When people say offload, they usually mean like moving parts of the working context into external files. But as your system grows, especially if you decide to integrate MCP, one day you realize that the tools themselves can also take up a lot of context. And having too many tools in context leads to confusion. We call it context confusion. And the model might call the wrong ones or even non-existing ones. So we have to find a way to also offload the tools. A common approach right now is doing dynamic RAG on tool descriptions, for example loading tools on demand based on the current task or the current status. But that also causes two issues. First of all, since tool definitions sit at the front of the context, your KV resets every time. And most importantly the model’s past calls to remove tools are still in the context. So it might few shot the model into calling invalid tools or using invalid parameters. So to address this, we’re experimenting with a new layered action space in Manus. Well essentially, we can let Manus to choose from three different levels of abstractions. Number one function calling, number two sandbox utilities and number three packages and API. We’ll go deeper into into these three layers of action space.

Yichao (Peak): Let’s start from level one, function calling. And this is classic. Everyone knows it. It is schema safe, thanks to constraint decoding. But we all know the downsides. For example, we mentioned breaking the cache, and maybe too many tool will cause some confusion. So in Manus right now, we only use a fixed number of atomic functions, for example, reading and writing files, executing shell commands, searching files in internet, and maybe some browser operations. We think these atomic functions have super clear boundaries, and they can work together to compose much more complex workflows. Then we offload everything else to the next layer, which is the sandbox utilities. As you know, each Manus session runs inside a full virtual machine sandbox, it’s running on our own customized Linux system and that means Manus can use the shell commands to run pre-install utility that we develop for Manus. For example, we have some format converters, we have speech recognition utilities and even a very special Manus MCP CLI, which is how we call MCP. We do not inject MCP tools to the function colony space. Instead, we do everything inside that sandbox through the command line interface. And utilities are great, because you can add new capabilities without touching the models’ function calling space, and it’s just some commands pre-installed in your computer. And if you’re familiar with Linux, you always know how to find those new commands, and you can even run –help to figure out how to use a new tool. And another good thing is, for larger outputs, they can just write to files or return the result in pages, and you can use all those Linux tools, like grab, cat, less, more, to process that results on the fly. So the trade-off here is that it’s super good for large outputs, but it’s not that good for low latency back and forth interactions with the front end. Because you always have to visualize the interactions of your agent and show it to the user. So this is pretty tricky here. But we think it already offloads a lot of things. And then we have another layer, the final layer, we call it packages and APIs. Here Manus can write Python scripts to call pre-authorized API or custom packages, for example, Manus might use a 3D design library for modeling, or call a financial API to fetch market data. And here actually we’ve purchased all these API on behalf of a user, and pay the money for them. It’s included in the subscription. So we basically have a lot of API keys pre-installed in Manus. And Manus can can access these APIs using the keys. I think these are perfect for task that requires lots of computation in memory, but do not need to push all that data into the model context. For example, imagine if you’re analyzing a stock’s entire year of price data, you don’t feed the model all the numbers. Instead, you should let the script to compute it, and only put the summary back into the context. And since code and APIs are super composable, you can actually chain a lot of things in one step. For example, in a typical API, you can do get city names, get city ID, get weather all in one Python script. There’s also a paper from one of my friend, called CodeAct. A lot of people were discussing about it. I think it’s the same idea. Because code is composable and it can do a lot of things in one step. But also it’s not schema safe. It’s very very hard to do a straight decoding on CodeAct. So we think you should find the right scenario for these features. For us, as we mentioned, everything that can handle inside a compiler or interpreter runtime, we do that using code, otherwise we use sandbox utilities or function calls. And the good thing is, if you have these three layers, from models’ point, all three levels still go through the standard function calls. So the interface stays simple, cache friendly, and orthogonal across functions. Because as we mentioned sandbox utilities, you’re still accessing these tools using the shell function. And also, if you’re using APIs in third party applications, you’re just using the file function to write or read file, and then execute it using the shell function. So it does not add overhead to the model. It’s still all the things that models are trained on, and they’re already familiar with.

Yichao (Peak): So let’s zoom out and connect the five dimensions, offload, reduce, retrieve, isolate, and cache. You can find out that they are not independent. We can see that offload and retrieve enables more efficient reduction, and stable retrieve makes isolation safe. But isolation also slows down context growth, and reduces the frequency of reduction. However, more isolation and reduction also affects cache efficiency and the quality of output. So at the end of the day, I think context engineering is the science and art that requires a perfect balance between multiple potentially conflicting objectives. It’s really hard.

Yichao (Peak): All right. Before we wrap up, I want to leave you with maybe one final thought, and it’s kind of the opposite of everything I just said, which is please avoid context over-engineering. Looking back at the past six or seven months since Manus’ launch, actually the biggest leap we’ve ever seen didn’t came from adding more fancy context management layers or clever retrieval hacks. They all came from simplifying, or from removing unnecessary tricks, and trusting the model a little more. Every time we simplify the architecture, the system got faster, more stable, and smarter because we think the goal of context engineering is to make the model’s job simpler, not harder. So if you take one thing from today, I think it should be build less and understand more. Well, thank you so much everyone, and thanks again to Lance and the langchain team for having me. Can’t wait to see what you guys all build next. Now back to Lance.

Lance: Yeah, amazing. Thank you for that. So we have a nice set of questions here. Maybe we can just start hitting them and we can kind of reference back to the slides if needed. And Peak, are your slides available to everyone?

Yichao (Peak): Oh yeah.I can share the PDF version afterwards.

Lance: Sounds good. Well, I’m going to start looking through some of the questions. Maybe we can start with the more recent ones first. So how does Manus call the various shell tools? How does it know which tools exist, and how to invoke them? Maybe you can explain a little bit about kind of the multi-tier kind of sandbox setup that you use with Manus.

Yichao (Peak): Yeah. I think imagine you’re the person that using a new computer, for example, if you know Linux, you can imagine all the tools are located in /usr/bin. So we actually do two things. First of all, we have it in the system prompt, telling Manus that, hey there’s a lot of pre-installed command line utilities located in some specific folder. And also for the most frequently used ones, we already injected in the system prompt, but it’s super compact. We do not tell the agent how to use the tools. We only list them and we can tell the agent that you can use the –help flag safely, because all the utilities are developed by our team and they have the same format.

Lance: Got it. I know you talked a lot about using file system. What’s your take on using indexing? Do you spin up vector stores on the fly, if the context you’re working with gets sufficiently large? How do you approach that?

Yichao (Peak): Yeah, I think there’s no right and wrong in this space as you’ve mentioned. But at Manus we do not use index databases, because right now, every sandbox in Manus session is a new one and user want to interact with things fast. So actually we don’t have the time to build the index on the fly. We are like Claude Code. We rely on like grep and glob. But I think if you consider to build something more long-term memory, or if you want to integrate enterprise knowledge base, you still have to rely on that external vector index, because it’s only about the amount of information that you can access. But for Manus, it operates in a sandbox, and for coding agent you operate in the codebase. So it depends on the scale.

Lance: Yeah. So here is a follow-up then. So let’s say I’m a user, I have my Manus account, I interact with Manus across many sessions. Do you have the notion of memory? So Claude has Claude MD files. They persist across all the different sessions of Claude Code. How about you guys? How do you handle long-term memory?

Yichao (Peak): Yeah. actually in Manus, we have a concept called knowledge, which is kind of like explicit memory. For example, every time you can tell Manus, hey, remember every time I ask for something, deliver is in Excel. And it’s not automatically inserted into some memory. It will pop up a dialogue and say here’s what I learned from our previous conversation, and would you accept it or reject it? So this is the explicit one. It requires user confirmation. But also we are discovering new ways to do it more automatically. For example, a pretty interesting thing in agents is that, compared to chat bots, user often correct the agent more oftenly. For example, like a common mistake that Manus make is when doing data visualization, if you’re using Chinese, Japanese or Korean, a lot of times there will be some font issues, and there will be errors in those render visualizations. So the user will often say, hey you should use Noto Sans CJK font. And for these kind of things, a different user will have the same correction, and we need to, maybe find out a way to leverage these kind of collective feedback and use it. That’s kind of, we call it, self-improving agent with online learning, but in a parameter free way.

Lance: Yeah. How about a different question that was raised here, and also I think about quite a bit. You mentioned towards the end of your talk that you gained a lot from removing things, and a lot of that is probably because of the fact that also the models are getting better. So model capabilities is increasing and so you can kind of remove scaffolding over time. How do you think about this. Because this is one of the biggest challenges that I’ve faced. Over time, the model gets better, and I can remove certain parts of my scaffolding. So you’re building on top of this foundation, the water’s rising, do you revisit your architecture every some number of months with new releases, and just delete as the models get better? And how do you approach that problem?

Yichao (Peak): Yeah, this is a super good question here. Actually we have already refactored Manus four/five times. Since we’ve launched Manus in March, and now it’s October, already five times. So we think you cannot stop, because models are not only improving, but they are changing. Models’ behavior are changing over time. One way is you can work closely with those model providers. But we also have another internal theory for how we evaluate or how we design our agent architecture. I cover a little bit on Twitter before. It’s basically, we do not care about the performance of a static benchmark. Instead we fix our agent architecture, and we switch between models. If your architecture can gain a lot from switching from a weaker model to a stronger model, then somehow your architecture is more future proof. Because the weaker model tomorrow might be as good as a stronger model today. Yeah. So we think switching between weaker and strong models can give you some early signals of what will happen next year, and give you some time to prepare your architecture. Yeah. So for Manus, we often do these kind of reveal every one or two month. And we often do some research internally, using open source models and maybe early access to prep proprietary models, to prepare the next release, even before the launch of the next model.

Lance: Yeah. It’s a good observation. You can actually do testing of your architecture by toggling different models that exist today. Yeah, that makes a lot of sense. What about best practices or considerations for format for storing data? So, markdown files, plain text, log, anything you prefer in particular? How do you think about that kind of file formats?

Yichao (Peak): Yeah. I think it’s not about plain text or markdown. But we always prioritize line based formats, because it allows the models to use like grep or read, from a range of lines. And also markdown can sometime cause some troubles. Models are trained to use markdown really well, and sometimes it will, maybe for some model I don’t want to say that name, but they often output too many bullet points if you use markdown too often. Yeah. So, actually we want to use more plain text.

Lance: Yeah, makes sense. How about on the topic of compaction versus summarization? Let’s hit on summarization. This is an interesting one that I’ve been asked a lot before. How do you prompt to produce good summaries? So, for example, summarization, like you said, is irreversible. So, if you don’t prompt it properly, you can actually lose information. The best answer I came up with is just tuning your prompt for high recall. But how do you approach this? So summarization, how do you think about prompting for summarization?

Yichao (Peak): Yeah, actually we tried a lot optimizing the prompt for summarization. But it turns out a simple approach works really well is that you do not use a free form prompt to let the AI generate everything. Instead, you could define a kind of schema. It’s just a form. There’s a lot of fields, and let the AI to fill them. For example, here are the files that I’ve modified and here’s the goal of the user, here’s what I left off. And if you use this kind of a more structured schema, at least the output is kind of stable, and you can iterate on this. So just do not use free form summarizations.

Lance: Got it. Yeah, that’s a great observation. So you structured outputs, rather than free form summarization to enforce certain things are always summarized. Yeah, that makes a lot of sense. How about with context? How about with compaction then? And actually I want to make sure I understood that. So with compaction, let’s say it’s a search tool, you have the raw search tool output, and that would be your raw message, then the compaction would just be a file name or something. Is that right?

Yichao (Peak): Yeah, it is. It’s not only about the tool call. It’s also applied to the result of the tool. Interestingly we find out that almost every action in Manus is just kind of reversible, if you can offload it to file system, or an external state. And for most of these tasks, you already have a unique identifier for it. For example for file operations, of course you have the file path, for browser operations you have the URL, and even for search actions you have the query. So it’s already there.

Lance: Yeah. Okay. That’s a great one. I just want to hit that again, because I’ve had this problem a lot. So, for example, I have an agent that uses search, it returns a token-heavy tool call, I don’t want to return that whole tool message to the agent. I’ve done things like summarization or compaction, and send the summary back. But how do you approach that? Because you might want all that information to be accessible for the agent for his next decision. But you don’t want that huge context block to live inside your message history. So how do you approach that? You could send the whole message back but then remove it later. That’s what Claude does now. You could do a summarization first and send the summary over. You could send everything and then do compaction, so that later on you don’t have the whole context in your message history, you only have, like a link to the file. How do you think about that specifically, if you see what I’m saying?

Yichao (Peak): Yeah, I know. Actually, it depends on the scenario. For example for complex search, I mean it’s not just one query for example, you have multiple queries and you want to gather some important things and drop everything else. In this case I think we should use sub agents, or internally we call it “agent as tool”. So from the models’ perspective, it’s still a kind of function, maybe called advanced search. It’s a function called advanced search. But what it triggers is actually another sub agent. But that subagent is more like a workflow or agentic workflow that has a fixed output schema, and that is the result that returns to the agent. But for other kinds of more simpler search, for example just searching Google, we just use the full detailed format and append it into the context, and rely on the compactions. But also, we always instruct the model to write down the intermediate insights or key findings into files, in case that the compaction happens earlier than the model expected. And if you do this really well, actually you don’t lose a lot of information by compaction, because sometimes those old tool calls are irrelevant after time.

Lance: Yeah, that makes sense. And I like the idea of agent as tool. We do that quite a bit and that is highly effective. But that brings up another interesting point, about, you referenced this a little bit, agent-agent communication. How do you address that? So Walden Yen from Cognition had a very nice blog post talking about. This is a major problem that they have with Devin. So like the kind of communication between agents. How do you think about that problem, and yeah ensuring sufficient information is transferred, but not overloading, like you said, the prefill of the sub agent with too much context. So how do you think about that?

Yichao (Peak): Yeah. at Manus we’ve launched a feature called wide research a month ago. Internally we call it agentic map reduce. Because we got inspired from the design of map reduce. And it’s kind of special for Manus, because there’s a full virtual machine behind the session. So one way we pass information or pass context from the main agent to sub agent is by sharing the same sandbox, so the file system is there and you can only pass the different path here. And I think sending information to sub agent is not that hard. The more complex thing is about how to have the correct output from different agents. And what we did here is, we have a trick, every time if the main agent want to spawn up a new subagent or maybe 10 sub agents, you have to let the main agent to define the output schema. And in the subagent perspective, you have a special tool called submit result. And we use constraint decoding to ensure that what the sub agent submits back to the main agent is the schema that is defined by the main agent. Yeah. So you can imagine that this kind of map produce operation, it will generate a spreadsheet and the spreadsheet is constrained by the schema.

Lance: That’s an interesting theme that seems to come up a lot with how you design Manus. You use schemas and structured outputs, both for summarization and for this agent agent communication. So it’s kind of use schemas as contracts, between agent sub agent, or between a tool and your agent, to ensure that sufficient information is passed in a structured way, in a complete way. In summarization, you use a schema as well. Okay fantastic. This is very very very helpful. I’m poking around some other interesting questions here. Any thoughts on models? I think you guys are using Anthropic, but do you work with open models? Do you do fine-tuning? You talked a lot about working with KV cache, so for that you use open models? How do you think about model choice?

Yichao (Peak): Yeah. Actually right now we don’t use any open source model. Because I think it’s not about quality, interestingly it’s about cost. We often think that open source model can lower the cost, but if you’re at the scale of Manus, and if you’re building a real agent which the input is way longer than the output, then KV cache is super important, and distributed KV cache is very hard to implement if you use open source solutions. And if you use those frontier LLM providers, they have more solid infrastructure for distributed cache globally. So sometimes if you do the math, at least for Manus, we find out that using these flagship models can sometimes be even more cheaper than using open source models. And right now we’re not only using Anthropic. First, of course Anthropic’s model is the best choice for agentic task, but we’re also seeing the progress in Gemini and in openAI model. I think right now these frontier labs are not converging in directions. For example if you’re doing coding, of course you should use Claude, and if you want to do more multimodality things you should use Gemini, and openAI model is super good at complex math and reasoning. So I think for application companies like us, one of our advantages is that we do not have to build on top of only one model. You can do some task level routing or maybe even subtask or step level routing if you can pull in that kind of KV cache validation. So I think it’s an advantage for us and we do a lot of evaluations internally to know which models to use for which subtask.

Lance: Yeah, Yeah. That makes a lot of sense. I want to clarify one little thing. So with KV cache, so what from the providers are you using for cache management? So okay, I know like Anthropic has input caching, as an example. That’s what you mean?

Yichao (Peak): Yeah.

Lance: Okay, got it. Yeah, cool. Okay, perfect. Cool. I’m just looking through some of the other questions. Yeah, tool selection is a good one. Right. So, you were talking about this. You don’t use indexing of tool descriptions, and fetching tools on the fly based on semantic similarity. How do you handle that? What’s the threshold for too many tools? Yeah, tool choice is a classic. How do you think about that?

Yichao (Peak): Yeah. First of all, it depends on the model. Different model has different capacity for tools. But I think a rule of thumb is try not to include more than 30 tools. It’s just a random number in my mind. But actually, I think if you’re building, we call it, a general AI agent like Manus, you want to make sure those native functions are super atomic. So actually there are not that much atomic function that we need to put inside the action space. So for Manus right now we only have like 10 or 20 atomic functions, and everything else is in the sandbox. Yeah. So we don’t have to pull things dynamically.

Lance: Yeah good point actually. Let’s explain that a little bit more. So you have, let’s say, 10 tools that can be called directly by the agent. But then I guess like you said, the agent can also choose to, for example write a script and then execute a script, so that expands its action space hugely without give it like *. You don’t have an independent tool for each possible script, of course. That would be insane. So a very general tool, to write a script and then run it, does a lot. Is that what you mean?

Yichao (Peak): Yeah. Yeah. Exactly. You know why we are super confident to call Manus a general agent? Because it runs on a computer and computer are turing complete. The computer is the best invention of human. Theoretically an agent can do anything that maybe a junior intern can do using a computer. So with the shell tool and the text editor, we think it’s already turing complete. So you can offload a lot of things to the sandbox.

Lance: Yeah. Okay. That makes a lot of sense. Right. You mentioned CodeAct. With code agents, my understanding is the model will actually always produce a script, and that’ll then be run inside a code sandbox. For every tool call, it is effectively a script is generated and run. It sounds like you do some hybrid, where sometimes Manus can just call tools directly, but other times it can actually choose to do something in the sandbox. Is that right? So it’s kind of a hybrid approach.

Yichao (Peak): Yeah. I think this is super important. Because actually we tried to use entirely CodeAct for Manus. But the problem is, if you’re using code, you cannot leverage constraint decoding, and things can go wrong. Yeah, but CodeAct has some special use cases, as I mentioned earlier in slides, for example processing a large amount of data, you don’t have to port everything in the tool result. It’s that you put it inside maybe the runtime memory of Python, and you only get the result back to the model. So we think you should do it in a hybrid way.

Lance: Got it. Allow for tool calling. And you’ve some number of tools, maybe 10 or something that are just called directly, some number of tools that actually run in the sandbox itself. Perfect. That makes a ton of sense. Very interesting. How about planning? Tell me about planning and I know Manus has this to-do tool or it generates a to-do list and start of tasks. Yeah, tell me about that.

Yichao (Peak): Yeah, I think this is very interesting. Because at the beginning, Manus uses that todo.md paradigm. It’s kind of, I don’t want to use the word stupid, but actually it wastes a lot of turns. Like back in maybe March or April, if you check the log of some Manus task, maybe one third of the action is about updating the to-do list. It wastes a lot of tokens. Yeah. So right now we’re using a more structuralized planning. For example, if you use Manus, there’s a planner at the bottom of the system. Internally it’s also kind of a tool call. We implemented it using the agent as tool paradigm. So that, there’s a separate agent that is managing the plan. So actually right now, the latest version of Manus, we are no longer using that todo.md thing. Of course todo.md still works and it can generate good results. But if you want to say save tokens, you can find another way.

Lance: Got it. Yeah. So you have a planner agent. For a subtask, it’ll be more like an agent as a tool call type things. Yeah. Got it.

Yichao (Peak): And it’s very important to have a separate agent that has a different perspective. So it can do some external reviews. And you can use different models for planning. For example, sometime Grok can generate some very interesting insights.

Lance: Yeah. That’s a great one actually. So thinking about multi-agent then. And so how do you think about that? So you might have a planning agent with its own context window, makes a plan, produces some kind of plan object, maybe it’s a file. Or maybe it just calls sub agents directly. How do you think about that? And how many different sub agents do you typically recommend using?

Yichao (Peak): Yeah, I think this also depends on your design. But actually Manus is not kind of the typical multi-agent system. For example, we’ve seen a lot of different agent that divides by role. For example, you have a designer agent or programming agent, manager agent. We don’t do that. Because we think why we have this is because this is how human company works, and this is due to the limitation of human context. So Manus is a multi-agent system. But we do not divide by role. We only have very few agent. For example we have a huge general executor agent, and a planner agent, and a knowledge management agent, and maybe some data API registration agent. Yeah. So we are very very cautious about adding more sub agents. Because of the reason that we’ve mentioned before, communication is very hard. And we implement more sub agents as agent as tools, as we mentioned before.

Lance: Yeah, that’s a great point. I see this mistake a lot. Or I don’t know if it’s a mistake. But you see anthropomorphizing agents a lot. It’s my designer agent. And I think it’s kind of a forced analogy to think about, like a human org chart in your sub agents. So got it. So for you, it’s like a planner and knowledge manager. A knowledge manager might do what? What will be the task of knowledge manager?

Yichao (Peak): Yeah, it’s even more simple. As we mentioned, we have a knowledge system in Manus. What the knowledge agent does is that it reviews the conversation between the user and the agent, and figure out what should be saved in the long-term memory. So it’s that simple.

Lance: Got it. Yeah. Okay. It’s a memory manager, planner, and then you have sub agents that could just take on a general executor sub agent that could just call all the tools or actions in the sandbox. That makes sense. Keep it simple. I like that a lot. Right. That makes a lot of sense. Um yeah, let me see if there’s any *. There’s a bunch of questions here, but we did hit a lot. So, how about guardrailing? Someone asked a question about kind of safety and guardrailing. How do you think about this? I guess that’s the nice thing about a sandbox, but tell me a little bit about that. How you think about it?

Yichao (Peak): Yeah, I think this is a very sensitive question. Because if you have a sandbox that’s connected to the internet, everything is dangerous. Yeah. So we have put a lot of effort in guardrailing. At least we do not let the information to get out of the sandbox. For example, if you got prompt injected, we have some checks on outgoing traffic. For example, we’ll ensure that no token things will go out of the sandbox. And if the user wants to print something out of the sandbox, we have what we call it “removing things” and to ensure that no information go out of the sandbox. But for another kind of thing, it’s that we have a browser inside of Manus, and the browser is very complicated. For example, if you log into some of your websites, you can choose to let Manus to persist your login state. And this turns out to be very tricky. Because sometime the content of the web page can also be malicious. Maybe they’re doing prompt injection and this is somehow out of scope for application company. So we’re working very closely with those computer use model providers, for example anthropic and Google. Yeah, they’re adding a lot of guardrails here. So right now in Manus, every time you do some sensitive operations, whether inside the browser or in the sandbox, Manus will require a manual confirmation, and you must accept it or otherwise you have to take over it to finish it yourself. So I think it’s pretty hard for us to have a very well designed solution. But it’s a progressive approach. So right now we’re letting the user to take over more frequently. But if the guardrail itself in the model gets better, we can do less. Yeah.

Lance: Yeah. How about the topic of evals? This has been discussed a lot. Quite a bit online if you probably seen. Claude Code, they talked a lot about just doing less formal evals, at least for code. Because code evals are more or less saturated. Lots of internal dog fooding. How do you think about evals? Are they useful? What eval are actually useful? What’s your approach? Yeah.

Yichao (Peak): Yes. Yeah. At the beginning of the launch of Manus, we’re using public academic benchmarks, like GAIA. But then after launching to the public, we find out that it’s super misaligned. Models that gets high scores on GAIA, the users don’t like it. So right now, we have three different kinds of evaluations. First of all, most importantly is that for every completed session in Manus, we’ll request the user to give a feedback, to give one to five stars. Yeah, this is the gold standard. We always care about the average user rating. This is number one. And number two, we’re still using some internal automated tests with verifiable results. For example, we have created our own dataset with clear answers. But also we still use a lot of public academic benchmarks. But we also created some datasets that’s more focused on execution. Because most benchmarks out there are more about read only tasks. So we designed some executing tasks or transactional task. Because we have the sandbox, we can frequently reset the test environment. So these are the automated parts. And most importantly, number three, we have a lot of interns. You have to use a lot of real human interns to do evaluations on things like website generation or data visualization. Because it’s very hard to design a good reward model that knows whether the output is visually appealing. It’s about the taste. Yeah. So we still rely a lot on human.

Lance: Perfect. Yeah. I know we’re coming up on time, but I do want to ask you about this emerging trend of reinforcement learning with verifiable rewards, versus just building tool calling agents. So Claude Code, extremely good. And they have the benefit. Because they built the harness and they can perform RL on their harness. And it can get really really good with the tools they provide in the harness. Do you guys do RL? Or how do you think about that? Because, of course in that case you would have to use open models. I’ve been playing with this quite a bit lately. How do you think about that? Just using tool calling out of the box with model providers, versus doing RL yourself inside your environment with your harness. Yeah how do you think about that?

Yichao (Peak): Yeah I mentioned before starting Manus, I was kind of model training guy. I’ve been doing pre training, post training, RL for a lot of years. But I have to say that, right now if you have sufficient resources, you can try. But actually, we as I mentioned earlier, MCP is a big changer here. Because if you want to support MCP, you’re not using a fixed action space, and if it’s not a fixed action space, it’s very very hard to design a good reward, and you cannot generate a lot of the *, rollouts and feedbacks will be unbalanced. So if you want to build a model that supports MCP, you are literally building a foundation model by yourself. So I think everyone in the community, like model companies, they’re doing the same thing. They’re doing the same thing for you. So, I don’t think we should spend that much time on doing RL right now. But as I mentioned earlier, we are just discovering/exploring new ways to do, maybe call it personalization or some sort of online learning, but using parameter-free way, for example collective feedbacks.

Lance: Yeah. One little one along those lines is that, is it the case that, for example Anthropic done reinforcement learning at verified rewards on some set of tools using Claude Code. Have you found that you can kind of mock your harness to use similar tool names, to kind of unlock the same capability if that makes sense? Like for example I believe, they obviously utilizes glob, uses grep, uses some other set of tools, for manipulating the file system. Can you effectively reproduce that same functionality by having the exact same tools, with the same tool name, same descriptions in your harness? Or kind of how do you think about that unlocking? You see what I’m saying?

Yichao (Peak): Yeah. Yeah. I don’t know the clear answer here. But for us, we actually try not to use the same name, because if you design your own function, you maybe have different requirements for that function, and the parameters, the input arguments might be different. So you don’t want to confuse the model, if the model is trained on a lot of post training data that has some internal tools. You don’t want to to let the models to be confused.

Lance: Okay. Okay. Got it. Got it. Perfect. Um well, I think we’re actually at time, and I want to respect your time, because I know it’s early, you’re in Singapore, it’s very early for you. So well, this was really good. Thank you. We’ll definitely make sure this recording is available. We’ll make sure slides are available. Any parting things you want to mention, things you want to call out, calls to action? Yeah, people should go use Manus, but the floor is yours.

Yichao (Peak): Yeah. I just want to say everybody try this. We have a free tier. Yeah.

Lance: Yeah. Absolutely. Hey, thanks a lot, Peak. I’d love to do this again sometime.

Yichao (Peak): Yeah. Thanks for having me.

Lance: Bye.

Yichao (Peak): Bye.