How do general purpose task intelligences, such as Manus, work?

AI Answers5mos agorelease AI Sharing Circle

1.3K 00

General-purpose task intelligences, such as Manus, are designed to mimic human problem-solving capabilities by understanding user intent, disassembling complex tasks, and working together to achieve a goal.The core of Manus lies in its Multi-Agent-based architecture, which allows multiple intelligent agents to divide up the workload and collaborate on general-purpose tasks proposed by the user. The workflow can be summarized in the following key steps:

Intent recognition: the first step in understanding user needs

The starting point for intelligent task execution is an accurate understanding of the user's needs; Manus' Intent Recognition Module first takes the user's input, e.g. a text instruction. The system then performs the necessary intent recognition and keyword extraction on the input. For example, if the user enters "I want to travel to Japan and need a travel plan", Manus will parse out the keyword "japan-trip" and recognize the task type as " travel".

When the user inputs general requirements and the system has difficulty in accurately recognizing his/her intent, Manus adopts a guiding strategy and actively engages the user in multiple rounds of dialogues to clarify the details of the requirements step by step. In addition, the system also supports users to upload documents, pictures and other diversified information as auxiliary materials for intent recognition, in order to understand the user's intent more comprehensively.

Task initialization: building an isolated execution environment

After accurately grasping the user's intent, Manus will enter the task initialization phase. The system will use the recognized task keywords, such as "japan-trip", to automatically create a separate folder related to the task, which will be used to store all intermediary products and final results during the execution of the task.

What's more, Manus starts a separate Docker container for each task, which ensures isolation, meaning that each task runs in a clean and isolated environment, guaranteeing task execution independence and avoiding interference between different tasks. The system also automatically cleans up the Docker container after tasks are completed, keeping the system clean and efficient.

Step-by-step planning: reasoning models to dismantle complex tasks

The next step in task initialization is step planning, which is Manus Manus utilizes a powerful reasoning model to break down tasks into detailed steps, a key component of automating complex tasks. The inference model synthesizes the results of intent recognition and task-related contextual information to intelligently decompose a large goal task into a series of executable subtasks.

For example, for the requirement "Japan travel planning", the inference model may break it down into multiple steps such as "searching for Japan travel tips", "checking air ticket and hotel information", "Make detailed itinerary" and so on. The information of the split steps will be written to the task folder under the [todo.md](https://t.co/tYosIUPa9o) file to form a structured task list that guides the execution of subsequent tasks.

Task Execution: Multi-Agent Collaboration for Efficient Operation

The task execution phase is the core operation of Manus. The system traverses the [todo.md](https://t.co/tYosIUPa9o) file, which contains a list of tasks in Markdown format.[ ] indicates a task to be performed.[x] then it represents a completed task.

Manus's task scheduling center, or the main thread as it can be called, will read the tasks to be executed one by one, and launch the so-called "function call" with the task context information. The "function call" here actually means that the system calls the predefined function modules, i.e., various types of agents, according to the task requirements, and Manus has a variety of built-in agents, e.g., search agent, code agent, data agent. Manus has a variety of built-in agents, such as search agent, code agent, data-analysis agent, etc. Each agent focuses on accomplishing a specific type of task.

Based on the result of the "function call", Manus schedules the corresponding agent to execute the task. Any content products generated by the agent during execution, such as search results, code files, analysis reports, etc., are written to the task folder of the Docker container to realize unified management and storage of data. The Agent will be written to the task folder of the Docker container to achieve unified management and storage of data. After the task is executed, the main thread will update the [todo.md](https://t.co/tYosIUPa9o) file, mark the completed task, and move on to the next task in the list until all steps are complete.

Summarizing: outputting results and collecting user feedback

(coll.) fail (a student) [todo.md](https://t.co/tYosIUPa9o) After all the tasks in the file are marked as complete, Manus enters the final stage of summarization and organization. The main thread will integrate and systematize all the content products generated during the execution of the tasks according to the user's initial requirements to form the final structured output.

The final task results will be presented in various forms, such as documents, code, images, links, etc., and will be made available for users to browse or download. In order to continuously optimize system performance and user experience, Manus also collects user satisfaction with the quality of task completion and final results, providing valuable reference for subsequent iterations and upgrades.

Search Agent Workflow Explained: Simulating Human Browsing Behavior

The core of the Manus solution lies in the design of the agent that executes tasks and the scheduling process of the main thread. Taking the search agent as an example, a deeper understanding of its execution steps when handling tasks like "Japan Travel Plan" can help us better understand how Manus works.

Keyword Extraction and Search: The Search agent first obtains keyword information such as "japan-trip" and calls a third-party API such as Google to initiate a search request to obtain 10-20 relevant search results.
Simulated web browsing: The search agent then simulates the behavior of a user browsing a web page. It "clicks" on the first link in the search results, uses headless browser technology to browse the content of the web page, captures the text of the web page, and takes a screenshot of the web page to obtain visual information.(Note: A headless browser is a browser that runs without a graphical user interface and is commonly used to automate web manipulation and data crawling.)
Multimodal Information Extraction: Next, the search agent will call the model that supports multimodal inputs,* (Note: Multimodal models are able to process multiple types of data such as text, images, etc. simultaneously.)Taking the current task requirements and web page information as input, the agent extracts valid information from the currently viewed web page, e.g., determining whether the web page content contains results that meet the travel plan requirements. If there is insufficient information on the current web page, agent alsoAnalyzing the structure of a web page* to find and return the next button element that might contain useful information.
Iterative Information Gathering: The Search agent simulates the user's clicks and scrolls to obtain additional web content and visual information. This process is repeated several times until the information collected satisfies the task requirements.
Content saving: Finally, the search agent saves all the collected information to the task folder to provide data support for the subsequent steps.

The core of Search agent is to simulate the real behavior of users browsing web pages, which enables it to accurately locate and extract the required information from the huge amount of information on the Internet just like human beings. The application of headless browser and multimodal model is the key technical support to realize this goal.

Code Agent and Data-Analysis Agent: Streamlining Code Tasks and Data Analysis

Compared to the search agent, the code agent and the data-analysis agent have a relatively simple but equally efficient workflow.

Code agent is mainly responsible for code generation and execution. When receiving a code writing task, code agent will create a local code file, such as Python code or HTML code, according to the task requirements, and write the generated code into the file. For data analysis tasks, code agent may generate Python code; for results presentation, it may generate HTML code for visualization. The code agent then executes the code through system calls and saves the results to the task folder. To make it easier for users to see how the code is executed, Manus also provides a code-preview service to preview the content of the HTML file.

Data-analysis agent focuses on data processing and analysis tasks. Its workflow is similar to that of code agent, but the main difference is that data-analysis agent focuses more on the implementation of data analysis logic and the mining of data insights.

Future Prospects: Continuously Evolving Multi-Agent Intelligence

While Manus has demonstrated strong capabilities in the area of general-purpose task intelligences, there is still plenty of room for improvement in such multi-agent products.

First, in the area of mandate dependency management, the current [todo.md](https://t.co/tYosIUPa9o) in which tasks present more linear dependencies. In the future, DAGs (directed acyclic graphs) could be introduced (Note: DAG, Directed Acyclic Graph, a graphical model for representing task dependencies and execution sequences, is capable of expressing more complex task flows.) to enable more complex and flexible task dependencies to cope with more complex real-world scenario requirements.

Second, in terms of the accuracy and reliability of task execution, an automated test agent can be introduced, which can automatically evaluate and judge the results of the task, and if the rating of a step is too low, the system can go back to a previous task node and re-execute the relevant steps, so as to realize the automatic correction and optimization of the task.

In addition, the convergence of human-computer collaboration modes is also an important development. manus can allow for a mix of fully automated and user-intervened modes. For example, after a step has been performed, the system can ask for feedback from the user and then continue automatically if the user does not provide feedback within a certain period of time, thus finding the optimal balance between automation and flexibility.

Summary and challenges

Overall, Manus has made significant progress in its engineering implementation, and its overall interactive experience compares favorably to other similar products. However, from a technical point of view, Manus still relies heavily on the capability of the underlying model. It is hypothesized that Manus may use lightweight models for intent recognition, while task planning and reasoning may rely on DeepSeek-R1 Such large-scale language models. For image recognition and code generation, advanced models such as Claude-3.7-Sonnet are also Manus' technology of choice.

high token Consumption indicates that cost control will become a key challenge for the popularization of applications like Manus. In the future, how to effectively reduce token costs and improve task execution accuracy and user satisfaction will be the key direction that all multi-agent products, including Manus, need to continue to explore and optimize. Whether Manus can be used on a large scale and be widely recognized in the marketplace remains to be seen in more real-world applications.