VitaBench - MMT LongCat Open Source Interactive Agent Review Benchmarks

Latest AI Resources6mos agorelease AI Sharing Circle

32.6K 00

What is VitaBench?

VitaBench is the first interactive Agent evaluation benchmark for complex life scenarios released by the LongCat team of Meituan, evaluating the comprehensive capabilities of large model intelligences in real life scenarios. Taking the three high-frequency life scenarios of take-away ordering, restaurant dining, and traveling as the carrier, an interactive evaluation environment containing 66 tools is constructed, covering complex tasks such as tool invocation, multi-source information processing, and user interaction. For the first time, we systematically quantify the three dimensions of reasoning complexity, tool complexity and interaction complexity, and accurately measure the ability of the intelligent body to cope with real-life scenarios through indicators such as the size of the observation space, the length of the tool invocation link, and the dynamics of the user portrait.

Features of VitaBench

Highly simulated life service scenarios: The three high-frequency life scenarios of take-away ordering, restaurant dining and traveling are typical carriers to build a complex mission environment.
Rich Tool Calls: Contains 66 tools covering a wide range of fields such as map navigation, voice transcription, payment interface, etc., forming a complete digital life tool chain.
Quantification of multidimensional complexity: Quantitative disassembly of intelligent body tasks from three dimensions: deep reasoning, tool use and user interaction, to achieve controlled construction of complex problems.
Real User Simulator: A real user simulator is introduced to simulate the behavior and preferences of different users, so that the intelligences can adapt to diverse user behaviors in multi-round conversations.
Fine-grained assessment: Drawing on recent research, the task goal is disassembled into a set of atomistic evaluation criteria (Rubric), and the complete conversation trajectory is scanned through a sliding window with overlap, judging the task completion with a strict 'all-or-nothing' criterion.
Cross-scenario integrated task design: 100 cross-scenario tasks and 300 single-scenario tasks were designed to examine the ability of intelligences to switch execution and information integration between multiple scenarios.
open source: The project homepage, links to papers, code repository and datasets are fully open-sourced, providing a rich resource for researchers and developers.

Core Benefits of VitaBench

Real Scene SimulationThe evaluation is based on a high-frequency life scenario, such as ordering takeout, dining in restaurants and traveling, and builds a highly simulated interactive evaluation environment to ensure that the evaluation results are close to the real application requirements.
Quantification of multidimensional complexity: For the first time, task complexity is quantified in terms of three dimensions: deep reasoning, tool use and user interaction, which comprehensively measures the integrated performance of intelligences in complex tasks.
Real User Simulator: Introducing a user simulator constructed based on real data to simulate diverse user behaviors and preferences and enhance the adaptive ability of intelligences in real interactions.
Fine-grained assessment mechanisms: Atomistic evaluation criterion (Rubric) and sliding window evaluator are used to achieve fine-grained, whole-process evaluation of intelligent body behaviors and to improve the accuracy and interpretability of the evaluation.
Cross-scenario mission design: Design rich cross-scenario synthesis tasks to examine the ability of intelligences in multi-scenario switching and information integration, revealing the shortcomings of existing models.

What is the official website of VitaBench

Project website:: https://vitabench.github.io
Github repository:: https://github.com/meituan-longcat/vitabench
arXiv Technical Paper:: https://arxiv.org/abs/2509.26490
HuggingFace dataset:: https://huggingface.co/datasets/meituan-longcat/VitaBench

Who VitaBench is for

artificial intelligence researcher: Researchers developing and optimizing intelligences can push the boundaries of intelligent body technology by testing and evaluating the performance of intelligences on complex tasks with VitaBench.
Large Model Developer: The team that develops and improves the Big Language Model uses VitaBench to evaluate the model's ability to be applied in real-life scenarios, and to identify and address model shortcomings.
application developer: Developers of smart-body-based applications use VitaBench to test the performance of smart bodies in real-world applications and enhance the user experience of their applications.
Corporate Technical Team: Enterprise technology teams focusing on the application of smart body technology in enterprise business can use VitaBench to assess whether the smart body meets the needs of the enterprise and accelerate the intelligent transformation of the enterprise.
Universities and research institutions: Universities and research organizations engaged in research related to artificial intelligence and machine learning use VitaBench as a tool for teaching and research, and for training professionals.
technology enthusiast: Individuals interested in intelligentsia and AI technology can broaden their technical horizons by using VitaBench to learn about and explore how intelligentsia perform in complex tasks.