AI Personal Learning
and practical guidance

Agent AI: Exploring the Frontier World of Multimodal Interaction [Fei-Fei Li - Classic Must Reads]

Agent AI: Surveying the Horizons of Multimodal Interaction.

Original: https://ar5iv.labs.arxiv.org/html/2401.03568


 

summaries

Multimodal AI systems are likely to be ubiquitous in our daily lives. One promising approach to making these systems more interactive is to implement them as intelligences in physical and virtual environments. Currently, systems utilize existing base models as the basic building blocks for creating embodied intelligences. Embedding intelligences into such environments contributes to the model's ability to process and interpret visual and contextual data, which is critical for creating more complex and context-aware AI systems. For example, a system that is able to sense user behavior, human behavior, environmental objects, audio expressions, and the collective emotion of a scene can be used to inform and guide the response of an intelligent body in a given environment. In order to accelerate the study of intelligentsia-based multimodal intelligence, we define "Agent AI" as a class of interacting systems that can perceive visual stimuli, linguistic inputs, and other environmentally-based data, and can produce meaningful embodied behaviors. In particular, we explore systems that aim to improve the prediction of intelligences based on the next embodied behavior by integrating external knowledge, multisensory inputs, and human feedback. We argue that the illusion of large base models and their tendency to produce environmentally incorrect outputs can also be mitigated by developing intelligent body AI systems in grounded environments. The emerging field of "Agent AI" encompasses the broader embodied and intelligent body aspects of multimodal interaction. In addition to intelligences acting and interacting in the physical world, we envision a future in which people can easily create any virtual reality or simulation scenario and interact with intelligences embedded in the virtual environment.

Agent AI: Exploring the Boundaries of Multimodal Interaction-1
Figure 1: Overview of an Agent AI system that can perceive and act in a variety of domains and applications.Agent AI is emerging as a promising pathway to General Artificial Intelligence (AGI).Agent AI training has demonstrated the ability to perform multimodal understanding in the physical world. It provides a framework for reality-independent training by utilizing generative AI and multiple independent data sources. When trained on cross-reality data, large base models trained for intelligences and action-related tasks can be applied to both physical and virtual worlds. We show a general overview of an Agent AI system that can perceive and act in many different domains and applications, potentially serving as a pathway to AGI using the intelligent body paradigm.

catalogs

  1. introductory
    1. 1.1 locomotive
    2. 1.2 contexts
    3. 1.3 summarize
  2. Agent AI Integration
    1. 2.1 Infinite AI Intelligence
    2. 2.2 Agent AI using large-scale base models
      1. 2.2.1 figment of one's imagination
      2. 2.2.2 Bias and inclusiveness
      3. 2.2.3 Data privacy and use
      4. 2.2.4 Interpretability and descriptiveness
      5. 2.2.5 Reasoning Enhancement
      6. 2.2.6 supervisory
    3. 2.3 Agent AI for emergent capabilities
  3. Agent AI Paradigm
    1. 3.1 Large Language Modeling and Visual Language Modeling
    2. 3.2 Intelligent Body Transformer Definition
    3. 3.3 Intelligent Body Transformer Creation
  4. Agent AI Learning
    1. 4.1 Strategies and mechanisms
      1. 4.1.1 Reinforcement Learning (RL)
      2. 4.1.2 Imitation Learning (IL)
      3. 4.1.3 Traditional RGB
      4. 4.1.4 Situational learning
      5. 4.1.5 Optimization in Intelligent Body Systems
    2. 4.2 Intelligent body systems (zero and few sample levels)
      1. 4.2.1 Intelligent Body Module
      2. 4.2.2 Intelligent Body Infrastructure
    3. 4.3 Intelligent body base model (pre-training and fine-tuning levels)
  5. Agent AI Classification
    1. 5.1 General Intelligence Domain
    2. 5.2 embodied intelligence
      1. 5.2.1 mobile intelligence
      2. 5.2.2 interacting intelligence
    3. 5.3 Simulation and Environmental Intelligence
    4. 5.4 generative intelligence
      1. 5.4.1 AR/VR/Mixed Reality Intelligent Body
    5. 5.5 Intellectual and Logical Reasoning Intelligence
      1. 5.5.1 Intellectual Intelligence Unit (KIU)
      2. 5.5.2 logical intelligence
      3. 5.5.3 Intelligent bodies for emotional reasoning
      4. 5.5.4 Neurosymbolic Intelligence Unit (NSI)
    6. 5.6 Large Language Modeling and Visual Language Modeling Intelligentsia
  6. Agent AI application tasks
    1. 6.1 Intelligent bodies for gaming
      1. 6.1.1 NPC Behavior
      2. 6.1.2 Human-NPC Interaction
      3. 6.1.3 Intelligent body-based game analysis
      4. 6.1.4 For game scene compositing
      5. 6.1.5 Experiments and results
    2. 6.2 Robotics
      1. 6.2.1 Large language model/visual language model intelligences for robotics.
      2. 6.2.2 Experiments and results.
    3. 6.3 health care
      1. 6.3.1 Current health care capacity
    4. 6.4 multimodal intelligence
      1. 6.4.1 Image-Language Understanding and Generation
      2. 6.4.2 Video and language comprehension and generation
      3. 6.4.3 Experiments and results
    5. 6.5 Video-Language Experiment
    6. 6.6 Intelligentsia for Natural Language Processing
      1. 6.6.1 macrolanguage model intelligence
      2. 6.6.2 Universal Large Language Model Intelligence (ULM)
      3. 6.6.3 Command-Following Large Language Model Intelligentsia
      4. 6.6.4 Experiments and results
  7. Agent AI Across Modalities, Domains, and Realities
    1. 7.1 Intelligentsia for cross-modal understanding
    2. 7.2 Intelligentsia for cross-domain understanding
    3. 7.3 Interactive Intelligentsia for Cross-Modal and Cross-Reality
    4. 7.4 Migration from simulation to reality
  8. Continuous and self-improvement of Agent AI
    1. 8.1 Data based on human interaction
    2. 8.2 Data generated by the base model
  9. Smartbody datasets and leaderboards
    1. 9.1 The "CuisineWorld" dataset for multi-intelligence games
      1. 9.1.1 standard of reference
      2. 9.1.2 mandates
      3. 9.1.3 Indicators and rubrics
      4. 9.1.4 valuation
    2. 9.2 Audio-video-language pre-training dataset.
  10. 10 Broader impact statement
  11. 11 ethical consideration
  12. 12 Diversity Statement
  13. GPT-4V Intelligent Body Alert Details
  14. GPT-4V for Bleeding Edge
  15. GPT-4V for Microsoft Flight Simulator
  16. GPT-4V for Assassin's Creed Odyssey
  17. GPT-4V for GEARS of WAR 4
  18. GPT-4V for Starfield

 

1 Introduction

1.1 Motivation

Historically, AI systems were defined at the Dartmouth Conference in 1956 as "artificial life forms" capable of gathering information from the environment and interacting with it in a useful way. Inspired by this definition, Minsky's group at MIT constructed a robotic system in 1970 called the "Replica Demonstration", which observed a "block world" scenario and successfully reconstructed the observed polyhedral block structure. The system included observation, planning, and manipulation modules, revealing that each subproblem was challenging and required further research. The field of AI is fragmented into specialized subfields that have made great strides in solving these and other problems, but oversimplification obscures the overall goals of AI research.

In order to move beyond the status quo, it is necessary to return to the foundations of AI driven by Aristotelian holism. Fortunately, the recent revolutions in Large Language Modeling (LLM/Large Language Model) and Visual Language Modeling (VLM/Visual Language Model) have made it possible to create new types of AI intelligences that conform to holistic ideals. Seizing this opportunity, this paper explores models that integrate linguistic abilities, visual cognition, contextual memory, intuitive reasoning, and adaptability. It explores the potential for accomplishing this holistic synthesis using large language models and visual language models. In our exploration, we also revisit the design of systems based on Aristotle's "purposive cause," the teleological "reason for the existence of a system," which may have been overlooked in previous AI developments.

The renaissance of natural language processing and computer vision has been catalyzed with the emergence of powerful pre-trained big language models and visual language models. Big language models now demonstrate an amazing ability to decipher the nuances of real-world linguistic data, often with capabilities that match or even surpass human expertise OpenAI (2023). Recently, researchers have shown that big language models can be extended to act in a variety of environments asintelligent bodythat perform complex actions and tasks when paired with domain-specific knowledge and modules Xi et al. (2023). These scenarios are characterized by complex reasoning, understanding of the intelligences' roles and their environments, and multi-step planning, testing the ability of intelligences to make highly nuanced and complex decisions within the constraints of their environment Wu et al. (2023); Meta Fundamental AI Research Diplomacy Team et al. (2022) Meta Fundamental AI Research (FAIR) Diplomacy Team, Bakhtin, Brown, Dinan, Farina, Flaherty, Fried, Goff, Gray, Hu, et al. (FAIR).

Building on these initial efforts, the AI community is on the cusp of a major paradigm shift away from the creation of AI models for passive, structured tasks, and towards models that can hypothesize the role of dynamic, intelligent bodies in diverse and complex environments. In this context, this paper investigates the great potential of using large language models and visual language models as intelligentsia, emphasizing models that combine linguistic competence, visual cognition, contextual memory, intuitive reasoning, and adaptability. Utilizing large language models and visual language models as intelligibles, especially in domains such as gaming, robotics, and healthcare, not only provides a rigorous platform for evaluating state-of-the-art AI systems, but also foreshadows the transformative impact that intelligibles-centered AI will have on society and industry. When fully utilized, intelligent body models can redefine the human experience and raise operational standards. The potential for full automation from these models heralds a dramatic shift in industry and socioeconomic dynamics. These advances will be intertwined with multifaceted leaderboards, not only technologically but also ethically, as we will elaborate in Section 11. We delve into the overlapping domains of these subfields of Intelligent Body AI and illustrate their interconnectedness in Figure 1.

1.2 Background

We will now present relevant research papers supporting the concept, theoretical background, and modern implementations of Artificial Intelligence for Intelligent Bodies.

Large-scale base models.

Large language models and visual language models have been driving efforts to develop general-purpose intelligent machines (Bubeck et al., 2023; Mirchandani et al., 2023). Although they are trained using large text corpora, their superior problem-solving capabilities are not limited to the canonical language processing domain. Large language models have the potential to handle complex tasks previously thought to be the preserve of human experts or domain-specific algorithms, from mathematical reasoning (Imani et al., 2023; Wei et al., 2022; Zhu et al., 2022) to answering specialized legal questions (Blair-Stanek et al., 2023; Choi et al., 2023; Nay, 2022). 2023; Nay, 2022). Recent research has shown that it is possible to use large language models to generate complex plans for robotics and gaming AIs (Liang et al., 2022; Wang et al., 2023a, b; Yao et al., 2023a; Huang et al., 2023a), which marks an important milestone in the use of large language models as general-purpose intelligent intelligences.

Embodied AI.

Some work has utilized large language models to perform task planning (Huang et al., 2022a; Wang et al., 2023b; Yao et al., 2023a; Li et al., 2023a), in particular the World Wide Web-scaled domain knowledge of large language models and the emergent zero-sample embodied capabilities to perform complex task planning and reasoning. Recent robotics research has also utilized large language models to perform task planning (Ahn et al., 2022a; Huang et al., 2022b; Liang et al., 2022) by decomposing natural language commands into a series of sub-tasks (either in natural language form or in Python code form), which are then executed using a low-level controller. In addition, they incorporate environmental feedback to improve task performance (Huang et al., 2022b), (Liang et al., 2022), (Wang et al., 2023a), and (Ikeuchi et al., 2023).

Interactive learning:

AI intelligences designed for interactive learning operate using a combination of machine learning techniques and user interaction. Initially, the AI intelligences are trained on a large dataset. This dataset contains various types of information, depending on the intended function of the intelligence. For example, an AI designed for a language task would be trained on a large corpus of text data. Training involves the use of machine learning algorithms, which may include deep learning models (e.g., neural networks). These training models enable the AI to recognize patterns, make predictions, and generate responses based on the data on which it was trained. AI intelligences can also learn from real-time interactions with users. This interactive learning can occur in a number of ways: 1) Feedback-based learning: the AI adjusts its responses based on direct feedback from the user Li et al. (2023b); Yu et al. (2023a); Parakh et al. (2023); Zha et al. (2023); Wake et al. (2023a, b, c). For example, if the user corrects the AI's response, the AI can use this information to improve future responses Zha et al. (2023); Liu et al. (2023a). 2) Observational Learning: the AI observes user interactions and learns implicitly. For example, if a user frequently asks similar questions or interacts with the AI in a particular way, the AI may adapt its responses to better fit these patterns. It allows AI intelligences to understand and process human language, multimodal settings, interpret cross-reality situations and generate responses from human users. Over time, the performance of the AI intelligences usually continues to improve through more user interaction and feedback. This process is usually overseen by human operators or developers who ensure that the AI is learning appropriately and is not developing biases or incorrect patterns.

1.3 Overview

Multimodal Agent AI (MAA/Multimodal Agent AI) is a set of systems that generate effective actions in a given environment based on the understanding of multimodal sensory inputs. With the emergence of the Large Language Model (LLM/Large Language Model) and the Visual Language Model (VLM/Visual Language Model), a number of multimodal agent AI systems have been proposed in a variety of fields ranging from basic research to applications. While these research areas are rapidly evolving through integration with traditional techniques in each domain (e.g., visual questioning and visual language navigation), they share common interests such as data collection, benchmarking, and ethical perspectives. In this paper, we focus on some representative research areas of AI for multimodal intelligences, namely multimodality, gaming (VR/AR/MR), robotics, and healthcare, and our goal is to provide comprehensive knowledge about the common concerns discussed in these areas. Thus, we want to learn the basics of artificial intelligence for multimodal intelligences and gain insights to further advance their research. Specific learning outcomes include:

  • An overview of Artificial Intelligence for Multimodal Intelligentsia: an in-depth look at its principles and role in contemporary applications, providing researchers with a comprehensive grasp of its importance and uses.
  • METHODOLOGY: Case studies from gaming, robotics, and healthcare detail how macrolanguage modeling and visual language modeling can enhance the artificial intelligence of multimodal intelligences.
  • Performance evaluation: a guide to evaluating multimodal intelligences AI using relevant datasets, focusing on their effectiveness and generalization capabilities.
  • Ethical Considerations: a discussion of the social implications and ethical rankings of deploying intelligent body AI, highlighting responsible development practices.
  • Emerging Trends and Future Charts: categorizes the latest developments in each field and discusses future directions.

Computer-based action and generalist intelligences (GA/Generalist Agent) are useful for many tasks. In order for a Generalist Intelligent Body to be truly valuable to its users, it can interact naturally and can generalize to a wide range of contexts and modalities. Our goal is to foster a vibrant research ecosystem within the Intelligent Body AI community and create a shared sense of identity and purpose. Multimodal intelligent body AI has the potential to be applied to a wide range of contexts and modalities, including input from humans. Therefore, we believe that this field of intelligent body AI can attract a wide variety of researchers, thereby fostering a dynamic intelligent body AI community and shared goals. Led by renowned experts from academia and industry, we expect that this paper will be an interactive and enriching experience, including smart body tutorials, case studies, task sessions, and experimental discussions, ensuring a comprehensive and engaging learning experience for all researchers.

The purpose of this paper is to provide general and comprehensive knowledge about current research in the field of Artificial Intelligence for Intelligentsia. To this end, the rest of the paper is organized as follows. Section 2 outlines how intelligent body AI can benefit by integrating with relevant emerging technologies, in particular large-scale base models. Section 3 describes the new paradigm and framework we propose for training intelligent body AI. Section 4 provides an overview of widely used methods for training intelligent body AI. Section 5 categorizes and discusses various types of intelligibles. Section 6 describes applications of intelligent-body AI in gaming, robotics, and healthcare. Section 7 explores the efforts of the research community to develop a generalized intelligent body AI that can be applied to a variety of modalities, domains, and bridge the simulation-to-reality gap. Section 8 discusses the potential of an intelligent body AI that not only relies on pre-trained base models, but also continuously learns and improves itself by utilizing interactions with the environment and the user. Section 9 describes our new dataset designed for training multimodal intelligent body AI. Section 11 discusses the hot topic of AI intelligences, limitations, and ethical considerations of the social implications of our paper.

 

2 Intelligent Body Artificial Intelligence Integration

As suggested in previous studies, the underlying models based on macrolanguage models and visual language models still exhibit limited performance in the field of embodied AI, especially in understanding, generating, editing, and interacting in unseen environments or scenes Huang et al. (2023a); Zeng et al. (2023). As a result, these limitations lead to suboptimal output from AI intelligences. Current approaches to intelligencercentric AI modeling focus on directly accessible and well-defined data (e.g., textual or string representations of the state of the world) and typically use domain- and environment-independent patterns learned from their large-scale pre-training to predict action outputs for each environment Xi et al. (2023); Wang et al. (2023c); Gong et al. (2023a); Wu et al. (2023). In (Huang et al., 2023a), we investigate the task of knowledge-guided collaboration and interactive scene generation by combining large base models, and show promising results indicating that knowledge-based intelligences of large language models can improve the performance of 2D and 3D scene comprehension, generation, and editing, as well as other human-computer interactions Huang et al. (2023a). ). By integrating the Intelligent Body AI framework, the large base model is able to understand user inputs more deeply, resulting in complex and adaptive human-computer interaction systems. The emergent capabilities of large language models and visual language models play an unseen role in human-computer interaction for generative AI, embodied AI, knowledge augmentation for multimodal learning, mixed reality generation, text-to-vision editing, and 2D/3D simulation in games or robotics tasks. Recent advances in fundamental modeling of intelligent body AI provide an imminent catalyst for unlocking generalized intelligence in embodied intelligences. Large-scale action models or visual language models of intelligences open up new possibilities for generalized embodied systems, such as planning, problem solving, and learning in complex environments. Intelligent embodied AI takes further steps in the metaverse and points the way to early versions of generalized AI.

Agent AI: Exploring the World of Multimodal Interaction [Fei-Fei Li - Classic Must Read]-1
Fig. 2: Multimodal Smartbody AI for 2D/3D embodied generation and editing interactions across reality.

2.1 Infinite Artificial Intelligence Intelligence

Artificial intelligence intelligences have the ability to interpret, predict and respond based on their training and input data. While these capabilities are advanced and improving, it is important to recognize their limitations and the impact of the underlying data on which they are trained. AI intelligences systems typically have the following capabilities:1) Predictive Modeling:AI intelligences can predict likely outcomes or suggest subsequent steps based on historical data and trends. For example, they may predict the continuation of text, the answer to a question, the next action of a robot, or the solution to a scenario.2) Decision Making: in some applications, AI intelligences can make decisions based on their inferences. Typically, the intelligences will make decisions based on what is most likely to achieve the specified goal. For AI applications such as recommender systems, intelligences can decide which products or content to recommend based on their inferences about user preferences.3) Handling Ambiguity: AI intelligences can often handle ambiguous inputs by inferring the most likely interpretation based on context and training. However, their ability to do so is limited by the scope of their training data and algorithms.4) Continuous Improvement: While some AI intelligences have the ability to learn from new data and interactions, many large language models do not continually update their knowledge base or internal representations after training. Their inferences are typically based only on the data available as of their last training update.

We show augmented interactive intelligences for multimodal and cross-reality agnostic integration with emergent mechanisms in Figure 2. Artificial intelligence intelligibles require the collection of large amounts of training data for each new task, which can be costly or impossible for many domains. In this study, we develop an infinite intelligent body that learns to transfer in-memory information from a generalized base model (e.g., GPT-X, DALL-E) to new domains or scenarios for scenario comprehension, generation, and interactive editing in a physical or virtual world.

One application of such infinite intelligences in robotics is RoboGen Wang et al. (2023d). In this study, the authors present a pipeline for autonomously running task suggestion, environment generation, and skill learning cycles.RoboGen is an effort to transfer knowledge embedded in large-scale models to robotics.

2.2 Artificial Intelligence for Intelligentsia with Large Base Models

Recent research has shown that large-scale base models play a crucial role in creating data that act as benchmarks for determining the actions of an intelligent body within the constraints imposed by the environment. Examples include the use of base models for robot manipulation Black et al. (2023); Ko et al. (2023) and navigation Shah et al. (2023a); Zhou et al. (2023a). To illustrate, Black et al. employ an image editing model as a high-level planner to generate images of future subgoals to guide the low-level strategy Black et al. (2023). For robot navigation, Shah et al. propose a system that employs a macrolanguage model to recognize landmarks from text and a visual language model to associate these landmarks with visual inputs, thereby augmenting navigation with natural language commands Shah et al. (2023a).

There is also a growing interest in generating conditioned human movements tailored to linguistic and environmental factors. Several AI systems have been proposed to generate movements and actions customized to specific linguistic commands Kim et al. (2023); Zhang et al. (2022); Tevet et al. (2022) and adapted to a variety of 3D scenes Wang et al. (2022a). This research highlights the growing capability of generative models in enhancing the adaptability and responsiveness of AI intelligences in a variety of scenarios.

2.2.1 Hallucinations

Intelligentsia that generate text are often prone to hallucinations, i.e., situations where the generated text is meaningless or does not match the provided source content Raunak et al. (2021); Maynez et al. (2020). Illusions can be categorized into two types, theinner illusioncap (a poem)external illusion Ji et al. (2023). Intrinsic illusions are those that contradict the source material, while extrinsic illusions are cases where the generated text contains additional information not initially included in the source material.

Some promising avenues for reducing the rate of illusions in language generation include the use of retrieval to enhance generation Lewis et al. (2020); Shuster et al. (2021) or other approaches that support natural language output through external knowledge retrieval Dziri et al. (2021); Peng et al. (2023). Typically, these approaches aim to enhance language generation by retrieving other source material and by providing mechanisms to check for contradictions between the generated response and the source material.

In the context of multimodal intelligent body systems, visual language models have also been shown to produce hallucinations Zhou et al. (2023b). A common cause of hallucinations in vision-based language generation is an overreliance on the co-occurrence of objects and visual cues in the training data Rohrbach et al. (2018). AI intelligences that rely exclusively on pre-trained macrolanguage or visual language models and use limited context-specific fine-tuning may be particularly prone to hallucinations because they rely on the internal knowledge base of the pre-trained model to generate actions and may not accurately understand the dynamics of the world state in which they are deployed.

2.2.2 Prejudice and inclusiveness

AI intelligences based on Large Language Models (LLMs) or Large Multimodal Models (LMMs) are biased due to multiple factors inherent in their design and training process. When designing these AI intelligences, we must be mindful of inclusivity and aware of the needs of all end-users and stakeholders. In the context of AI Intelligent Bodiesnon-exclusivity This refers to measures and principles taken to ensure that the responses and interactions of intelligentsia are inclusive, respectful, and sensitive to a wide range of users from diverse backgrounds. We will list key aspects of intelligent body bias and inclusiveness below.

  • Training data: The underlying models are trained on large amounts of textual data collected from the Internet, including books, articles, websites, and other textual sources. These data often reflect biases that exist in human society, which the model may inadvertently learn and reproduce. This includes stereotypes, prejudices, and biased views related to race, gender, ethnicity, religion, and other personal attributes. In particular, by training on Internet data, and often using only English text for training, models implicitly learn the cultural norms of Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies Henrich et al. ( 2010 ), which have a disproportionate presence on the Internet. However, it is important to recognize that human-created datasets cannot be completely free of bias, as they often reflect societal biases as well as those of the individuals who originally generated and/or compiled the data.
  • Historical and cultural bias: AI models are trained on large datasets from diverse content. As such, training data often includes historical texts or materials from different cultures. In particular, training data from historical sources may contain offensive or derogatory language that represents the cultural norms, attitudes, and prejudices of a particular society. This may result in models that perpetuate outdated stereotypes or fail to fully understand contemporary cultural changes and nuances.
  • Language and contextual constraints: Language models may have difficulty understanding and accurately representing nuances in language, such as irony, humor, or cultural allusions. This can lead to misunderstandings or biased responses in some cases. Additionally, many aspects of spoken language are not captured by text-only data, leading to a potential disconnect between how humans understand language and how models understand it.
  • Policies and guidelines: AI intelligences operate under strict policies and guidelines to ensure fairness and inclusiveness. For example, when generating images, there are rules to diversify the depiction of characters and avoid stereotypes related to race, gender, and other attributes.
  • overgeneralization: These models tend to generate responses based on patterns seen in the training data. This can lead to overgeneralization, and models may generate responses that seem to stereotype certain groups or make broad assumptions.
  • Continuous monitoring and updating: The AI system is continually monitored and updated to address any emerging issues of bias or inclusivity. Feedback from users and ongoing research in AI ethics play a critical role in this process.
  • Amplifying the mainstream view: Since training data usually contains more content from the dominant culture or group, the model may be more biased in favor of these views and thus may underestimate or distort the views of minority groups.
  • Ethical and Inclusive Design: AI tools should be designed with ethical considerations and inclusiveness as core principles. This includes respecting cultural differences, promoting diversity and ensuring that AI does not perpetuate harmful stereotypes.
  • User Guide: Users are also instructed on how to interact with the AI in ways that promote inclusivity and respect. This includes avoiding requests that may lead to biased or inappropriate output. In addition, it can help mitigate situations where the model learns harmful material from user interactions.

Despite these measures, biases in AI intelligences persist. Ongoing efforts in AI research and development focus on further reducing these biases and enhancing the inclusiveness and fairness of AI systems for intelligences. Bias Reduction Efforts:

  • Diverse and inclusive training data: Efforts are being made to include more diverse and inclusive sources in the training data.
  • Bias detection and correction: Ongoing research focuses on detecting and correcting bias in model responses.
  • Ethical guidelines and policies: Models are often bound by ethical guidelines and policies designed to mitigate bias and ensure respectful and inclusive interactions.
  • Diverse representation: Ensure that the content generated or responses provided by AI intelligences represent a wide range of human experiences, cultures, ethnicities, and identities. This is particularly relevant in scenarios such as image generation or narrative construction.
  • Bias mitigation: Actively work to reduce bias in AI responses. This includes bias related to race, gender, age, disability, sexual orientation, and other personal characteristics. The goal is to provide a fair and balanced response, not to perpetuate stereotypes or biases.
  • Cultural sensitivity: AI's designs are culturally sensitive, recognizing and respecting the diversity of cultural norms, practices and values. This includes understanding and responding appropriately to cultural references and nuances.
  • accessibility: Ensure that AI intelligences are accessible to users with different abilities, including those with disabilities. This may involve incorporating features that make it easier for people with visual, hearing, motor or cognitive impairments to interact.
  • Language-based inclusiveness: Provide support for multiple languages and dialects for a global user base and be sensitive to intra-language nuances and variations Liu et al. ( 2023b ).
  • Ethical and respectful interaction: The intelligence is programmed to interact ethically and respectfully with all users, avoiding responses that may be perceived as offensive, harmful, or disrespectful.
  • User feedback and adaptation: Incorporate user feedback to continuously improve the inclusiveness and effectiveness of AI intelligences. This includes learning from interactions to better understand and serve a diverse user base.
  • Compliance with inclusive guidelines: Adhere to established guidelines and standards for inclusivity in AI intelligences, which are typically set by industry groups, ethics committees, or regulatory agencies.

Despite these efforts, it is important to be aware of the potential for biases in responses and to think critically about interpreting them.Continuous improvements in AI intelligences technology and ethical practices aim to reduce these biases over time. Intelligent Bodies One of the overarching goals of AI inclusivity is to create an intelligent body that is respectful and accessible to all users, regardless of their background or identity.

2.2.3 Data privacy and use

A key ethical consideration for AI intelligences involves understanding how these systems process, store and potentially retrieve user data. We discuss key aspects below:

Data collection, use and purpose.

When using user data to improve model performance, model developers have access to data collected by AI intelligences in production and when interacting with users. Some systems allow users to view their data through their user account or by making a request to a service provider. It is important to recognize what data the AI intelligences are collecting during these interactions. This may include text input, user usage patterns, personal preferences, and sometimes more sensitive personal information. Users should also understand how the data collected from their interactions is used. If, for some reason, the AI holds incorrect information about a specific individual or group, there should be a mechanism for users to help correct that error once it is recognized. This is important for accuracy and respect for all users and groups. Common uses for retrieving and analyzing user data include improving user interactions, personalizing responses, and system optimization. It is important for developers to ensure that data is not used for purposes that users have not consented to (e.g., unsolicited marketing).

Storage and security.

Developers should be aware of where user interaction data is stored and the security measures in place to protect it from unauthorized access or disclosure. This includes encryption, secure servers, and data protection protocols. It is important to determine if and under what conditions smart body data is shared with third parties. This should be transparent and usually requires user consent.

Data deletion and retention.

It is also important for users to understand how long user data is stored and how users can request that their data be deleted. Many data protection laws give users the right to be forgotten, which means they can request that their data be deleted.AI Intelligentsia must comply with data protection laws such as the EU's GDPR or California's CCPA. These laws govern data processing practices and users' rights to their personal data.

Data portability and privacy policy.

Additionally, developers must create privacy policies for AI intelligences to document and explain to users how their data will be handled. This should detail data collection, use, storage and user rights. Developers should ensure they obtain user consent for data collection, especially for sensitive information. Users can often opt out or restrict the data they provide. In some jurisdictions, users may even have the right to request a copy of their data in a format that can be transferred to another service provider.

Anonymization.

For data used in broader analytics or AI training, it should ideally be anonymized to protect individual identities. Developers must understand how their AI intelligences retrieve and use historical user data during interactions. This may be for personalization or to improve the relevance of a response.

In summary, understanding data privacy for AI intelligences includes understanding how user data is collected, used, stored, and protected, and ensuring that users are aware of their rights regarding access, correction, and deletion of their data. Understanding the data retrieval mechanisms of users and AI intelligences is also critical to a full understanding of data privacy.

2.2.4 Interpretability and descriptiveness

Learning by imitation → decoupling

Intelligentsia are typically trained using continuous feedback loops in reinforcement learning (RL) or imitation learning (IL), starting with a randomly initialized strategy. However, this approach faces bottlenecks in obtaining initial rewards in unfamiliar environments, especially when rewards are sparse or only available at the end of long-step interactions. Therefore, a superior solution is to use infinite-memory intelligences trained through imitation learning, which can learn strategies from expert data, leading to improved exploration and utilization of unseen environment spaces, as well as emerging infrastructures, as shown in Fig. 3. With expert features to help the intelligent body better explore and utilize the unseen environmental space. Intelligent body AI can learn strategies and new paradigm processes directly from expert data.

Traditional imitation learning allows intelligences to learn strategies by imitating the behavior of an expert demonstrator. However, learning expert strategies directly may not always be the best approach, as the intelligent body may not generalize well to unseen situations. To address this problem, we propose to learn an intelligent body with contextual cues or implicit reward functions that capture key aspects of the expert's behavior, as shown in Figure 3. This equips the infinite-memory intelligences with physical-world behavioral data learned from expert demonstrations for task execution. It helps overcome the drawbacks of existing imitation learning, such as the need for large amounts of expert data and the potential for error in complex tasks. The key idea behind the Intelligent Body AI has two components: 1) the Infinite Intelligent Body, which collects physical-world expert demonstrations as state-action pairs, and 2) the virtual environment of the Imitation Intelligent Body Generator. The Imitation Intelligent Body generates actions that mimic the behavior of the expert, and the Intelligent Body learns a policy mapping from states to actions by a loss function that reduces the difference between the expert's actions and the actions generated by the learning policy.

Decoupling → Generalization

Instead of relying on task-specific reward functions, the intelligent body learns from expert demonstrations that provide a diverse set of state-action pairs covering various task aspects. The intelligent body then learns strategies for mapping states to actions by imitating the expert's behavior. Decoupling in imitation learning refers to separating the learning process from the task-specific reward function, thus allowing the strategy to generalize across different tasks without explicitly relying on the task-specific reward function. Through decoupling, an intelligent can learn from expert demonstrations and learn a strategy that can be adapted to various situations. Decoupling enables transfer learning, in which a strategy learned in one domain can be adapted to other domains with minimal fine-tuning. By learning a generalized strategy that is not tied to a specific reward function, an intelligent body can use the knowledge it has gained in one task to perform well in other related tasks. Since the intelligent body does not depend on a specific reward function, it can adapt to changes in the reward function or the environment without extensive retraining. This makes the learned strategies more robust and generalizable across different environments. In this context, decoupling refers to the separation of two tasks in the learning process: learning the reward function and learning the optimal policy.

Agent AI: Exploring the Boundaries of Multimodal Interaction-3
Figure 3: Example of an emergent interaction mechanism using intelligences to recognize image-related text from candidate text. The task involves integrating external world information using multimodal AI intelligences from the network and manually labeled knowledge interaction samples.

Generalization → emergent behavior

Generalization explains how emergent properties or behaviors emerge from simpler components or rules. The key idea is to identify the basic elements or rules that control the behavior of the system, such as individual neurons or basic algorithms. Thus, by observing how these simple components or rules interact with each other. The interactions of these components usually lead to the emergence of complex behaviors that cannot be predicted by examining only the individual components. Generalization across different levels of complexity allows the system to learn generic principles that apply at these levels, leading to emergent properties. This allows the system to adapt to new situations, demonstrating the emergence of more complex behaviors from simpler rules. Additionally, the ability to generalize across different levels of complexity facilitates the transfer of knowledge from one domain to another, which helps to emerge complex behaviors in new environments as the system adapts.

2.2.5 Reasoning Enhancement

The reasoning capabilities of AI intelligences lie in their ability to interpret, predict, and respond based on training and input data. While these capabilities are advanced and improving, it is important to recognize their limitations and the impact of the underlying data on which they are trained. In particular, in the context of Large Language Models (LLMs), it refers to their ability to draw conclusions, make predictions, and generate responses based on the data they are trained on and the inputs they receive.Reasoning augmentation in AI intelligences refers to the use of additional tools, techniques, or data to augment the natural reasoning capabilities of an AI in order to improve its performance, accuracy, and utility. This is especially important in complex decision-making scenarios or when dealing with nuanced or specialized content. We'll list particularly important sources of reasoning enhancement below:

Data richness.

Incorporating additional (often external) data sources to provide more context or background can help AI intelligences make more informed inferences, especially in areas where their training data may be limited. For example, AI intelligences can infer meaning from the context of a conversation or text. They analyze the given information and use it to understand the intent and relevant details of a user's query. These models are good at recognizing patterns in data. They use this ability to infer information about language, user behavior, or other relevant phenomena based on patterns learned during training.

Algorithmic Enhancement.

Improving the underlying algorithms of AI for better reasoning. This may involve using more advanced machine learning models, integrating different types of AI (e.g., combining natural language processing (NLP) with image recognition), or updating algorithms to better handle complex tasks. Reasoning in language modeling involves understanding and generating human language. This includes grasping tone, intent, and the nuances of different language structures.

Human in the Loop (HITL).

Involving humans to augment the AI's reasoning may be particularly useful in areas where human judgment is critical, such as ethical considerations, creative tasks, or ambiguous scenarios. Humans can provide guidance, correct errors, or offer insights that the intelligence cannot infer on its own.

Real-time feedback integration.

Using real-time feedback from the user or environment to augment reasoning is another promising approach to improving performance during inference. For example, an AI may adjust its recommendations based on real-time user responses or changing conditions in a dynamic system. Alternatively, if the AI takes an action that violates certain rules in a simulated environment, feedback may be dynamically provided to the AI to help it correct itself.

Cross-disciplinary knowledge transfer.

Leveraging knowledge or models from one domain to improve reasoning in another is particularly useful when generating output in specialized disciplines. For example, techniques developed for language translation might be applied to code generation, or insights from medical diagnostics could enhance predictive maintenance in machinery.

Customization for specific use cases.

Tailoring the reasoning capabilities of an AI for a specific application or industry may involve training the AI on specialized datasets or fine-tuning its model to better suit a particular task, such as legal analysis, medical diagnosis, or financial forecasting. Since specific language or information within a domain may contrast with language from other domains, fine-tuning the intelligences on domain-specific knowledge may be beneficial.

Ethical and bias considerations.

It is important to ensure that the enhancement process does not introduce new biases or ethical issues. This involves careful consideration of the impact on fairness and transparency of sources of additional data or new reasoning enhancement algorithms. When reasoning, especially on sensitive topics, AI intelligences must sometimes respond to ethical considerations. This includes avoiding harmful stereotypes, respecting privacy, and ensuring fairness.

Continuous learning and adaptation.

Regularly update and refine AI's capabilities to keep up with new developments, changing data landscapes, and evolving user needs.

In summary, reasoning augmentation in AI intelligences involves methods to augment their natural reasoning capabilities with additional data, improved algorithms, artificial inputs, and other techniques. Depending on the use case, such enhancements are often critical for handling complex tasks and ensuring the accuracy of the intelligences' output.

2.2.6 Regulation

Recently, there have been significant advances in intelligent body AI and its integration with embodied systems has opened up new possibilities for interacting with intelligent bodies through more immersive, dynamic, and engaging experiences. To accelerate this process and alleviate the drudgery involved in the development of intelligent body AI, we propose the development of a next-generation AI-enabled intelligent body interaction pipeline. Develop a human-machine collaboration system that enables humans and machines to communicate and interact meaningfully. The system could utilize the conversational capabilities and wide range of actions of a Large Language Model (LLM) or Visual Language Model (VLM) to talk to human players and recognize human needs. It will then perform appropriate actions to assist the human player as required.

When using Large Language Models (LLM)/Visual Language Models (VLM) for human-robot collaborative systems, it is important to note that these models operate as black boxes and produce unpredictable outputs. This uncertainty can become critical in physical settings (e.g., operating an actual robot). One way to address this challenge is to limit the focus of the Large Language Model (LLM)/Visual Language Model (VLM) through cue engineering. For example, when performing robotic task planning based on instructions, providing environmental information in cues has been reported to produce more stable output than relying only on text Gramopadhye and Szafir (2022). This report is supported by Minsky's AI framework theory Minsky (1975), which suggests that the problem space to be solved by a Large Language Model (LLM)/Visual Language Model (VLM) is defined by a given prompt. Another approach is to design prompts such that the Large Language Model (LLM)/Visual Language Model (VLM) contains explanatory text to enable the user to understand what the model is focusing on or recognizing. In addition, implementing a higher level that allows for pre-execution validation and modification under human guidance can facilitate the operation of systems that work under such guidance (Figure 4).

Agent AI: Exploring the Boundaries of Multimodal Interaction-4
Figure 4: Robotics teaching system developed in Wake et al. (2023c). (Left) System workflow. The process consists of three steps: task planning, which ChatGPT Planning of robot tasks based on commands and environmental information; Demonstration, where the user demonstrates action sequences visually. All steps are reviewed by the user, and if any step fails or shows defects, previous steps can be revisited as needed. (right) A web application that allows uploading of demo data and interaction between the user and ChatGPT.

2.3 Intelligentsia AI for emergent capabilities

Despite the increasing adoption of interactive intelligent body AI systems, most of the proposed approaches still face challenges in terms of generalization performance in unseen environments or scenarios. Current modeling practices require developers to prepare large datasets for each domain to fine-tune/pre-train the model; however, this process is costly or even impossible if the domain is new. To address this problem, we constructed interactive intelligences that utilize the knowledge memory of generalized base models (ChatGPT, Dall-E, GPT-4, etc.) for new scenarios, in particular for generating collaborative spaces between humans and intelligences. We identify an emergent mechanism - which we call mixed reality with knowledge-reasoning interactions - that facilitates collaboration with humans to solve challenging tasks in complex real-world environments, and the ability to explore unseen environments to adapt to virtual reality. For this mechanism, intelligences learn i) micro-responses across modalities: collecting relevant individual knowledge for each interaction task (e.g., understanding unseen scenarios) from explicit network sources and implicitly inferring it through outputs from pre-trained models; ii) macro-behavior in a reality-independent manner: refining the dimensions and patterns of interactions in the linguistic and multimodal domains and reasoning about them according to the represented roles, certain goal variables, mixed reality and influence diversity of collaborative information in the Large Language Model (LLM) to make changes. We investigate the task of knowledge-guided interaction synergies for scenario generation in collaboration with various OpenAI models and show how the Interactive Intelligentsia system can further enhance promising results for large-scale base models in our setup. It integrates and improves the generalization depth, awareness and interpretability of complex adaptive AI systems.

 

3 Intelligent Body AI Paradigm

In this section, we discuss a new paradigm and framework for training AI for intelligences. We hope to achieve several goals with the proposed framework:

  • -
    Utilizing existing pre-training models and pre-training strategies to efficiently guide our intelligences to understand important modalities, such as textual or visual inputs.
  • -
    Support adequate long-term mission planning capacity.
  • -
    Introducing a mnemonic framework that allows learning to be encoded and retrieved at a later date.
  • -
    Allows the use of environmental feedback to effectively train the intelligences to learn which actions to take.

We show a high-level graph of new intelligences in Figure 5, outlining the important sub-modules of such a system.

Agent AI: Exploring the Boundaries of Multimodal Interaction-5
Fig. 5: Our proposed new intelligent body paradigm for multimodal generalized intelligences. As shown, there are 5 main modules: 1) environment and perception, including task planning and skill observation; 2) intelligent body learning; 3) memory; 4) intelligent body action; and 5) cognition.

3.1 Large Language Modeling and Visual Language Modeling

We can use either a Large Language Model (LLM) or a Visual Language Model (VLM) to bootstrap the components of an intelligent body, as shown in Figure 5. In particular, Large Language Models have been shown to perform well in task planning Gong et al. (2023a), contain a large amount of world knowledge Yu et al. (2023b), and exhibit impressive logical reasoning Creswell et al. (2022). In addition, visual language models like CLIP Radford et al. (2021) provide a generalized visual coder aligned to language as well as providing zero-sample visual recognition capabilities. For example, state-of-the-art open-source multimodal models such as LLaVA Liu et al. (2023c) and InstructBLIP Dai et al. (2023) rely on the frozen CLIP model as a visual encoder.

Agent AI: Exploring the Boundaries of Multimodal Interaction-4
Figure 6: We show the current paradigm for creating multimodal AI intelligences by combining large language models (LLMs) with large visual models (LVMs). Typically, these models receive visual or verbal input and use pre-trained and frozen visual and verbal models to learn to connect and bridge smaller sub-networks of modalities. Examples include Flamingo Alayrac et al. (2022), BLIP-2 Li et al. (2023c), InstructBLIP Dai et al. (2023), and LLaVA Liu et al.

3.2 Definition of Intelligent Body Transformer

In addition to using frozen macrolanguage models and visual language models as AI intelligences, it is also possible to use a single intelligence, the Transformer model, which combines visual Token and linguistic Token as inputs, similar to Gato Reed et al. (2022). In addition to visual and linguistic, we add a third generic type of input, which we denote asintelligent body Token: Conceptually, an Intelligent Body Token is used to reserve a specific subspace for Intelligent Body behaviors in the input and output space of a model. For robots or games, this can be represented as the input action space of the controller. Intelligent Body Token can also be used when training an intelligent body to use a specific tool, such as an image generation or image editing model, or for other API calls, as shown in Figure 7, where we can combine Intelligent Body Token with Visual and Linguistic Token to generate a unified interface for training multimodal intelligent body AI. Using the Intelligent Body Transformer has several advantages over using a large proprietary large language model as an intelligent body. First, the model can be easily customized to very specific intelligent body tasks that may be difficult to represent in natural language (e.g., controller inputs or other specific actions). Thus, intelligences can learn from environmental interactions and domain-specific data to improve performance. Second, by accessing the probabilities of an intelligent body's Token, it can be easier to understand why a model takes or does not take a particular action. Third, certain domains (e.g., healthcare and law) have stringent data privacy requirements. Finally, a relatively small Intelligent Body Transformer may be much cheaper than a larger proprietary language model.

Agent AI: Exploring the Boundaries of Multimodal Interaction-5
Fig. 7: Unified multimodal Transformer model for intelligentsia. Instead of connecting frozen submodules and using existing base models as building blocks, we propose a unified end-to-end training paradigm for intelligent body systems. We can still initialize the submodules using the large language model and the large vision model in Figure 6, but we can also use SmartBody Tokens, which are specialized tokens used to train models to perform SmartBody behaviors in specific domains (e.g., robotics.) For more details on SmartBody Tokens, see Section 3.2.

3.3 Creating an Intelligent Body Transformer

As shown in Figure 5 above, we can use the new Intelligent Body Paradigm with Intelligent Bodies guided by Large Language Models and Visual Language Models, and use the data generated by the large base model to train the Intelligent Body Transformer model to learn to perform specific goals. In the process, the Intelligent Body model is trained to be customized specifically for a particular task and domain. This approach allows you to leverage features and knowledge learned from pre-existing base models. We show a simplified overview of the process in two steps below:

Define goals within the domain.

In order to train an Intelligent Body Transformer, the goal and action space of the Intelligent Body for each particular environment needs to be explicitly defined. This includes determining which specific tasks or actions the intelligences need to perform and assigning unique intelligences Token to each task or action.In addition, any automated rules or procedures that can be used to recognize successful task completion can significantly increase the amount of data available for training. Otherwise, data generated by the base model or manually annotated data will be needed to train the model. Once data is collected and the performance of the intelligences can be evaluated, the process of continuous improvement can begin.

Continuous improvement.

Continuously monitoring the performance of the model and collecting feedback are fundamental steps in the process. Feedback should be used for further fine-tuning and updating. It is also critical to ensure that the model does not perpetuate biased or unethical results. This requires scrutinizing the training data, periodically checking for bias in the output, and, if needed, training the model to identify and avoid bias. Once the model reaches satisfactory performance, it can be deployed into the intended application. Continuous monitoring remains critical to ensure that the model is performing as expected and to facilitate necessary adjustments. See Section 8 for more details on this process, sources of training data, and more details on continuous learning of AI for intelligences.

 

4 Intelligence Body AI Learning

4.1 Strategies and Mechanisms

Strategies for interactive AI in different domains extend the paradigm of using trained intelligences that actively seek to collect user feedback, action information, useful knowledge for generation and interaction to invoke large base models. Sometimes, there is no need to re-train the large language model/visual language model, and we improve the performance of the intelligences by providing them with improved contextual cues at test time. On the other hand, it always involves modeling knowledge/reasoning/common sense/reasoning interactions through a combination of ternary systems - one system performs knowledge retrieval from multi-model queries, the second performs interaction generation from relevant intelligences, and the last trains new, information-rich self-supervised training or pre-training in an improved way, including reinforcement learning or imitation learning.

4.1.1 Reinforcement of learning (RL)

There is a long history of using reinforcement learning (RL) to train interactive intelligences that exhibit intelligent behavior. Reinforcement learning is a method for learning the optimal relationship between states and actions based on the rewards (or punishments) received for their actions. Reinforcement learning is a highly scalable framework that has been applied to numerous applications including robotics, however, it typically faces several leaderboard problems and large language modeling/visual language modeling has shown its potential to mitigate or overcome some of these difficulties:

  • Reward designThe efficiency of strategy learning depends heavily on the design of the reward function. Designing a reward function requires not only an understanding of the reinforcement learning algorithm but also an in-depth knowledge of the nature of the task, and thus usually requires designing the function based on expert experience. Several studies have explored the use of large language models/visual language models to design reward functions Yu et al. (2023a); Katara et al. (2023); Ma et al.
  • Data collection and efficiency Given its exploratory nature, strategy learning based on reinforcement learning requires large amounts of data Padalkar et al. (2023). The need for large amounts of data becomes particularly evident when the strategy involves managing long sequences or integrating complex operations. This is because these scenarios require more nuanced decision making and learning from a wider range of situations. In recent studies, efforts have been made to enhance data generation to support strategy learning Kumar et al. (2023); Du et al. Furthermore, in some studies, these models have been integrated into reward functions to improve strategy learning Sontakke et al. (2023). Parallel to these developments, another study focused on achieving parameter efficiency in the learning process using visual language models Tang et al. (2023); Li et al. (2023d) and a large language model Shi et al. (2023).
  • longitudinal step Regarding data efficiency, reinforcement learning becomes more challenging as the length of action sequences increases. This is due to the unclear relationship between actions and rewards (known as the credit allocation problem), as well as the increase in the number of states to be explored, which requires a large amount of time and data. A typical approach for long and complex tasks is to decompose them into a series of subgoals and apply pre-trained strategies to solve each subgoal (e.g., Takamatsu et al. (2022)). This idea belongs to the Task and Motion Planning (TAMP) framework Garrett et al. (2021). Task and motion planning consists of two main components: task planning, which entails identifying high-level sequences of operations, and motion planning, which entails finding physically consistent, collision-free trajectories to achieve the goals of the task plan. Large language models are well suited for task and motion planning, and recent research has typically taken the approach that high-level task planning is performed using large language models, while low-level control is solved by reinforcement learning-based strategies Xu et al. (2023); Sun et al. (2023a); Li et al. (2023b); Parakh et al. The advanced features of large language models allow them to efficiently decompose abstract instructions into subgoals Wake et al. (2023c), thus contributing to enhanced language comprehension in robotic systems.

4.1.2 Imitation Learning (IL)

While reinforcement learning aims to train strategies based on exploring behaviors and maximizing the rewards of interacting with the environment, imitation learning (IL) aims to use expert data to mimic the behavior of experienced intelligences or experts. For example, one of the main frameworks for imitation-based learning in robotics is behavioral cloning (BC). Behavioral cloning is an approach to train a robot to imitate an expert's actions through direct copying. In this approach, the actions of an expert in performing a specific task are recorded and the robot is trained to replicate these actions in similar situations. Recent approaches based on behavioral cloning typically combine techniques from large language modeling/visual language modeling, resulting in more advanced end-to-end models. For example, Brohan et al. present RT-1 Brohan et al. (2022) and RT-2 Brohan et al. (2023), Transformer-based models that take a series of images and language as inputs and output sequences of base and arm actions. These models are reported to show high generalization performance due to being trained on a large amount of training data.

4.1.3 Traditional RGB

Learning intelligent body behavior using image inputs has been of interest for many years Mnih et al. (2015). The inherent challenge of using RGB input is dimensionality catastrophe. To address this problem, researchers have either used more data Jang et al. (2022); Ha et al. (2023) or introduced an inductive bias in model design to improve sample efficiency. In particular, the authors integrate 3D structures into the model architecture used for manipulation Zeng et al. (2021); Shridhar et al. (2023); Goyal et al. (2023); James and Davison (2022). For robot navigation, the authors Chaplot et al. (2020a, b) utilize maps as a representation. Maps can be learned by neural networks aggregating all previous RGB inputs, as well as by 3D reconstruction methods (e.g., neural radiation fields) Rosinol et al. (2022).

To obtain more data, researchers have used graph simulators to synthesize synthetic data Mu et al. (2021); Gong et al. (2023b) and have attempted to close the sim2real gap Tobin et al. (2017); Sadeghi and Levine (2016); Peng et al. (2018). Recently, concerted efforts have been made to curate large-scale datasets aiming at solving the data scarcity problem Padalkar et al. (2023); Brohan et al. On the other hand, data enhancement techniques have also been extensively investigated in order to increase sample complexity Zeng et al. (2021); Rao et al. (2020); Haarnoja et al. (2023); Lifshitz et al.

4.1.4 Context learning

Context learning has been shown to be an effective method for solving natural language processing tasks using large language models like GPT-3 Brown et al. (2020); Min et al. (2022). By providing task examples in the context of the big language model prompts, it can be seen that sample less prompts are an effective way to contextualize the model output in a variety of tasks in natural language processing. Factors such as the variety of examples and the quality of the examples presented in the context may improve the quality of the model output An et al. (2023); Dong et al. (2022). In the context of multimodal base models, when only a small number of examples are given, models like Flamingo and BLIP-2 Alayrac et al. (2022); Li et al. (2023c) have been shown to be effective in a wide variety of visual comprehension tasks. Contextual learning of intelligences in the environment can be further improved by integrating context-specific feedback when certain actions are taken Gong et al. (2023a).

4.1.5 Optimization in Intelligent Body Systems

Optimization of intelligent body systems can be divided into spatial and temporal aspects. Spatial optimization considers how intelligent bodies operate in physical space to perform tasks. This includes coordination between robots, resource allocation, and maintaining an organized space.

To efficiently optimize intelligent-body AI systems, especially those in which a large number of intelligences run in parallel, previous work has focused on using high-volume reinforcement learning Shacklett et al. (2023). Since task-specific multi-intelligent body interaction datasets are rare, self-gaming reinforcement learning enables teams of intelligent bodies to improve over time. However, this can also lead to very fragile intelligences that can only work under self-gaming and not with humans or other independent intelligences because they overfit the self-gaming training paradigm. To address this problem, we can instead discover a set of different conventions Cui et al. (2023); Sarkar et al. (2023) and train an intelligent that understands the various conventions. The base model can further help to establish conventions with humans or other independent intelligences, thus enabling smooth coordination with new intelligences.

On the other hand, temporal optimization focuses on how intelligences perform tasks over time. This includes task scheduling, sequencing, and timeline efficiency. For example, optimizing the trajectory of a robot arm is an example of efficiently optimizing movement between successive tasks Zhou et al. (2023c). At the task-scheduling level, algorithms like the LLM-DP Dagan et al. (2023) and the ReAct Approaches such as Yao et al. (2023a) address effective task planning by interactively integrating environmental factors.

4.2 Agent systems (zero and few sample levels)

4.2.1 Agent Module

Our initial exploration of the Agent paradigm involved the development of Agent AI "modules" for interactive multimodal Agents using either the Large Language Model (LLM) or the Visual Language Model (VLM). Our initial Agent modules contribute to training or contextual learning and employ a minimalist design designed to demonstrate the ability of an Agent to schedule and coordinate effectively. We also explored initial cue-based memory techniques that help with better planning and inform future methods of action in the field. To illustrate this, our "MindAgent" infrastructure consists of 5 main modules: 1) Environment Awareness with Task Planning, 2) Agent Learning, 3) Memory, 4) Generalized Agent Action Prediction, and 5) Cognition, as shown in Figure 5.

4.2.2 Agent infrastructure

Agent-based AI is a large and rapidly growing community in entertainment, research, and industry. The development of large-scale base models has significantly improved the performance of Agent AI systems. However, creating agents in this way is limited by the increasing amount of work and overall cost required to create high-quality datasets. At Microsoft, building a high-quality Agent infrastructure has had a significant impact on multimodal Agent co-piloting through the use of advanced hardware, diverse data sources, and robust software libraries. As Microsoft continues to push the boundaries of Agent technology, the AI Agent platform is expected to continue to be a dominant force in multimodal intelligence for years to come. Nonetheless, Agent AI interaction currently remains a complex process that requires a combination of skills. Recent advances in the field of large-scale generative AI modeling have the potential to significantly reduce the high cost and time currently required for interactive content, both for large studios and for smaller, independent content creators to design high-quality experiences beyond their current capabilities. Multimodal Agents Inside Current HCI systems are largely rule-based. They do have intelligent behaviors that respond to human/user actions and have some degree of network knowledge. However, these interactions are usually limited by the cost of software development, which prevents specific behaviors from being implemented in the system. In addition, current models are not designed to help users achieve goals in situations where they are unable to accomplish specific tasks. Therefore, an Agent AI system infrastructure is needed to analyze user behavior and provide appropriate support when needed.

4.3 Agent-based base models (pre-training and fine-tuning levels)

The use of pre-trained base models provides significant advantages in terms of broad applicability across a variety of use cases. The integration of these models enables the development of customized solutions for a variety of applications, thus avoiding the need to prepare large labeled datasets for each specific task.

A notable example in the field of navigation is the LM-Nav system Shah et al. (2023a), which combines GPT-3 and CLIP in a new approach. it effectively utilizes textual landmarks generated by a language model to navigate by anchoring them in images acquired by the robot. This approach demonstrates the seamless fusion of textual and visual data, significantly enhancing the robot's ability to navigate while maintaining broad applicability.

In robot manipulation, several studies have proposed the use of off-the-shelf LLMs (e.g., ChatGPT) along with an open vocabulary object detector.The combination of an LLM and an advanced object detector (e.g., Detic Zhou et al. (2022)) can help to comprehend human commands while situating the textual information within the scene information Parakh et al. (2023 ). Furthermore, recent advances demonstrate the potential of using cue engineering in conjunction with advanced multimodal models such as GPT-4V(ision) Wake et al. (2023b). This technique opens the way to multimodal task planning, highlighting the versatility and adaptability of pre-trained models in a variety of environments.

 

5 Agent AI Classification

5.1 Generic Agent Domain

Computer-based actions and Generalized Agents (GA) are useful for many tasks. Recent advances in the field of large-scale base models and interactive AI have enabled new capabilities for GA. However, for GA to be truly valuable to its users, it must be easy to interact with and generalize to a wide range of environments and modalities. We expand on the main chapters on Agent-based AI in Section 6 with high quality, especially in areas related to these topics in general:

Multimodal Agent AI (MMA) is an upcoming forum ^1^^^1^ Current URL: https://multimodalagentai.github.io/ for our research and industry communities to interact with each other and with the broader Agent AI research and technology community. Recent advances in the field of large-scale fundamental models and interactive AI have enabled new capabilities for General Purpose Agents (GA), such as predicting user behavior and task planning in constrained environments (e.g., MindAgent Gong et al. (2023a), fine-grained multimodal video understanding Luo et al. (2022), robotics Ahn et al. ( (2022b); Brohan et al. (2023)), or providing users with chat companions that contain knowledge feedback (e.g., web-based customer support for healthcare systems Peng et al. (2023)). More detailed information on representative and recent representative work is shown below. We hope to discuss our vision for the future of MAA and inspire future researchers to work in the field. This paper and our forum cover the following major topics, but are not limited to them:

  • Main Theme: Multimodal Agent AI, Generalized Agent AI
  • Secondary theme: Embodied Agents, Motion Agents, Language-based Agents, Visual and Linguistic Agents, Knowledge and Reasoning Agents, Agents for Gaming, Robotics, Healthcare, and more.
  • Extended Theme: Visual navigation, simulated environments, rearrangement, Agent-based modeling, VR/AR/MR, embodied vision, and language.

Next, we list the following representative Agent categories:

5.2 Possessive Agent

Our biological brain exists in our bodies, and our bodies move through an ever-changing world. The goal of embodied AI is to create Agents, such as robots, that learn to creatively solve challenging tasks that require interaction with the environment. While this is a huge challenge, major advances in deep learning and the growing popularity of large datasets such as ImageNet have enabled superhuman performance on a wide range of AI tasks that were previously considered difficult to handle. Computer vision, speech recognition, and natural language processing have undergone transformative revolutions in passive input-output tasks such as language translation and image categorization, while reinforcement learning has achieved world-class performance in interactive tasks such as gaming. These advances provide a powerful impetus for embodied AI, enabling more and more users to move quickly toward intelligent agents that can interact with machines.

5.2.1 Action Agent

Action Agents are agents that need to perform physical actions in a simulated physical environment or in the real world; in particular, they need to be actively involved in activities with the environment. We broadly categorize Action Agents into two different classes based on their application areas: game AI and robotics.

In Game AI, the Agent will interact with the game environment and other independent entities. In these settings, natural language enables smooth communication between Agent and humans. Depending on the game, there may be a specific task to be accomplished that provides a real reward signal. For example, in competitive diplomacy games, human-level gaming can be achieved by using human dialog data and training language models with reinforcement-learning action strategies Meta Fundamental AI Research Diplomacy Team et al. (2022) Meta Fundamental AI Research (FAIR ) Diplomacy Team, Bakhtin, Brown, Dinan, Farina, Flaherty, Fried, Goff, Gray, Hu, et al. (FAIR).

In some cases, Agents will act like ordinary residents of a town Park et al. (2023a) without trying to optimize for specific goals. Base models are useful in these settings because they can simulate more natural-looking interactions by mimicking human behavior. When augmented with external memory, they produce convincing Agents that can carry on conversations, organize daily activities, build relationships, and have virtual lives.

5.2.2 Interactive Agent

Interactive Agents simply refer to Agents that can interact with the world, which is a broader class of Agents than Action Agents.The form of their interaction does not necessarily require aphysiotherapyactions, but may involve conveying information to the user or modifying the environment. For example, an embodied, interactive Agent could answer a user's questions about a topic through dialog or help the user parse existing information, similar to a chatbot. By extending the Agent's capabilities to include information sharing, the core design and algorithms of Agent AI can be effectively adapted to a range of applications, such as diagnostics Lee et al. (2023) and knowledge retrieval agent Peng et al.

5.3 Simulation and Environment Agent

An effective way for an AI Agent to learn how to act in an environment is through the experience of repeated trials of interacting with the environment. A representative approach is reinforcement learning (RL), which requires a large number of failures to train the Agent.While methods exist that use physical Agents Kalashnikov et al. (2018), using physical Agents is both time-consuming and expensive. In addition, when failures in real-world environments can be dangerous (e.g., autonomous driving, underwater vehicles), training in physical environments is often not feasible. Therefore, using simulators to learn strategies is a common approach.

A number of simulation platforms have been proposed for embodied AI research, ranging from navigation Tsoi et al. (2022); Deitke et al. (2020); Kolve et al. (2017) to object manipulation Wang et al. (2023d); Mees et al. (2022); Yang et al. (2023a) ; Ehsani et al. (2021). One example is Habitat Savva et al. (2019); Szot et al. (2021), which provides a 3D indoor environment in which human and robotic Agents can perform various tasks such as navigation, command following, and question answering. Another representative simulation platform is VirtualHome Puig et al. (2018), which supports human avatars to manipulate objects in a 3D indoor environment. In the gaming domain, Carroll et al. introduced "Overcooked-AI", a benchmark environment designed to study cooperative tasks between humans and AI Carroll et al. (2019). Similarly, some work aims to incorporate realistic human intervention beyond the focus on Agent-environment interaction Puig et al. (2023); Li et al. (2021a); Srivastava et al. (2022). These simulators facilitate policy learning in real-world environments involving Agent-robot interactions, as well as IL-based policy learning using human demonstration actions.

In some cases, the process of learning strategies may require the integration of specialized functionality in the simulator. For example, when learning image-based strategies, realistic rendering is often required to facilitate adaptation to real environments Mittal et al. (2023); Zhong et al. (2023). The use of a realistic rendering engine is effective for generating images that reflect a variety of conditions (e.g., lighting environments). In addition, simulators employing physics engines are needed to simulate physical interactions with objects Liu and Negrut (2021). Integration of physics engines in simulations has been shown to be helpful in acquiring skills applicable to realistic scenarios Saito et al. (2023).

5.4 Generative Agent

Recent advances in the area of large-scale generative AI models have the potential to significantly reduce the high cost and time currently required for interactive content, both for large game studios and for smaller independent studios to create high-quality experiences beyond their current capabilities. Additionally, embedding large-scale AI models into sandbox environments will allow users to author their own experiences and express their creativity in ways that are currently unattainable.

The goal of this Agent is not just to add interactive 3D content to the scene:

  • Add arbitrary behavior and interaction rules to objects, allowing users to create their own VR rules with minimal prompting.
  • Generate entire level geometry from sketches on paper by using multimodal GPT4-v models, as well as other model chains involving visual AI models.
  • Re-texturing the contents of a scene using a diffusion model
  • Create custom shaders and visual effects with simple user prompts

One potential short-term application is VR creation storyboarding/prototyping tools that allow a single user to create a rough (but fully functional) sketch of an experience/game at an order of magnitude faster than is currently feasible. Such a prototype could then be extended and improved using these tools.

5.4.1 AR/VR/Mixed Reality Agent

AR/VR/Mixed Reality (collectively XR) settings currently require skilled artists and animators to create characters, environments, and objects to simulate interactions in the virtual world. This is a costly process involving concept art, 3D modeling, texturing, binding, and animation. xR Agent can assist in this process by facilitating interaction between creators and building tools that help construct the final virtual environment.

Our early experiments have shown that GPT models can be used in the Unity engine's sample less mechanism (without any additional fine-tuning) to invoke engine-specific methods using API calls to download 3D models from the Internet and place them into the scene, and assigning behavior and animation to them in the state tree Huang et al. (2023a). This behavior may occur due to the presence of similar code in open source game repositories using Unity. As a result, the GPT model is able to build rich visual scenes by loading many objects into the scene based on simple user prompts.

The goal of this class of Agents is to build a platform and a set of tools that provide efficient interfaces between large AI models (the GPT family of models as well as diffuse image models) and rendering engines. We explore two main avenues here:

  • Integrate large models into the various editor tools in the Agent infrastructure to dramatically speed up development.
  • Control the rendering engine from the user experience by generating code that follows the user's instructions and compiles at runtime, thus allowing the user to edit the VR/simulation they are interacting with in any way they want, and even introducing new Agent mechanisms.

Introducing an AI co-pilot focused on XR setups is useful for XR creators, who can use the co-pilot for tedious tasks such as providing simple assets or writing code samples, allowing creators to focus on their creative vision and iterate on ideas quickly.

In addition, the Agent can help users modify their environment interactively by adding new assets, changing the dynamics of the environment, or building new settings. This form of runtime dynamic generation can also be specified by the creator so that the user's experience feels fresh and evolves over time.

5.5 Knowledge and logical reasoning intelligences

The ability to reason and apply knowledge is a defining characteristic of human cognition, and is particularly evident in complex tasks such as logical deduction and understanding theories of mind ^2^^2^https://plato.stanford.edu/entries/cognitive-science. Reasoning about knowledge ensures that the AI's responses and behaviors are consistent with known facts and logical principles. This coherence is a key mechanism for maintaining trust and reliability in AI systems, especially in critical applications such as medical diagnosis or legal analysis. Here, we present a number of intelligences that combine the interplay between knowledge and reasoning and that are designed to address specific aspects of intelligence and reasoning.

5.5.1 Knowledge Intelligence

Knowledge intelligences reason about their acquired knowledge systems in two ways: implicitly and explicitly. Implicit knowledge is typically the knowledge encapsulated by large language models like the GPT family Brown et al. (2020); OpenAI (2023) after being trained on large amounts of textual data. These models can generate responses that give the impression of understanding because they utilize patterns and information implicitly learned during training. In contrast, explicit knowledge is structured and can be directly queried, such as information found in knowledge bases or databases, and has traditionally augmented AI's reasoning by referencing verifiable external resources.

Despite advances in language models, their implicit knowledge is static and becomes outdated as the world evolves Lewis et al. (2020); Peng et al. (2023). This limitation makes it necessary to integrate continuously updated sources of explicit knowledge to ensure that AI systems can provide accurate and up-to-date responses. The integration of implicit and explicit knowledge equips AI intelligences with a more nuanced understanding and the ability to apply knowledge in context like human intelligence Gao et al. (2022). This integration is critical for building knowledge-centered AI intelligences that not only have information, but can understand, interpret, and use that information, thus bridging the gap between extensive learning and deep knowledge (Marcus and Davis, 2019; Gao et al., 2020). These intelligences aim to reason in a flexible manner and with dynamic information about the world, thus enhancing their robustness and adaptability (Marcus, 2020).

5.5.2 Logical intelligences

Typically, logical intelligences are components of a system designed to apply logical reasoning to process data or solve tasks specific to logical reasoning or logical argumentation. In the context of a large base model like GPT-4, logical intelligences refer to specialized components or submodules designed to handle logical reasoning tasks. These tasks typically involve understanding and manipulating abstract concepts, inferring conclusions from given premises, or solving problems that require a structured, logical approach. Broadly speaking, base models like GPT-4 are trained on large corpora of textual data and learn to perform a variety of tasks, including those that require some form of logical reasoning. As such, their logical reasoning capabilities are integrated into the overall architecture, and they typically do not have separate, isolated "logical intelligences". While GPT-4 and similar models can perform tasks involving logic, their approach is fundamentally different from the way humans or traditional logic-based systems operate. They do not follow formal rules of logic or have an explicit understanding of logic; instead, they generate responses based on patterns learned from training data. As a result, their performance in logic tasks can be impressive, but can also be inconsistent or limited by the nature of the training data and the inherent limitations of the model design. An example of embedding a separate logic submodule in the architecture is (Wang et al., 2023e), which modifies the Token embedding process used by the LLM during pre-training by parsing the text into logical segments and explicitly modeling the logical hierarchy in the Token embedding.

5.5.3 Intelligentsia for Emotional Reasoning

Emotional understanding and empathy are important skills that intelligences need to possess in many human-computer interactions. To illustrate this point, an important goal in creating engaging conversational intelligences is for intelligences to exhibit greater emotion and empathy when they act, while minimizing socially inappropriate or offensive output. To advance this goal for conversational intelligences, we publish a review of neuroimages with empathy (NICE) dataset Chen et al. (2021), which contains nearly 2 million images along with corresponding manually generated comments and a set of human sentiment annotations. We also provide a new pre-trained model for modeling sentiment generation for image comments (MAGIC) Chen et al. (2021)-aims to generate comments on images that are conditioned on linguistic representations that capture style and emotion and help generate comments that are more empathetic, emotional, engaging, and socially appropriate. Our experiments show that the approach is effective in training more human-like and engaging image commenting intelligences. Developing empathy-aware intelligences is a promising direction for interactive intelligences, and it is important to create intelligences with emotional understanding for a wide range of groups and populations, especially given that many current language models exhibit biases in their emotional understanding and empathetic reasoning abilities Mao et al. (2022); Wake et al. (2023d).

5.5.4 Neuro-symbolic intelligences

Neurosymbolic intelligences operate on hybrid systems of neurons and symbols d'Avila Garcez and Lamb (2020). Solving the problem of natural language representations is a challenging task because it requires explicitly capturing the discrete symbolic structural information implicit in the input. However, most general-purpose neural sequence models do not explicitly capture such structural information, thus limiting their performance on these tasks.The work of Chen et al. (2020) proposes a novel encoder-decoder model based on structured neural representational intelligences.The encoder of TP-N2F employs a TPR "binding The encoder of TP-N2F uses TPR "binding" to encode natural language symbolic structures in vector space, and the decoder uses TPR "unbinding" to generate sequential programs represented by relational tuples in the symbolic space, where each tuple contains a relation (or operation) and some parameters.

Command-following visual language (VL) models like GPT-4 provide a flexible interface that supports a wide range of multimodal tasks in a zero-sample manner. However, interfaces that run on complete images do not directly enable the user to "point" and access specific regions within the image. This capability is important not only for supporting reference-based VL benchmarks, but also for real-world applications that require accurate in-image reasoning. In (Park et al., 2023b), we constructed a local visual commonsense model that allows users to specify (multiple) regions as input. We trained our model by sampling local commonsense knowledge from a large language model (LLM): specifically, we prompted the LLM to collect commonsense knowledge based on global literal image descriptions and a set of local literal region descriptions automatically generated by the VL model. The process is scalable and fully automated, as there is no need for aligned or manually written image and text pairs. By individually training a review model of selected high-quality examples, we find that training on a local corpus of general knowledge extended only from images can successfully refine existing VL models to support interfaces that take quotes as input. Empirical results in a zero-sample setting and manual evaluation show that our refinement method produces more accurate inference VL models compared to a baseline of pass-generated referential expressions.

5.6 Large Language Modeling and Visual Language Modeling Intelligentsia

Much work has utilized large language models as intelligences to perform task planning (Huang et al., 2022a; Wang et al., 2023b; Yao et al., 2023a; Li et al., 2023a), and has leveraged the Internet-scale domain knowledge and zero-sample planning capabilities of large language models to perform intelligences such as planning and inference tasks. Recent robotics research has also utilized large language models to perform task planning (Ahn et al., 2022a; Huang et al., 2022b; Liang et al., 2022) by decomposing natural language commands into a series of subtasks, either in the form of natural language or in Python code, and then executing these subtasks using a low-level controller. In addition, Huang et al. (2022b), Liang et al. (2022), and Wang et al. (2023a) integrate environmental feedback to improve task performance. There is also much work demonstrating that generalized visually aligned large language models trained on large text, image, and video data can serve as a basis for creating multimodal intelligences that can embody and act in a variety of environments Baker et al. (2022); Driess et al. (2023); Brohan et al.

 

6 Intelligent Body AI Application Tasks

Agent AI: Exploring the Boundaries of Multimodal Interaction-2
Figure 8: Minecraft dungeon game sense simulation and generation via GPT-4V for embodied intelligences for user interaction game action prediction and interaction editing.

6.1 Game Intelligentsia

The game provides a unique sandbox to test the behavior of intelligences from both the large language model and the visual language model, pushing the boundaries of their ability to collaborate and make decisions. In particular, we describe three domains that highlight the ability of intelligences to interact with human players and other intelligences and to take meaningful actions in their environment.

6.1.1 NPC Behavior

In modern game systems, the behavior of non-player characters (NPCs) is largely determined by predefined scripts written by developers. These scripts contain a set of reactions and interactions based on various triggers or player actions in the game environment. However, this scripted nature often results in predictable or repetitive NPC behavior that does not evolve based on player actions or the dynamic environment of the game. This rigidity hinders the immersive experience expected in a dynamic game environment. As a result, there is increasing interest in leveraging large language models to induce autonomy and adaptability in NPC behavior to make interactions more nuanced and engaging. Artificial intelligence-driven NPCs can learn from player behavior, adapt to different strategies, and provide more challenging and unpredictable game experiences. Large Language Models (LLMs) can make a significant contribution to the evolution of NPC behavior in games. By processing large amounts of text, LLMs can learn patterns and generate more varied and human-like responses. They can be used to create dynamic dialog systems that make interactions with NPCs more engaging and less predictable. In addition, Big Language Models can be trained with player feedback and in-game data to continuously improve NPC behavior to better match player expectations and game dynamics.

6.1.2 Man-machine NPC interaction

The interaction between the human player and the NPC is a key aspect of the game experience. Traditional interaction models are largely one-dimensional, with NPCs reacting to player input in a pre-determined manner. This limitation stifles the potential for more natural and richer interactions (similar to human-to-human interactions) within the virtual realm. The emergence of macrolanguage modeling and visual language modeling technologies promises to change this paradigm. By employing these technologies, game systems can analyze and learn human behavior in order to provide more human-like interactions. This not only enhances the realism and engagement of games, but also provides a platform for exploring and understanding human-computer interactions in controlled but complex environments.

Agent AI: Exploring the Boundaries of Multimodal Interaction-3
Figure 9: The GPT-4V can effectively predict high-level next moves when given the "Action History" and "Game Goal" in the cue. In addition, the GPT-4V accurately recognizes that the player is holding a wooden log in his hand and can incorporate this perceived information into his future plans of action. Although GPT-4V appears to be able to predict some low-level actions (e.g., pressing the "E" key to open a backpack), the model's outputs are not inherently suited for primitive low-level action prediction (including mouse movements) and may require supplemental modules for low-level action control.

6.1.3 Analysis of Intelligent Body Based Games

Gaming is an integral part of everyday life, attracting an estimated half of the world's population https://www.dfcint.com/global-video-game-audience-reaches-3-7-billion/. In addition, it has a positive impact on mental health https://news.microsoft.com/source/features/work-life/mind-games-how-gaming-can-play-a-positive-role-in-mental- health/. However, contemporary gaming systems are deficient in interacting with human players because their behaviors are largely handcrafted by game developers. These pre-programmed behaviors are often not adaptable to the needs of the player. Therefore, new AI systems are needed in games that can analyze player behavior and provide appropriate support when necessary. Intelligent Interaction Systems have the potential to revolutionize the way gamers interact with game systems.NPCs' interactions with gamers are no longer constrained by a limited set of rules designed by game developers. They have the potential to seamlessly adapt to the gamer's experience, providing timely feedback to enrich the gaming experience and enhance the synergy of human-computer interaction.

Big Language Models can be powerful tools for analyzing in-game textual data, including chat transcripts, player feedback, and narrative content. They can help identify patterns in player behavior, preferences, and interactions, which can be invaluable to game developers in improving game mechanics and narratives. In addition, visual language models can parse large amounts of image and video data from game sessions to help analyze user intent and actions in the game world. In addition, large language models and visual language models can facilitate the development of intelligent in-game intelligences that can communicate with the player and other intelligences in a sophisticated and human-like manner, thus enhancing the overall game experience. In addition to large language models and visual language models, user input data offers a promising avenue for creating game intelligences that model perception, game play, and game understanding by mimicking human players. By combining player interaction and feedback, pixel input, and natural language planning and comprehension, intelligent body models can help to continuously improve game dynamics, thereby driving game environments to evolve in a more player-centric manner.

6.1.4 Game Scene Synthesis

Agent AI: Exploring the Boundaries of Multimodal Interaction-3
Figure 10: Masked video prediction on unseen Minecraft video. From left to right: original frame, masked frame, reconstructed frame, and reconstructed frame with patch.

Scene compositing is an important part of creating and enhancing immersive game environments. It entails the automatic or semi-automatic generation of three-dimensional (3D) scenes and environments in the game. This process includes generating terrain, placing objects, creating realistic lighting, and sometimes even dynamic weather systems.

Modern games often feature vast open-world environments. Designing these landscapes manually can be time-consuming and resource-intensive. Automated terrain generation (often utilizing procedural or AI-driven techniques) can produce complex, realistic landscapes without much human intervention. Big Language Models and Visual Language Models can utilize Internet-scale knowledge to develop rules for designing visually impressive and unique non-repetitive landscapes. In addition, Big Linguistic Models and Visual Linguistic Models can be used to ensure semantic consistency and variability of the generated assets. Placing objects such as buildings, vegetation, and other elements in the scene in a realistic and aesthetically pleasing manner is critical for immersion.

Agent AI: Exploring the World of Multimodal Interaction [Fei-Fei Li - Classic Must Read]-1
Figure 11: Low-level next-action prediction using a pre-trained model of small intelligences in a Minecraft game scenario.

Visual Language Models and Big Language Models can speed up the level design process by following predefined or learned rules and aesthetics to aid in object placement. Visual Language Models and Big Language Models can be further trained to understand the principles of design and aesthetics that can help in procedural generation of content. They can help develop rules or guidelines that program algorithms can follow to generate objects and scenes that are visually appealing and appropriate in context.

Realistic lighting and atmospheric effects are the foundation for creating believable and engaging game environments. Advanced algorithms can simulate natural lighting conditions and dynamic weather effects to enhance the realism and atmosphere of a scene. Big Language Modeling can help develop systems to achieve more realistic lighting and atmospheric effects in several innovative ways. Visual language models can analyze large data sets from real-world lighting and atmospheric conditions to help develop more realistic algorithms to simulate these effects in games. By understanding the patterns and complexity of natural lighting and weather, these models can contribute to the development of algorithms that closely mimic reality. Large language models and visual language models can also be used to develop systems that adjust lighting and atmospheric effects in real time based on player actions, game state, or external inputs. They can process natural language commands from the player to modify the game environment to provide a more interactive and immersive experience.

6.1.5 Experiments and results

Zero-sample/few-sample learning using a large language model or a visual language model.

As we show in Figures 8 and 9, we use GPT-4V for high-level description and action prediction. Figure 8 shows some qualitative examples of using GPT-4V to generate action descriptions and edits. Intelligent body-enhanced text opens up a new way of generating 3D scenes using game action a priori to help improve the naturalness of the scene. As a result, GPT-4V generates relevant high-level descriptions related to the game video.

Pre-training models for small intelligences.

To demonstrate our visual language architecture for intelligibles, we first investigate its application for pre-training on Minecraft data, a domain widely used for game intelligibles. As shown in Fig. 7, given an input action-intelligent body, video keyframes, and corresponding text, one can use a standard encoder-decoder to convert the intelligent body actions and images into Action Text Token and Image Patch Token, which can then be converted into Action Prediction Sentences using the Intelligent Body Visual Language Decoder. The overall architecture is shown in Figure 7. We evaluate our approach using several Minecraft demos. the Minecraft video data consists of 5-minute clips, and the video we use for pre-training contains 78K videos, and we use 5K videos (accounting for 6% of the pre-training data) for the first round of pre-training. We train a 250 million parameter model on 16 NVIDIA v100 GPUs for one day and visualize our model output in Figures 10 and 11. Figure 10 shows that our relatively small smartbody architecture can generate reasonable outputs for Minecraft scenes not seen during training. Figure 11 shows a comparison of the model predictions with the actions of real human players, demonstrating the potential for low-level comprehension of our small intelligent body model.

Agent AI: Exploring the Boundaries of Multimodal Interaction-5
Figure 12: MindAgent for Contextual Learning Game Infrastructure. planning skills and tool usage: Game environments require a diverse set of planning skills and tool usage to accomplish tasks. It generates relevant game information and converts the game data into a structured text format that can be processed by the Big Language Model. Big Language Model: the main working engine of our infrastructure makes decisions and thus acts as a scheduler for the multi-intelligent body system. Memory history: utilities for storing relevant information. Action Module: extracts actions from textual input and converts them into a domain-specific language, and validates the DSL so that no errors occur during execution.

Multi-Intelligentsia Infrastructure.

As shown in the intelligent body paradigm in Fig. 5, we have designed a new infrastructure for a new game scenario called "CuisineWorld" Gong et al. (2023a). We describe our approach in detail in Figure 12. Our infrastructure allows for multi-intelligence collaboration by utilizing GPT-4 as the central planner and working across multiple game domains. We investigate the multi-intelligence planning capabilities of the system and deploy the infrastructure to real-world video games to demonstrate the effectiveness of its multi-intelligence and human-computer collaboration. In addition, we demonstrate "Cuisineworld", a text-based benchmark for multi-intelligence collaboration that provides a new automated metric Collaboration Score (CoS) to quantify collaboration efficiency.

For more examples and details on game descriptions, high-level action predictions, and GPT-4V hints, see the Appendix. We show examples for Bleeding Edge, see Figure 32 and Appendix B; Microsoft Flight Simulator, see Figure 33 and Appendix C; ASSASSIN's CREED ODYSSEY, see Figure 34 and Appendix D; GEARS of WAR 4, see Figure 35 and Appendix E; and Starfield, see Figure 36 and Appendix F. We also show examples of the game's GPT-4V hints. see Figure 35 and Appendix E; and the Starfield example, see Figure 36 and Appendix F. We also provide detailed screen shots of the GPT4V prompting process used to generate the Minecraft example, see Figure 31 and Appendix A.

6.2 Robotics

Robots are typical intelligences that need to interact effectively with their environment. In this section, we introduce the key elements of efficient robot operation, review the research topics where the latest Large Language Modeling/Visual Language Modeling (LLM/VLM) techniques have been applied, and share the results of our recent research.

Visual Motion Control.

Visual-motor control refers to the integration of visual perception and motor actions to perform tasks efficiently in a robotic system. This integration is critical because it allows the robot to interpret visual data from its environment and adjust its motor actions accordingly to accurately interact with the environment. For example, on an assembly line, a robot equipped with visual motion control can sense the position and orientation of objects and accurately adjust its robotic arm to interact with those objects. This capability is critical to ensuring the accuracy and effectiveness of robotic operations in applications ranging from industrial automation to assisting the elderly with daily chores. In addition, visual motion control helps robots adapt to dynamic environments, where the state of the environment can change rapidly, requiring real-time adjustments to motor actions based on visual feedback.

Furthermore, in the context of safe operation, visual information is crucial for detecting execution errors and confirming pre- and post-conditions for each robot action. In uncontrolled environments, such as unknown home environments, robots are more likely to be exposed to unintended outcomes due to unpredictable factors such as changing furniture shapes, lighting changes and slippage. Executing pre-planned motion scenarios in a feed-forward manner only may pose significant risks in these environments. Therefore, utilizing visual feedback to continuously validate the outcome of each step is key to ensuring robust and reliable operation of the robotic system.

Language-conditioned operation.

Language-conditioned operation refers to the ability of a robotic system to interpret and perform tasks based on linguistic instructions. This aspect is particularly important for creating intuitive and user-friendly human-robot interfaces. Through natural language commands, users can specify goals and tasks to the robot in a manner similar to human-to-human communication, thus lowering the threshold for operating the robotic system. For example, in a real-world scenario, a user can instruct a service robot to "pick up the red apple from the table," and the robot will parse this command, recognize the object it is referring to, and perform the task of picking up the object Wake et al. (2023c). The core challenge is to develop robust natural language processing and understanding algorithms that can accurately interpret a wide range of instructions, from direct commands to more abstract instructions, and enable the robot to translate these instructions into actionable tasks. In addition, ensuring that robots can generalize these instructions to different tasks and environments is critical to increasing the versatility and utility of robots in real-world applications. The use of linguistic inputs to guide task planning for robots has received attention in a robotics framework called Task and Motion Planning Garrett et al. (2021).

Skill Optimization.

Recent research has emphasized the effectiveness of large language models (LLMs) in robotic task planning. However, optimal execution of tasks, especially those involving physical interactions such as grasping, requires a deeper understanding of the environment than just interpreting human instructions. For example, robotic grasping requires precise contact points Wake et al. (2023e) and arm postures Sasabuchi et al. (2021) for effective execution of subsequent actions. While these elements-precise contact points and arm poses-are intuitive to humans, expressing them through language is challenging. Despite advances in Internet-scale visual language modeling (VLM), capturing these subtle indirect cues from a scene and effectively translating them into robotic skills remains a significant challenge. In response, the robotics community has increasingly focused on collecting augmented datasets (e.g., Wang et al. (2023d); Padalkar et al. (2023)) or developing methods to acquire skills directly from human demonstrations Wake et al. (2021a). Frameworks including Learning from Demonstration and Learning by Imitation are leading these developments and play a crucial role in optimizing physics skills.

6.2.1 Large Language Model/Visual Language Model (LLM/VLM) Intelligentsia for Robotics.

Recent research has demonstrated the potential of Large Language Models/Visual Language Models (LLM/VLM) for robotic intelligences involved in interacting with humans in the environment. Research topics aimed at utilizing the state-of-the-art Large Language Model/Visual Language Model (LLM/VLM) technology include:

multimodal system: Recent research has been actively focusing on the development of end-to-end systems that use the latest Large Language Modeling (LLM) and Visual Language Modeling (VLM) technologies as encoders of input information. In particular, modifying these underlying models to handle multimodal information has become an important trend.Jiang et al. (2022); Brohan et al. (2023, 2022); Li et al. (2023d); Ahn et al. (2022b); Shah et al. (2023b); Li et al. ( 2023e). This adaptation aims to guide robot actions based on verbal commands and visual cues for effective embodiment.

Mission planning and skills training: In contrast to end-to-end systems, task and motion planning (TAMP)-based systems first compute high-level task plans, which are then realized through low-level robot control (called skills).

The advanced language processing capabilities of large language models (LLMs) have been shown to be able to interpret commands and decompose them into robot action steps, thus greatly advancing task planning techniques Ni et al. (2023); Li et al. (2023b); Parakh et al. (2023); Wake et al. (2023c). For skill training, several studies have explored the use of large language modeling (LLM)/visual language modeling (VLM) to design reward functions Yu et al. (2023a); Katara et al. (2023); Ma et al. (2023), to generate data for strategy learning Kumar et al. (2023); Du et al. ( 2023), or as part of a reward function Sontakke et al. (2023). Combined with training frameworks such as reinforcement learning and imitation learning, these efforts will help develop efficient robot controllers.

On-site optimization: Performing long task steps in robotics can be difficult due to unexpected and unpredictable environmental conditions. Therefore, a major challenge in the field of robotics consists in dynamically adapting and improving robotic skills by combining task planning with real-time environmental data. For example, (Ahn et al., 2022b) proposed a method that calculates the feasibility (i.e., availability) of an action from visual information and compares it to a planned task. In addition, there are approaches that focus on enabling a large language model (LLM) to output preconditions and postconditions (e.g., the state of objects and their interrelationships) of a task step in order to optimize its execution Zhou et al. (2023c), and to detect preconditioning errors in order to make necessary modifications to the task plan Raman et al. (2023). These strategies aim to achieve environment-directed robot execution by integrating environmental information and adapting the robot's actions at the task plan or controller level.

Conversations with Intelligentsia: When creating conversational robots, large language models (LLMs) can facilitate natural, contextually relevant interactions with humans Ye et al. (2023a); Wake et al. (2023f). These models process and generate responses that mimic human conversations, enabling robots to engage in meaningful dialog. In addition, Large Language Models (LLMs) play an important role in the evaluation of concepts Hensel et al. (2023); Teshima et al. (2022) and affective attributes Zhao et al. (2023); Yang et al. (2023b); Wake et al. (2023d). These attributes help in understanding human intentions and generating meaningful gestures, thus contributing to the naturalness and effectiveness of human-computer communication.

Navigating Intelligent Bodies: Robot navigation has a long history of research focusing on core aspects such as map-based path planning and simultaneous localization and map building (SLAM) to create maps of the environment. These features have become standard in widely used robot middleware such as the Robot Operating System (ROS) Guimarães et al. (2016).

While classical navigation techniques are still prevalent in many robotic applications, they typically rely on static or pre-created maps. Recently, there has been increasing interest in advanced techniques that enable robots to navigate in more challenging environments, utilizing breakthroughs in areas such as computer vision and natural language processing. A representative task is object navigation Chaplot et al. (2020a); Batra et al. (2020); Gervet et al. (2023); Ramakrishnan et al. (2022); Zhang et al. (2021), where robots use object names rather than map coordinates for navigation, which requires visual localization of object names in the environment. In addition, recent attention has been focused on techniques for navigating robots in completely unfamiliar and novel environments in a zero-sample manner on top of a base model, so-called zero-sample object navigation Gadre et al. (2023); Dorbala et al. (2023); Cai et al. (2023). In addition, Visual Linguistic Navigation (VLN) Anderson et al. (2018a) is a representative task, where the task involves navigating an intelligent body through natural language commands in a previously unseen real environment Shah et al. (2023a); Zhou et al. (2023a); Dorbala et al. (2022); Liang et al. (2023); Huang et al. (2023b). Visual language navigation (VLN) interprets sentences rather than object names, e.g., "walk to the bathroom on your left," so it requires higher functionality to parse input text Wang et al. (2019).

The emergence of the underlying model has contributed to the development of this adaptive, on-the-fly navigation technique by enhancing the understanding of human verbal commands and the visual interpretation of environmental information. For a more detailed explanation of representative visual linguistic navigation (VLN) research, see 6.2.2.

6.2.2 Experiments and results.

Agent AI: Exploring the Boundaries of Multimodal Interaction-7
Figure 13: Overview of a robotic teaching system integrating a ChatGPT-driven task planner. The process consists of two steps: task planning, where the user creates a sequence of actions using the task planner and adjusts the results with feedback as needed; and demonstration, where the user visually demonstrates the sequence of actions to provide the information needed for robot operation. The vision system collects the visual parameters that will be used for robot execution.

There is growing evidence that recent Visual Language Models (VLMs) and Large Language Models (LLMs) have good capabilities for symbolic task planning (e.g., what to do). However, each task requires low-level control strategies (e.g., how to do it) in order to achieve successful interactions between environments. While reinforcement and imitation learning are promising approaches for learning strategies in a data-driven manner, another promising approach is to acquire strategies directly from humans through live demonstrations, an approach called "learning from observation" Wake et al. (2021a); Ikeuchi et al. In this section, we present a study in which we used ChatGPT for task planning and enriched the plan by parameterizing it with availability information to facilitate effective and precise execution (Figure 13).

The process consists of two modules: task planning and parameterization. In task planning, the system receives linguistic instructions and a description of the work environment. These instructions, along with predefined robot actions and output specifications, are compiled into a comprehensive prompt provided to ChatGPT, which then generates a series of decomposed tasks and their textual descriptions (Fig. 13; left pane). It is worth noting that we use a sample less approach, which means that ChatGPT has not been trained for this task, which is advantageous in terms of applicability as it removes the need for hardware-related data collection and model training. In addition, the textual descriptions in the output enable the user to check and adjust the results as needed, a key feature for safe and reliable operation. Figure 14 shows the qualitative results of the simulation of intelligences performed on top of VirtualHome Puig et al. (2018). The results show a reasonable task plan and flexibility in adjusting the outputs, demonstrating the broad applicability of our approach.

Agent AI: Exploring the World of Multimodal Interaction [Fei-Fei Li - Classic Must Read]-2
Figure 14: Example of tuning an output sequence with automatically generated feedback. We used the open-source simulator VirtualHome for our experiments. Given the instruction "Pick up the pie on the table and heat it on the stove." The task planner plans a sequence of functions provided in VirtualHome. If an error is detected during execution, the task planner corrects its output with an automatically generated error message.

While the task planner ensures consistency between task sequences, detailed parameters are required for successful operation in reality. For example, the type of grasp is crucial to carry the container while not spilling the contents, a parameter that is often ignored in simulators (see the example of grasping a pie in Fig. 14). Thus, in our robotic system, the user needs to demonstrate each action visually (Fig. 13; right pane). These tasks have predefined parameters required for execution, which our vision system extracts from the video Wake et al. (2021b). It is worth noting that our robotic system is not designed to accurately replicate human actions (i.e., teleoperation), but rather is designed to deal with a variety of real-world conditions, such as changes in object position. Therefore, the parameters extracted from the human demonstration contain not only the precise motion paths, but also availability information indicative of effective environmental motion (e.g., waypoints for collision avoidance Wake et al. (2023a), grasp type Wake et al. (2023e) and upper limb posture Sasabuchi et al. (2021); Wake et al. ( 2021a)). Upper limb posture is critical in robots with high degrees of freedom and is designed to present predictable postures for humans coexisting with the operating robot. Sequences of tasks with availability are converted into a set of reusable robotic skills that can be acquired through reinforcement learning and executed by the robot Takamatsu et al. (2022).

Agent AI: Exploring the Boundaries of Multimodal Interaction-13
Figure 15: Overview of the multimodal task planner utilizing GPT-4V and GPT-4. The system processes video presentations and textual instructions to generate task plans for robot execution.

Agent AI: Exploring the World of Multimodal Interaction [Fei-Fei Li - Classic Must Read]-4
Figure 16: Example of video analyzer output. Five frames are extracted at fixed intervals and fed into the GPT-4V. We describe the entire process in Section 6.2.2.

By integrating it with Visual Language Models (VLMs), task planning driven by Large Language Models (LLMs) can be extended to enable more generalized robotic systems. Here, we show an example where we use GPT-4V(ision) to extend the above task planner in a multimodal input context (Fig. 15), where a human performs actions intended to be replicated by a robot. In this paper, only a part of the hint is shown. The full prompts can be found at microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts.

The process receives demo video and text and then outputs a series of robot actions. The vision analyzer is designed to understand the actions performed by the human in the video. We used the GPT-4V and provided a prompt to generate a style of textual commands similar to human-to-human communication. Figure 16 illustrates how the use of text input allows the user to provide feedback for corrections to the GPT-4V's recognition results. This feature, designed to improve the accuracy of recognition results, also enables more robust operation.

Agent AI: Exploring the Boundaries of Multimodal Interaction-15
Figure 17: Example output from a scene analyzer utilizing the GPT-4V. We describe our entire process in Section 6.2.2.

Next, the Scene Analyzer compiles the expected work environment into textual information based on the first frame (or environment image) of the command and video data. This environment information includes a list of object names recognized by the GPT-4V, graspable attributes of the objects, and spatial relationships between the objects. Although these computational processes are black-boxed in GPT-4V, they are based on the knowledge of GPT-4V as well as the image/text input/output information. Figure 17 shows an example output from our scene analyzer. As shown, GPT-4V successfully selects objects that are relevant to the operation. For example, when a human repositions a spam container on a table, a table is included in the output, while the table is ignored in the refrigerator opening task. These results suggest that the scene analyzer encodes scene information based on human actions. We prompt GPT-4V to explain the results of the object selection process and the reasons behind these selections. In practice, we found that this approach produced reasonable output. Finally, based on the given textual instructions and environment information, the task planner outputs a series of tasks Wake et al. (2023c).

Embodied intelligences for robot navigation.

The task of Visual Linguistic Navigation (VLN) is to navigate an embodied intelligence in order to execute natural language commands in a realistic 3D environment. Navigating in 3D environments Zhu et al. (2017a); Mirowski et al. (2016); Mousavian et al. (2018); Hemachandra et al. (2015) is an essential capability for mobile intelligent systems operating in the physical world. In the last few years, a large number of tasks and evaluation protocols have been proposed Savva et al. (2017); Kolve et al. (2017); Song et al. (2017); Xia et al. (2018); Anderson et al. (2018a), as Anderson et al. (2018b) summarized in. Visual Linguistic Navigation (VLN) Anderson et al. (2018a) focuses on language-based navigation in realistic 3D environments. To address the Visual Linguistic Navigation (VLN) task, (Anderson et al., 2018a) developed an attention-based sequence-to-sequence baseline model. Then, (Wang et al., 2018) introduced a hybrid approach which combines model-free and model-based reinforcement learning (RL) to improve the generalization ability of the model. Finally, (Fried et al., 2018) propose a speaker-follower model that employs data augmentation, panoramic action space, and modified beam search for visual-linguistic navigation (VLN) to establish current state-of-the-art performance on the Room-to-Room dataset. Building on previous work, we present Reinforced Cross-Modal Matching (RCM) for Visual-Linguistic Navigation (VLN) in (Wang et al., 2019).The RCM model builds on Fried et al. (2018), but differs in a number of important ways:(1) RCM combines a novel type of multi-reward reinforcement learning (RL) with Reinforcement Learning for Visual language navigation (VLN) with imitation learning, whereas the speaker-follower model Fried et al. (2018) uses only supervised learning, as shown in (Anderson et al., 2018a). (2) The RCM inference navigator performs cross-modal localization rather than a temporal attention mechanism for unimodal inputs. (3) The RCM Matching Critic is similar to the speaker in terms of architectural design, but the former is used to provide cyclic reconstruction of intrinsic rewards for Reinforcement Learning (RL) and Self-Supervised Imitation Learning (SIL) training, while the latter is used to augment training data for supervised learning. In (Wang et al., 2019), we investigate how to address the three key leader challenges for this task: cross-modal localization, feedback for ill-posedness, and the generalization problem. As shown in Fig. 18, we propose a novel reinforcement cross-modal matching method that enforces cross-modal localization locally and globally via reinforcement learning (RL). In particular, a matching commenter is used to provide intrinsic rewards to encourage global matching between instructions and trajectories, and an inference navigator is used to perform cross-modal localization in local visual scenes. Evaluation of the Visual Linguistic Navigation (VLN) benchmark dataset shows that our RCM model significantly outperforms the previous approach 10% on SPL and achieves new state-of-the-art performance. To improve the generalization ability of the learning strategy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past good decisions. We demonstrate that self-supervised imitation learning (SIL) can approximate better and more effective strategies, thus greatly narrowing the success rate performance gap between seen and unseen environments (from 30.7% to 11.7%). Furthermore, in Wang et al. (2019), we introduce a self-supervised imitation learning method for exploration to explicitly address the generalization problem, which has not been fully investigated in previous work. Concurrently with this work, Thomason et al. (2018); Ke et al. (2019); Ma et al. (2019a, b) investigated visual linguistic navigation (VLN) tasks in various aspects and Nguyen et al. (2018) introduced a variant of the visual linguistic navigation (VLN) task in order to need to request linguistic help to find objects. Note that we are the first to propose exploring unseen environments for visual linguistic navigation (VLN) tasks.

Agent AI: Exploring the Boundaries of Multimodal Interaction-10
Figure 18: Demonstration of visual linguistic navigation (VLN) task embodied intelligences from Wang et al. (2019). Instructions, local visual scenes, and global trajectories in the top view are shown. The intelligent body does not have access to the top view. Path A is the demonstration path that follows the instruction. Paths B and C are two different paths executed by the intelligent body.

6.3 Health care

In healthcare, Large Language Models (LLMs) and Visual Language Models (VLMs) can serve as diagnostic intelligences, patient care assistants, and even therapeutic aids, but they also face unique leader challenges and responsibilities. Because AI intelligences have enormous potential to improve patient care and save lives, the possibility that they could be misused or hastily deployed could jeopardize thousands or even millions of people around the world is equally dangerous. We will discuss some of the promising avenues for AI intelligences in healthcare and address some of the key leader challenges faced.

Diagnosing Intelligentsia.

Medical chatbots that use large language models (LLMs) as patient diagnostics have recently attracted a great deal of attention due to the high demand for medical specialists and the potential of LLMs to help classify and diagnose patients Lee et al. (2023). Conversational intelligences, especially those that can effectively communicate important medical information to a wide range of people from diverse patient populations, have the potential to provide equitable access to healthcare for historically disadvantaged or marginalized groups. In addition, physicians and healthcare systems around the world are largely overburdened and under-resourced, resulting in hundreds of millions of people around the globe being denied access to adequate healthcare World Health Organization and World Bank (2015). Diagnostic intelligences offer a particularly favorable pathway to improving healthcare for millions of people, as they can be built with the ability to understand a variety of languages, cultures, and health conditions. Preliminary results suggest that large language models (LMMs) with healthcare knowledge can be trained by utilizing large-scale network data Li et al. (2023f). While this is an exciting direction, the prospect of diagnostic intelligences is not without risk. We will focus on the risks of hallucinations in healthcare settings, as well as potential avenues for solutions, in the following sections.

Knowledge Retrieval Intelligence.

In the medical context, modeling hallucinations are particularly dangerous and can even lead to serious patient injury or death, depending on the severity of the error. For example, if a patient mistakenly receives a diagnosis that suggests a condition they do not actually have, this can lead to catastrophic consequences. These consequences include delayed or inappropriate treatment or, in some cases, a complete lack of necessary medical intervention. The severity of an undiagnosed or misdiagnosed condition can lead to increased healthcare costs, prolonged treatment leading to further physical stress, and in extreme cases, serious injury or even death. Therefore, methods that allow for more reliable knowledge retrieval using intelligent bodies Peng et al. (2023) or methods that generate text in a retrieval-based manner Guu et al. (2020) are promising directions. Pairing diagnostic intelligences with medical knowledge retrieval intelligences has the potential to significantly reduce hallucinations while improving the quality and accuracy of diagnostic dialog intelligences' responses.

Telemedicine and remote monitoring.

Intelligent body-based AI also has great potential in telemedicine and remote monitoring to improve access to healthcare, improve communication between healthcare providers and patients, and increase efficiency and reduce the cost of frequent doctor-patient interactions Amjad et al. (2023). Primary care clinicians spend a great deal of time sifting through patient messages, reports, and emails that are often irrelevant or unnecessary for them to view. Supporting intelligences that can help categorize messages from physicians, patients, and other healthcare providers and help highlight important messages from all parties has tremendous potential. The telemedicine and digital health industry has tremendous transformative potential by enabling intelligent body AI systems to coordinate with patients, clinicians, and other AI intelligences.

6.3.1 Current health care capacity

Agent AI: Exploring the Boundaries of Multimodal Interaction-19
Figure 19: Examples of cues and responses when using GPT-4V in the healthcare image understanding domain. From left to right: (1) images of a CT scan performed by a nurse and a doctor, (2) a synthesized image of an irregular ECG scan, and (3) an image from the ISIC Codella et al. (2018) skin lesion dataset. We can see that GPT-4V possesses substantial medical knowledge and is able to reason about medical images. However, it is unable to diagnose certain medical images due to security training.

Agent AI: Exploring the Boundaries of Multimodal Interaction-5
Figure 20: Example cues and responses when using GPT-4V in the healthcare video understanding domain. We enter the example videos as 2x2 grids and overlay text to indicate the order of the frames. In the first two examples, we prompted the GPT-4V to examine frames in the video to detect clinical bedside activities performed on a volunteer patient. For the last example, we attempted to prompt the GPT-4V to evaluate the echocardiography video, but due to the GPT-4V's safety training, it did not provide a detailed response. For clarity, we highlighted the text describing the activity of interest in bold and abbreviated unnecessary model responses. We grayed out individuals' faces to protect their privacy.

Image Understanding.

We show in Figure 19 the current capabilities and limitations of modern multimodal intelligences such as the GPT-4V in the healthcare context. We can see that while the GPT-4V has substantial internal knowledge of the equipment and procedures involved in hospital care, it does not always respond to more prescriptive or diagnostic user queries.

Video comprehension.

We investigated the performance of visual language modeling (VLM) intelligences for medical video comprehension in two contexts. First, we investigated the ability of Visual Language Model (VLM) intelligences to recognize important patient care activities in clinical space. Second, we explored the application of Visual Language Modeling (VLM) to more specialized videos such as ultrasound. Specifically, in Figure 20, we demonstrate some of the current capabilities and limitations of GPT-4V for hospital care and medical video analysis.

6.4 Multimodal intelligences

Agent AI: Exploring the Boundaries of Multimodal Interaction-18
Figure 21: Interactive Multimodal Intelligence consists of four pillars: interaction, speech, vision, and language. The Co-Pilot Intelligent Body consists of different services.1) Interaction services help to develop a unified platform for automated operations, cognition, and decision-making.2) Audio services integrate audio and speech processing into applications and services.3) Vision services recognize and analyze content in images, videos, and digital ink.4) Language services extract meaning from structured and unstructured text.

The integration of vision and language understanding is critical to the development of complex multimodal AI intelligences. This includes tasks such as image captioning, visual question answering, video language generation and video understanding. We aim to delve into these visual-linguistic tasks and explore the leader challenges and opportunities they present in the context of AI intelligences.

6.4.1 Image-Language Understanding and Generation

Image-language understanding is a task that involves interpreting the linguistic visual content of a given image and generating the associated linguistic description. This task is crucial for developing AI intelligences that can interact with the world in a more human-like manner. Some of the most popular tasks are image captioning Lin et al. (2014); Sharma et al. (2018); Young et al. (2014); Krishna et al. (2016), finger representation Yu et al. (2016); Karpathy et al. (2014), and visual question answering Antol et al. (2015); Ren et al. (2015); Singh et al. (2019).

Recently, knowledge-intensive visual question answering tasks such as OKVQA Marino et al. (2019), KB-VQA Wang et al. (2015), FVQA Wang et al. (2017) and WebQA Chang et al. (2021) have been introduced. Multimodal intelligences should be able to recognize objects in an image, understand their spatial relationships, generate accurate descriptive sentences about the scene, and utilize reasoning skills to handle knowledge-intensive visual inference. This requires not only object recognition skills, but also an in-depth understanding of spatial relations, visual semantics, and the ability to map these visual elements to linguistic structures with world knowledge.

6.4.2 Video and Language Understanding and Generation

Video language generation.

Video captioning or video storytelling is the task of generating coherent sentences for a series of video frames. Inspired by the success of the recurrent large-scale base model used in video and language tasks, variants of the intelligencer-driven augmentation model show promising results in video language generation tasks. The fundamental challenge is that the robustness of the neural encoder-decoder model does not work well for visual storytelling, as the task requires a comprehensive understanding of the content of each image and the relationships between different frames. An important goal in this area is to create an intelligent-body-aware text synthesis model that can efficiently encode a sequence of frames and generate a thematically coherent multisentence paragraph.

Video comprehension.

Video understanding extends the scope of image understanding to dynamic visual content. This involves interpreting and reasoning about sequences of frames in a video, often in conjunction with accompanying audio or textual information. Intelligentsia should be able to interact with a variety of modalities from visual, textual, and audio modalities to demonstrate their advanced understanding of video content. Tasks in this area include video captioning, video question answering, and activity recognition. The leader challenges in video understanding are multifaceted. They include temporal alignment of visual and verbal content, processing long sequential frames, and interpreting complex activities that unfold over time. With respect to audio, intelligentsia can process spoken words, background noise, music, and intonation to understand the mood, setting, and nuances of video content.

Previous work has focused on using existing video language training data available online to build video base models Li et al. (2020, 2021b); Fu et al. (2022); Bain et al. (2021); Zellers et al. (2021, 2022); Fu et al. (2023). However, supporting such training processes and functions is difficult due to the limited and often inconsistent nature of these datasets. Video base models are designed using mask-and-contrast pre-training objectives and then adapted on their respective tasks. Despite showing excellent results in multimodal benchmarks, these models struggle with video-only tasks (e.g., action recognition) because they rely on limited video text data constructed from noisy audio transcriptions. This limitation also leads to a lack of robustness and fine-grained reasoning skills that are typically possessed by large language models.

Other approaches, similar to those used in image language understanding, utilize the powerful reasoning skills and extensive knowledge of large language models to improve different aspects of video interpretation. The task of video understanding is simplified by language-only models like ChatGPT and GPT4 or image-language models like GPT4-V, which treat audio, video, and linguistic modalities as separate interpretable input data types and position intelligentsia as powerful open-source models. For example, Huang et al. (2023c); Li et al. (2023g) convert video understanding into natural language processing (NLP) question-answering formulas by textualizing video content using open-source visual categorization/detection/captioning models.Lin et al. (2023) integrate GPT4-V with specialized tools for visual, audio, and speech to facilitate complex video comprehension tasks such as scripting character actions and maneuvers in long-form videos.

Parallel studies explored generating scaled datasets from large models and then applying visual instructions to adjust Liu et al. (2023c); Li et al. (2023c); Zhu et al. (2023) on the generated data. The video was then represented using a large number of audio, speech and visual expert perceptual models. Speech was transcribed using an automatic speech recognition tool, and video descriptions and associated data were generated using a variety of labeling, localization, and captioning models Li et al. (2023g); Maaz et al. (2023); Chen et al. (2023); Wang et al. (2023f). These techniques demonstrate how adapting the instructions of a video language model on a generated dataset can enhance video reasoning and communication capabilities.

6.4.3 Experiments and results

  • Knowledge-intensive models: An intensive neural knowledge task that combines the required knowledge of human annotations to support knowledge-intensive retrieval tasks, as introduced in INK Park et al. (2022) and KAT Gui et al. (2022a).
  • Multimodal intelligences: There is a growing interest in multimodal language models such as Chameleon Lu et al. (2023) and MM-React Yang et al. (2023c).
  • Visual Command Adjustment: VCL Gui et al. (2022b), Mini-GPT4 Zhu et al. (2023), MPLUG-OWL Ye et al. (2023b), LSKD Park et al. (2023c) to generate image-level instruction tuning datasets.

Agent AI: Exploring the Boundaries of Multimodal Interaction-12
Figure 22: Example of the Intensive Neural Knowledge (INK) Park et al. (2022) task, which uses knowledge to recognize image-related text from a set of text candidates. Our task involves utilizing visual and textual knowledge retrieved from the Web and manually annotated knowledge.

Knowledge-intensive intelligences.

As shown in Figs. 22 and 23, knowledge-based visual question answering and visual language retrieval tasks are challenging tasks in multimodal machine learning that require external knowledge beyond image content. Recent research on large-scale Transformers has focused primarily on maximizing the efficiency of storing information about model parameters. This line of research explores a different aspect: whether multimodal Transformers can use explicit knowledge in their decision-making process. Transformer-based pre-training methods have shown excellent success in implicitly learning knowledge representations across multiple modalities. However, traditional approaches (mainly unimodal) have mainly investigated knowledge retrieval and subsequent answer prediction, raising questions about the quality and relevance of the retrieved knowledge as well as the integration of reasoning processes using both implicit and explicit knowledge. To address these issues, we introduce the Knowledge Augmentation Transformer (KAT), which outperforms other models by 6% on the 2022 OK-VQA open-domain multimodal task.KAT combines implicit knowledge from the GPT3 with explicit knowledge from a website using an encoder-decoder structure and allows for the concurrent reasoning during answer generation using both knowledge types. In addition, combining explicit knowledge enhances the interpretability of model predictions. The code and pre-trained model can be found at https://github.com/guilk/KAT.

Agent AI: Exploring the Boundaries of Multimodal Interaction-6
Figure 23: KAT model Gui et al. (2022a) retrieve knowledge entries from the explicit knowledge base using a contrast learning based module and retrieve implicit knowledge with supporting evidence using GPT-3. The integration of the knowledge is handled by the respective encoder Transformer and the answer generation is performed jointly by the end-to-end training and reasoning module and the decoder Transformer.

Agent AI: Exploring the Boundaries of Multimodal Interaction-21
Figure 24: General architecture of the VLC model Gui et al. (2022b). Our model consists of three modules:(1) modality-specific projections. We use a simple linear projection to embed patched images and a word embedding layer to embed tokenized text; (2) a multimodal encoder. We use the 12-layer ViT Dosovitskiy et al. (2021) initialized from MAE He et al. (2022) (ImageNet-1K without labels) as our backbone; (3) task-specific decoder. We learn our multimodal representation through masked image/language modeling and image-text matching used only during pretraining. We use a 2-layer MLP to fine-tune our multimodal encoder for downstream tasks. Importantly, we find that the masked image modeling goal is important in the second stage of pre-training, not just for initializing the visual Transformer.

Visual language Transformer intelligences.

Next, we introduce the "Training Visual Language Transformer from Captions" (VLC) model Gui et al. (2022b), a transformer pre-trained using only image caption pairs.Although VLC uses only a simple linear projection layer for image embedding, VLC achieves competitive results in a variety of visual language tasks compared to other methods that rely on object detectors or supervised CNN/ViT networks. VLC achieves competitive results in a variety of visual language tasks compared to other approaches that rely on object detectors or supervised CNN/ViT networks. Through extensive analysis, we explore the potential of VLC as a visual language Transformer intelligence. For example, we show that VLC's visual representation is very effective for ImageNet-1K classification, and our visualizations confirm that VLC can accurately match image chunks with corresponding text tokens. Performance scaling with additional training data highlights the great potential for developing large-scale, weakly supervised, open-domain visual language models.

6.5 Video-Language Experiments

To understand the utility of using pre-trained Image-Large Language Models (LLMs) for video comprehension, we temporally extend and fine-tune InstructBLIP Dai et al. (2023) for video caption generation. Specifically, we extended InstructBLIP's visual coder (EVA-CLIP-G Sun et al. (2023b)) using the same segmented spatio-temporal attention scheme as Frozen in Time Bain et al. (2021), and maintained the Q-former and large language model (LLM) during training (Flan-T5-XL Chung et al. (2022)) frozen. We freeze all spatial layers of the visual coder while keeping the temporal layer unfrozen during caption training. This allows our model to take images and videos as input (matching the image-level performance of InstructBLIP). We trained on a subset of 5 million video captions from WebVid10M Bain et al. (2021). We visualize two example outputs in Figure 25. However, existing intelligences fail to fully understand the precise, fine-grained visual details of video content. Visual command fine-tuning methods suffer from similar limitations in that they lack generalized human-level perceptual capabilities that remain to be addressed by multimodal models and intelligentsia.

Agent AI: Exploring the Boundaries of Multimodal Interaction-14
Figure 25: Example cues and responses when using InstructBLIP's video fine-tuning variant (the method described in Section 6.5). Our model is able to generate long text responses describing the scene and is able to answer questions related to the temporality of events in the video.

The command fine-tuning models show promise in accurately summarizing the actions visible in the video and effectively recognizing actions such as "the person sitting on the bench", as shown in Figure 25. However, they sometimes add incorrect details, such as the "person smiling at the camera", which reveals a shortcoming in capturing the theme of the dialog or the atmosphere of the video, elements that would be obvious to a human observer. This shortcoming highlights another key limitation: the omission of audio and speech modalities that would enrich video comprehension through context, helping to make more accurate interpretations and preventing such misrepresentations. Bridging this gap requires a holistic integration of the available modalities, enabling multimodal intelligences to achieve a level of understanding similar to human perception and ensuring a fully multimodal approach to video interpretation.

GPT-4V-based audio-video-verbal intelligences.

Agent AI: Exploring the Boundaries of Multimodal Interaction-5
Figure 26: The audio-multimodal intelligences described in Section 6.5. Illusory content is highlighted in red. We use GPT-4V to generate 1) video chat summaries with video frames, 2) video summaries with frame captions, and 3) video summaries with frame captions and audio information.

Agent AI: Exploring the Boundaries of Multimodal Interaction-7
Figure 27: An interactive multimodal intelligence that combines visual, audio, and textual modalities for video understanding. Our process mines difficult negative illusions to generate difficult queries for the VideoAnalytica challenge. See Section 9.2 for more relevant details about the Interactive Audio-Video-Language Intelligentsia dataset.

We then follow the approach outlined in Lin et al. (2023) to evaluate the ability of the GPT-4V to act as a multimodal intelligence that integrates vision, audio, and speech for a detailed and accurate understanding of video. The results depicted in Figure 26 compare the performance of various video intelligences on a video summarization task. The video instruction fine-tuning model (Li et al., 2023g) provides accurate content but lacks in comprehensiveness and detail, leaving out specific actions such as systematically using a broom to measure the height of a tree.

To improve the accuracy of the video descriptions, we use GPT-4V to subtitle the frames, while the audio and its transcription are from OpenAI Whisper Modeling. We then prompted GPT-4V to create video summaries using frame-only captioning, followed by frame-only captioning and audio transcription. Initially, we observed that frame-only captioning resulted in fictitious events, such as someone biting a stick in the third segment. These inaccuracies persisted in the video summaries, such as descriptions like "in an interesting twist, he bit it while holding it horizontally". Without audio input, the intelligentsia cannot correct these subtitle errors, resulting in semantically correct but visually misleading descriptions.

However, when we provide an audio transcription to an intelligent body, it can accurately describe the content and even capture detailed body movements, such as "rotate the broom perpendicular to the body and downward". This level of detail is much more informative, allowing the viewer to have a clearer understanding of the purpose and key details of the video. These findings highlight the importance of integrating audio, video, and verbal interactions to develop high-quality multimodal intelligences, and the GPT-4V serves as a promising foundation for such advanced multimodal understanding and interaction.

GPT-4V-based embodied multimodal intelligences.

As shown in Figure 27, we mainly use StackOverflow to obtain the initial question, and then use the "Bing search" API to retrieve the video and audio related to the question. Next, we mainly use GPT-4V to obtain relevant text information and high-level video descriptions. On the other hand, we use ASR to convert the keyframe audio into low-level fragment descriptions of the keyframes. Finally, we use GPT-4V to generate convincing "illusions" that can be used as difficult negative queries for video quizzing tasks. We support interaction and quizzing in the current frame of the video, as well as summarizing the overall high-level video description. During the reasoning process, we also incorporate external knowledge information gained through web searches to improve answering.

The following describes the main GPT-4V prompts. The entire prompt is indented for clarity; it runs over one page.

GPT-4V is an assistant designed to provide descriptive, informative and comprehensive details in videos for visually impaired people who can hear the video but not see it. This work is done by synthesizing given annotations and outputting them as JSON to create high-quality, dense video descriptions. Specifically, GPT-4V will be provided with the original query used to search for the video, the video title, description, audio transcription, and potentially noisy descriptions at specific times in the video. Different segments of the same video are annotated as "[time start - time end (in seconds)] 'text'". A combination of transcripts and descriptions are utilized to infer the exact details and visual presentation of what may be occurring in the video.The GPT-4V will combine or split the timestamps as needed to provide the optimal segmentation of the video.

Expectations for the GPT-4V output:

  1. Action-oriented descriptions: prioritize audio-cued plausible actions, movements, and physical demonstrations that enrich your narrative with dynamic visual cues.
  2. Complete Video Coverage: Provides a continuous and consistent audio description experience that covers every moment of the video, ensuring that no content is left out.
  3. Succinct segmentation: Build your descriptions in focused, succinct segments of 1-2 sentences each to effectively communicate the visual action without going into too much detail.
  4. Contextual audio-visual synthesis: seamlessly blending verbal audio content with inferred visual elements to form a narrative that reflects underlying screen activity.
  5. Imaginative and plausible speculation: Inject your descriptions with creative but plausible visual details that correspond to the audio to enhance scene comprehension.
  6. Accurate Timecode Correspondence: align your descriptive clips with the corresponding timecode to ensure that the presumed visual details are synchronized with the timeline of the audio narrative.
  7. Confident Narrative Delivery: confidently presenting descriptions as if the presumed visualization is happening to build confidence in the listener.
  8. Omit Unreasonable Details: exclude descriptions of objects or events that do not reasonably fit into the context established by the audio and visual information.

The final output should be structured in JSON format with a list of dictionaries, each detailing a segment of the video.

The final output should be structured in JSON format with a list of dictionaries, each detailing a segment of the video.

['start': , 'end': , 'text': ""]

For MC Creation: our task is to create multiple-choice questions for a video-to-text retrieval task that can be easily solved by looking at the title and reading through the audio transcription. To do this, we will obtain the original query used to obtain the video, the description, the audio transcription, and a potentially noisy description of a specific time in the video.

  • Format for audio transcription: - [start-end time in seconds] "Transcribe"
  • Format of the noisy description: - [time in seconds] "description"

We implore GPT-4V to generate four queries, where the primary query is aligned to the video content, and the other three negative queries are slightly different from our primary query. The selection of the primary query should not simply involve listening to the audio transcription, e.g., the textual original query is included in the audio transcription. Negative queries should be closely related but not fully aligned to the video content, and require a visual understanding of the video to distinguish them. For example, modify the semantics in a subtle way so that one needs to watch the video rather than just listen to select the original query. Compile four queries in subtitle-like statements, with the first query being the reformulated original query.

Think step-by-step about how to use the information in the video to formulate negative statements. And prove that negative queries are incorrect, but still compelling choices that require a careful understanding of the video. And how humans don't accidentally choose the negative query over the original query.

Finally, we present this work in the form of the following analysis and 4 queries. There is no need to generate how you translate the original queries.

  • Video analysis: xxx
  • Queries: [query1, query2, query3, query4]
  • Reason: xxx

6.6 Intelligentsia for Natural Language Processing (NLP)

6.6.1 Large Language Model (LLM) Intelligentsia

Recognizing and acting on task instructions has been a fundamental challenge in interactive AI and natural language processing for decades. With recent advances in deep learning, there is growing interest in working together on these areas to improve human-computer collaboration. We identify three specific directions, as well as others, to improve language-based intelligences:

  • Tool use and querying from the knowledge base. This direction emphasizes the importance of integrating external knowledge bases, web searches, or other useful tools into the reasoning process of AI intelligences. By leveraging structured and unstructured data from a variety of sources, intelligences can enhance their understanding and provide more accurate and contextually relevant responses. Additionally, it fosters the ability of intelligences to proactively seek information in the face of unfamiliar situations or queries, thus ensuring more comprehensive and informed responses. Examples include Toolformer Schick et al. (2023) and Retrieve What You Need Wang et al. (2023g).
  • Improved Intelligent Body Reasoning and Planning. Enhancing the reasoning and planning capabilities of intelligences is essential for effective human-computer collaboration. This involves developing models that can understand complex commands, infer user intent, and predict potential future scenarios. This can be accomplished by asking the intelligences to reflect on past actions and failures, as in ReAct Yao et al. (2023a), or by constructing the intelligences' thought processes as a form of search Yao et al. (2023b). By modeling different outcomes and evaluating the consequences of various actions, intelligences can make more informed, contextually relevant decisions.
  • Integrating system and human feedback. AI intelligences can typically operate in two main environments: one that provides explicit signals about the effectiveness of their actions (system feedback), and another that works with humans who can provide verbal criticism (human feedback). This direction emphasizes the need for adaptive learning mechanisms that allow the intelligences to improve their strategies and correct errors, for example, in the AutoGen In Wu et al. (2023). The ability to continuously learn and adapt from a variety of feedback sources ensures that the intelligences remain helpful and consistent to user needs.

6.6.2 Generic Large Language Model (LLM) Intelligentsia

Recognizing and understanding intelligent body content and natural language has been a fundamental challenge in interactive AI and natural language processing for decades. With recent advances in deep learning, there is increasing interest in studying these two areas together to gain a deeper understanding of intelligent body planning or human feedback for knowledge reasoning and natural language generation. These are key components of many human-computer interaction intelligences, such as "AutoGen" Wu et al. (2023) and "Retrieve What You Need" Wang et al.

6.6.3 Instructions follow the Large Language Model (LLM) intelligences

Furthermore, creating LLM intelligences that can be trained to effectively follow human instructions has become an important area of research. Initial models used human feedback to train agent reward models to mimic human preferences through a process called Reinforcement Learning Based on Human Feedback (RLHF), Ouyang et al. (2022). This process generates models such as InstructGPT and ChatGPT. In order to more efficiently train instructions to follow LLM intelligences without the need for human labeling, researchers have developed a more effective instruction fine-tuning method that trains LLM intelligences directly on instruction/response pairs that are either human-generated, e.g., Dolly 2.0 ^5^^^5^Dolly 2.0 Blog Post Link or automatically derived from LLMs, e.g., Alpaca Taori et al. (2023). We show Alpaca's overall training process in Figure 28.

Agent AI: Exploring the Boundaries of Multimodal Interaction-28
Figure 28: Training scheme used to train the Alpaca model Taori et al. (2023). At a high level, the available LLMs are used to generate a large number of instruction-following examples from a small set of seed tasks. The generated instruction-following examples are then used to fine-tune instructions to the LLMs available for the underlying model weights.

Agent AI: Exploring the Boundaries of Multimodal Interaction-29
Figure 29: Logic Transformer Intelligentsia model Wang et al. (2023e). We integrate the logical reasoning module into the Transformer-based abstract summary model in order to empower the logical intelligences to reason about text and dialog logic, thus enabling them to generate higher quality abstract summaries and reduce factual errors.

6.6.4 Experiments and results

Despite the growing popularity of conversational and self-feedback systems, these forms of AI still perform poorly at generating factually correct responses from their own implicit knowledge, and so often use external tools such as web search and knowledge retrieval mechanisms to augment their responses when reasoning. Solving this problem will help create a more engaging experience for users in many real-life applications. In social conversations (e.g., on social media platforms such as Instagram and Facebook), or on question and answer sites (e.g., Ask or Quora), people typically interact with others through a series of comments and by searching the Web for information and knowledge relevant to the discussion. Therefore, the task of generating conversation rounds in this context is not just guided by traditional NLP models and tasks, but uses intelligences to generate conversations through intelligent behaviors that reflect knowledge search and acquisition Peng et al. (2023). In this way, intelligent intelligences for NLP tasks extend the task description and improve the interpretability of the response by adding explicit knowledge search and retrieval steps during the dialog. Integrating these web search and retrieval intelligences as feedback during conversations will help to further and more deeply facilitate social interactions between humans and intelligences Wang et al. (2023e). As shown in Fig. 29, we introduce a new modeling paradigm for the Transformer language model that detects and extracts important logical structures and information from the input text, and then integrates them into the input embedding via a carefully designed multi-layered hierarchical logical projection to inject the logical structures into the pre-trained language model as an NLP intelligibles.Wang et al. ( 2023e) proposed a novel approach to constructing a language model for Transformer through a combination of logic detection, logic mapping, and layered logic projectionLogic-aware input embeddingThe new model can then be developed to upgrade all existing Transformer language models to theLogic Transformerin order to continuously improve its performance. The proposed Logical Transformer intelligences consistently achieve better performance than their baseline Transformer models through a deeper understanding of the logical structure of text. For human users, often these aspects are more important for providing meaningful and interesting conversations through the coordination of intelligences between dialog and information retrieval. Delving deeper into natural language processing, this topic will discuss advances and leaderboards in making LLMs more intelligent and better suited for a variety of language-centered tasks.

Open Domain Question and Answer (QA) systems typically follow theSearch then readThe paradigm in whichretrieverfor retrieving relevant passages from a large corpus and thenreader (software)generating answers based on retrieved passages and original questions. In Wang et al. (2023g), we propose a simple and novel framework for mutual learning by means of a program calledknowledge selectorIntermediate modules for intelligences to improveSearch then readstyle model performance, we use reinforcement learning to train the module. Combining the fine-grainedknowledge selectorpull intoRetrieve then Readerparadigm, the goal of which is to construct a small subset of paragraphs that retain information relevant to the question. As shown in Figure 30, the knowledge selector intelligences are trained as a component of our novel mutual learning framework that iteratively trains knowledge selectors and readers. We take a simple and novel approach that uses policy gradients to optimize the knowledge selector intelligences, using feedback from the reader to train it to select a small set of information-rich passages. This approach avoids brute force search or manually designing heuristics without the need for any annotated query document pairs for supervision. We show that iteratively training the reader and knowledge selector intelligences leads to better prediction performance on a number of public open-domain Q&A benchmarks.

Agent AI: Exploring the Boundaries of Multimodal Interaction-30
Fig. 30: Architecture of a proposed mutual learning framework for NLP intelligences Wang et al. (2023g). In each epoch, phase 1 and phase 2 are executed alternately. during phase 1, the parameters of the reader model are kept fixed and only the weights of the knowledge selectors are updated. On the contrary, during phase 2, the parameters of the reader model are adjusted while keeping the weights of the knowledge selectors frozen.

7 Agent AI Across Modalities, Domains, and Realities

7.1 Agent for cross-modal understanding

Multimodal understanding is a major challenge in creating a general-purpose AI agent because of the lack of large-scale datasets containing vision, language, and agent behavior. More generally, the training data for an AI agent is usually modality-specific. This leads most modern multimodal systems to use combinations of frozen submodules. Some notable examples are Flamingo Alayrac et al. (2022), BLIP-2 Li et al. (2023c), and LLaVA Liu et al. (2023c), all of which utilize frozen Large Language Models (LLMs) and frozen visual coders. These submodules are trained separately on separate datasets, and then the adaptation layer is trained to encode the visual coder into the Large Language Model (LLM) embedding space. In order to make further progress in cross-modal understanding of AI agents, the strategy of using frozen LLMs and visual encoders may need to be changed. Indeed, the recent visual-linguistic model RT-2 exhibits significant performance gains when performing operations within the robotics domain when jointly tuning the visual encoder and the Large Language Model (LLM) for both robotic and visual-linguistic tasks Brohan et al. (2023).

7.2 Agent for cross-domain understanding

A key challenge in creating a generalized agent is the unique visual appearance and different action spaces in different domains. Humans are capable of interpreting images and videos from a variety of sources, including the real world, video games, and specialized domains such as robotics and healthcare, once they are familiar with the specifics of these domains. However, existing Large Language Models (LLMs) and Visual Language Models (VLMs) often exhibit significant discrepancies between the data they are trained on and the various domains in which they are applied. Notably, training agent models to predict specific actions poses considerable challenges when trying to develop a single strategy that can effectively learn multiple control systems across domains. In general, the approach taken in most modern works when applying systems within a specific domain is to start with a pre-trained base model and then fine-tune a separate model for each specific domain. This fails to capture any commonalities between domains and results in a smaller total set of data for training rather than utilizing data from each domain.

7.3 Interactive agent for cross-modal and cross-realistic

Developing AI agents that can successfully understand and perform tasks in different realities is an ongoing challenge, with some recent success in image and scene generation Huang et al. (2023a). In particular, it is challenging for agents to understand both real-world and virtual reality environments due to visual differences and independent physical properties of the environments. In the context of cross-reality, simulation-to-reality migration is a particularly important issue when using simulation-trained strategies to process real-world data, as we discuss in the next section.

7.4 Simulation to Reality Migration

Techniques that enable models trained in simulation to be deployed in the real world. Embodied agents, especially those based on Reinforcement Learning (RL) strategies, are often trained in simulated environments. These simulations do not exactly replicate real-world features (e.g., interference, light, gravity, and other physical properties). Because of this difference between simulation and reality, models trained in simulation often struggle to perform well when applied to the real world. This problem is known as the "simulation-to-reality" problem. There are several approaches that can be taken to solve this problem:

  • domain randomization: Domain randomization is a technique for training models while randomly varying parameters (e.g., object appearance, sensor noise, and optical properties) in a simulated environment to cope with real-world uncertainty and variation Tobin et al. (2017). For example, when training Reinforcement Learning (RL)-based grasping skills, introducing randomness in the shape of an object can enable the strategy to adapt to objects with slightly different shapes Saito et al. (2022).
  • domain adaptation: Domain adaptation or domain migration is a technique that bridges the gap between simulated and real-world domains by training a model using a large number of simulated images and a small number of real-world images. In practical settings, unpaired image-to-image conversion methods are used due to the difficulty of preparing paired images across domains, e.g., CycleGAN Zhu et al. (2017b). Several enhanced versions of reinforcement learning exist, including Reinforcement Learning -CycleGAN Rao et al. (2020), and enhanced versions of imitation learning, e.g., RetinaGAN Ho et al. (2021).
  • Improved simulation: Realistic simulation is the key to simulation-to-reality migration. Part of this work is achieved through system identification techniques Zhu et al. (2017c); Allevato et al. (2020), aiming at identifying simulation parameters to simulate real-world environments. Furthermore, the use of photo-realistic simulators would be effective in image-based reinforcement learning Martinez-Gonzalez et al. (2020); Müller et al. (2018); Shah et al. (2018); Sasabuchi et al. (2023).

As methods continue to evolve, simulation-to-reality migration remains a central challenge in embodied Agent research. Theoretical and empirical research is essential to further advance these techniques.

8 Continuous and Self-Improvement of Agent AI

Currently, base model-based AI agents have the ability to learn from many different data sources, which provides a more flexible data source for training. Two key outcomes of this are that (1) user and human-based interaction data can be used to further improve the agent, and (2) existing base models and model artifacts can be used to generate training data. We discuss each of these in more detail in the following sections, but we note that because current AI Agents are primarily associated with existing pre-trained base models, they typically do not learn from ongoing interactions with the environment. We believe this is an exciting future direction, and preliminary work by Bousmalis et al. suggests that self-improving agents for robot control are capable of continuously learning and improving through environmental interactions without the need for supervision Bousmalis et al. (2023).

8.1 Human-based interaction data

The core idea of using human-based interaction data is to utilize a large number of agent-human interactions to train and improve the agent for future iterations. There are several strategies that can be used to improve an agent through human-human interaction.

  • Additional training data: Perhaps the easiest way to use human-computer interaction is to use the interaction examples themselves as training data for future iterations of the agent. This typically requires a filtering strategy to distinguish successful agent examples from unsuccessful interaction examples. Filtering can be rule-based (e.g., to reach some desired end goal state), model-based (e.g., to categorize successful vs. unsuccessful interactions), or manually selected after post-hoc inspection and/or modification of the interaction examples.
  • Human preference learning: During the interaction with the user, the agent system can prompt the user with several different model outputs and allow the user to choose the best output. This is typically used for Large Language Models (LLMs) like ChatGPT and GPT-4, where the user can select one (out of several) outputs that best matches their preferences.
  • Safety training (red team drill): A red team exercise in the context of Agent AI refers to having a specialized adversarial team (either human or computer) that attempts to exploit and expose weaknesses and vulnerabilities in an Agent AI system. While adversarial in nature, red team exercises are often used as a means of understanding how to improve AI security measures and reduce the occurrence of harmful outputs. The core principle is to discover consistent ways to induce unwanted agent output so that models can be trained on data that explicitly corrects for this behavior.

8.2 Data generated by the base model

As academia and industry have generated robust base model artifacts, methods have been developed to extract and generate meaningful training data from these artifacts using a variety of cueing and data pairing techniques.

  • Large Language Model (LLM) Instruction TuningThe method of generating instructions from Large Language Models (LLMs) that follow training data makes it possible to fine-tune smaller open-source models based on the output of larger proprietary Large Language Models (LLMs) Wang et al. (2022b). For example, Alpaca Taori et al. (2023) and Vicuna Zheng et al. (2023) are based on the open-source LLaMA family of Touvron et al. (2023) Large Language Models (LLMs), which have been tuned based on a variety of outputs from ChatGPT and human participants. This instruction-tuning approach can be viewed as a form of knowledge distillation in which the larger Large Language Model (LLM) acts as a teacher model for the smaller student. Importantly, although it has been shown that large language model (LLM) instruction tuning can transfer the writing style and some instruction-following abilities of the teacher model to the student model, there is still a significant gap between the authenticity and competence of the teacher model and the student model Gudibande et al. (2023).
  • Visual-verbal pairs: Some recent work has attempted to increase the amount and diversity of pre-training data available for visual-linguistic modeling (VLM) by automatically generating captions and other text for visual content. For example, LLaVA Liu et al. (2023c) used 150,000 examples of command-following behaviors from textual and visual inputs that were primarily generated by the Large Language Model (LLM). Other work has shown that relabeling images using visual language models (VLMs) can improve the quality of training data and subsequent image generation models Segalis et al. In the field of video understanding, relabeling videos using visual language models (VLMs) and large language models (LLMs) has been shown to improve the performance and quality of subsequent visual language models (VLMs) trained on relabeled videos Wang et al. (2023f); Zhao et al.

9 Agent datasets and leaderboards

To accelerate research in this area, we propose two benchmarks for multi-agent games and agent-based visual language tasks, respectively. We will release two new datasets - "CuisineWorld" and "VideoAnalytica" - as well as a set of baseline models. -As well as a set of baseline models, participants are encouraged to explore new models and systems and submit their results to the test set of our leaderboard.

9.1 The "CuisineWorld" dataset for multi-agent games

CuisineWorld is a text-based game similar to Overcooked! It provides a platform for AI-driven agents to cooperate and play together. This dataset will test the collaborative efficiency of multi-agent systems, providing insight into the ability of the Large Language Model (LLM) and other systems to work together in dynamic scenarios. In particular, the dataset will focus on how agents understand goals and how agents coordinate with each other. This dataset supports two modes: a centralized scheduler mode and a decentralized mode. Participants can choose a game mode and submit their results to our leaderboard.

9.1.1 Benchmarks

For our competition, we will publish a benchmark, the CuisineWorld benchmark, which consists of a textual interface that includes an extensible task definition file, and an interface for multi-agent interaction and human-computer interaction. We introduce game interaction tasks whose goal is to generate relevant and appropriate multi-agent collaboration strategies that maximize collaboration efficiency. We evaluate the collaboration efficiency using the proposed evaluation metric: CoS.

The "CuisineWorld" dataset was collected by Microsoft, UCLA, and Stanford University. The goal of the competition is to explore how different, existing, and novel interaction techniques based on the underlying Large Language Model (LLM) operate on this benchmark, and to establish a robust baseline for the task of multi-agent gaming infrastructure.

CuisineWorld's dataset includes:

  • A well-defined set of multi-agent collaboration tasks.
  • An API system for facilitating agent interactions.
  • An automated assessment system.

(The link for downloading the dataset will be provided soon and this article will be updated to include it here.)

9.1.2 Tasks

  • We provide a dataset and associated benchmarks, called Microsoft MindAgent, and accordingly release a dataset "CuisineWorld" to the research community.
  • We will provide benchmarks to evaluate and rank the submitted "MindAgent" algorithms. We will also provide baseline results generated using popular infrastructure.

9.1.3 Indicators and judgments

The quality of multi-agent collaboration efficiency is determined by the new "cos" automated metric (from MindAgent Gong et al. (2023a)). The final score for our metric is calculated as the average of the collaboration efficiency metrics evaluated by the multi-agent system across all tasks. Human evaluators will be asked to rate individual responses and provide subjective judgments on the engagement, breadth, and overall quality of user-agent interactions.

9.1.4 Assessment

  • Automated assessmentWe plan to start publishing the leaderboard on a release date (TBD). We plan to start publishing the Leaderboard on the release date (TBD), where registered participants will be asked to submit their results on tasks related to the dataset "CuisineWorld" (the dataset we are publicly releasing for the Leaderboard). The submission of results will close on the end date (TBD). Each team will be required to submit the results they generated on the test set to automatically evaluate the "cos" metric.
  • Manual evaluation on our charts.. Leaderboard participants will need to provide a submission file generated by a local evaluation script. We will use the evalAI system to check the submission file and have the option to re-run the code for the top challenge contenders. Therefore, teams must also submit their code with a self-description file on how they ran their code. The manual evaluation will be performed by the organization team.
  • Winners announced. We will announce the winners and post the final ratings of the submissions on our leaderboard.

9.2 Audio-video-language pre-training dataset.

We introduce VideoAnalytica: a new benchmark for analyzing video presentations for comprehension.VideoAnalytica focuses on using video presentations as an aid to better understand the complex, high-level reasoning embedded in long-form instructional videos. The goal is to assess the cognitive reasoning capabilities of video language models, pushing them beyond mere recognition tasks and basic comprehension toward a more complex and nuanced understanding of video. Critically, VideoAnalytica emphasizes the integration of multiple modalities (e.g., audio, video, and language) and the ability of the models to apply domain-specific knowledge to understand the information presented in the video and interpret it. Specifically, VideoAnalytica addresses two main tasks:

  1. 1.
    Video text retrieval: This task involves the accurate retrieval of relevant text from instructional videos. The challenge is to distinguish between relevant and irrelevant information, thus requiring a deep understanding of the video content and analyzing the presentation to retrieve the correct query. To further increase the complexity of these tasks, we introduce hard negative examples generated by the Large Language Model (LLM) in the dataset. We manually validate the generated negative examples and remove instances that make the task invalid and unfair (e.g., negative examples are valid).
  2. 2.
    Video-assisted informative quiz: This task requires the model to answer questions based on information extracted from the video. The focus is on complex questions that require analytical reasoning and a thorough understanding of the video presentation.

To facilitate the development of an audio-video-verbal agent for analyzing video comprehension, we introduce a benchmark leaderboard for two tasks in VideoAnalytica.

  • Leaderboard participants will be required to submit their solutions for evaluation. The evaluation will be based on the performance of the model on two tasks and the results will be displayed on the leaderboard. Participants will be required to submit their code, along with a detailed explanation of their approach and methodology.
  • Ethical considerations: The focus of the rankings is on understanding and interpreting video content, which could be used in surveillance or other privacy-invasive applications. Therefore, it is critical to consider the ethical implications and potential misuse of the technology. We encourage participants to consider these aspects in their submissions and promote the ethical use of AI.

10 Broader impact statement

This paper and our associated forum ^6^^^6^https://multimodalagentai.github.io aims to be a catalyst for innovative research, fostering collaboration and driving the next wave of AI adoption. By focusing on multimodal intelligences, we emphasize the future direction of human-computer interaction, leaderboards, and solutions. We detail three areas where we are making a significant contribution to the wider community.

First and foremost, we hope that our forum will lead AI researchers to develop solutions driven by real-world problems such as gaming, robotics, healthcare, and long-form video comprehension. Specifically, the development of multimodal intelligences in games could transform the gaming industry by leading to more immersive and personalized gaming experiences. In robotics, the development of adaptive robotic systems could revolutionize industries from manufacturing to agriculture, potentially solving labor shortages and increasing efficiency. In healthcare, the use of Large Language Models (LLMs) and Visual Language Models (VLMs) as diagnostic intelligences or patient care assistants could lead to more accurate diagnoses, improved patient care, and increased access to healthcare services, especially in underserved areas. In addition, the ability of these models to interpret lengthy videos can have a wide range of applications, from enhancing online learning to improving technical support services. Overall, the topics covered in our forums will have significant downstream impacts on industries and humans on a global scale.

Second, we hope that our forum will be a valuable resource for AI practitioners and researchers, serving as a platform for exploring and gaining a deep understanding of the various complex leaderboards presented by the implementation of AI intelligences in a variety of environments and contexts. Such exploration includes, for example, understanding the specific limitations and potential hazards of Agentic AI systems developed for specific domains such as healthcare diagnostics. In this domain, issues such as dangerous hallucinations in AI behavior can pose significant risks, highlighting the critical importance of careful design and testing. However, when considering AI intelligences made for the gaming industry, these specific rankings may be less relevant or obvious. In these areas of entertainment, developers may prioritize different obstacles, such as the need for AI to perform more open generation and exhibit creativity, dynamically adapting to unpredictable game scenarios and player interactions. By attending the forum, participants will gain insight into how these different environments dictate the focus and direction of AI development, and how best to adapt AI solutions to meet these different needs and overcome relevant leaderboards.

Third, the various elements of our event, including expert presentations, informative posters, and the winners of our two leaderboards, are designed to provide a substantive but concise overview of the latest and significant trends, research directions, and innovative concepts in the field of multimodal intelligences. These presentations will summarize key findings and developments, revealing new systems, ideas, and technologies in the field of AI for multimodal intelligences. This collection of knowledge is not only beneficial to our Forum attendees who wish to deepen their understanding and expertise in the field, but also serves as a dynamic and rich repository of resources. Visitors to our Forum website can utilize this repository of information to discover and understand the cutting-edge advances and innovative ideas that will drive the future of AI for multimodal intelligences. We strive to be a repository of useful knowledge for both novices and veterans in the field. By engaging with these resources, we hope that participants and online visitors alike will learn about the transformative changes and novel approaches that are shaping the exciting landscape of multimodal intelligences AI.

11 Ethical considerations

Multimodal AI systems have many applications. In addition to interactive AI, fundamental multimodal models can help drive content generation for robots and AI intelligences, and assist with productivity applications that help with replay, paraphrasing, motion prediction, or compositing 3D or 2D scenes. Fundamental advances in intelligent body AI help achieve these goals, and many will benefit from a better understanding of how to model body and empathy in simulated reality or the real world. Arguably, many of these applications are likely to have positive benefits.

However, this technology can also be exploited by bad actors. Intelligent AI systems that generate content can be used to manipulate or deceive people. It is therefore important that this technology is developed according to responsible AI guidelines. For example, it is important to clearly inform users that content is generated by AI systems and to provide users with control over customizing such systems. Intelligent body AI has the potential to be used to develop new ways of detecting manipulative content - in part because it has the rich hallucinatory properties of a large base model - and thus help solve an alternative real-world problem.

For example, 1) it is critical to deploy Large Language Modeling (LLM) and Visual Language Modeling (VLM) intelligences in an ethical manner in health topics, especially in sensitive areas such as healthcare. AI intelligences trained on biased data may exacerbate health disparities by providing inaccurate diagnoses to vulnerable populations. In addition, the handling of sensitive patient data by AI intelligences raises serious privacy and confidentiality concerns.2) In the gaming industry, AI intelligences may change the role of developers by shifting their focus from scripting non-player characters to improving the learning process of intelligences. Similarly, adaptive robotic systems may redefine the role of manufacturing, requiring new skill sets rather than replacing human workers. Responding responsibly to these shifts is critical to minimizing potential socioeconomic disruptions.

In addition, the focus of Intelligent Body AI is to learn collaborative strategies in simulations, which is risky if the strategies are directly applied to the real world due to distributional bias. Robust testing and continuous security monitoring mechanisms should be established to minimize the risk of unpredictable behavior in real-world scenarios. Our "VideoAnalytica" dataset was collected from the Internet, and given that it is not a fully representative source, we have completed ethical reviews and legal procedures from Microsoft and the University of Washington. Even so, we need to be aware of possible biases in the corpus. Data distributions can be characterized in a number of ways. In this workshop, we have captured the differences between the intelligence level distribution in our dataset and other existing datasets. However, there is much more that can be included in a dataset or workshop. We believe that there is a need for more methods or discussions that are relevant to the actual task or topic and are realized by making these data or systems available.

We will dedicate a portion of our project to discussing these ethical questions, exploring potential mitigation strategies, and deploying responsible multimodal AI intelligences. With this paper, we hope to help more researchers work together to answer these questions.

12 Diversity statement

By investigating the adaptability of AI intelligent body models in various domains, we naturally embrace a wide variety of rankings, perspectives, and solutions. In this spirit, our program aims to build a diverse community by exploring a variety of topics in multimodal and intelligent body AI.

With these principles in mind, this project focuses on advanced multimodal systems that interact effectively in physical and virtual environments and facilitate effective interaction with humans. As such, we intend to attract experts and practitioners from a wide range of technical specialties, cultures, countries, and academic fields to discuss important topics, including but not limited to:

  • Application of the base model: development of intelligences with integrated modalities (audio, image, text, sensor inputs) aimed at enhancing their ability to recognize and respond to various applications.
  • Generic End-to-End Systems: develops end-to-end models trained using large-scale data with the aim of creating generic and adaptable AI solutions.
  • Principle of modal grounding: integrating information from various modalities to enhance the coherence and effectiveness of data processing.
  • Intuitive human-computer interfaces: developing effective and meaningful interactions between humans and intelligences.
  • Taming Large Language Models (LLM)/Visual Language Models (VLM): exploring new ways to address common problems in large-scale models, such as illusions and biases in output.

We aspire to broaden our collective understanding of the potential and limitations of AI for intelligences by leveraging our unique and diverse perspectives. We strongly believe that this approach will not only enrich individual perspectives, but also enhance the collective knowledge of the community and foster a more inclusive overall view of the broader leaderboard facing multimodal AI intelligences.

Appendix AGPT-4V Agent Tip Details

We use GPT-4V's default system prompts unless otherwise noted. We show in detail the process of prompting GPT-4V for a Minecraft game, and how we generated GPT-4V's responses in Figure 31.

Agent AI: Exploring the Boundaries of Multimodal Interaction-12
Figure 31: We provide code snippets to show the complete hinting process for calling GPT-4V for Minecraft videos. First, we encode the frames of the video and then call the GPT-4V endpoint. The model's response is displayed at the bottom of the screen.

Appendix BGPT-4V for Bleeding Edge

Bleeding Edge is a third-person team-based combat game in which players attempt to capture objective points or collect more resources than the enemy team. We show examples of inputs and outputs when cueing the GPT-4V for Bleeding Edge gameplay in Figure 32. Compared to Minecraft, we qualitatively found that GPT-4V has a less thorough understanding of visual content and game rules. This may be due to (1) the large amount of Minecraft data in GPT-4V's training data, and (2) the higher visual complexity of Bleeding Edge compared to Minecraft.

Agent AI: Exploring the Boundaries of Multimodal Interaction-26
Figure 32: GPT-4V is capable of generating meaningful content for games like Bleeding Edge that have a third-person perspective and visually complex scenes. In order to feed a large number of frames (48) into GPT-4V, we fed the frames as a grid and superimposed frame numbers on each frame (as shown above).

Appendix CGPT-4V for Microsoft Flight Simulator

Agent AI: Exploring the World of Multimodal Interaction [Fei-Fei Li - Classic Must Read]-1
Figure 33: GPT-4V can generate meaningful content descriptions for Microsoft Flight Simulator games. While the player in the flight simulator is piloting the aircraft, the GPT-4V agent can provide a high-level description of the action, display it through the cockpit view and the external view of the aircraft, and manage the various flight controls and gauges to maintain proper airspeed and altitude while navigating through the virtual airspace. In order to input a large video, we select a number of keyframes (6 frames) to send to the GPT-4V. we input each keyframe individually, without the use of a mesh (as shown above).

As shown in Figure 33, the GPT-4V-based agent provides a high-level description of the actions of the player in Microsoft Flight Simulator. It describes how the player, through the cockpit view and the external view of the aircraft, is flying the aircraft, managing the various flight controls and instruments to maintain proper airspeed and altitude, while navigating through the virtual airspace.

Appendix DGPT-4V for Assassin's Creed: Odyssey

Agent AI: Exploring the World of Multimodal Interaction [Fei-Fei Li - Classic Must Read]-4
Figure 34: GPT-4V is able to generate meaningful content descriptions for the game Assassin's Creed: Odyssey. GPT-4V-based agents can also provide high-level action descriptions, i.e., "characters in the image [are] engaging in nighttime battles in historical campaigns, wielding spears and utilizing special abilities such as "bull rush," fighting multiple enemies with level indicators, while managing their life and ability values in a third-person action role-playing game". Manage their life and ability values in a third-person action role-playing game". In order to input a large video, we select a number of keyframes (8 frames) to send to GPT-4V. We input each keyframe individually, without using a grid (as shown above).

As shown in Figure 34, the GPT-4V agent provides a high-level action description of the character in the image, i.e., the character is engaging in nighttime combat in a historical campaign, wielding a spear and utilizing special abilities such as "bull rush," fighting multiple enemies with level indicators, and managing their life and ability values in a third-person action role-playing game. GPT-4V also depicts the player engaging in a large-scale melee combat scenario, controlling a character with red accents on his armor, and being prompted to use the "bull rush ability" based on on-screen instructions. The environment is dark and full of flames, suggesting a battlefield, and the player's character is facing off against multiple enemies with level indicators and overhead blood bars. The action is part of a tutorial or combat sequence in which the player learns and performs combat abilities.

 

Appendix EGPT-4V for Gears of War 4

Agent AI: Exploring the Boundaries of Multimodal Interaction-11
Figure 35: GPT-4V is capable of generating meaningful content descriptions for the Gears of War 4 game.The GPT-4V agent provides a high-level description of the action, i.e., the player is engaging in a combat scenario in a visually rich futuristic environment reminiscent of an industrial or military complex. The character, wearing armor and equipped with firearms, is navigating the area, engaging the enemy using cover and targeting systems. Captured actions include the player character aiming from behind cover, shooting at enemies, taking damage (as indicated by on-screen visual cues), and engaging in close combat. The environment is marked by nighttime illumination, with artificial light sources highlighting areas or actions of interest. At certain moments, the player's character appears as a transparent outline, which may indicate a special ability or a temporary state such as invincibility or stealth. In order to input a large video, we select a number of keyframes (10 frames) to send to GPT-4V. we input each keyframe individually, without the use of a grid (as shown above).

As shown in FIG. 35, based on the provided 10-frame video sequence, the GPT-4V agent may provide a general description of the player's actions in the video game:

"The player is navigating the character through an industrial environment at night, strategically hiding behind various objects such as crates and walls. The character is armed with weapons and engages enemy entities. The player uses firearms and may also use melee attacks, as shown in Close Quarters Combat. They appear to be managing their ammo and life values, as indicated by on-screen visual cues (such as flashing red indicators, which usually indicate that the character is taking damage).

Additionally, the player appears to be moving tactically, switching between different cover positions to advance or defend against enemies. At some point, shields appear to be activated, suggesting that the player is responding to incoming fire by using defensive capabilities. Characters also seem to be reloading their weapons, which suggests an element of resource management during combat. Blood and impact effects indicate successful hits on enemies.

The player's character cooperates with allies as other characters can be seen in the vicinity who are involved in similar combat maneuvers. There are interactable elements in the environment, such as exploding barrels, which the player may utilize to their advantage.

Scenarios depict fast-paced action with a focus on strategic movement, shooting and situational awareness to overcome opponents and potentially achieve specific goals in the game."

 

Appendix FGPT-4V for The Starry Sky

Agent AI: Exploring the World of Multimodal Interaction [Fei-Fei Li - Classic Must Read]-1
Figure 36: GPT-4V's ability to generate meaningful content descriptions for the X-Box console game Starsky and Hutch. the GPT-4V agent provides high-level descriptions of the action in this space-themed exploration game. In order to input a large video, we selected a number of keyframes (10 frames) to send to GPT-4V. we input each keyframe individually, without the use of a grid (as shown in the figure above).

As shown in Figure 36, based on the 10-frame video sequence provided for Stars, the GPT-4V agent provides a general depiction of the player's actions in the video game. According to GPT-4V, "The image depicts a sequence of events in the game of Stars, showing the player's interactions in the mining environment. The following is a description of the sequence:

  1. The player stands in a dimly lit underground cavern with stalactites next to another character in space-like gear. They face a highlighted waypoint indicating a goal or location of interest.
  2. The player walks through the interior of an industry, possibly a mining station, and approaches another character. The Heads-Up Display (HUD) shows the "Supervisor Forest" dialog line, which indicates NPC (non-player character) interaction and possibly a mission briefing or update.
  3. The player appears to be following the NPC Supervisor Lin through the caves, which hints at the guided portion of the game in which NPCs direct the player to specific locations or quests.
  4. The supervisor forest's dialog hints at a decision point or quest update that may guide players through new objectives related to the "large vein" resource.
  5. The player encounters Supervisor Lin again, who provides instructions or information about metal deposits, suggesting the start of a mining mission or the exploration of a new area.
  6. In this frame, the player appears to be observing or interacting with a machine, and another character, Isabel, indicates actions or conversations related to the operation of equipment or technical work.
  7. The player looks down and sees rugged terrain and two glowing objects, which may represent resources to be collected or points of interest for the mission.
  8. The player examines a specific resource, which is labeled "Beryllium Deposit" on the heads-up display, indicating that resource collection or mining activity is underway.
  9. The player walks through the cavern, possibly looking for other resources or navigating to the next objective, as shown earlier by another character in space gear.
  10. The player is located in a darker area of the cave with two glowing markers indicating the path or objective point to reach.

Throughout the imagery, the player appears to be engaged in a mission that involves exploration, NPC interaction, resource gathering, and possibly operating machinery in a mining environment. Heads-up display elements such as waypoints, dialog text, and mission updates provide context for the player's actions and gameplay goals."

 

Author's Profile

Zane Durante

Ph.D. Candidate, Computer Science, Stanford University
E-mail: durante@stanford.edu
Website: https://zanedurante.github.io/
Zane Durante is a PhD student in the Stanford Vision Lab, supervised by Dr. Fei-Fei Li. His research interests include video understanding, linking computer vision to natural language, and applications of artificial intelligence in hospital care. He is leading an ongoing collaboration between the Stanford School of Medicine and the Stanford Vision Lab to develop a dataset of natural language descriptions of video clips of hospital scenes. His doctoral work was funded by a Graduate Research Fellowship from the National Science Foundation.

Qiuyuan Huang

Principal Researcher at Microsoft Research, Redmond, WA, USA.
E-mail: qihua@microsoft.com
Website: https://www.microsoft.com/en-us/research/people/qihua/
Qiuyuan Huang is a Principal Investigator in the Deep Learning group at Microsoft Research (MSR) in Redmond, WA. Her current research interests focus on deep learning, multimodality, and natural language processing, especially agent AI for gaming, robotics, and healthcare; knowledge-based reasoning intelligence for interactive AI; neural symbolic computation for inference; and large-scale base models for NLP and multimodality.

Naoki Wake

A researcher at Microsoft in Redmond, Wash.
E-mail: naoki.wake@microsoft.com
Website: https://www.microsoft.com/en-us/research/people/nawake/
Naoki Wake is a researcher in the Applied Robotics Research Group at Microsoft. His current research involves developing multimodal perception systems and collaborative speech gesture systems for robots. His past research covers auditory neuroscience, neurorehabilitation, and speech processing.Naoki received his B.S. in Engineering from the University of Tokyo in 2014 and his Ph.D. in Information Science and Technology in 2019.

Ran Gong

Ph.D. student in computer science at UCLA.
E-mail: nikepupu@ucla.edu
Website: https://nikepupu.github.io
Ran Gong is a PhD student in the VCLA Lab at UCLA. His research area lies at the intersection of robotics, computer vision, computer graphics, and machine learning. His research focuses on embodied simulation and interaction with the goal of creating intelligent behaviors that can solve different tasks in different environments and have the ability to collaborate with humans. He received a Bachelor of Science degree in Computer Science and Engineering from UCLA.

Jae Sung Park

Ph.D. Candidate, University of Washington
E-mail: jspark96@cs.washington.edu
Website: https://homes.cs.washington.edu/~jspark96/
Jae Sung is a PhD student supervised by Yejin Choi and Ali Farhadi. His research focuses on developing models with multimodal commonsense reasoning. He is interested in equipping models with foundations that place linguistic concepts in the visual modality and allow them to understand multimedia content in the same way that humans process visual information.Jae Sung received his Bachelor of Science degree in Computer Science from the University of California, Berkeley.

Bidipta Sarkar

Undergraduate, Stanford University
E-mail: bidiptas@stanford.edu
Website: https://bsarkar321.github.io/
Bidipta Sarkar is a senior undergraduate student at Stanford University and a member of the Stanford ILIAD Lab. His research focuses on creating artificially intelligent agents that can interact with their environment and safely work in concert with humans and other autonomous agents.

Rohan Taori

Ph.D. Candidate, Computer Science, Stanford University
E-mail: rtaori@cs.stanford.edu
Website: https://www.rohantaori.com/
Rohan Taori is a PhD student in the Artificial Intelligence Lab at Stanford University. His research focuses on the foundations of machine learning in the context of real-world systems. Recently, Rohan has pushed the frontiers of open-source large language models, fine-tuning them to be useful general-purpose assistants.Rohan is also interested in augmenting large language models with multimodal capabilities, enabling them to reason about images and videos.Rohan received his B.S. in Computer Science from the University of California, Berkeley.

Yusuke Noda

Principal Software Engineer. Microsoft Gaming, Redmond, WA.
E-mail: yusuke.noda@microsoft.com
Website: https://www.linkedin.com/in/yusuke-noda-908797/
Yusuke Noda is a Principal Software Engineer for the Microsoft Gaming Platform. He led the development of the gaming infrastructure for Xbox One and Xbox Cloud Gaming and has over 16 years of experience in developing efficient infrastructure for cloud and gaming technologies.

Demetri Terzopoulos

UCLA Distinguished Professor, Academy Award-winning computer scientist and entrepreneur.
E-mail: dt@cs.ucla.edu
Website: https://web.cs.ucla.edu/~dt/
Dr. Demetri Terzopoulos is a Distinguished Professor and Chancellor's Professor of Computer Science at UCLA, where he directs the UCLA Computer Graphics and Vision Laboratory. He is also co-founder and Chief Scientist of VoxelCloud, Inc. a multinational healthcare AI company. He is or has been a Guggenheim Fellow, a Fellow of the Royal Society of London (FRS) and of the Royal Society of Canada (FRSC), a Fellow of the Association for Computing Machinery (ACM), a Life Member of the Institute of Electrical and Electronics Engineers (IEEE), a Fellow of the Asia Pacific Association for Artificial Intelligence (AAIA), a Distinguished Fellow of the International Institute of Engineering and Technology (IETI), a Fellow of the European Academy of Sciences (EAS) and of the New York Academy of Sciences (NYAS). (EAS) and the New York Academy of Sciences (NYAS), and a Life Member of Sigma Xi.

Yejin Choi

Wisnner-Slivaka Chair and Brett Helsel Professor at the University of Washington, Senior Research Manager and MacArthur Fellow at the Allen Institute for Artificial Intelligence.
E-mail: yejin@cs.washington.edu
Website: https://homes.cs.washington.edu/~yejin/
Yejin is the Wisnner-Slivaka Chair and Brett Helsel Professor at the University of Washington and a Senior Research Manager at the Allen Institute for Artificial Intelligence. She received the Anita Borg Early Career Award in 2018. She is the recipient of a 2020 MacArthur Fellows Foundation Fellowship. She received Distinguished Paper Awards for AAAI 2020, Neurips 2021, ICML 2022, and ACL 2023, and Best Paper Awards for NAACL 2022 and ACL 2023. She is one of the main organizers of COLM 2024, an academic venue focused on language modeling research. Her main research interests are in natural language processing, machine learning, and commonsense reasoning in artificial intelligence, as well as broader interests in computer vision and digital humanities.

Fei-Fei Li

Professor of Computer Science, Stanford University
E-mail: feifeili@stanford.edu
Website: https://profiles.stanford.edu/fei-fei-li/
Fei-Fei is the inaugural Sequoia Professor in the Department of Computer Science at Stanford University and co-director of Stanford's Institute for Human-Centered Artificial Intelligence. She served as Director of the Stanford Artificial Intelligence Laboratory from 2013 to 2018. During her sabbatical at Stanford from January 2017 to September 2018, Dr. Lee served as a Vice President at Google and as Chief Scientist for Artificial Intelligence/Machine Learning at Google Cloud. Since then, she has served as a board member or advisor to several public or private companies.

Katsushi Ikeuchi

Senior Principal Research Manager at Microsoft and IEEE Life Member. Microsoft Research, Redmond, Washington, U.S.A. IEEE Life Member.
E-mail: katsuike@microsoft.com
Website: https://www.microsoft.com/en-us/research/people/katsuike/
Dr. Katsushi Ikeuchi received his B.S. in Mechanical Engineering from Kyoto University in 1973 and his Ph.D. in Information Engineering from the University of Tokyo in 1978. After three years at the MIT Artificial Intelligence Laboratory, five years at the Electronics Technology Laboratory of the Japanese Government's Ministry of International Trade and Industry, 10 years at the Robotics Institute at Carnegie Mellon University, and 19 years at the Institute of Industrial Science at the University of Tokyo, he joined Microsoft as a Principal Researcher in 2015. During his tenure at CMU and the University of Tokyo, he mentored more than 50 PhD students.

His research interests span computer vision, robotics, and computer graphics. He has received several best paper awards in these research areas, including the David Marr Award in Computational Vision and the IEEE Robotics and Automation Society K. S. Fu Memorial Best Paper Award.

His community service includes: general chair of IROS95, ITSC99, IV01, ICCV05, ACCV07, ICCV17; program chair of CVPR96, ICCV03, ICRA09, ICPR12, ICCV15; EIC of IJCV (2000-2017), EIC of IJ ITS (2012-2014), associate editor of IEEE Trans. EIC for IJCV (2000-2017), EIC for IJ ITS (2012-2014), Associate Editor for IEEE Trans. RA, IEEE Trans. PAMI; and Distinguished Lecturer for the IEEE Signal Processing Society 2000-2002, IEEE Robotics and Automation Society 2004-2006, and IEEE Computer Society 2008-2010.

Through this research and community service, he has received (life) fellowships from IEEE, IEICE, IPSJ, and RSJ. He received the Distinguished Researcher Award from IEEE-PAMI, the Medal with Purple Ribbon from the Emperor of Japan, the Okawa Prize from the Okawa Foundation, and several awards for research accomplishments from Japanese professional associations.

Hoi Vo

Technology Fellow. Microsoft Games and X-Box Emerging Technologies, Redmond, WA, USA.
E-mail: hoiv@microsoft.com
Website: https://www.linkedin.com/in/hoi-vo-193420/
Hoi Vo is a Technical Fellow in Microsoft's Gaming Division focused on integrating AGI with new gaming IP. He played a key role in building the Xbox Cloud Gaming service and led cloud and gaming efforts across Microsoft teams, including Microsoft's Windows Azure platform. His focus is on optimizing AI models to run efficiently at the edge while leveraging the cloud to handle scenarios beyond the capabilities of the hardware, including fine-tuning models to make them more compelling for a variety of gaming experiences.

Jianfeng Gao

Distinguished Scientist and Vice President at Microsoft and IEEE Fellow. Microsoft Research, Redmond, WA, USA. IEEE Fellow.
E-mail: jfgao@microsoft.com
Website: https://www.microsoft.com/en-us/research/people/jfgao/
Dr. Jianfeng Gao is a Distinguished Scientist and Vice President at Microsoft Research, a Fellow of the IEEE, a Distinguished Member of the ACM, and the current head of the Deep Learning Group at Microsoft Research.From 2014 to 2017, he served as Business AI Partner Research Manager. He led the development of predictive sales and marketing AI solutions. He has also worked on deep learning for text and image processing (see ACL/SIGIR 2018 tutorials, Deep Learning 2017 tutorials, and IJCAI 2016 tutorials or Microsoft's intranet site), and led the development of AI systems for dialog, machine reading comprehension (MRC), and question answering (QA). Since 2022, he has led research on self-improving AI that enhances and adapts LLM (e.g. ChatGPT/GPT4) for developing commercial AI systems.

 

a thank-you note

We are particularly grateful to Peter Lee, Doug Burger, Desney Tan, Johannes Gehrke, Ryen White, Ece Kamar, Subhojit Som, and Kareem Choudhry for their advice, strong support, and encouragement. We thank Haiyan Zhang, Spencer Perreault, Dave Bignell, Katja Hofmann, Sam Devlin, Shanzheng Tan, Raluca Georgescu, Bill Dolan, Nebojsa Jojic, Sudha Rao, Adrian Brown, and Andrzej Banburski-Fahey for insightful early discussions and gameplay help. We thank Kiran Muthabatulla, Antonio Criminisi, Tom Cashman, Nguyen Bach, Jennifer Marsman, and Jaron Lanier from the Microsoft Mesh team, Microsoft OCTO, and the Microsoft Office team for their contributions to the Mixed Reality work, the dataset work, and their generous help and contributions to this project. and their generous help and feedback on the project. We are especially grateful to Paul Bennett, Corby Rosset, Michel Galley, Chenglong Wang, Baolin Peng, Hao Chen, Silviu Cucerzan, Ahmed Awadallah, and Saleema Amershi for their suggestions and comments on the NLP section. comments on the NLP section. The authors would like to thank Paul Smolensky, Yonatan Bisk, Kezhen Chen, Borui Wang, Liangke Gui, Dingmin Wang, Xin (Eric) Wang, Zhe Gan, Xiaojian Ma, Zilong Zheng, Song-chun Zhu, Dragomir R. Wang, and Juan J. Wang for their suggestions and comments on the NLP section. Zhu, Dragomir R. Radev, Daniel McDuff, Harry Shum, for previous work, comments, suggestions, painstaking reviews of this paper, and their tips on the literature. Finally, we sincerely thank the Microsoft Holulens team, the Microsoft X-box team, and the Meta Quest team for generously providing the devices; the MSR Central Engineering (CE) team, the Microsoft 343 team for data collection and sharing; and the Microsoft AOAI and GCR teams for their support of the Azure-OpenAI endpoint.

We would also like to thank our colleagues from the Stanford AI-Assisted Care Partnership who helped us understand the medical applications explored in this work. We are particularly grateful to Ehsan Adeli, Paul Tang, Amit Kaushal, Roger Bohn, Kevin Schulman, and Arnold Milstein for their clinical expertise and guidance.

This research was supported by Microsoft Research Program Fair 2023, Microsoft HackBox 2023 and the OCTO team.

May not be reproduced without permission:Chief AI Sharing Circle " Agent AI: Exploring the Frontier World of Multimodal Interaction [Fei-Fei Li - Classic Must Reads]

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish