GLM-PC is the world's first public-oriented, ready-to-use computer intelligence (agent) based on the CogAgent multimodal model. It can "observe" and "operate" the computer like a human being, and assist users in accomplishing various computer tasks efficiently.
Since the release of GLM-PC v1.0 on November 29, 2024 and the open internal beta, we have been optimizing and upgrading it, with the latest introduction of the "Deep Thinking" mode and the addition of features dedicated to logical reasoning and code generation. In addition, theWe also offer support for Windows systemsThe
Download & Experience: https://cogagent.aminer.cn
GLM-PC Architecture
In recent years, there has been a growing interest in the modeling and architectural aspects of Agents.
The tool invocation capabilities of Large Language Models (LLMs) show for the first time how LLMs can be used as agents organically integrated with human production, with good generalization and small-sample learning capabilities, but their scope of application is limited to the types of publicly accessible tools that can be interacted with in textual form.
in order to CogAgent A series of Visual Language Model (VLM)-based Graphical Interface Intelligents (GUI Agents), represented by a series of GUI Agents, propose new paths to realize full GUI space interaction through multimodal perception. These GUI Agents, similar to human beings, can visually perceive interface elements and layouts, and simulate human beings to perform meta-operations such as clicking, keyboard input, etc., which greatly expands the boundaries of the Agent's application in the virtual interaction space.
At the same time, multi-agent systems such as SWE-agent demonstrate the potential for multi-agent collaboration that incorporates the strengths of various models to explore multi-model based planning, reflection and self-iterative.
We believe that the development of Agents can be attributed to model capability enhancement and collaboration architecture optimization.
A complete Agent needs to fulfill the following conditions:
- At the perceptual level, it is capable of receiving multiple signals such as text, images, video and audio;
- At the level of thinking, the ability to think logically and plan tasks (similar to the left brain) and the ability to perceive efficiently and operate flexibly (similar to the right brain);
- At the execution level, the ability to perform full GUI spatial operations, receive environmental feedback, and self-correct.
Based on such thinking, in 2023, we introduced the CogAgent open-source model, which fills the gap of GUI Agent in multimodal perception; and in November 2024, the GLM-PC v1.0 further strengthens the perception, planning, and creation capabilities, and realizes limited self-correction.
Now, the new version of GLM-PC draws on the division of labor between the "left brain" and the "right brain" of human beings, and realizes the in-depth combination of logical reasoning and perceptual cognition through code generation and graphical interface comprehension, giving it the ability to balance logic and creativity, and thus assisting human beings in accomplishing complex tasks.
Behind it is the multimodal model CogAgent and code model developed by Smart Spectrum. CodeGeex The new GLM-PC directs workflows and tool invocations in code. The new version of GLM-PC commands workflow and tool invocation in code form, and strengthens the ability to plan, reason, and reflect in deep thinking mode, so that it can stably and efficiently respond to complex scenarios and tasks. During actual execution, GLM-PC can sense multi-layer environmental feedback and assist reflection for effective self-correction and optimization.
It is worth mentioning that we open-sourced the fully-enhanced model CogAgent-9B-20241220 in December 2024 in order to facilitate research on pre-trained GUI Agents.
Agent Left Brain: Code Generation and Logic Execution
The "left brain" of the GLM-PC is responsible for rigorous logical reasoning and task execution. Its main functions include:
1. Planning
GLM-PC is able to quickly develop a detailed task planning program based on the user's task requirements. It comprehensively analyzes the objectives as well as the available resources, generates an execution roadmap, and automatically breaks down large tasks into manageable sub-tasks to build a clear execution path.
2、Looping Execution (Looping Execution)
At the end of the planning phase, GLM-PC will launch the code generation module to execute a logical loop that progresses step by step towards the completion of the task. This looping mechanism ensures precise execution and a high degree of automation of the task, thus realizing a complete closed loop from input to output without human intervention.
Case Study: One-Stop Shopping Process
Taking product information as an example, GLM-PC can automatically extract product data from pictures, store it in Excel, and automatically add the products to the Taobao shopping cart, thus realizing a one-stop shopping process.
Operation Instruction: Get the product information in the picture, create a new Excel on the desktop to store the information, and add the product information to the Taobao shopping cart.
(Some acceleration of the video in the text.)
3. Long thinking skills: dynamic reflection, error correction and optimization
The "left brain" function of GLM-PC not only generates static plans, but also makes real-time adjustments, reflective corrections and self-corrections based on new environmental information during the execution process, thus continuously optimizing the solution. The specific performance is as follows:
- Flexibility to cope with interruptions: When the process is interrupted by external factors, the GLM-PC quickly reconfigures the logical path to ensure that the task runs smoothly.
- Proactive Information Refinement: When missing information is encountered, GLM-PC will proactively interact with the user to refine the task execution plan by asking questions.
Case Study: Efficient Information Processing and Social Interaction
For example, when helping users to process the information of "Spring Festival New Year's Movie" on Xiaohongshu, GLM-PC can quickly find and extract the relevant data, and at the same time write the code to store the information on the computer. If there are errors in the generated code, it can correct itself according to the error message.
Instructions: Search for "Spring Festival New Year's Eve Movie" in Xiaohongshu, quote the posting image from the first graphic post, send the image to the {GGG} group chat on WeChat, and ask them which movie they would like to see.
Agent Right Brain: Images and GUI Cognition
GLM-PC's "Right Brain" focuses on depth perception and interactive experience. Its core functions cover:
- GUI Image Understanding: Accurately recognize graphical interface elements (e.g., buttons, icons, layouts, etc.) and understand their function and interaction logic.
- User Behavior Cognition: Combining the learning of the user interface and the understanding of historical operation information, it provides the user with intelligent recommended operations for the current interface.
- Image Semantic Parsing: In-depth semantic analysis of complex images to extract key information such as text, identifiers, and trends and indicators in data visualization charts.
- Multi-modal information fusion: Fusing image and text information to form a comprehensive perceptual result. For example, recognizing both button positions and text labels in the user interface, helping the "left brain" to make precise operation plans.
Demonstration: Efficient data organization and archiving
For example, GLM-PC is able to search and extract the graphic content related to "AI Ranking" in Xiaohongshu. Subsequently, through the self-written code, the company information is stored in the newly created Excel file on the desktop, and the text content of the post is saved in the specified Word document, ensuring efficient organization and archiving of user data and improving the efficiency of information management.
Operation instructions: search for "new energy vehicle list" in the first picture and text post on Xiaohongshu, quote the picture content and text content of the first post, get the list of information in the picture and store it in the new desktop Excel, and put the text content of the post into a new desktop word document called new-energy. and put the text of the post into a new word document called new-energy on the desktop.
Agent of Agents: Left and Right Brain Collaboration
This model, which draws on the collaboration between the left and right brains, enables GLM-PC to not only handle complex logical tasks, but also demonstrate higher adaptability, creativity, and generalization on open-ended problems. Through dynamic optimization and context-awareness, GLM-PC can help users explore more efficient solutions, especially in cyclic task processing, multi-step reasoning execution, and long-chain task management.
Case Study: Grade 6 English Vocabulary Study Aid
As a Grade 6 English vocabulary learning assistant, GLM-PC can automatically extract Grade 6 vocabulary words from specified websites, make sentences based on these words, and automatically save the vocabulary words and their sentences to a new Word document named "Grade 6 English Vocabulary Learning".
Find 3 vocabulary words in this "https://www.dxsbb.com/news/277.html" Grade 6 vocabulary, then make a sentence for each word, paste the vocabulary words and the corresponding sentences into a new Word document, and save it as "Grade 6 English Vocabulary Learning".
Demonstration: personalized WeChat blessings and New Year congratulations picture mass mailing
GLM-PC is able to automatically customize personalized Chinese New Year wishes and congratulatory pictures/videos for WeChat group friends, and realize group sending through one-click operation to complete holiday greetings efficiently.
Instruction: Quote the list of "GGG" group members on WeChat, and send each of them a 2025 Chinese New Year wishing message and a picture with the theme of the Year of the Snake.
Demonstration: Intelligent Flight Inquiry and Scheduling
GLM-PC can provide users with quick flight information, screen the most economical air tickets, and synchronize the setting of Flybook calendar reminders, realizing one-stop service from flight inquiry, ticket screening to scheduling.
Instructions: Help me find the cheapest air ticket from Shanghai to Beijing on January 21st on Ctrip; Help me set up a Flybook Calendar, the time is 6 hours before the plane takes off, the topic is departure to the airport, and the duration is half an hour.
Showcase: PDF Math problem extraction and organization process
GLM-PC automatically opens PDF files, extracts specified content, and organizes and stores the information in a Word document.
Operation Instruction: Help me open the desktop Permutations and Binomial Theorem Exercise.pdf file, quote the first few math problems that summarize the current interface , and put it into a new word document on the desktop.
collaborative
We are exploring in-depth cooperation with Lenovo, Asus and other well-known PC manufacturers to jointly promote the innovation and development of AIPC (AI Personal Computer).
AIPC is not only a computer, but also a new application of AI Agent in personal computing, which can provide users with more efficient and smarter work and life experience.