General Introduction
OmniParser is a tool developed by Microsoft to parse user interface screenshots into structured and easy-to-understand elements. This tool significantly improves the ability of GPT-4V to generate accurate actions in the corresponding interface area.OmniParser not only supports a wide range of large language models, but can also be used in conjunction with the Windows 11 Virtual Machine to provide powerful interface control. The latest version of OmniParser V2.0 is the leading tool for interface parsing with significant performance and latency improvements.
Function List
- Parsing User Interface Screenshots: Transform screenshots into structured elements that are easy to understand and manipulate.
- Support for multiple large-scale language models: including OpenAI, DeepSeek, Qwen, and Anthropic.
- Controlling Windows 11 VMs: Combined with visual modeling for full control of VMs.
- Provide detailed icon detection and functional descriptions: Support finer icon detection and interaction element prediction.
- High performance and low latency: the latest version offers significant improvements in performance and latency.
Using Help
Installation process
- Create and activate a virtual environment:
conda create -n "omni" python==3.12
conda activate omni
- Install the necessary dependencies:
pip install -r requirements.txt
- Download the V2 version of the weights file and place it in the specified folder:
rm -rf weights/icon_detect weights/icon_caption weights/icon_caption_florence
for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
mv weights/icon_caption weights/icon_caption_florence
Usage Process
- Run the Gradio Demo:
python gradio_demo.py
- Parsing user interface screenshots:
- Upload or take a screenshot of the user interface.
- Use OmniParser to parse screenshots and generate structured interface elements.
- Controlling Windows 11 virtual machines:
- Combined with visual modeling, it enables full control of the virtual machine.
- Supports a variety of large-scale language models to improve the accuracy and efficiency of operations.
Detailed Function Operation
- Icon Detection: OmniParser detects icons in the interface and provides detailed descriptions of their functions to help users quickly understand and operate them.
- Interaction Element Prediction: Predict which elements of the interface are interactive and enhance the user experience.
- High Performance Analysis: The latest version offers significant performance and latency improvements to ensure a fast and efficient parsing process.