YOLOE is an open source project developed by the Multimedia Intelligence Group (THU-MIG) at Tsinghua University School of Software, with the full name "You Only Look Once Eye". It is based on the PyTorch framework, and is an extension of the YOLO series, which can detect and segment any object in real time. The project is hosted on GitHub, ...
General Introduction SegAnyMo is an open source project developed by a team of researchers at UC Berkeley and Peking University, including members such as Nan Huang. This tool focuses on video processing and can automatically recognize and segment arbitrary moving objects in a video, such as people, animals or vehicles. It combines TAP...
Enable Builder Smart Programming Mode, unlimited use of DeepSeek-R1 and DeepSeek-V3, smoother experience than the overseas version. Just enter the Chinese commands, even a novice programmer can write his own apps with zero threshold.
Comprehensive Introduction RF-DETR is an open source object detection model developed by the Roboflow team. It is based on the Transformer architecture, and its core feature is real-time efficiency. The model achieved the first real-time detection of over 60 APs on the Microsoft COCO dataset, as well as an outstanding performance in the RF100-VL benchmark,...
General Introduction HumanOmni is an open source multimodal big model developed by the HumanMLLM team and hosted on GitHub. It focuses on analyzing human video and can process both picture and sound to help understand emotion, movement, and conversational content. The project used 2.4 million human-centered video clips and...
General Introduction Vision Agent is an open-source project developed by LandingAI (Enda Wu's team) and hosted on GitHub, designed to help users quickly generate code that solves computer vision tasks. It utilizes an advanced agent framework and a multimodal model to generate efficient by simple prompts...
General Introduction Make Sense is a free online image annotation tool designed to help users quickly prepare datasets for computer vision projects. It requires no complicated installation, just open a browser access to use it, supports multiple operating systems, and is perfect for small deep learning projects. Users can use it to...
Comprehensive introduction YOLOv12 is an open source project developed by GitHub user sunsmarterjie , focusing on real-time target detection technology . The project is based on YOLO (You Only Look Once) series of frameworks , the introduction of the attention mechanism to optimize the performance of traditional convolutional neural networks (CNN) , not only in the detection of ...
Comprehensive Introduction VLM-R1 is an open source visual language modeling project developed by Om AI Lab and hosted on GitHub. The project is based on DeepSeek's R1 approach, combined with the Qwen2.5-VL model, which significantly improves the model's visual... through reinforcement learning (R1) and supervised fine-tuning (SFT) techniques.
Comprehensive Introduction HealthGPT is a state-of-the-art medical grand visual language model designed to enable unified medical visual understanding and generation capabilities through heterogeneous knowledge adaptation. The goal of the project is to integrate medical vision understanding and generation capabilities into a unified autoregressive framework, significantly enhancing the medical image processing...
Comprehensive Introduction MedRAX is a state-of-the-art AI intelligence designed specifically for Chest X-ray (CXR) analysis. It integrates state-of-the-art CXR analysis tools and a multimodal large language model to dynamically process complex medical queries without additional training.MedRAX, through its modular design and strong technological base,...
Comprehensive Introduction Agentic Object Detection is an advanced target detection tool from Landing AI. The tool greatly simplifies the process of traditional target detection by using text prompts for detection without the need for data labeling and model training. Users simply upload an image and enter the detection prompts, and AI ...
General Introduction CogVLM2 is an open source multimodal model developed by the Tsinghua University Data Mining Research Group (THUDM), based on the Llama3-8B architecture, and designed to provide performance comparable to or even better than GPT-4V. The model supports image understanding, multi-round dialog, and video understanding, and is capable of handling content up to 8K long...
Synthesis Gaze-LLE is a gaze target prediction tool based on a large-scale learning encoder. Developed by Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, and James M. Rehg, it is designed to use pre-trained visual base models (e.g., DINOv2) to actualize ...
Comprehensive Introduction Video Analyzer is a comprehensive video analysis tool that combines computer vision, audio transcription, and natural language processing techniques to generate detailed video content descriptions. The tool does this by extracting key frames from the video, transcribing audio content, and generating natural language...
General Introduction Twelve Labs is a multimodal AI company focused on video understanding, dedicated to helping users understand and process large amounts of video content through advanced AI technologies. Its core technologies include video search, generation, and embedding that can extract key features from video such as actions, objects, on-screen text,...