Kimi K2-0905 is an advanced AI model from Dark Side of the Moon Technologies Ltd. that excels in programming assistance, generates code efficiently, and supports the generation of neat and standardized code in front-end development. The model context length is extended to 256K to handle complex tasks.
Nano Banana is the Gemini 2.5 Flash Image codename for Gemini, an AI image generation and editing model from Google that generates detailed, photorealistic images based on simple text prompts to make high-quality modifications to existing images.
Skywork UniPic 2.0 is an efficient multimodal model open-sourced by KunlunWei, focusing on image generation, editing and understanding. The model is based on a 2B-parameter SD3.5-Medium architecture, which is realized through pre-training, progressive dual-task reinforcement strategies and co-training...
MiniMax Speech 2.5 is an advanced speech generation model developed by MiniMax team. It has made significant progress in the field of speech synthesis, especially in multilingual expressiveness, timbre reproduction accuracy and language coverage. The model supports 40 languages...
GPT-5 is the latest language model released by OpenAI with several upgrades. It is a unified intelligence system with a built-in real-time router that automatically switches between efficient and deep thinking modes according to the complexity of the problem, realizing fast response and accurate answers.GPT-5 has several versions, including the one for general...
Qwen-Image is an open source image generation base model released by Alibaba Tongyi Qianqian team. With 20 billion parameters, it adopts the Multimodal Diffusion Transformer Architecture (MMDiT), which integrates three modules: multimodal understanding, high-resolution coding and diffusion modeling.Qwen-Image's...
RedOne is a large language model customized for social networks introduced by Little Red Book. The model is trained through a three-stage training strategy that incorporates social and cultural knowledge, strengthens multitasking capabilities, and aligns human preferences.RedOne significantly outperforms the base model in social task performance, in harmful content detection and browsing...
TRAE SOLO is an AI automated development assistant introduced by TRAE, an AI programming assistant launched by ByteDance, to simplify the software development process with AI technology.TRAE SOLO understands the user's needs, supports text descriptions, voice commands, and file uploads to input the requirements, and automatically plans...
Comprehensive introduction LiveTalking is an open source real-time interactive digital human system , is committed to building high-quality digital human live solution . The project uses the Apache 2.0 open source protocol and integrates a number of cutting-edge technologies , including ER-NeRF rendering , real-time audio and video streaming processing ...
Handy is open source and free local speech to text tool, supporting Windows, MacOS and Linux systems, developed by Rust and React. It is suitable for quick transcription and text input by processing voice data locally without uploading it to the cloud to ensure privacy and security.
FG-CLIP 2 is the world's leading graphical cross-modal visual language model (VL-M) launched by 360 Artificial Intelligence Research Institute, which surpasses similar models from Google and Meta in 29 authoritative benchmark tests, making it the most powerful VL-M at present.It is able to accurately recognize the gross...
BettaFish is an open source multi-intelligence system for public opinion analysis. Using multi-intelligent body architecture, through Query, Media, Insight, Report and other Agents work together to achieve retrieval, extraction and reporting closed loop. The system supports AI-driven full ...
Ouro is a new type of Looped Language Models (LLMs) developed by the ByteDance Seed team, with the core innovation of directly building inference capabilities in the pre-training phase through a parameter-sharing recurrent computation structure. The model uses 24 layers as the base block through...
ChronoEdit, an open-source AI image editing framework developed by NVIDIA in conjunction with the University of Toronto, redefines the image editing task as a video generation task to ensure that the editing results are temporally and physically consistent. By distilling a pre-trained video generation model with 14B parameters from a...
LongCat-Flash-Omni is an open source fully modal big language model released by the LongCat team of Meituan. With a parameter scale of 560 billion (27 billion activated parameters), it realizes millisecond-level real-time audio and video interaction capabilities while maintaining a large number of parameters.
Petri is an open source AI security auditing framework developed by Anthropic that systematically assesses the security and behavioral alignment of AI models. By simulating a real-world scenario where an automated auditor engages in multiple rounds of conversations with a target model, followed by a judge agent that acts on the model's...
Kimi Linear is a new hybrid linear attention architecture open-sourced by Dark Side of the Moon, with Kimi Delta Attention (KDA) as the core, optimizing the traditional attention model through a finer-grained gating mechanism, which significantly improves the hardware efficiency and memory control ability ...
FIBO is the world's first open source text generation image model with native JSON support developed by Bria AI. Based on the DiT (Diffusion Transformer) architecture with 8B parameters, it adopts the Flow Matching training method...
SoulX-Podcast is Soul AI Lab's open source advanced multi-speaker conversational speech synthesis model designed for generating high quality podcast content. SoulX-Podcast has the ability to generate multiple rounds of conversations, which can simulate smooth conversations in real podcasting scenarios, and supports Mandarin, English, and multiple Chinese...
GigaBrain-0 is the first end-to-end Vision-Language-Action (VLA) embodied base model in China that uses world model generation data to realize real machine generalization, and it is jointly released as open source by GigaVision and Hubei Humanoid Robot Innovation Center. It adopts the hybrid Transformer architecture, integrating ...
Ming-flash-omni-Preview is an open-source full-modal macromodel released by Ant Group inclusionAI, with a parameter scale of hundreds of billions, based on the sparse MoE architecture of Ling 2.0, with total parameters of 103B and activations of 9B. in full-modal understanding and generating...
OmniVinci is an open-source, fully modal large-scale language model developed by NVIDIA that solves the problem of modal fragmentation in multimodal models through architectural innovation and data optimization. Alignment of visual and audio embeddings is enhanced by OmniAlignNet, which utilizes temporally embedded group capture...
olmOCR 2 is an open source multimodal document parsing model from the Allen Institute for Artificial Intelligence (AI2) and is an upgraded version of olmOCR. The digitized printed documents (e.g. PDF) will be high...
ValueCell is an open source multi-intelligent body financial application platform that improves the efficiency of financial analysis and investment management through AI technology. Simulating a professional investment team, multiple AI intelligences work together, covering market analysis, sentiment analysis, fundamental research, automated trading and other functions, to provide users with a comprehensive...
Dexbotic is the open source Visual-Linguistic-Action (VLA) model of embodied intelligence one-stop scientific research service platform of Dexmal, which solves the problems of fragmentation and low efficiency of research in the field of embodied intelligence. Based on PyTorch, Dexbotic is a one-stop research service platform to solve the problems of fragmentation and inefficiency in the field of embodied intelligence...
LongCat-Video is a 1.36 billion parameter video generation model open source by the LongCat team, using the MIT open source protocol, supporting three major tasks: text-generated video, graph-generated video and video continuation. The model through the "coarse to fine" generation strategy and block sparse attention mechanism, can be in a number of minutes ...
DreamOmni2 is a multimodal AI image editing and generation model open-sourced by Jiajia's team at HKUST. Can handle both text and image commands, supports multiple reference images, providing creators with more flexible ways of creation. The model is trained using a three-stage data synthesis process , joint training generation/editing...
WorldMirror 1.1 (WorldMirror) is an open source 3D reconstruction of large models released by Tencent's WorldMirror team, which is an upgraded version of the WorldMirror series. It supports multi-view images, videos, and multi-modal a priori inputs such as camera position, internal reference, depth map, etc. It breaks through the traditional 3D reconstruction that only relies on...
DeepSeek-OCR is an advanced optical character recognition (OCR) model open-sourced by the DeepSeek team, which converts text into images through "contextual optical compression" technology, and utilizes visual tokens for compression and decoding to achieve efficient long text processing.
VitaBench is the first interactive Agent evaluation benchmark for complex life scenarios released by the LongCat team of Meituan, assessing the comprehensive capabilities of large model intelligences in real life scenarios. The three high-frequency life scenarios of take-away ordering, restaurant dining, and traveling are used as the carrier to build the package...
MinerU2.5 is a decoupled visual language model jointly developed by Shanghai Artificial Intelligence Laboratory (AIL) and Peking University, focusing on efficiently processing high-resolution document image parsing. The core innovation lies in the two-phase design of "global layout detection followed by local content recognition": the first phase is a low-resolution...
LongCat-Audio-Codec is an open source speech codec solution from the LongCat team of Meituan. The program is designed for Speech Large Language Model (Speech LLM), through the semantic and acoustic dual Token parallel extraction mechanism , taking into account the semantic and acoustic features of speech ...
PaddleOCR-VL is Baidu's open source ultra-lightweight visual-language model, optimized for document parsing scenarios. The model contains only 0.9B parameters , through the fusion of dynamic high-resolution visual coder and lightweight ERNIE language model , while maintaining high accuracy and significantly reduce the computational overhead .
UniPixel is a novel multimodal model jointly proposed by Hong Kong Polytechnic University, Tencent, Chinese Academy of Sciences and Vivo to achieve pixel-level visual language understanding. By unifying object referencing and segmentation capabilities, it supports a variety of fine-grained tasks such as image segmentation, video segmentation, region understanding, and pi...
DiaMoE-TTS is a multi-dialect speech synthesis framework jointly open-sourced by Tsinghua University and Giant Network, based on the International Phonetic Alphabet (IPA), to solve the problems of dialect data scarcity, orthographic inconsistency, and complex phonological changes. Through a unified IPA front-end standardized phoneme representation to eliminate cross-dialect differences ...
Kandinsky 5.0 is the latest video generation model series developed by Russian AI team, focusing on lightweight design and high performance performance. The first model in the series, Kandinsky 5.0 Video Lite, has only 2 billion parameters but surpasses similar 14B models, especially...
SongBloom is an open source song generation model developed by Tencent AI Lab in collaboration with The Chinese University of Hong Kong (Shenzhen) and Nanjing University, which solves the problem of "plasticity" in AI music generation, and realizes high-quality, structurally complete song generation. Simply enter 10 seconds of reference audio and corresponding lyrics, and you can...
Pyscn is an intelligent code quality analysis tool designed for Python developers to detect potential problems in code to improve maintainability. It analyzes dead code through control flow diagrams, identifies duplicate code using APTED+LSH algorithm, calculates metrics such as module coupling and circle complexity...
Youtu-Embedding is a generalized text representation model open-sourced by Tencent's Youtu Lab, designed for enterprise-level applications. Through deep neural networks to map the text to a high-dimensional vector space, so that semantically similar sentences are closer in that space, to achieve accurate semantic retrieval.
SAIL-VL2 is an open source multimodal visual language model by the Byte Jump team, focusing on joint modeling of multimodal inputs such as images and text. Using the sparse mixture of experts (MoE) architecture and progressive training strategy, it achieves high performance at parameter scales from 2B to 8B, especially in the areas of graphic comprehension, math...
MineContext is an active context-aware AI partner open-sourced by the ByteDance Viking team to help users efficiently manage massive amounts of information and improve the efficiency of knowledge work. Over the screenshot and content understanding technology, automatically record the user's daily operations (such as browsing the web, editing documents, etc.), support...
nanochat is an open source project released by AI legend and former Tesla AI Director Andrej Karpathy that allows individuals to quickly train a small ChatGPT-like language model at a very low cost and simplicity. The entire project uses only about 800...
LLaVA-OneVision-1.5 is an open-source multimodal model by the EvolvingLMMS-Lab team, using 8B parameter scale, through a compact three-phase training process (language-image alignment, conceptual equalization and knowledge injection, and instruction fine-tuning) on 128 A800...
Paper2Video is an open-source presentation video project for automatic generation of academic papers by Show Lab at National University of Singapore. Using the PaperTalker multi-intelligence framework, papers are transformed into full presentation videos containing slides, subtitles, voiceover and speaker avatar...
NeuTTS Air is open source lightweight speech synthesis model, developed by Neuphonic team, which can run in real time on local devices (e.g. cell phones, laptops, Raspberry Pi) without relying on the cloud. Using 0.5B parameter Qwen architecture and self-developed NeuCodec codec...
KAT-Dev-72B-Exp is an open-source programming-specific large language model launched by the Racer team, optimized based on reinforcement learning technology, which achieved an accuracy rate of 74.6% in the SWE-Bench Verified benchmark test, the best performance of any open-source model at present. The model uses innovative...
Jamba Reasoning 3B is a lightweight inference model open-sourced by Israeli AI startup AI21 Labs with strong performance and potential for a wide range of applications. It utilizes a hybrid SSM-Transformer architecture that combines Trans...
Agentic AI is the newest course on intelligent bodies launched by Ernest Ng.The course focuses on the design and construction of intelligent bodies, covering the four major design patterns of reflection, tool use, planning, and multi-intelligent body collaboration. Learners will master how to make intelligent bodies check outputs, autonomously adjust through theoretical explanations and code practice...
OpenAgents is the open source project that creates a network of AI agents and facilitates open collaboration between agents. A basic network infrastructure is provided to enable AI agents to seamlessly connect and collaborate. Users can quickly start their own agent network, extend functionality through a modular architecture, support...
Androidify is Google's open source project to help developers learn how to build AI-driven apps on Android. The project uses Google's latest technologies such as Jetpack Compose, Gemini API (via Fire...
Ling-1T is a trillion-parameter general-purpose language model open-sourced by Ant Group, which belongs to the flagship product of the Ling 2.0 series of Bering's large models. The model adopts a highly efficient MoE architecture, supports 128K context windows, and surpasses GPT in 7 benchmarks including code generation, mathematical reasoning, and logic test...
EchoCare is a large model of ultrasound base developed by the Center for Artificial Intelligence and Robotics Innovation (CAIR) at the Hong Kong Institute of Innovation and Research of the Chinese Academy of Sciences (CAS), trained based on the world's largest ultrasound image dataset (more than 4.5 million images), covering multi-center, multi-region, multi-ethnicity, and more than 50 individuals...
Code2Video is innovative open source project that automatically converts code snippets into high quality video content (mp4 format). The project through a unique code-centric paradigm , the use of carbon-now-cli tools to generate code into beautiful images , the use of ffmpeg will be these ...
SceneGen is an open source method for generating 3D scenes from a single image at Shanghai Jiao Tong University. From a single scene image and a target resource mask, a complete scene containing multiple 3D resources is efficiently generated, including the geometric structure of the resources, texture and relative spatial location.
Ming-UniAudio is Ant Group's open source unified audio multimodal generation model that supports mixed input and output of text, audio, image and video. Using multi-scale Transformer and hybrid expert (MoE) architecture , through modality-aware routing mechanism to efficiently handle cross-modal ...
AIMangaStudio is a free AI manga creation tool that provides creators with a complete manga creation pipeline, including plot generation, sub-scene design, character setting and other functions, which can simplify the production process from script to manga page. It supports natural language generation of comic scripts, including plot, dialog...
FireRedChat is an open source full-duplex voice interaction system for Xiaohongshu with real-time bidirectional dialog capabilities and support for controlled interruptions. Adopts a modular design , including transcription control module , interaction module and dialogue manager , etc., supports cascade and semi-cascade architecture , can be flexibly deployed .