AI Personal Learning
and practical guidance
讯飞绘镜
Total 60 articles

Tags: document extraction and cleaning Page 3

Outlines:通过正则表达式、JSON或Pydantic模型生成结构化文本输出-首席AI分享圈

Outlines: Generate structured text output via regular expressions, JSON or Pydantic models

Comprehensive Introduction Outlines is an open source library developed by dottxt-ai to enhance the application of Large Language Models (LLMs) through structured text generation. The library supports a wide range of model integrations, including OpenAI, transformers, llama.cpp, etc. It provides simple but powerful cue primitives,...

Chunkr:使用视觉模型进行文档摄取以及根据文本段落层级智能分块的一体化服务-首席AI分享圈

Chunkr: An All-in-One Service for Document Ingestion and Intelligent Chunking Based on Text Paragraph Hierarchy Using Visual Models

Comprehensive Introduction Chunkr is a self-hosted API specialized in converting PDF, PPTX, DOCX, and Excel files into data suitable for use in RAG (Retrieval Augmented Generation) and LLM (Large Language Modeling). It was developed by Lumina AI Inc. and utilizes advanced visual models for document ingest...

MegaParse:解析各类型文档为LLM可用数据,完整保留文档中的表格、图片等所有信息-首席AI分享圈

MegaParse: parses all types of documents into LLM-available data, preserving all information in the document such as tables, pictures, etc. in its entirety

Comprehensive Introduction MegaParse is a powerful and versatile document parsing tool designed to optimize data processing for the Large Language Model (LLM). Whether you are working with text, PDF, PowerPoint presentations or Word documents, MegaParse makes it easy and ensures that the parsing process is not...

ViTLP:排版复杂PDF文档提取结构化数据,视觉引导生成文本布局预训练模型-首席AI分享圈

ViTLP: Extracting Structured Data from Typographically Complex PDF Documents and Visually Guided Generation of Text Layout Pre-training Models

Comprehensive Introduction ViTLP (Visually Guided Generative Text-Layout Pre-training for Document Intelligence) is an open source project that aims to enhance document intelligence processing through visually guided generative text layout pre-training models. The project was developed by Veason-silverbul...

pdf2htmlEX:PDF无损转换为HTML,保持文本格式,适用于学术论文和杂志排版-首席AI分享圈

pdf2htmlEX: PDF lossless conversion to HTML, maintaining text formatting, suitable for academic papers and magazine layout

Comprehensive introduction pdf2htmlEX is an open source tool designed to convert PDF files to HTML format , by analyzing the content of PDF files and use HTML + CSS to accurately restore its visual effect , PDF documents into a browser can be viewed directly on the web page . The tool is particularly suitable for containing a large number of ...

Maxun:开源无代码平台,自动抓取网页数据并转换为API或电子表格-首席AI分享圈

Maxun: open source no-code platform that automatically crawls web data and converts it to APIs or spreadsheets

Comprehensive Introduction Maxun is an open source no-code web data extraction platform that allows users to train robots in minutes to automatically crawl web data and convert it into APIs or spreadsheets. The platform supports paging and scrolling, can adapt to changes in website layout, provides powerful data crawling features for...

OmniParse:从文档/多媒体中提取任何非结构化数据解析为结构化数据-首席AI分享圈

OmniParse: extract any unstructured data from documents/multimedia and parse it into structured data

Comprehensive Introduction OmniParse is a powerful data parsing and optimization platform designed to transform any unstructured data into structured, actionable data optimized for the GenAI (Generative Artificial Intelligence) framework. Whether you are working with documents, tables, images, videos, audio files or web content,...

文本提取API(text-extract-api):视觉提取文本信息,匿名化的PDF提取工具-首席AI分享圈

Text Extraction API (text-extract-api): visual extraction of text information, anonymized PDF extraction tool

General Description Text Extraction API (text-extract-api) is a powerful tool designed to extract and parse content from a variety of document formats (e.g. PDF, Word, PPTX, etc.). The API utilizes state-of-the-art Optical Character Recognition (OCR) technology and Ollama-supported models to be able to take any document or image...

MinerU:PDF文档提取转换为多模态Markdown格式,支持电子书OCR扫描-首席AI分享圈

MinerU: PDF document extraction and conversion to multimodal Markdown format, support e-book OCR scanning

Comprehensive Introduction MinerU is an open source data extraction tool developed by the OpenDataLab team at the Shanghai Artificial Intelligence Laboratory, focusing on efficiently extracting content from complex PDF documents, web pages, and eBooks. It can convert multimodal PDF documents containing images, formulas, tables and other elements into easy-to-analyze M...

en_USEnglish