🚀 Invitation to Experience: China's First AI IDE Intelligent Programming Software Trae Chinese version downloadThe DeepSeek-R1 and Doubao-pro are available for unlimited use!

Total 66 articles

Tags: document extraction and cleaning Page 3

llms.txt Generator：快速抓取网站内容并，生成LLM训练文本数据集-首席AI分享圈

llms.txt Generator: Rapidly crawls website content and generates LLM training text datasets.

Comprehensive Introduction llmstxt-generator is a professional web content extraction and integration tool specialized in preparing high-quality textual datasets for training and inference in Large Language Models (LLMs). Developed by Mendable AI, the tool uses web crawling technology provided by @firecrawl_dev and GPT-4-mini ...

2025-01-05AI tools AI open source project Document Extraction and Cleaning

Doc2X：文档图片公式识别与转换工具，支持多格式转换与高精度翻译-首席AI分享圈

Doc2X: Document image formula recognition and conversion tools, support for multi-format conversion and high-precision translation

Comprehensive introduction Doc2X is a powerful document image formula recognition and conversion tools, is committed to providing efficient and intelligent document processing solutions. Whether it is an academic research paper, textbooks, corporate documents or financial reports, Doc2X can accurately recognize the tables and formulas in PDF and convert them with one key...

2025-01-02AI tools AI Open Services AI translation Document Extraction and Cleaning

Trae Chinese Version First Invitation to Download: Unlimited use of DeepSeek-R1 after registration!

Enable Builder Smart Programming Mode, unlimited use of DeepSeek-R1 and DeepSeek-V3, smoother experience than the overseas version. Just enter the Chinese commands, even a novice programmer can write his own apps with zero threshold.

2025-04-15

ExtractThinker：提取和分类文档为结构化数据，优化文档处理流程-首席AI分享圈

ExtractThinker: extracting and classifying documents into structured data to optimize the document processing flow

Comprehensive Introduction ExtractThinker is a flexible document intelligence tool that utilizes Large Language Models (LLMs) to extract and classify structured data from documents, providing a seamless ORM-like document processing workflow. It supports multiple document loaders, including Tesseract OCR, Azure Form Recog...

2025-01-02AI tools AI open source project Document Extraction and Cleaning

HtmlRAG：构建高效HTML检索增强生成系统，优化RAG系统中的HTML文档检索与处理-首席AI分享圈

HtmlRAG: Building an Efficient HTML Retrieval Enhanced Generation System, Optimizing HTML Document Retrieval and Processing in RAG Systems

Comprehensive Introduction HtmlRAG is an innovative open source project focused on improving the processing of HTML documents in Retrieval Augmented Generation (RAG) systems. The project presents a novel approach that argues that using HTML formatting in RAG systems is more efficient than plain text. The project encompasses a complete data processing flow from the cha...

2025-01-02AI tools Document Extraction and Cleaning Knowledge Retrieval and the RAG Framework

ScrapeGraphAI：一个提示词搞定网页抓取，无需编写规则智能网页内容提取工具-首席AI分享圈

ScrapeGraphAI: A single cue word for web crawling, no need to write rules intelligent web content extraction tools

Comprehensive Introduction ScrapeGraphAI is an innovative Python web scraping library that cleverly combines Large Language Modeling (LLM) and Direct Graph Logic to create a scraping pipeline for websites and local documents. The uniqueness of this tool lies in its perfect balance of simplicity and power: the user simply describes what he/she wants to mention...

2025-01-01AI tools AI open source project Document Extraction and Cleaning

Vision Parse：使用视觉语言模型将PDF文档智能转换为Markdown格式-首席AI分享圈

Vision Parse: Intelligent Conversion of PDF Documents to Markdown Format Using Visual Language Models

Comprehensive Introduction Vision Parse is a revolutionary document processing tool that cleverly combines state-of-the-art Visual Language Models (Vision Language Models) technology to intelligently convert PDF documents into high-quality Markdown format content. The tool supports a wide range of top-notch visual language models, including o...

2024-12-26AI tools AI open source project Document Extraction and Cleaning

Outlines：通过正则表达式、JSON或Pydantic模型生成结构化文本输出-首席AI分享圈

Outlines: Generate structured text output via regular expressions, JSON or Pydantic models

Comprehensive Introduction Outlines is an open source library developed by dottxt-ai to enhance the application of Large Language Models (LLMs) through structured text generation. The library supports a wide range of model integrations, including OpenAI, transformers, llama.cpp, etc. It provides simple but powerful cue primitives,...

2024-12-19AI tools AI open source project Document Extraction and Cleaning

MarkItDown：微软文档智能转换工具，转换各种文件为Markdown格式-首席AI分享圈

MarkItDown: Microsoft Document Intelligent Conversion Tool to convert various files to Markdown format

General Introduction MarkItDown is a Python tool developed by Microsoft designed to convert various files and office documents to Markdown format. The tool supports a wide range of file types, including PDF, PowerPoint, Word, Excel, images (EXIF metadata and OCR), audio (EXIF metadata and language...

2024-12-14AI tools AI open source project Document Extraction and Cleaning

Chunkr：使用视觉模型进行文档摄取以及根据文本段落层级智能分块的一体化服务-首席AI分享圈

Chunkr: An All-in-One Service for Document Ingestion and Intelligent Chunking Based on Text Paragraph Hierarchy Using Visual Models

Comprehensive Introduction Chunkr is a self-hosted API specialized in converting PDF, PPTX, DOCX, and Excel files into data suitable for use in RAG (Retrieval Augmented Generation) and LLM (Large Language Modeling). It was developed by Lumina AI Inc. and utilizes advanced visual models for document ingest...

2024-12-13AI tools AI open source project OCR Document Extraction and Cleaning

GitIngest：快速将Github代码仓库转为适合LLM理解的文本-首席AI分享圈

GitIngest: Quickly Convert Github Code Repositories to Text Suitable for LLM Understanding

General Introduction GitIngest is an open source tool designed to transform GitHub code repositories into text suitable for Large Language Model (LLM) hints. With a simple operation, users can extract and format the content of any GitHub repository into text suitable for LLM use. The tool provides one-click analysis...

2024-12-12AI tools AI open source project Document Extraction and Cleaning

E2M：将多种文件格式转换为Markdown，轻松实现文档格式统一-首席AI分享圈

E2M: Convert multiple file formats to Markdown for easy document formatting unification

General Introduction E2M (Everything to Markdown) is an open source Python library designed to convert multiple file formats to Markdown format. The tool supports a wide range of file types including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3 and m4a.E2M uses...

2024-12-11AI tools AI open source project Document Extraction and Cleaning

Docling：支持多种格式文档解析并导出为Markdown和JSON，PDF支持OCR-首席AI分享圈

Docling: support for a variety of formats document parsing and export as Markdown and JSON, PDF support OCR

Comprehensive Introduction Docling is a powerful document parsing and exporting tool that supports a wide range of document formats including PDF, DOCX, PPTX, XLSX, Image, HTML, AsciiDoc and Markdown.It parses and exports these documents to HTML, Markdown, and JSON formats, with support for embedding and...

2024-12-09AI tools AI open source project OCR Document Extraction and Cleaning

MegaParse：解析各类型文档为LLM可用数据，完整保留文档中的表格、图片等所有信息-首席AI分享圈

MegaParse: parses all types of documents into LLM-available data, preserving all information in the document such as tables, pictures, etc. in its entirety

Comprehensive Introduction MegaParse is a powerful and versatile document parsing tool designed to optimize data processing for the Large Language Model (LLM). Whether you are working with text, PDF, PowerPoint presentations or Word documents, MegaParse makes it easy and ensures that the parsing process is not...

2024-12-04AI tools AI open source project Document Extraction and Cleaning

ViTLP：排版复杂PDF文档提取结构化数据，视觉引导生成文本布局预训练模型-首席AI分享圈

ViTLP: Extracting Structured Data from Typographically Complex PDF Documents and Visually Guided Generation of Text Layout Pre-training Models

Comprehensive Introduction ViTLP (Visually Guided Generative Text-Layout Pre-training for Document Intelligence) is an open source project that aims to enhance document intelligence processing through visually guided generative text layout pre-training models. The project was developed by Veason-silverbul...

2024-12-03AI tools OCR Document Extraction and Cleaning

Trieve: a full-service RAG cloud infrastructure for search, recommendations and analytics

General Introduction Trieve is an all-inclusive infrastructure developed by Devflow, Inc. designed for search, recommendations, RAG (retrieval augmentation generation) and analytics. The platform is served via an API, supports self-hosting, and is available for environments such as AWS, GCP, Kubernetes, and Docker Compose....

2024-12-03AI tools AI Open Services Document Extraction and Cleaning

pdf2htmlEX：PDF无损转换为HTML，保持文本格式，适用于学术论文和杂志排版-首席AI分享圈

pdf2htmlEX: PDF lossless conversion to HTML, maintaining text formatting, suitable for academic papers and magazine layout

Comprehensive introduction pdf2htmlEX is an open source tool designed to convert PDF files to HTML format , by analyzing the content of PDF files and use HTML + CSS to accurately restore its visual effect , PDF documents into a browser can be viewed directly on the web page . The tool is particularly suitable for containing a large number of ...

2024-11-26AI tools AI open source project Document Extraction and Cleaning

Maxun：开源无代码平台，自动抓取网页数据并转换为API或电子表格-首席AI分享圈

Maxun: open source no-code platform that automatically crawls web data and converts it to APIs or spreadsheets

Comprehensive Introduction Maxun is an open source no-code web data extraction platform that allows users to train robots in minutes to automatically crawl web data and convert it into APIs or spreadsheets. The platform supports paging and scrolling, can adapt to changes in website layout, provides powerful data crawling features for...

2024-11-22AI tools AI open source project Document Extraction and Cleaning

OmniParse：从文档/多媒体中提取任何非结构化数据解析为结构化数据-首席AI分享圈

OmniParse: extract any unstructured data from documents/multimedia and parse it into structured data

Comprehensive Introduction OmniParse is a powerful data parsing and optimization platform designed to transform any unstructured data into structured, actionable data optimized for the GenAI (Generative Artificial Intelligence) framework. Whether you are working with documents, tables, images, videos, audio files or web content,...

2024-11-15AI tools AI open source project Document Extraction and Cleaning

Parsio：自动从 PDF、电子邮件和其他文档中提取关键结构化数据-首席AI分享圈

Parsio: Automatically Extract Key Structured Data from PDFs, Emails and Other Documents

General Introduction Parsio is an AI-based document and email data extraction tool that automatically extracts structured data from PDFs, emails and other documents. The platform provides a powerful PDF parser and OCR functionality, and supports a wide range of document types, including invoices, business cards and IDs...

2024-11-14AI tools Document Extraction and Cleaning

preceding page
1
2
3
4
next page
Total 4 pages