🚀 Invitation to Experience: China's First AI IDE Intelligent Programming Software Trae Chinese version downloadThe DeepSeek-R1 and Doubao-pro are available for unlimited use!

Total 67 articles

Tags: document extraction and cleaning Page 4

Parsio：自动从 PDF、电子邮件和其他文档中提取关键结构化数据-首席AI分享圈

Parsio: Automatically Extract Key Structured Data from PDFs, Emails and Other Documents

General Introduction Parsio is an AI-based document and email data extraction tool that automatically extracts structured data from PDFs, emails and other documents. The platform provides a powerful PDF parser and OCR functionality, and supports a wide range of document types, including invoices, business cards and IDs...

2024-11-14AI tools Document Extraction and Cleaning

Chonkie: a lightweight RAG text chunking library

Comprehensive Introduction Chonkie is a lightweight and efficient RAG (Retrieval-Augmented Generation) text chunking library designed to help developers quickly and easily chunk text. The library supports a variety of chunking methods, including chunking based on tags, words, sentences and semantic similarity...

2024-11-13AI tools AI open source project Document Extraction and Cleaning

Trae Chinese Version First Invitation to Download: Unlimited use of DeepSeek-R1 after registration!

Enable Builder Smart Programming Mode, unlimited use of DeepSeek-R1 and DeepSeek-V3, smoother experience than the overseas version. Just enter the Chinese commands, even a novice programmer can write his own apps with zero threshold.

2025-04-26

TextIn: Universal Document Conversion, PDF to Markdown Tool

Comprehensive introduction TextIn is a professional PDF to Markdown tool designed to help users efficiently convert PDF documents to Markdown format. The tool supports a variety of file formats, easy to operate, fast conversion speed, the ability to retain the original PDF format and content, to enhance the efficiency of document processing. Whether it is a ...

2024-11-07AI tools Document Extraction and Cleaning

文本提取API（text-extract-api）：视觉提取文本信息，匿名化的PDF提取工具-首席AI分享圈

Text Extraction API (text-extract-api): visual extraction of text information, anonymized PDF extraction tool

General Description Text Extraction API (text-extract-api) is a powerful tool designed to extract and parse content from a variety of document formats (e.g. PDF, Word, PPTX, etc.). The API utilizes state-of-the-art Optical Character Recognition (OCR) technology and Ollama-supported models to be able to take any document or image...

2024-11-05AI tools AI open source project OCR Document Extraction and Cleaning

Datalab：专用OCR识别AI模型，PDF转Markdown（开源/API）-首席AI分享圈

Datalab: dedicated OCR recognition AI model, PDF to Markdown (open source/API)

Comprehensive Introduction Datalab offers a range of advanced AI models focused on OCR, layout analysis, PDF to Markdown, and more. These models are not only high performing, but also easy to use and open source. The Marker models on the platform can quickly and accurately convert PDF to Markdown, including tables...

2024-10-21AI tools AI Open Services AI open source project OCR Document Extraction and Cleaning

MinerU：PDF文档提取转换为多模态Markdown格式，支持电子书OCR扫描-首席AI分享圈

MinerU: PDF document extraction and conversion to multimodal Markdown format, support e-book OCR scanning

Comprehensive Introduction MinerU is an open source data extraction tool developed by the OpenDataLab team at the Shanghai Artificial Intelligence Laboratory, focusing on efficiently extracting content from complex PDF documents, web pages, and eBooks. It can convert multimodal PDF documents containing images, formulas, tables and other elements into easy-to-analyze M...

2024-09-30AI tools AI open source project OCR Document Extraction and Cleaning

Marker: quickly convert PDF to Markdown open source tools

General Introduction Marker is a deep learning based document processing tool designed to convert PDF files to Markdown format quickly and accurately. It supports a wide range of document types and is especially optimized for conversion of books and scientific papers.Marker is able to remove redundant content such as headers and footers, format tables and...

2024-09-03AI tools AI open source project Document Extraction and Cleaning

Mathpix: PDF and image documents structured conversion software, support for multi-terminal

General Description Mathpix is a powerful AI-driven document automation tool designed for researchers, developers, and businesses. It quickly and accurately converts PDFs and images into searchable, exportable, and machine-readable text.Mathpix offers a wide range of features, including mathematical formula recognition, LaT...

2024-09-03AI tools AI Open Services Document Extraction and Cleaning

Unstructured：开源预处理非结构化文档，无结构数据处理的利器-首席AI分享圈

Unstructured: open source preprocessing unstructured documents, unstructured data processing tools

Comprehensive Introduction Unstructured-IO provides a range of open source components for processing and preprocessing images and text documents such as PDF, HTML, Word documents, etc. Its main goal is to simplify and optimize data processing workflow , especially for large language model (LLM) applications to provide support.Unstructured...

2024-09-01AI tools AI open source project Document Extraction and Cleaning

Reader API：网页内容提取工具，HTML转换为Markdown格式-首席AI分享圈

Reader API: Web page content extraction tool, HTML to Markdown format conversion

Comprehensive introduction Jina AI's Reader project is an open source tool (Reader open source address), can be any URL by adding the prefix https://r.jina.ai/转换成适合大型语言模型 (Large Language Models, LLM) input format, support for dynamic streaming mode and image reading...

2024-08-10AI tools AI open source project Document Extraction and Cleaning

preceding page
1
2
3
4
Total 4 pages