Document Extraction and Cleaning

Total 67 articles posts
自动解析PDF内容并提取文字与表格的开源服务

Automatically parse PDF content and extract text and tables of open source services

Comprehensive Introduction It can automatically analyze the layout of PDF documents, identify text, titles, images, tables, formulas and other elements in the page, and determine their correct order. The tool supports OCR functionality and can convert scanned PDF to searchable text. It runs on Docker and provides two models...
4mos ago
0998
Cloudsquid:上传文档并描述要求智能提取结构化数据

Cloudsquid: upload documents and describe requirements for intelligent extraction of structured data

General Introduction Cloudsquid is a company founded in 2023 in Berlin, Germany, focused on simplifying document processing with artificial intelligence. Its core product is an online data extraction platform that allows users to simply upload documents such as PDFs, images, audio, video, etc. and simply state that they need to extract...
5mos ago
01.1K
飞桨 PP-TableMagic:复杂表格结构化信息提取神器

Flying Paddle PP-TableMagic: Structured Information Extraction for Complex Tables

The goal of table recognition is to parse tables in images, accurately identify table structures and cell locations, and reduce them to structured table formats (e.g., HTML). In today's information age, a large amount of important tabular data still exists in an unstructured state (e.g., scanned documents with pictures of statistical tables...).
5mos ago
01.9K
Mistral OCR:94.89%总体精度,1000 页/30秒,只需1美元

Mistral OCR: 94.89% Overall Accuracy, 1000 Pages/30 Seconds, Only $1

In the long history of human civilization, every leap in the way information is acquired and parsed has profoundly driven social progress. From the ancient hieroglyphics, to the portable papyrus, to the later emergence of the printing press and today's wave of digitization, each technological innovation has greatly expanded the paradigm of human knowledge dissemination...
5mos ago
01.1K
Crawl4LLM:为LLM预训练提供的高效网页爬取工具

Crawl4LLM: An Efficient Web Crawling Tool for LLM Pretraining

Comprehensive Introduction Crawl4LLM is an open source project jointly developed by Tsinghua University and Carnegie Mellon University, focusing on optimizing the efficiency of web crawling for pre-training of large models (LLM). It significantly reduces ineffective crawling by intelligently selecting high-quality web page data, claiming to be able to originally need to crawl 1...
6mos ago
0980
zChunk:基于Llama-70B的通用语义分块策略

zChunk: a generic semantic chunking strategy based on Llama-70B

Comprehensive Introduction zChunk is a novel chunking strategy developed by ZeroEntropy that aims to provide a solution for generic semantic chunking. The strategy is based on the Llama-70B model, which optimizes the chunking process of documents by prompting for chunks to be generated, ensuring that information retrieval is maintained at a high...
6mos ago
01.2K
Pulse:文档处理与数据提取的商业解决方案

Pulse: Business Solutions for Document Processing and Data Extraction

Comprehensive Introduction Pulse is an intelligent platform focused on document processing and data extraction, designed to help organizations and developers efficiently parse and process a wide range of complex documents. Through its advanced computer vision and multimodal processing technology, Pulse is able to accurately extract data from text, images, tables, and many other...
6mos ago
01.1K
Rowfill:批量提取文档结构化信息并自动化分析

Rowfill: Batch Extraction of Structured Information from Documents and Automated Analysis

General Introduction Rowfill is an open source document processing platform designed for knowledge workers. It uses advanced artificial intelligence techniques to extract, analyze and process data from complex documents, images and PDFs.Rowfill supports Native Large Language Model (LLM) and Ope...
6mos ago
01.2K
UnDatas.IO:精准解析各类非结构化数据的API服务(付费)

UnDatas.IO: API service for accurate parsing of various types of unstructured data (paid)

Comprehensive Introduction UnDatas.IO is a platform focused on parsing and processing unstructured data. It utilizes advanced technology to automatically recognize document layouts and categorize tables, images, formulas and text, greatly simplifying the data processing process. The platform not only saves a lot of time in organizing data...
7mos ago
01.2K
Doc2X:文档图片公式识别与转换工具,支持多格式转换与高精度翻译

Doc2X: Document image formula recognition and conversion tools, support for multi-format conversion and high-precision translation

Comprehensive introduction Doc2X is a powerful document image formula recognition and conversion tools, is committed to providing efficient and intelligent document processing solutions. Whether it is an academic research paper, a textbook, a corporate document or a financial report, Doc2X can accurately recognize PDF tables and...
6mos ago
01.8K
ExtractThinker:提取和分类文档为结构化数据,优化文档处理流程

ExtractThinker: extracting and classifying documents into structured data to optimize the document processing flow

Comprehensive Introduction ExtractThinker is a flexible document intelligence tool that utilizes Large Language Models (LLMs) to extract and classify structured data from documents, providing a seamless ORM-like document processing workflow. It supports a variety of document loaders, including Tess...
7mos ago
01.6K
HtmlRAG:构建高效HTML检索增强生成系统,优化RAG系统中的HTML文档检索与处理

HtmlRAG: Building an Efficient HTML Retrieval Enhanced Generation System, Optimizing HTML Document Retrieval and Processing in RAG Systems

Comprehensive Introduction HtmlRAG is an innovative open source project focused on improving the processing of HTML documents in Retrieval Augmented Generation (RAG) systems. The project presents a novel approach that argues that using HTML formatting in RAG systems is more efficient than plain text. The project contains a complete ...
7mos ago
01.5K
ScrapeGraphAI:一个提示词搞定网页抓取,无需编写规则智能网页内容提取工具

ScrapeGraphAI: A single cue word for web crawling, no need to write rules intelligent web content extraction tools

Comprehensive Introduction ScrapeGraphAI is an innovative Python web crawling library that cleverly combines Large Language Modeling (LLM) and Direct Graph Logic to create crawling pipelines for websites and local documents. The uniqueness of this tool lies in its perfect level of simplicity and power...
7mos ago
01.4K
MegaParse:解析各类型文档为LLM可用数据,完整保留文档中的表格、图片等所有信息

MegaParse: parses all types of documents into LLM-available data, preserving all information in the document such as tables, pictures, etc. in its entirety

Comprehensive Introduction MegaParse is a powerful and versatile document parsing tool designed to optimize data processing for the Large Language Model (LLM). Whether you are working with text, PDF, PowerPoint presentations or Word documents, MegaParse...
8mos ago
02.1K
Maxun:开源无代码平台,自动抓取网页数据并转换为API或电子表格

Maxun: open source no-code platform that automatically crawls web data and converts it to APIs or spreadsheets

Comprehensive Introduction Maxun is an open source no-code web data extraction platform that allows users to train robots in minutes to automatically crawl web data and convert it into APIs or spreadsheets. The platform supports paging and scrolling, can adapt to changes in website layout, provides powerful data crawling...
7mos ago
01.7K
OmniParse:从文档/多媒体中提取任何非结构化数据解析为结构化数据

OmniParse: extract any unstructured data from documents/multimedia and parse it into structured data

Comprehensive Introduction OmniParse is a powerful data parsing and optimization platform designed to convert any unstructured data into structured, actionable data optimized for GenAI (Generative Artificial Intelligence) framework. Whether you are working with documents, tables, images, videos, audio files or...
9mos ago
01.7K
MinerU:PDF文档提取转换为多模态Markdown格式,支持电子书OCR扫描

MinerU: PDF document extraction and conversion to multimodal Markdown format, support e-book OCR scanning

Comprehensive Introduction MinerU is an open source data extraction tool developed by the OpenDataLab team at the Shanghai Artificial Intelligence Laboratory, focusing on efficiently extracting content from complex PDF documents, web pages, and eBooks. It can take multimodal PDFs containing images, formulas, tables and other elements...
10mos ago
02.3K