AI Personal Learning
and practical guidance
Beanbag Marscode1
Total 66 articles

Tags: document extraction and cleaning Page 3

Doc2X:文档图片公式识别与转换工具,支持多格式转换与高精度翻译-首席AI分享圈

Doc2X: Document image formula recognition and conversion tools, support for multi-format conversion and high-precision translation

Comprehensive introduction Doc2X is a powerful document image formula recognition and conversion tools, is committed to providing efficient and intelligent document processing solutions. Whether it is an academic research paper, textbooks, corporate documents or financial reports, Doc2X can accurately recognize the tables and formulas in PDF and convert them with one key...

ExtractThinker:提取和分类文档为结构化数据,优化文档处理流程-首席AI分享圈

ExtractThinker: extracting and classifying documents into structured data to optimize the document processing flow

Comprehensive Introduction ExtractThinker is a flexible document intelligence tool that utilizes Large Language Models (LLMs) to extract and classify structured data from documents, providing a seamless ORM-like document processing workflow. It supports multiple document loaders, including Tesseract OCR, Azure Form Recog...

HtmlRAG:构建高效HTML检索增强生成系统,优化RAG系统中的HTML文档检索与处理-首席AI分享圈

HtmlRAG: Building an Efficient HTML Retrieval Enhanced Generation System, Optimizing HTML Document Retrieval and Processing in RAG Systems

Comprehensive Introduction HtmlRAG is an innovative open source project focused on improving the processing of HTML documents in Retrieval Augmented Generation (RAG) systems. The project presents a novel approach that argues that using HTML formatting in RAG systems is more efficient than plain text. The project encompasses a complete data processing flow from the cha...

ScrapeGraphAI:一个提示词搞定网页抓取,无需编写规则智能网页内容提取工具-首席AI分享圈

ScrapeGraphAI: A single cue word for web crawling, no need to write rules intelligent web content extraction tools

Comprehensive Introduction ScrapeGraphAI is an innovative Python web scraping library that cleverly combines Large Language Modeling (LLM) and Direct Graph Logic to create a scraping pipeline for websites and local documents. The uniqueness of this tool lies in its perfect balance of simplicity and power: the user simply describes what he/she wants to mention...

Vision Parse:使用视觉语言模型将PDF文档智能转换为Markdown格式-首席AI分享圈

Vision Parse: Intelligent Conversion of PDF Documents to Markdown Format Using Visual Language Models

Comprehensive Introduction Vision Parse is a revolutionary document processing tool that cleverly combines state-of-the-art Visual Language Models (Vision Language Models) technology to intelligently convert PDF documents into high-quality Markdown format content. The tool supports a wide range of top-notch visual language models, including o...

Outlines:通过正则表达式、JSON或Pydantic模型生成结构化文本输出-首席AI分享圈

Outlines: Generate structured text output via regular expressions, JSON or Pydantic models

Comprehensive Introduction Outlines is an open source library developed by dottxt-ai to enhance the application of Large Language Models (LLMs) through structured text generation. The library supports a wide range of model integrations, including OpenAI, transformers, llama.cpp, etc. It provides simple but powerful cue primitives,...

Chunkr:使用视觉模型进行文档摄取以及根据文本段落层级智能分块的一体化服务-首席AI分享圈

Chunkr: An All-in-One Service for Document Ingestion and Intelligent Chunking Based on Text Paragraph Hierarchy Using Visual Models

Comprehensive Introduction Chunkr is a self-hosted API specialized in converting PDF, PPTX, DOCX, and Excel files into data suitable for use in RAG (Retrieval Augmented Generation) and LLM (Large Language Modeling). It was developed by Lumina AI Inc. and utilizes advanced visual models for document ingest...

MegaParse:解析各类型文档为LLM可用数据,完整保留文档中的表格、图片等所有信息-首席AI分享圈

MegaParse: parses all types of documents into LLM-available data, preserving all information in the document such as tables, pictures, etc. in its entirety

Comprehensive Introduction MegaParse is a powerful and versatile document parsing tool designed to optimize data processing for the Large Language Model (LLM). Whether you are working with text, PDF, PowerPoint presentations or Word documents, MegaParse makes it easy and ensures that the parsing process is not...

ViTLP:排版复杂PDF文档提取结构化数据,视觉引导生成文本布局预训练模型-首席AI分享圈

ViTLP: Extracting Structured Data from Typographically Complex PDF Documents and Visually Guided Generation of Text Layout Pre-training Models

Comprehensive Introduction ViTLP (Visually Guided Generative Text-Layout Pre-training for Document Intelligence) is an open source project that aims to enhance document intelligence processing through visually guided generative text layout pre-training models. The project was developed by Veason-silverbul...

pdf2htmlEX:PDF无损转换为HTML,保持文本格式,适用于学术论文和杂志排版-首席AI分享圈

pdf2htmlEX: PDF lossless conversion to HTML, maintaining text formatting, suitable for academic papers and magazine layout

Comprehensive introduction pdf2htmlEX is an open source tool designed to convert PDF files to HTML format , by analyzing the content of PDF files and use HTML + CSS to accurately restore its visual effect , PDF documents into a browser can be viewed directly on the web page . The tool is particularly suitable for containing a large number of ...

Maxun:开源无代码平台,自动抓取网页数据并转换为API或电子表格-首席AI分享圈

Maxun: open source no-code platform that automatically crawls web data and converts it to APIs or spreadsheets

Comprehensive Introduction Maxun is an open source no-code web data extraction platform that allows users to train robots in minutes to automatically crawl web data and convert it into APIs or spreadsheets. The platform supports paging and scrolling, can adapt to changes in website layout, provides powerful data crawling features for...

OmniParse:从文档/多媒体中提取任何非结构化数据解析为结构化数据-首席AI分享圈

OmniParse: extract any unstructured data from documents/multimedia and parse it into structured data

Comprehensive Introduction OmniParse is a powerful data parsing and optimization platform designed to transform any unstructured data into structured, actionable data optimized for the GenAI (Generative Artificial Intelligence) framework. Whether you are working with documents, tables, images, videos, audio files or web content,...

en_USEnglish