Document Extraction and Cleaning

Total 67 articles posts

Sorting

OneFileLLM: Integrating Multiple Data Sources into a Single Text File

Comprehensive Introduction OneFileLLM is an open source command line tool designed to consolidate multiple data sources into a single text file for easy input into Large Language Models (LLMs). It supports processing GitHub repositories, ArXiv papers, YouTube video transcriptions, web...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

12mos ago

055.2K

Chatlog: extract and query WeChat chat logs of open source tools

General Introduction Chatlog is an open source tool that focuses on extracting and querying chat logs from WeChat's local database. It supports WeChat versions 3.x and 4.0, covering both Windows and macOS systems. Users can use the command line, terminal interface or H...

Latest AI Resources # AI Java Open Source Projecct # MCP services # Document Extraction and Cleaning

12mos ago

0128.3K

VOP: OCR Tool for Extracting Complex Diagrams and Math Formulas

Comprehensive Introduction Versatile OCR Program is an open source Optical Character Recognition (OCR) tool designed specifically for working with complex academic and educational documents. It can extract text, tables, mathematical formulas, charts and diagrams from PDFs, images and other documents and generate...

Latest AI Resources # AI Java Open Source Projecct # OCR # Document Extraction and Cleaning

12mos ago

052.1K

DevDocs: an MCP service for quickly crawling and organizing technical documentation

General Introduction DevDocs is a completely free open source tool developed by the CyberAGI team and hosted on GitHub. Designed for programmers and software developers, it starts with the URL of a technical document, automatically crawls the relevant pages and organizes them into a concise Ma...

Latest AI Resources # AI Java Open Source Projecct # MCP services # Document Extraction and Cleaning

1yrs ago

057.6K

Automatically parse PDF content and extract text and tables of open source services

Comprehensive Introduction It can automatically analyze the layout of PDF documents, identify text, titles, images, tables, formulas and other elements in the page, and determine their correct order. The tool supports OCR functionality and can convert scanned PDF to searchable text. It runs on Docker and provides two models...

Latest AI Resources # AI Java Open Source Projecct # OCR # Document Extraction and Cleaning

1yrs ago

059.1K

Free Conversion of Multiple Files to Markdown Format Based on Workers AI

General Introduction serverless-markdown-convertor is a free and open source tool, based on Cloudflare Worker and Workers AI, that converts a wide range of files to Markdow...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

055.8K

GPT-Crawler: Automatically Crawling Website Content to Generate Knowledge Base Documents

General Introduction GPT-Crawler is an open source tool developed by the BuilderIO team and hosted on GitHub. It crawls page content by inputting one or more website URLs, generating structured knowledge files (output.jso...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

10mos ago

057.8K

pure.md: insert "pure.md/" in front of the URL to extract clean text.

General Introduction pure.md is a tool for AI agents and developers that focuses on quickly converting web content or files to Markdown format. It bypasses anti-crawler restrictions through proxy services, extracts the core data of a web page, and outputs a concise Markdown ...

Latest AI Resources # AI Open Services # Document Extraction and Cleaning

1yrs ago

063.5K

Cloudsquid: upload documents and describe requirements for intelligent extraction of structured data

General Introduction Cloudsquid is a company founded in 2023 in Berlin, Germany, focused on simplifying document processing with artificial intelligence. Its core product is an online data extraction platform that allows users to simply upload documents such as PDFs, images, audio, video, etc. and simply state that they need to extract...

Latest AI Resources # Document Extraction and Cleaning

1yrs ago

055.2K

PDF Craft: PDF scanned documents to Markdown open source tools

General Introduction PDF Craft is an open source tool designed for scanning PDFs of books and converting them to Markdown format. It was developed by oomol-lab and is hosted on GitHub for users who like to organize their eBooks. The tool works through this ...

Latest AI Resources # AI Java Open Source Projecct # OCR # Document Extraction and Cleaning

1yrs ago

081.1K

Supametas.AI: Extracting Unstructured Data into LLM Highly Available Data

Comprehensive Introduction Supametas.AI is a data processing platform that specializes in organizing web pages, documents, audio and video, and other messy information into structured data that AI can use. It supports collecting data from multiple sources, including web links, APIs, local files, etc., and then outputting it as JSON ...

Latest AI Resources # AI Open Services # Document Extraction and Cleaning

1yrs ago

055.3K

MarkPDFDown: based on the multimodal model will be converted to PDF Markdown file

General Introduction MarkPDFDown is an open source tool. It utilizes the Multimodal Large Language Model to convert PDF files into Markdown format. The developer is GitHub user jorben. the goal of this tool is simple: to make PDF documents ...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

061K

SmolDocling: a visual language model for efficient document processing in a small volume

Comprehensive Introduction SmolDocling is a Visual Language Model (VLM) developed by the ds4sd team in collaboration with IBM, built on SmolVLM-256M and hosted on the Hugging Face platform. It is small in size, only ...

Latest AI Resources # AI Java Open Source Projecct # OCR # Document Extraction and Cleaning

1yrs ago

052.2K

Flying Paddle PP-TableMagic: Structured Information Extraction for Complex Tables

The goal of table recognition is to parse tables in images, accurately identify table structures and cell locations, and reduce them to structured table formats (e.g., HTML). In today's information age, a large amount of important tabular data still exists in an unstructured state (e.g., scanned documents with pictures of statistical tables...).

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

067K

Mistral OCR: 94.89% Overall Accuracy, 1000 Pages/30 Seconds, Only $1

In the long history of human civilization, every leap in the way information is acquired and parsed has profoundly driven social progress. From the ancient hieroglyphics, to the portable papyrus, to the later emergence of the printing press and today's wave of digitization, each technological innovation has greatly expanded the paradigm of human knowledge dissemination...

Latest AI Resources # AI Open Services # OCR # Document Extraction and Cleaning

1yrs ago

060.4K

Firecrawl MCP Server：基于 Firecrawl 的网页爬虫 MCP 服务

Firecrawl MCP Server: Firecrawl-based Web Crawler MCP Service

General Introduction Firecrawl MCP Server is an open source tool developed by MendableAI, based on the Model Context Protocol (MCP) protocol implementation, with Firecrawl A...

Latest AI Resources # AI Java Open Source Projecct # MCP services # Document Extraction and Cleaning

1yrs ago

072.8K

olmOCR: PDF document conversion to text, support for tables, formulas and handwritten content recognition

General Introduction olmOCR is an open source tool developed by the AllenNLP team at the Allen Institute for Artificial Intelligence (AI2) that focuses on converting PDF files...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

068.7K

par_scrape: a crawler tool to intelligently extract data from web pages

General Introduction par_scrape is a Python-based open source web crawler tool, launched on GitHub by developer Paul Robello, designed to help users intelligently extract data from web pages. It integrates Selenium...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

055.1K

PDF-Extract-Kit: extract the complex structure of PDF content of open source tools

Comprehensive introduction PDF-Extract-Kit is an open source project developed by the OpenDataLab team , focusing on the efficient extraction of high-quality content from complex and diverse PDF documents . It integrates advanced document parsing technology to support layout detection , formula recognition ...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

0104.8K

Crawl4LLM: An Efficient Web Crawling Tool for LLM Pretraining

Comprehensive Introduction Crawl4LLM is an open source project jointly developed by Tsinghua University and Carnegie Mellon University, focusing on optimizing the efficiency of web crawling for pre-training of large models (LLM). It significantly reduces ineffective crawling by intelligently selecting high-quality web page data, claiming to be able to originally need to crawl 1...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

057K

Markdownify MCP Server：基于MCP协议将各种内容转换为Markdown格式

Markdownify MCP Server: Converts various content to Markdown format based on the MCP protocol.

General Introduction Markdownify MCP Server is an open source tool based on the Model Context Protocol, hosted on GitHub by developer Zach Caceres ...

Latest AI Resources # AI Java Open Source Projecct # MCP services # Document Extraction and Cleaning

1yrs ago

065.7K

CodeWeaver: Automatically generate Markdown documents from code structure and content.

General Introduction CodeWeaver is a command-line tool designed to weave code libraries into single, easy-to-navigate Markdown documents. It generates a structured representation of a project's file hierarchy by recursively scanning directories and embedding the contents of each file in code blocks. This tool ...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

058.7K

Kreuzberg: open source tool to extract text from any document

General Introduction Kreuzberg is a library for simplifying text extraction from PDF files, designed to provide a simple, hassle-free text extraction solution. The library is particularly suitable for RAG (Retrieval-Augmented Generatio...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

061.8K

Instructor: a Python library to simplify structured output workflows for large language models

Comprehensive Introduction Instructor is a popular Python library designed for processing structured output from Large Language Models (LLMs). Built on Pydantic, it provides a simple, transparent and user-friendly API for managing data...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

055K

zChunk: a generic semantic chunking strategy based on Llama-70B

Comprehensive Introduction zChunk is a novel chunking strategy developed by ZeroEntropy that aims to provide a solution for generic semantic chunking. The strategy is based on the Llama-70B model, which optimizes the chunking process of documents by prompting for chunks to be generated, ensuring that information retrieval is maintained at a high...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

050.3K

Pulse: Business Solutions for Document Processing and Data Extraction

Comprehensive Introduction Pulse is an intelligent platform focused on document processing and data extraction, designed to help organizations and developers efficiently parse and process a wide range of complex documents. Through its advanced computer vision and multimodal processing technology, Pulse is able to accurately extract data from text, images, tables, and many other...

Latest AI Resources # Document Extraction and Cleaning

1yrs ago

053.7K

Rowfill: Batch Extraction of Structured Information from Documents and Automated Analysis

General Introduction Rowfill is an open source document processing platform designed for knowledge workers. It uses advanced artificial intelligence techniques to extract, analyze and process data from complex documents, images and PDFs.Rowfill supports Native Large Language Model (LLM) and Ope...

Latest AI Resources # AI Java Open Source Projecct # AI data analysis # Document Extraction and Cleaning

1yrs ago

053.9K

PPTX2MD: Specialized tool for converting PPTX files to Markdown

General Introduction PPTX2MD is an open source tool designed to convert PowerPoint PPTX files to Markdown format. Developed by GitHub user ssine, the tool supports preserving headings, lists, text formatting (e.g., bold, italic, color, and super...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

077.6K

Repomix: packaging the code base into a text file for large model retrieval

General Introduction Repomix (formerly known as Repopack) is an open source tool designed to package an entire codebase into a single, AI-friendly file. This tool allows developers to easily make their codebase available to large language models (such as Claude, Chat...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

089.2K

Yek: reading git repository text files and quickly chunking them for use in large models

General Introduction Yek is a fast Rust-based tool for reading text files from repositories or directories, chunking them, and serializing them for use in Large Language Models (LLMs). The tool uses the .gitignore rule by default to skip unwanted files and utilizes...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

058K

LlamaParse：Llamaindex推出的高品质解析文档，提取数据服务（每日免费提取1000页）

LlamaParse: High-quality document parsing and data extraction service by Llamaindex (1000 free pages per day).

Comprehensive Introduction LlamaParse is a powerful document parsing tool that can process complex documents such as PDF, PowerPoint, Word documents and spreadsheets and convert them into structured data.LlamaParse offers a variety of ways to use...

Latest AI Resources # AI Open Services # Document Extraction and Cleaning

1yrs ago

068.2K

UnDatas.IO: API service for accurate parsing of various types of unstructured data (paid)

Comprehensive Introduction UnDatas.IO is a platform focused on parsing and processing unstructured data. It utilizes advanced technology to automatically recognize document layouts and categorize tables, images, formulas and text, greatly simplifying the data processing process. The platform not only saves a lot of time in organizing data...

Latest AI Resources # AI Open Services # Document Extraction and Cleaning

1yrs ago

052.5K

Zerox: PDF, DOCX, image conversion to Markdown, visual modeling high-precision OCR

Comprehensive introduction Zerox is an open source project designed to convert PDF, DOCX, images and other documents to Markdown format through visual modeling. The project is developed by getomni-ai team , provides a simple and efficient OCR (Optical Character Recognition) solution.Ze...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

076.4K

SemHash: Fast implementation of semantic text de-duplication to improve data cleaning efficiency

Comprehensive Introduction SemHash is a lightweight and flexible tool for de-duplicating datasets by semantic similarity. It combines the fast embedding generation of Model2Vec with the efficient ANN (approximate nearest neighbor) similarity search of Vicinity.SemHa...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

071.3K

Parseur: automated extraction of document data, all types of documents to extract structured text

General Introduction Parseur is a leading AI data extraction software designed to help users automatically extract text data from PDFs, emails and other documents. With Parseur, users can easily convert unstructured data into structured data and send it to various applications...

Latest AI Resources # Document Extraction and Cleaning

1yrs ago

059.8K

AI Functions: (API) services that convert input content into structured outputs

Comprehensive Introduction Weco AI Functions is a powerful platform designed to help users rapidly build and deploy AI functions. By simply describing tasks, users can generate structured output patterns with A/B testing and observational monitoring. The platform supports no-code prototyping...

Latest AI Resources # AI Open Services # Document Extraction and Cleaning

1yrs ago

051.8K

NV Ingest: Parsing complex format documents and extracting multimodal data into metadata and text

Comprehensive Introduction NV Ingest (NVIDIA Ingest) is a suite of early access microservices designed for parsing hundreds of thousands of complex, messy unstructured PDFs and other enterprise documents. It can convert these documents into metadata and text for embedding into retrieval...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

063.6K

Trellis：转换非结构文档为结构化EXCEL格式数据，PDF快速转表格（付费）

Trellis: convert unstructured documents into structured EXCEL format data, PDF fast to form (paid)

General Introduction Trellis is a data platform focused on converting complex unstructured data sources into structured SQL formats. Through its powerful AI engine, Trellis is able to process a wide range of data sources such as financial documents, voice calls, and emails and convert them into data ready and...

Latest AI Resources # Document Extraction and Cleaning

1yrs ago

050.3K

Ollama OCR: Extracting Text from Images Using Visual Models in Ollama

Comprehensive Introduction Ollama OCR is a powerful Optical Character Recognition (OCR) toolkit that utilizes the state-of-the-art visual language model provided by the Ollama platform to extract text from images. The project is available both as a Python package and as a user-friendly Strea...

Latest AI Resources # AI Java Open Source Projecct # OCR # Document Extraction and Cleaning

1yrs ago

0105.1K

llms.txt Generator：快速抓取网站内容并，生成LLM训练文本数据集

llms.txt Generator: Rapidly crawls website content and generates LLM training text datasets.

Comprehensive Introduction llmstxt-generator is a professional web content extraction and integration tool specialized in preparing high-quality text datasets for training and inference in Large Language Models (LLM). The tool was developed by Mendable AI using @firec...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

054.8K

Doc2X: Document image formula recognition and conversion tools, support for multi-format conversion and high-precision translation

Comprehensive introduction Doc2X is a powerful document image formula recognition and conversion tools, is committed to providing efficient and intelligent document processing solutions. Whether it is an academic research paper, a textbook, a corporate document or a financial report, Doc2X can accurately recognize PDF tables and...

Latest AI Resources # AI Open Services # AI Translation # Document Extraction and Cleaning

1yrs ago

088K

ExtractThinker: extracting and classifying documents into structured data to optimize the document processing flow

Comprehensive Introduction ExtractThinker is a flexible document intelligence tool that utilizes Large Language Models (LLMs) to extract and classify structured data from documents, providing a seamless ORM-like document processing workflow. It supports a variety of document loaders, including Tess...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

060.5K

HtmlRAG：构建高效HTML检索增强生成系统，优化RAG系统中的HTML文档检索与处理

HtmlRAG: Building an Efficient HTML Retrieval Enhanced Generation System, Optimizing HTML Document Retrieval and Processing in RAG Systems

Comprehensive Introduction HtmlRAG is an innovative open source project focused on improving the processing of HTML documents in Retrieval Augmented Generation (RAG) systems. The project presents a novel approach that argues that using HTML formatting in RAG systems is more efficient than plain text. The project contains a complete ...

Latest AI Resources # Document Extraction and Cleaning # Knowledge Retrieval with RAG Framework

1yrs ago

056.5K

ScrapeGraphAI：一个提示词搞定网页抓取，无需编写规则智能网页内容提取工具

ScrapeGraphAI: A single cue word for web crawling, no need to write rules intelligent web content extraction tools

Comprehensive Introduction ScrapeGraphAI is an innovative Python web crawling library that cleverly combines Large Language Modeling (LLM) and Direct Graph Logic to create crawling pipelines for websites and local documents. The uniqueness of this tool lies in its perfect level of simplicity and power...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

057.7K

Vision Parse: Intelligent Conversion of PDF Documents to Markdown Format Using Visual Language Models

Comprehensive Introduction Vision Parse is a revolutionary document processing tools, it cleverly combines the most advanced visual language models (Vision Language Models) technology, to be able to intelligently convert PDF documents into high-quality Markdown format ...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

056.3K

Outlines：通过正则表达式、JSON或Pydantic模型生成结构化文本输出

Outlines: Generate structured text output via regular expressions, JSON or Pydantic models

Comprehensive Introduction Outlines is an open source library developed by dottxt-ai to enhance the application of Large Language Models (LLMs) through structured text generation. The library supports a variety of model integrations, including OpenAI, transformers...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

084.1K

MarkItDown: Microsoft Document Intelligent Conversion Tool to convert various files to Markdown format

General Introduction MarkItDown is a Python tool developed by Microsoft designed to convert various files and office documents to Markdown format. The tool supports a wide range of file types, including PDF, PowerPoint, Word, Excel, diagrams...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

063.5K

Chunkr: An All-in-One Service for Document Ingestion and Intelligent Chunking Based on Text Paragraph Hierarchy Using Visual Models

General Introduction Chunkr is a self-hosted API specialized in converting PDF, PPTX, DOCX and Excel files into data suitable for use in RAG (Retrieval Augmented Generation) and LLM (Large Language Model). The project was developed by Lumina...

Latest AI Resources # AI Java Open Source Projecct # OCR # Document Extraction and Cleaning

1yrs ago

054.5K

GitIngest: Quickly Convert Github Code Repositories to Text Suitable for LLM Understanding

General Introduction GitIngest is an open source tool designed to transform GitHub code repositories into text suitable for Large Language Model (LLM) hints. With a simple operation, users can extract and format the content of any GitHub repository to fit the LLM ...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

081K

E2M: Convert multiple file formats to Markdown for easy document formatting unification

General Introduction E2M (Everything to Markdown) is an open source Python library designed to convert a wide range of file formats to Markdown format. The tool supports formats including doc, docx, epub, html, htm, u...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

061.1K

Docling：支持多种格式文档解析并导出为Markdown和JSON，PDF支持OCR

Docling: support for a variety of formats document parsing and export as Markdown and JSON, PDF support OCR

Comprehensive Introduction Docling is a powerful document parsing and exporting tool that supports a wide range of document formats, including PDF, DOCX, PPTX, XLSX, Image, HTML, AsciiDoc and Markdown.It can parse and export these documents...

Latest AI Resources # AI Java Open Source Projecct # OCR # Document Extraction and Cleaning

1yrs ago

0108K

MegaParse：解析各类型文档为LLM可用数据，完整保留文档中的表格、图片等所有信息

MegaParse: parses all types of documents into LLM-available data, preserving all information in the document such as tables, pictures, etc. in its entirety

Comprehensive Introduction MegaParse is a powerful and versatile document parsing tool designed to optimize data processing for the Large Language Model (LLM). Whether you are working with text, PDF, PowerPoint presentations or Word documents, MegaParse...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

063K

ViTLP: Extracting Structured Data from Typographically Complex PDF Documents and Visually Guided Generation of Text Layout Pre-training Models

Comprehensive Introduction ViTLP (Visually Guided Generative Text-Layout Pre-training for Document Intelligence) is an open source project designed to pass...

Latest AI Resources # OCR # Document Extraction and Cleaning

1yrs ago

053.8K

Trieve: a full-service RAG cloud infrastructure for search, recommendations and analytics

Comprehensive Introduction Trieve is an all-inclusive infrastructure developed by Devflow, Inc. designed for search, recommendations, RAG (retrieval augmentation generation), and analytics. The platform is served via an API and supports self-hosting for AWS, GCP, K...

Latest AI Resources # AI Open Services # Document Extraction and Cleaning

1yrs ago

061.1K

pdf2htmlEX：PDF无损转换为HTML，保持文本格式，适用于学术论文和杂志排版

pdf2htmlEX: PDF lossless conversion to HTML, maintaining text formatting, suitable for academic papers and magazine layout

Comprehensive introduction pdf2htmlEX is an open source tool designed to convert PDF files to HTML format , by analyzing the content of the PDF file and use HTML + CSS to accurately restore its visual effect , the PDF document will be converted to a browser ...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

061.2K

Maxun: open source no-code platform that automatically crawls web data and converts it to APIs or spreadsheets

Comprehensive Introduction Maxun is an open source no-code web data extraction platform that allows users to train robots in minutes to automatically crawl web data and convert it into APIs or spreadsheets. The platform supports paging and scrolling, can adapt to changes in website layout, provides powerful data crawling...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

062K

OmniParse: extract any unstructured data from documents/multimedia and parse it into structured data

Comprehensive Introduction OmniParse is a powerful data parsing and optimization platform designed to convert any unstructured data into structured, actionable data optimized for GenAI (Generative Artificial Intelligence) framework. Whether you are working with documents, tables, images, videos, audio files or...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

052.8K

Parsio: Automatically Extract Key Structured Data from PDFs, Emails and Other Documents

General Introduction Parsio is an AI-based document and email data extraction tool that automatically extracts structured data from PDFs, emails and other documents. The platform provides a powerful PDF parser and OCR functionality and supports a wide range of document types, including...

Latest AI Resources # Document Extraction and Cleaning

1yrs ago

057.6K

Chonkie: a lightweight RAG text chunking library

Comprehensive Introduction Chonkie is a lightweight and efficient RAG (Retrieval-Augmented Generation) text chunking library designed to help developers quickly and easily chunk text. The library supports a variety of chunking methods , including ...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

073K

TextIn: Universal Document Conversion, PDF to Markdown Tool

Comprehensive Introduction TextIn is a professional PDF to Markdown tool designed to help users efficiently convert PDF documents to Markdown format. The tool supports a variety of file formats, easy to operate, fast conversion speed, and can retain the format and content of the original PDF...

Latest AI Resources # Document Extraction and Cleaning

1yrs ago

054.2K

文本提取API（text-extract-api）：视觉提取文本信息，匿名化的PDF提取工具

Text Extraction API (text-extract-api): visual extraction of text information, anonymized PDF extraction tool

Comprehensive Introduction The Text Extraction API (text-extract-api) is a powerful tool designed to extract and parse content from a variety of document formats (e.g. PDF, Word, PPTX, etc.). The API utilizes state-of-the-art Optical Character Recognition (OCR) technology and Ol...

Latest AI Resources # AI Java Open Source Projecct # OCR # Document Extraction and Cleaning

1yrs ago

056.9K

Datalab：专用OCR识别AI模型，PDF转Markdown（开源/API）

Datalab: dedicated OCR recognition AI model, PDF to Markdown (open source/API)

Comprehensive Introduction Datalab offers a range of advanced AI models focused on OCR, layout analysis, PDF to Markdown, and more. These models are not only high performing, but also easy to use and open source. The Marker models on the platform can quickly and accurately...

Latest AI Resources # AI Open Services # AI Java Open Source Projecct # OCR

1yrs ago

065.5K

MinerU：PDF文档提取转换为多模态Markdown格式，支持电子书OCR扫描

MinerU: PDF document extraction and conversion to multimodal Markdown format, support e-book OCR scanning

Comprehensive Introduction MinerU is an open source data extraction tool developed by the OpenDataLab team at the Shanghai Artificial Intelligence Laboratory, focusing on efficiently extracting content from complex PDF documents, web pages, and eBooks. It can take multimodal PDFs containing images, formulas, tables and other elements...

Latest AI Resources # AI Java Open Source Projecct # OCR # Document Extraction and Cleaning

2yrs ago

0139K

Marker: quickly convert PDF to Markdown open source tools

General Introduction Marker is a deep learning based document processing tool designed to convert PDF files to Markdown format quickly and accurately. It supports a wide range of document types and is especially optimized for conversion of books and scientific papers.Marker is able to remove headers...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

1yrs ago

0125.2K

Mathpix: PDF and image documents structured conversion software, support for multi-terminal

General Description Mathpix is a powerful AI-driven document automation tool designed for researchers, developers and enterprises. It quickly and accurately converts PDFs and images into searchable, exportable and machine-readable text.Mathpix offers a wide range of features...

Latest AI Resources # AI Open Services # Document Extraction and Cleaning

2yrs ago

0107.8K

Unstructured: open source preprocessing unstructured documents, unstructured data processing tools

Comprehensive Introduction Unstructured-IO provides a set of open source components for processing and pre-processing images and text documents such as PDF, HTML, Word documents, etc. Its main goal is to simplify and optimize the data processing workflow , especially for large language models (LL...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

2yrs ago

072.8K

Reader API: Web page content extraction tool, HTML to Markdown format conversion

General Introduction Jina AI's Reader project is an open source tool (Reader open source address) that takes any URL by adding the prefix https://r.jina.ai/转换成适合大型语言模型 (Large Languag...

Latest AI Resources # AI Java Open Source Projecct # Document Extraction and Cleaning

2yrs ago

0337.4K

No more