🚀 Invitation to Experience: China's First AI IDE Intelligent Programming Software Trae Chinese version downloadThe DeepSeek-R1 and Doubao-pro are available for unlimited use!

Total 60 articles

Tags: document extraction and cleaning Page 2

Rowfill: Batch Extraction of Structured Information from Documents and Automated Analysis

Comprehensive Introduction Rowfill is an open source document processing platform designed for knowledge workers. It utilizes advanced AI technologies to extract, analyze and process data from complex documents, images and PDFs.Rowfill supports native Large Language Models (LLM) and OpenAI Visual Models to ensure that data is hidden...

PPTX2MD: Specialized tool for converting PPTX files to Markdown

General Introduction PPTX2MD is an open source tool designed to convert PowerPoint PPTX files to Markdown format. Developed by GitHub user ssine, the tool supports retaining headings, lists, text formatting (such as bold, italic, color, and hyperlinks), images, and tables in a variety of formats.PPTX2MD...

2025-02-03AI tools AI open source project Document Extraction and Cleaning

Trae Chinese Version First Invitation to Download: Unlimited use of DeepSeek-R1 after registration!

Enable Builder Smart Programming Mode, unlimited use of DeepSeek-R1 and DeepSeek-V3, smoother experience than the overseas version. Just enter the Chinese commands, even a novice programmer can write his own apps with zero threshold.

2025-03-26

Repomix: packaging the code base into a text file for large model retrieval

General Introduction Repomix (formerly known as Repopack) is an open source tool designed to package an entire codebase into a single, AI-friendly file. This tool makes it easy for developers to make their codebase available to large language models (such as Claude, ChatGPT, and Gemini) for analysis and processing...

2025-01-21AI tools AI open source project Document Extraction and Cleaning

Yek: reading git repository text files and quickly chunking them for use in large models

General Introduction Yek is a fast Rust-based tool for reading text files from repositories or directories, chunking them, and serializing them for use in Large Language Models (LLMs). The tool uses the .gitignore rule by default to skip unwanted files and uses Git history to infer important files....

2025-01-21AI tools AI open source project Document Extraction and Cleaning

LlamaParse：Llamaindex推出的高品质解析文档，提取数据服务（每日免费提取1000页）-首席AI分享圈

LlamaParse: High-quality document parsing and data extraction service by Llamaindex (1000 free pages per day).

Comprehensive Introduction LlamaParse is a powerful document parsing tool that can process complex documents such as PDF, PowerPoint, Word documents and spreadsheets and convert them to structured data.LlamaParse offers multiple ways to use it, including a standalone REST API, Python packages, TypeScr...

2025-01-20AI tools AI Open Services Document Extraction and Cleaning

UnDatas.IO: API service for accurate parsing of various types of unstructured data (paid)

Comprehensive Introduction UnDatas.IO is a platform focused on parsing and processing unstructured data. It utilizes advanced technology to automatically recognize document layouts and categorize tables, images, formulas and text, greatly simplifying the data processing process. The platform not only saves a lot of time in organizing data, but also helps...

2025-01-20AI tools AI Open Services Document Extraction and Cleaning

Zerox：PDF、DOCX、图像转换为Markdown，视觉模型高精度OCR-首席AI分享圈

Zerox: PDF, DOCX, image conversion to Markdown, visual modeling high-precision OCR

Comprehensive introduction Zerox is an open source project designed to convert PDF, DOCX, images and other documents to Markdown format through visual modeling . The project is developed by getomni-ai team , provides a simple and efficient OCR (Optical Character Recognition) solution.Zerox supports Node and Python programming languages, ...

2025-01-19AI tools AI open source project Document Extraction and Cleaning

SemHash: Fast implementation of semantic text de-duplication to improve data cleaning efficiency

General Introduction SemHash is a lightweight and flexible tool for dataset de-duplication by semantic similarity. It combines the fast embedding generation of Model2Vec with the efficient ANN (Approximate Nearest Neighbor) similarity search of Vicinity.SemHash supports single dataset de-duplication (e.g., cleaning the training...

2025-01-17AI tools AI open source project Document Extraction and Cleaning

Parseur: automated extraction of document data, all types of documents to extract structured text

General Introduction Parseur is a leading AI data extraction software designed to help users automatically extract text data from PDFs, emails and other documents. With Parseur, users can easily convert unstructured data into structured data and send it to various applications. The software is widely ...

2025-01-17AI tools Document Extraction and Cleaning

AI Functions：将输入内容转换为结构化输出的（API）服务-首席AI分享圈

AI Functions: (API) services that convert input content into structured outputs

Comprehensive Introduction Weco AI Functions is a powerful platform designed to help users rapidly build and deploy AI functions. By simply describing tasks, users can generate structured output patterns with A/B testing and observational monitoring. The platform supports code-free prototyping, enabling even non-technical users to...

2025-01-16AI tools AI Open Services Document Extraction and Cleaning

NV Ingest：解析复杂格式文档，提取多模态数据为元数据和文本-首席AI分享圈

NV Ingest: Parsing complex format documents and extracting multimodal data into metadata and text

Comprehensive Introduction NV Ingest (NVIDIA Ingest) is a suite of early access microservices designed for parsing hundreds of thousands of complex, messy unstructured PDFs and other enterprise documents. It can convert these documents into metadata and text for embedding into retrieval systems.NVIDIA Ingest supports...

2025-01-14AI tools AI open source project Document Extraction and Cleaning

Trellis：转换非结构文档为结构化EXCEL格式数据，PDF快速转表格（付费）-首席AI分享圈

Trellis: convert unstructured documents into structured EXCEL format data, PDF fast to form (paid)

General Introduction Trellis is a data platform focused on converting complex unstructured data sources into a structured SQL format. Through its powerful AI engine, Trellis is able to process a wide range of data sources such as financial documents, voice calls, and emails and convert them into SQL that can be used by data and operations teams...

2025-01-13AI tools Document Extraction and Cleaning

Ollama OCR：使用Ollama中视觉模型提取图像中的文本-首席AI分享圈

Ollama OCR: Extracting Text from Images Using Visual Models in Ollama

Comprehensive Introduction Ollama OCR is a powerful Optical Character Recognition (OCR) toolkit that utilizes the state-of-the-art visual language model provided by the Ollama platform to extract text from images. The project is available both as a Python package and provides a user-friendly Streamlit web application interface. It supports multiple ...

2025-01-10AI tools AI open source project OCR Document Extraction and Cleaning

llms.txt Generator：快速抓取网站内容并，生成LLM训练文本数据集-首席AI分享圈

llms.txt Generator: Rapidly crawls website content and generates LLM training text datasets.

Comprehensive Introduction llmstxt-generator is a professional web content extraction and integration tool specialized in preparing high-quality textual datasets for training and inference in Large Language Models (LLMs). Developed by Mendable AI, the tool uses web crawling technology provided by @firecrawl_dev and GPT-4-mini ...

2025-01-05AI tools AI open source project Document Extraction and Cleaning

Doc2X：文档图片公式识别与转换工具，支持多格式转换与高精度翻译-首席AI分享圈

Doc2X: Document image formula recognition and conversion tools, support for multi-format conversion and high-precision translation

Comprehensive introduction Doc2X is a powerful document image formula recognition and conversion tools, is committed to providing efficient and intelligent document processing solutions. Whether it is an academic research paper, textbooks, corporate documents or financial reports, Doc2X can accurately recognize the tables and formulas in PDF and convert them with one key...

2025-01-02AI tools AI Open Services AI translation Document Extraction and Cleaning

ExtractThinker：提取和分类文档为结构化数据，优化文档处理流程-首席AI分享圈

ExtractThinker: extracting and classifying documents into structured data to optimize the document processing flow

Comprehensive Introduction ExtractThinker is a flexible document intelligence tool that utilizes Large Language Models (LLMs) to extract and classify structured data from documents, providing a seamless ORM-like document processing workflow. It supports multiple document loaders, including Tesseract OCR, Azure Form Recog...

2025-01-02AI tools AI open source project Document Extraction and Cleaning

HtmlRAG：构建高效HTML检索增强生成系统，优化RAG系统中的HTML文档检索与处理-首席AI分享圈

HtmlRAG: Building an Efficient HTML Retrieval Enhanced Generation System, Optimizing HTML Document Retrieval and Processing in RAG Systems

Comprehensive Introduction HtmlRAG is an innovative open source project focused on improving the processing of HTML documents in Retrieval Augmented Generation (RAG) systems. The project presents a novel approach that argues that using HTML formatting in RAG systems is more efficient than plain text. The project encompasses a complete data processing flow from the cha...

2025-01-02AI tools Document Extraction and Cleaning Knowledge Retrieval and the RAG Framework

ScrapeGraphAI：一个提示词搞定网页抓取，无需编写规则智能网页内容提取工具-首席AI分享圈

ScrapeGraphAI: A single cue word for web crawling, no need to write rules intelligent web content extraction tools

Comprehensive Introduction ScrapeGraphAI is an innovative Python web scraping library that cleverly combines Large Language Modeling (LLM) and Direct Graph Logic to create a scraping pipeline for websites and local documents. The uniqueness of this tool lies in its perfect balance of simplicity and power: the user simply describes what he/she wants to mention...

2025-01-01AI tools AI open source project Document Extraction and Cleaning

Vision Parse：使用视觉语言模型将PDF文档智能转换为Markdown格式-首席AI分享圈

Vision Parse: Intelligent Conversion of PDF Documents to Markdown Format Using Visual Language Models

Comprehensive Introduction Vision Parse is a revolutionary document processing tool that cleverly combines state-of-the-art Visual Language Models (Vision Language Models) technology to intelligently convert PDF documents into high-quality Markdown format content. The tool supports a wide range of top-notch visual language models, including o...

2024-12-26AI tools AI open source project Document Extraction and Cleaning

preceding page
1
2
3
4
next page
Total 4 pages