🚀 Invitation to Experience: China's First AI IDE Intelligent Programming Software Trae Chinese version downloadThe DeepSeek-R1 and Doubao-pro are available for unlimited use!

Total 67 articles

Tags: document extraction and cleaning Page 2

Crawl4LLM: An Efficient Web Crawling Tool for LLM Pretraining

Comprehensive Introduction Crawl4LLM is an open source project jointly developed by Tsinghua University and Carnegie Mellon University, focusing on optimizing the efficiency of web crawling for pre-training of large models (LLM). It significantly reduces ineffective crawling by intelligently selecting high-quality web page data, claiming to be able to originally need to crawl 100 web pages of work...

2025-02-23AI tools AI open source project Document Extraction and Cleaning

Markdownify MCP Server：基于MCP协议将各种内容转换为Markdown格式-首席AI分享圈

Markdownify MCP Server: Converts various content to Markdown format based on the MCP protocol.

General Introduction Markdownify MCP Server is an open source tool based on the Model Context Protocol, hosted on GitHub and created by developer Zach Caceres. It specializes in combining multiple file types (e.g., PDF, images, audio, office documents, etc.) with...

Trae Chinese Version First Invitation to Download: Unlimited use of DeepSeek-R1 after registration!

Enable Builder Smart Programming Mode, unlimited use of DeepSeek-R1 and DeepSeek-V3, smoother experience than the overseas version. Just enter the Chinese commands, even a novice programmer can write his own apps with zero threshold.

2025-04-24

CodeWeaver：将代码结构和内容自动生成Markdown文档-首席AI分享圈

CodeWeaver: Automatically generate Markdown documents from code structure and content.

General Introduction CodeWeaver is a command-line tool designed to weave code libraries into single, easy-to-navigate Markdown documents. It generates a structured representation of a project's file hierarchy by recursively scanning directories and embedding the contents of each file in code blocks. The tool is designed with the goal of simplifying...

2025-02-16AI tools AI open source project Document Extraction and Cleaning

Kreuzberg: open source tool to extract text from any document

Comprehensive introduction Kreuzberg is a library to simplify the text extraction of PDF files , designed to provide a simple , hassle-free text extraction solution . The library is especially suited for RAG (Retrieval-Augmented Generation) services that require text extraction.Kreuzberg supports local operation, easy control and...

2025-02-15AI tools AI open source project Document Extraction and Cleaning

Instructor：简化大语言模型结构化输出工作流的Python库-首席AI分享圈

Instructor: a Python library to simplify structured output workflows for large language models

Comprehensive Introduction Instructor is a popular Python library designed for processing structured output from large language models (LLMs). Built on Pydantic, it provides a simple, transparent, and user-friendly API for managing data validation, retrying, and streaming responses.Instructor every...

2025-02-10AI tools AI open source project Document Extraction and Cleaning

zChunk: a generic semantic chunking strategy based on Llama-70B

Comprehensive Introduction zChunk is a novel chunking strategy developed by ZeroEntropy to provide a solution for generic semantic chunking. The strategy is based on the Llama-70B model and optimizes the chunking process of a document by prompting for chunks to be generated, ensuring that a high signal-to-noise ratio is maintained during information retrieval. zChunk is particularly suited for...

2025-02-10AI tools AI open source project Document Extraction and Cleaning

Pulse: Business Solutions for Document Processing and Data Extraction

Comprehensive Introduction Pulse is an intelligent platform focused on document processing and data extraction, designed to help organizations and developers efficiently parse and process a wide range of complex documents. Through its advanced computer vision and multimodal processing technology, Pulse is able to accurately process documents from text, images, tables and other formats...

2025-02-09AI tools Document Extraction and Cleaning

Rowfill: Batch Extraction of Structured Information from Documents and Automated Analysis

Comprehensive Introduction Rowfill is an open source document processing platform designed for knowledge workers. It utilizes advanced AI technologies to extract, analyze and process data from complex documents, images and PDFs.Rowfill supports native Large Language Models (LLM) and OpenAI Visual Models to ensure that data is hidden...

2025-02-06AI tools AI open source project AI data analysis Document Extraction and Cleaning

PPTX2MD: Specialized tool for converting PPTX files to Markdown

General Introduction PPTX2MD is an open source tool designed to convert PowerPoint PPTX files to Markdown format. Developed by GitHub user ssine, the tool supports retaining headings, lists, text formatting (such as bold, italic, color, and hyperlinks), images, and tables in a variety of formats.PPTX2MD...

2025-02-03AI tools AI open source project Document Extraction and Cleaning

Repomix: packaging the code base into a text file for large model retrieval

General Introduction Repomix (formerly known as Repopack) is an open source tool designed to package an entire codebase into a single, AI-friendly file. This tool makes it easy for developers to make their codebase available to large language models (such as Claude, ChatGPT, and Gemini) for analysis and processing...

2025-01-21AI tools AI open source project Document Extraction and Cleaning

Yek: reading git repository text files and quickly chunking them for use in large models

General Introduction Yek is a fast Rust-based tool for reading text files from repositories or directories, chunking them, and serializing them for use in Large Language Models (LLMs). The tool uses the .gitignore rule by default to skip unwanted files and uses Git history to infer important files....

2025-01-21AI tools AI open source project Document Extraction and Cleaning

LlamaParse：Llamaindex推出的高品质解析文档，提取数据服务（每日免费提取1000页）-首席AI分享圈

LlamaParse: High-quality document parsing and data extraction service by Llamaindex (1000 free pages per day).

Comprehensive Introduction LlamaParse is a powerful document parsing tool that can process complex documents such as PDF, PowerPoint, Word documents and spreadsheets and convert them to structured data.LlamaParse offers multiple ways to use it, including a standalone REST API, Python packages, TypeScr...

2025-01-20AI tools AI Open Services Document Extraction and Cleaning

UnDatas.IO: API service for accurate parsing of various types of unstructured data (paid)

Comprehensive Introduction UnDatas.IO is a platform focused on parsing and processing unstructured data. It utilizes advanced technology to automatically recognize document layouts and categorize tables, images, formulas and text, greatly simplifying the data processing process. The platform not only saves a lot of time in organizing data, but also helps...

2025-01-20AI tools AI Open Services Document Extraction and Cleaning

Zerox：PDF、DOCX、图像转换为Markdown，视觉模型高精度OCR-首席AI分享圈

Zerox: PDF, DOCX, image conversion to Markdown, visual modeling high-precision OCR

Comprehensive introduction Zerox is an open source project designed to convert PDF, DOCX, images and other documents to Markdown format through visual modeling . The project is developed by getomni-ai team , provides a simple and efficient OCR (Optical Character Recognition) solution.Zerox supports Node and Python programming languages, ...

2025-01-19AI tools AI open source project Document Extraction and Cleaning

SemHash: Fast implementation of semantic text de-duplication to improve data cleaning efficiency

General Introduction SemHash is a lightweight and flexible tool for dataset de-duplication by semantic similarity. It combines the fast embedding generation of Model2Vec with the efficient ANN (Approximate Nearest Neighbor) similarity search of Vicinity.SemHash supports single dataset de-duplication (e.g., cleaning the training...

2025-01-17AI tools AI open source project Document Extraction and Cleaning

Parseur: automated extraction of document data, all types of documents to extract structured text

General Introduction Parseur is a leading AI data extraction software designed to help users automatically extract text data from PDFs, emails and other documents. With Parseur, users can easily convert unstructured data into structured data and send it to various applications. The software is widely ...

2025-01-17AI tools Document Extraction and Cleaning

AI Functions：将输入内容转换为结构化输出的（API）服务-首席AI分享圈

AI Functions: (API) services that convert input content into structured outputs

Comprehensive Introduction Weco AI Functions is a powerful platform designed to help users rapidly build and deploy AI functions. By simply describing tasks, users can generate structured output patterns with A/B testing and observational monitoring. The platform supports code-free prototyping, enabling even non-technical users to...

2025-01-16AI tools AI Open Services Document Extraction and Cleaning

NV Ingest：解析复杂格式文档，提取多模态数据为元数据和文本-首席AI分享圈

NV Ingest: Parsing complex format documents and extracting multimodal data into metadata and text

Comprehensive Introduction NV Ingest (NVIDIA Ingest) is a suite of early access microservices designed for parsing hundreds of thousands of complex, messy unstructured PDFs and other enterprise documents. It can convert these documents into metadata and text for embedding into retrieval systems.NVIDIA Ingest supports...

2025-01-14AI tools AI open source project Document Extraction and Cleaning

Trellis：转换非结构文档为结构化EXCEL格式数据，PDF快速转表格（付费）-首席AI分享圈

Trellis: convert unstructured documents into structured EXCEL format data, PDF fast to form (paid)

General Introduction Trellis is a data platform focused on converting complex unstructured data sources into a structured SQL format. Through its powerful AI engine, Trellis is able to process a wide range of data sources such as financial documents, voice calls, and emails and convert them into SQL that can be used by data and operations teams...

2025-01-13AI tools Document Extraction and Cleaning

preceding page
1
2
3
4
next page
Total 4 pages