AI Personal Learning
and practical guidance
Beanbag Marscode1
Total 62 articles

Tags: document extraction and cleaning Page 4

MinerU:PDF文档提取转换为多模态Markdown格式,支持电子书OCR扫描-首席AI分享圈

MinerU: PDF document extraction and conversion to multimodal Markdown format, support e-book OCR scanning

Comprehensive Introduction MinerU is an open source data extraction tool developed by the OpenDataLab team at the Shanghai Artificial Intelligence Laboratory, focusing on efficiently extracting content from complex PDF documents, web pages, and eBooks. It can convert multimodal PDF documents containing images, formulas, tables and other elements into easy-to-analyze M...

Unstructured:开源预处理非结构化文档,无结构数据处理的利器-首席AI分享圈

Unstructured: open source preprocessing unstructured documents, unstructured data processing tools

Comprehensive Introduction Unstructured-IO provides a range of open source components for processing and preprocessing images and text documents such as PDF, HTML, Word documents, etc. Its main goal is to simplify and optimize data processing workflow , especially for large language model (LLM) applications to provide support.Unstructured...

en_USEnglish