AI Personal Learning
and practical guidance
Beanbag Marscode1
Total 62 articles

Tags: document extraction and cleaning

Supametas.AI:提取非结构化数据为LLM高可用数据-首席AI分享圈

Supametas.AI: Extracting Unstructured Data into LLM Highly Available Data

Comprehensive Introduction Supametas.AI is a data processing platform that specializes in organizing web pages, documents, audio and video, and other cluttered information into structured data that AI can use. It supports collecting data from multiple sources, including web links, APIs, local files, etc., and then exporting it to JSON or Markdown format. Platform...

飞桨 PP-TableMagic:复杂表格结构化信息提取神器-首席AI分享圈

Flying Paddle PP-TableMagic: Structured Information Extraction for Complex Tables

The goal of table recognition is to parse tables in images, accurately identify table structures and cell locations, and reduce them to structured table formats (e.g., HTML). In today's information age, a large amount of important tabular data still exists in an unstructured state (e.g., pictures of information statistics in scanned documents, pd...

Mistral OCR:94.89%总体精度,1000 页/30秒,只需1美元-首席AI分享圈

Mistral OCR: 94.89% Overall Accuracy, 1000 Pages/30 Seconds, Only $1

In the long history of human civilization, every leap in the way information is acquired and parsed has profoundly driven social progress. From the ancient hieroglyphics, to the portable papyrus, to the later emergence of the printing press and today's wave of digitization, each technological innovation has greatly expanded the transmission of human knowledge...

PDF-Extract-Kit:提取复杂结构PDF内容的开源工具-首席AI分享圈

PDF-Extract-Kit: extract the complex structure of PDF content of open source tools

Comprehensive Introduction PDF-Extract-Kit is an open source project developed by the OpenDataLab team , focusing on the efficient extraction of high-quality content from complex and diverse PDF documents . It integrates advanced document parsing technology , support for layout detection , formula recognition , table extraction and OCR functions for ...

Crawl4LLM:为LLM预训练提供的高效网页爬取工具-首席AI分享圈

Crawl4LLM: An Efficient Web Crawling Tool for LLM Pretraining

Comprehensive Introduction Crawl4LLM is an open source project jointly developed by Tsinghua University and Carnegie Mellon University, focusing on optimizing the efficiency of web crawling for pre-training of large models (LLM). It significantly reduces ineffective crawling by intelligently selecting high-quality web page data, claiming to be able to originally need to crawl 100 web pages of work...

CodeWeaver:将代码结构和内容自动生成Markdown文档-首席AI分享圈

CodeWeaver: Automatically generate Markdown documents from code structure and content.

General Introduction CodeWeaver is a command-line tool designed to weave code libraries into single, easy-to-navigate Markdown documents. It generates a structured representation of a project's file hierarchy by recursively scanning directories and embedding the contents of each file in code blocks. The tool is designed with the goal of simplifying...

en_USEnglish