AI Personal Learning
and practical guidance
TRAE
Total 67 articles

Tags: document extraction and cleaning Page 2

Crawl4LLM:为LLM预训练提供的高效网页爬取工具-首席AI分享圈

Crawl4LLM: An Efficient Web Crawling Tool for LLM Pretraining

Comprehensive Introduction Crawl4LLM is an open source project jointly developed by Tsinghua University and Carnegie Mellon University, focusing on optimizing the efficiency of web crawling for pre-training of large models (LLM). It significantly reduces ineffective crawling by intelligently selecting high-quality web page data, claiming to be able to originally need to crawl 100 web pages of work...

CodeWeaver:将代码结构和内容自动生成Markdown文档-首席AI分享圈

CodeWeaver: Automatically generate Markdown documents from code structure and content.

General Introduction CodeWeaver is a command-line tool designed to weave code libraries into single, easy-to-navigate Markdown documents. It generates a structured representation of a project's file hierarchy by recursively scanning directories and embedding the contents of each file in code blocks. The tool is designed with the goal of simplifying...

zChunk:基于Llama-70B的通用语义分块策略-首席AI分享圈

zChunk: a generic semantic chunking strategy based on Llama-70B

Comprehensive Introduction zChunk is a novel chunking strategy developed by ZeroEntropy to provide a solution for generic semantic chunking. The strategy is based on the Llama-70B model and optimizes the chunking process of a document by prompting for chunks to be generated, ensuring that a high signal-to-noise ratio is maintained during information retrieval. zChunk is particularly suited for...

Pulse:文档处理与数据提取的商业解决方案-首席AI分享圈

Pulse: Business Solutions for Document Processing and Data Extraction

Comprehensive Introduction Pulse is an intelligent platform focused on document processing and data extraction, designed to help organizations and developers efficiently parse and process a wide range of complex documents. Through its advanced computer vision and multimodal processing technology, Pulse is able to accurately process documents from text, images, tables and other formats...

Rowfill:批量提取文档结构化信息并自动化分析-首席AI分享圈

Rowfill: Batch Extraction of Structured Information from Documents and Automated Analysis

Comprehensive Introduction Rowfill is an open source document processing platform designed for knowledge workers. It utilizes advanced AI technologies to extract, analyze and process data from complex documents, images and PDFs.Rowfill supports native Large Language Models (LLM) and OpenAI Visual Models to ensure that data is hidden...

UnDatas.IO: API service for accurate parsing of various types of unstructured data (paid)

Comprehensive Introduction UnDatas.IO is a platform focused on parsing and processing unstructured data. It utilizes advanced technology to automatically recognize document layouts and categorize tables, images, formulas and text, greatly simplifying the data processing process. The platform not only saves a lot of time in organizing data, but also helps...

SemHash: Fast implementation of semantic text de-duplication to improve data cleaning efficiency

General Introduction SemHash is a lightweight and flexible tool for dataset de-duplication by semantic similarity. It combines the fast embedding generation of Model2Vec with the efficient ANN (Approximate Nearest Neighbor) similarity search of Vicinity.SemHash supports single dataset de-duplication (e.g., cleaning the training...

Trellis:转换非结构文档为结构化EXCEL格式数据,PDF快速转表格(付费)-首席AI分享圈

Trellis: convert unstructured documents into structured EXCEL format data, PDF fast to form (paid)

General Introduction Trellis is a data platform focused on converting complex unstructured data sources into a structured SQL format. Through its powerful AI engine, Trellis is able to process a wide range of data sources such as financial documents, voice calls, and emails and convert them into SQL that can be used by data and operations teams...

en_USEnglish