AI Personal Learning
and practical guidance
讯飞绘镜

Unstructured: open source preprocessing unstructured documents, unstructured data processing tools

General Introduction

Unstructured-IO provides a range of open source components for processing and preprocessing images and text documents such as PDF, HTML, Word documents, etc. The main goal is to simplify and optimize data processing workflows, especially for Large Language Model (LLM) applications. Its main goal is to simplify and optimize data processing workflows, especially for Large Language Model (LLM) applications.Unstructured-IO's modular functionality and connectors form a unified system that makes data ingestion and preprocessing efficient and adaptable to different platforms.

Unstructured:开源预处理非结构化文档,无结构数据处理的利器-1


 

 

Function List

  • Data ingestion and pre-processing
  • Support for multiple document types (PDF, HTML, Word, etc.)
  • Modular functions and connectors
  • Provides open source APIs and client libraries
  • Support for Docker containerized deployment
  • Provide serverless APIs to improve performance

 

 

Using Help

Installation process

  1. Using the Docker Container Runtime Library
    • Ensure that Docker is installed.
    • Run the following command to download and run the appropriate Docker image:
      docker pull downloads.unstructured.io/unstructured-io/unstructured:latest
      docker run -it --rm downloads.unstructured.io/unstructured-io/unstructured:latest
      
  2. Installing libraries from PyPI
    • Use pip to install:
      pip install unstructured
      
  3. Local Development Installation
    • Clone a GitHub repository:
      git clone https://github.com/Unstructured-IO/unstructured.git
      cd unstructured
      pip install -e .
      

 

Guidelines for use

  1. Data ingestion
    • utilization unstructured The library ingests documents:
      from unstructured.partition.pdf import partition_pdf
      document = partition_pdf("example.pdf")
      
  2. Data preprocessing
    • Clean up and chunk documents:
      from unstructured.cleaners.core import clean
      cleaned_document = clean(document)
      
  3. Connecting to data sources and targets
    • Use the connector to transfer data to the target location:
      from unstructured.connectors import send_to_destination
      send_to_destination(cleaned_document, destination="s3://bucket-name")
      
  4. Serverless API
    • Register and get the API key:
      • interviews Unstructured API Registration PageThe
      • Get the API key and start using it:
        import requests
        headers = {"Authorization": "Bearer YOUR_API_KEY"}
        response = requests.post("https://api.unstructured.io/process", headers=headers, json={"document": "example.pdf"})
        
May not be reproduced without permission:Chief AI Sharing Circle " Unstructured: open source preprocessing unstructured documents, unstructured data processing tools
en_USEnglish