Unstructured: open source preprocessing unstructured documents, unstructured data processing tools

Latest AI Resources11mos agoupdate AI Sharing Circle

2.2K 00

General Introduction

Unstructured-IO provides a range of open source components for processing and preprocessing images and text documents such as PDF, HTML, Word documents, etc. The main goal is to simplify and optimize data processing workflows, especially for Large Language Model (LLM) applications. Its main goal is to simplify and optimize data processing workflows, especially for Large Language Model (LLM) applications.Unstructured-IO's modular functionality and connectors form a unified system that makes data ingestion and preprocessing efficient and adaptable to different platforms.

Function List

Data ingestion and pre-processing
Support for multiple document types (PDF, HTML, Word, etc.)
Modular functions and connectors
Provides open source APIs and client libraries
Support for Docker containerized deployment
Provide serverless APIs to improve performance

Using Help

Installation process

Using the Docker Container Runtime Library

Ensure that Docker is installed.

Run the following command to download and run the appropriate Docker image:

docker pull downloads.unstructured.io/unstructured-io/unstructured:latest
docker run -it --rm downloads.unstructured.io/unstructured-io/unstructured:latest

Installing libraries from PyPI
- Use pip to install:
```
pip install unstructured
```

Local Development Installation

Clone a GitHub repository:

git clone https://github.com/Unstructured-IO/unstructured.git
cd unstructured
pip install -e .

Guidelines for use

Data ingestion

utilization unstructured The library ingests documents:

from unstructured.partition.pdf import partition_pdf
document = partition_pdf("example.pdf")

Data preprocessing

Clean up and chunk documents:

from unstructured.cleaners.core import clean
cleaned_document = clean(document)

Connecting to data sources and targets

Use the connector to transfer data to the target location:

from unstructured.connectors import send_to_destination
send_to_destination(cleaned_document, destination="s3://bucket-name")

Serverless API

interviews Unstructured API Registration PageThe

Get the API key and start using it:

import requests
headers = {"Authorization": "Bearer YOUR_API_KEY"}
response = requests.post("https://api.unstructured.io/process", headers=headers, json={"document": "example.pdf"})