General Introduction
Unstructured-IO provides a range of open source components for processing and preprocessing images and text documents such as PDF, HTML, Word documents, etc. The main goal is to simplify and optimize data processing workflows, especially for Large Language Model (LLM) applications. Its main goal is to simplify and optimize data processing workflows, especially for Large Language Model (LLM) applications.Unstructured-IO's modular functionality and connectors form a unified system that makes data ingestion and preprocessing efficient and adaptable to different platforms.
Function List
- Data ingestion and pre-processing
- Support for multiple document types (PDF, HTML, Word, etc.)
- Modular functions and connectors
- Provides open source APIs and client libraries
- Support for Docker containerized deployment
- Provide serverless APIs to improve performance
Using Help
Installation process
- Using the Docker Container Runtime Library
- Ensure that Docker is installed.
- Run the following command to download and run the appropriate Docker image:
docker pull downloads.unstructured.io/unstructured-io/unstructured:latest docker run -it --rm downloads.unstructured.io/unstructured-io/unstructured:latest
- Installing libraries from PyPI
- Use pip to install:
pip install unstructured
- Use pip to install:
- Local Development Installation
- Clone a GitHub repository:
git clone https://github.com/Unstructured-IO/unstructured.git cd unstructured pip install -e .
- Clone a GitHub repository:
Guidelines for use
- Data ingestion
- utilization
unstructured
The library ingests documents:from unstructured.partition.pdf import partition_pdf document = partition_pdf("example.pdf")
- utilization
- Data preprocessing
- Clean up and chunk documents:
from unstructured.cleaners.core import clean cleaned_document = clean(document)
- Clean up and chunk documents:
- Connecting to data sources and targets
- Use the connector to transfer data to the target location:
from unstructured.connectors import send_to_destination send_to_destination(cleaned_document, destination="s3://bucket-name")
- Use the connector to transfer data to the target location:
- Serverless API
- Register and get the API key:
- interviews Unstructured API Registration PageThe
- Get the API key and start using it:
import requests headers = {"Authorization": "Bearer YOUR_API_KEY"} response = requests.post("https://api.unstructured.io/process", headers=headers, json={"document": "example.pdf"})
- Register and get the API key: