Chunkr: An All-in-One Service for Document Ingestion and Intelligent Chunking Based on Text Paragraph Hierarchy Using Visual Models

Latest AI Resources8mos agorelease AI Sharing Circle

2.7K 00

General Introduction

Chunkr is a self-hosted API that specializes in converting PDF, PPTX, DOCX and Excel files into data suitable for RAG (Retrieval Augmented Generation) and LLM (Large Language Modeling) use. Developed by Lumina AI Inc., it utilizes advanced vision models for document ingestion, supports OCR (Optical Character Recognition) and bounding box detection, and generates structured data in HTML and Markdown formats.Chunkr provides an efficient document processing solution for a wide range of enterprise and developer needs.

Function List

Document Conversion: Support for converting PDF, PPTX, DOCX and Excel files to RAG/LLM data.
OCR Support: Integrate optical character recognition technology to automatically recognize text content in documents.
Boundary box detection: Generate accurate bounding boxes by detecting document layouts with visual models.
Structured Output: Generate structured HTML and Markdown formats for easy subsequent processing and use.
self-hosted: Supports Docker and Kubernetes deployments, allowing users to self-host services locally or in the cloud.
High availability and scalability: Provides high-availability configurations and extension guides to accommodate the needs of enterprise-class applications.

Using Help

Installation process

Docker Compose Quick Start

Installation prerequisites: Ensure that Docker and Docker Compose are installed.
clone warehouse::

   git clone https://github.com/lumina-ai-inc/chunkr
cd chunkr

Copy the environment configuration file::

   cp .env.example .env

Starting services::

   docker compose up -d

access service::
- Web UI: http://localhost:5173
- API: http://localhost:8000

Kubernetes Production Environment Deployment

preliminary: Ensure that the Kubernetes cluster and kubectl are installed.
Deployment services::

   kubectl apply -f kubernetes-manifests/

Configuring High Availability and Scaling: Reference self-deployment.md documentation for high availability configuration and scaling.

Guidelines for use

Create an account and get an API key::
- Visit chunkr.ai to register for an account.
- Log in to get the API key.
Creating Tasks::

   curl -X POST https://api.chunkr.ai/api/v1/task \
-H "Content-Type: multipart/form-data" \
-H "Authorization: ${YOUR_API_KEY}" \
-F "file=@/path/to/your/file" \
-F "model=HighQuality" \
-F "target_chunk_length=512" \
-F "ocr_strategy=Auto"

Polling task status::

   curl -X GET https://api.chunkr.ai/api/v1/task/${TASK_ID} \
-H "Authorization: ${YOUR_API_KEY}"

Main function operation flow

Document Conversion: After uploading the file, select the conversion model and target block length, and the system will automatically process and return the structured data.
OCR Recognition: When you select the OCR policy when uploading a file, the system automatically recognizes the text content in the document and generates a bounding box.
Results View: View converted structured data via API or Web UI, supporting HTML and Markdown formats.

Chunkr provides detailed documentation and sample code to help users get started quickly and integrate into existing systems. Whether you are a developer or an enterprise user, you can utilize Chunkr to efficiently process and convert documents and increase productivity.