General Introduction
NV Ingest (NVIDIA Ingest) is a suite of early access microservices designed for parsing hundreds of thousands of complex, messy unstructured PDF and other enterprise documents. It transforms these documents into metadata and text for embedding in retrieval systems.NVIDIA Ingest supports parsing of PDF, Word, and PowerPoint documents, utilizing NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts, and images for use by downstream generative applications. The service parallelizes processing, splits documents into pages, categorizes content (e.g., tables, charts, images, text), and extracts it into well-defined JSON schemas using optical character recognition (OCR).NVIDIA Ingest also optionally manages the computation of the embedded content and stores it in the vector database, Milvus.
Function List
- Support for parsing PDF, Word and PowerPoint documents
- Find, contextualize and extract text, tables, charts and images with NVIDIA NIM microservices
- Parallelizing documents, splitting them into pages and categorizing content
- Extracting content via OCR and converting to JSON schema
- Supports extraction methods for multiple document types to balance throughput and accuracy
- Supports a variety of pre-processing and post-processing operations, including text splitting and chunking, conversion and filtering, embedding generation and image offloading to storage
- Optionally manage the computation and storage of embedded content into the vector database Milvus
Using Help
Installation process
- Clone the NVIDIA Ingest repository:
git clone https://github.com/NVIDIA/nv-ingest.git
- Go to the project catalog:
cd nv-ingest
- Install dependencies:
pip install -r requirements.txt
- Configure environment variables:
source setup_env.sh
- Start the service:
docker-compose up
Usage Process
- Submitting Document Parsing Tasks::
- Submit JSON job descriptions containing document loads and parsing tasks via the API.
- Example JSON job description:
{ "document_payload": "base64_encoded_document", "document_tasks": ["parse_text", "extract_metadata"], "extract_metadata"] "ingestion_tasks": ["parse_text", "extract_metadata"] }
- Retrieve parse results::
- Retrieve the results of the job via the API, resulting in a JSON dictionary containing extracted object metadata, processing annotations, and time/tracking data.
- Sample API calls:
curl -X GET "http://localhost:5000/api/results/{job_id}"
- Supported document types and extraction methods::
- PDF documents: support for extraction via pdfium, Unstructured.io and Adobe Content Extraction Services.
- Word documents: support for extraction via the Microsoft Office API.
- PowerPoint documents: extraction via Microsoft Office API is supported.
- Images: Extraction via OCR is supported.
- Pre- and post-processing operations::
- Text Splitting and Chunking: Split long text into smaller chunks for better processing and analysis.
- Conversion and Filtering: Convert and filter the extracted text to improve data quality.
- Embedding generation: computes embeddings of extracted content for storage and retrieval in a vector database.
- Image Offload to Storage: Offloads extracted images to external storage for further processing and analysis.
Detailed Operation Procedure
- Submitting Document Parsing Tasks::
- Submit JSON job descriptions containing document loads and parsing tasks via the API.
- Example JSON job description:
{ "document_payload": "base64_encoded_document", "document_tasks": ["parse_text", "extract_metadata"], "extract_metadata"] "ingestion_tasks": ["parse_text", "extract_metadata"] }
- Retrieve parse results::
- Retrieve the results of the job via the API, resulting in a JSON dictionary containing extracted object metadata, processing annotations, and time/tracking data.
- Sample API calls:
curl -X GET "http://localhost:5000/api/results/{job_id}"
- Supported document types and extraction methods::
- PDF documents: support for extraction via pdfium, Unstructured.io and Adobe Content Extraction Services.
- Word documents: support for extraction via the Microsoft Office API.
- PowerPoint documents: extraction via Microsoft Office API is supported.
- Images: Extraction via OCR is supported.
- Pre- and post-processing operations::
- Text Splitting and Chunking: Split long text into smaller chunks for better processing and analysis.
- Conversion and Filtering: Convert and filter the extracted text to improve data quality.
- Embedding generation: computes embeddings of extracted content for storage and retrieval in a vector database.
- Image Offload to Storage: Offloads extracted images to external storage for further processing and analysis.