NV Ingest: Parsing complex format documents and extracting multimodal data into metadata and text

Latest AI Resources7mos agorelease AI Sharing Circle

2.3K 00

General Introduction

NV Ingest (NVIDIA Ingest) is a suite of early access microservices designed for parsing hundreds of thousands of complex, messy unstructured PDF and other enterprise documents. It transforms these documents into metadata and text for embedding in retrieval systems.NVIDIA Ingest supports parsing of PDF, Word, and PowerPoint documents, utilizing NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts, and images for use by downstream generative applications. The service parallelizes processing, splits documents into pages, categorizes content (e.g., tables, charts, images, text), and extracts it into well-defined JSON schemas using optical character recognition (OCR).NVIDIA Ingest also optionally manages the computation of the embedded content and stores it in the vector database, Milvus.

Help file: https://docs.nvidia.com/nv-ingest/

Function List

Support for parsing PDF, Word and PowerPoint documents
Find, contextualize and extract text, tables, charts and images with NVIDIA NIM microservices
Parallelizing documents, splitting them into pages and categorizing content
Extracting content via OCR and converting to JSON schema
Supports extraction methods for multiple document types to balance throughput and accuracy
Supports a variety of pre-processing and post-processing operations, including text splitting and chunking, conversion and filtering, embedding generation and image offloading to storage
Optionally manage the computation and storage of embedded content into the vector database Milvus

Using Help

Installation process

Clone the NVIDIA Ingest repository:

   git clone https://github.com/NVIDIA/nv-ingest.git

Go to the project catalog:

   cd nv-ingest

Install dependencies:

   pip install -r requirements.txt

Configure environment variables:

   source setup_env.sh

Start the service:

   docker-compose up

Usage Process

Submitting Document Parsing Tasks::
- Submit JSON job descriptions containing document loads and parsing tasks via the API.
- Example JSON job description:
```
 {
"document_payload": "base64_encoded_document",
"ingestion_tasks": ["parse_text", "extract_metadata"]
}
```
Retrieve parse results::
- Retrieve the results of the job via the API, resulting in a JSON dictionary containing extracted object metadata, processing annotations, and time/tracking data.
- Sample API calls:
```
 curl -X GET "http://localhost:5000/api/results/{job_id}"
```
Supported document types and extraction methods::
- PDF documents: support for extraction via pdfium, Unstructured.io and Adobe Content Extraction Services.
- Word documents: support for extraction via the Microsoft Office API.
- PowerPoint documents: extraction via Microsoft Office API is supported.
- Images: Extraction via OCR is supported.
Pre- and post-processing operations::
- Text Splitting and Chunking: Split long text into smaller chunks for better processing and analysis.
- Conversion and Filtering: Convert and filter the extracted text to improve data quality.
- Embedding generation: computes embeddings of extracted content for storage and retrieval in a vector database.
- Image Offload to Storage: Offloads extracted images to external storage for further processing and analysis.

Detailed Operation Procedure

Submitting Document Parsing Tasks::
- Submit JSON job descriptions containing document loads and parsing tasks via the API.
- Example JSON job description:
```
 {
"document_payload": "base64_encoded_document",
"ingestion_tasks": ["parse_text", "extract_metadata"]
}
```
Retrieve parse results::
- Retrieve the results of the job via the API, resulting in a JSON dictionary containing extracted object metadata, processing annotations, and time/tracking data.
- Sample API calls:
```
 curl -X GET "http://localhost:5000/api/results/{job_id}"
```
Supported document types and extraction methods::
- PDF documents: support for extraction via pdfium, Unstructured.io and Adobe Content Extraction Services.
- Word documents: support for extraction via the Microsoft Office API.
- PowerPoint documents: extraction via Microsoft Office API is supported.
- Images: Extraction via OCR is supported.
Pre- and post-processing operations::
- Text Splitting and Chunking: Split long text into smaller chunks for better processing and analysis.
- Conversion and Filtering: Convert and filter the extracted text to improve data quality.
- Embedding generation: computes embeddings of extracted content for storage and retrieval in a vector database.
- Image Offload to Storage: Offloads extracted images to external storage for further processing and analysis.