AI Personal Learning
and practical guidance

NV Ingest: Parsing complex format documents and extracting multimodal data into metadata and text

General Introduction

NV Ingest (NVIDIA Ingest) is a suite of early access microservices designed for parsing hundreds of thousands of complex, messy unstructured PDF and other enterprise documents. It transforms these documents into metadata and text for embedding in retrieval systems.NVIDIA Ingest supports parsing of PDF, Word, and PowerPoint documents, utilizing NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts, and images for use by downstream generative applications. The service parallelizes processing, splits documents into pages, categorizes content (e.g., tables, charts, images, text), and extracts it into well-defined JSON schemas using optical character recognition (OCR).NVIDIA Ingest also optionally manages the computation of the embedded content and stores it in the vector database, Milvus.

NV Ingest: Parsing Complex Format Documents and Extracting Multimodal Data into Metadata and Text-1

Help file: https://docs.nvidia.com/nv-ingest/


 

Function List

  • Support for parsing PDF, Word and PowerPoint documents
  • Find, contextualize and extract text, tables, charts and images with NVIDIA NIM microservices
  • Parallelizing documents, splitting them into pages and categorizing content
  • Extracting content via OCR and converting to JSON schema
  • Supports extraction methods for multiple document types to balance throughput and accuracy
  • Supports a variety of pre-processing and post-processing operations, including text splitting and chunking, conversion and filtering, embedding generation and image offloading to storage
  • Optionally manage the computation and storage of embedded content into the vector database Milvus

 

Using Help

Installation process

  1. Clone the NVIDIA Ingest repository:
   git clone https://github.com/NVIDIA/nv-ingest.git
  1. Go to the project catalog:
   cd nv-ingest
  1. Install dependencies:
   pip install -r requirements.txt
  1. Configure environment variables:
   source setup_env.sh
  1. Start the service:
   docker-compose up

Usage Process

  1. Submitting Document Parsing Tasks::
    • Submit JSON job descriptions containing document loads and parsing tasks via the API.
    • Example JSON job description:
     {
    "document_payload": "base64_encoded_document", "document_tasks": ["parse_text", "extract_metadata"], "extract_metadata"]
    "ingestion_tasks": ["parse_text", "extract_metadata"]
    }
    
  2. Retrieve parse results::
    • Retrieve the results of the job via the API, resulting in a JSON dictionary containing extracted object metadata, processing annotations, and time/tracking data.
    • Sample API calls:
     curl -X GET "http://localhost:5000/api/results/{job_id}"
    
  3. Supported document types and extraction methods::
    • PDF documents: support for extraction via pdfium, Unstructured.io and Adobe Content Extraction Services.
    • Word documents: support for extraction via the Microsoft Office API.
    • PowerPoint documents: extraction via Microsoft Office API is supported.
    • Images: Extraction via OCR is supported.
  4. Pre- and post-processing operations::
    • Text Splitting and Chunking: Split long text into smaller chunks for better processing and analysis.
    • Conversion and Filtering: Convert and filter the extracted text to improve data quality.
    • Embedding generation: computes embeddings of extracted content for storage and retrieval in a vector database.
    • Image Offload to Storage: Offloads extracted images to external storage for further processing and analysis.

Detailed Operation Procedure

  1. Submitting Document Parsing Tasks::
    • Submit JSON job descriptions containing document loads and parsing tasks via the API.
    • Example JSON job description:
     {
    "document_payload": "base64_encoded_document", "document_tasks": ["parse_text", "extract_metadata"], "extract_metadata"]
    "ingestion_tasks": ["parse_text", "extract_metadata"]
    }
    
  2. Retrieve parse results::
    • Retrieve the results of the job via the API, resulting in a JSON dictionary containing extracted object metadata, processing annotations, and time/tracking data.
    • Sample API calls:
     curl -X GET "http://localhost:5000/api/results/{job_id}"
    
  3. Supported document types and extraction methods::
    • PDF documents: support for extraction via pdfium, Unstructured.io and Adobe Content Extraction Services.
    • Word documents: support for extraction via the Microsoft Office API.
    • PowerPoint documents: extraction via Microsoft Office API is supported.
    • Images: Extraction via OCR is supported.
  4. Pre- and post-processing operations::
    • Text Splitting and Chunking: Split long text into smaller chunks for better processing and analysis.
    • Conversion and Filtering: Convert and filter the extracted text to improve data quality.
    • Embedding generation: computes embeddings of extracted content for storage and retrieval in a vector database.
    • Image Offload to Storage: Offloads extracted images to external storage for further processing and analysis.
May not be reproduced without permission:Chief AI Sharing Circle " NV Ingest: Parsing complex format documents and extracting multimodal data into metadata and text

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish