AI Personal Learning
and practical guidance
Beanbag Marscode1

OmniParse: extract any unstructured data from documents/multimedia and parse it into structured data

General Introduction

OmniParse is a powerful data parsing and optimization platform designed to transform any unstructured data into structured, actionable data optimized for GenAI (Generative Artificial Intelligence) frameworks. Whether you are working with documents, tables, images, videos, audio files or web content, OmniParse makes your data clean, structured and ready for AI applications such as RAG (Retrieval Augmented Generation) and fine-tuning.

OmniParse: extract any unstructured data from documents/multimedia and parse it into structured data


 

OmniParse: extract any unstructured data from documents/multimedia and parse it into structured data
Open source demo address: https://colab.research.google.com/github/adithya-s-k/omniparse/blob/main/examples/OmniParse_GoogleColab.ipynb

 

Function List

  • Fully localized, no external API required
  • For T4 GPUs
  • Supports about 20 file types
  • Convert documents, multimedia and web pages into high-quality structured Markdown
  • Table extraction, image extraction/subtitling, audio/video transcription, web crawling
  • Easy Deployment with Docker and Skypilot
  • Friendly Colab environment
  • Interactive UI powered by Gradio

Using Help

Installation process

  1. clone warehouse::
    git clone https://github.com/adithya-s-k/omniparse
    cd omniparse
    
  2. Creating a Virtual Environment::
    conda create -n omniparse-venv python=3.10
    conda activate omniparse-venv
    
  3. Installation of dependencies::
    poetry install
    # 或者
    pip install -e .
    # 或者
    pip install -r pyproject.toml
    

Using Docker

  1. Pulling OmniParse API images from Docker Hub::
    docker pull savatar101/omniparse:0.1
    
  2. Run the Docker container, exposing port 8000::
    # 如果使用GPU
    docker run --gpus all -p 8000:8000 savatar101/omniparse:0.1
    # 否则
    docker run -p 8000:8000 savatar101/omniparse:0.1
    

Operations Server

  1. Start the server::
    python server.py --host 0.0.0.0 --port 8000 --documents --media --web
    
    • --documents: Load all the models that help parse and ingest documents (e.g., the Surya OCR family of models and Florence-2).
    • --media: Load Whisper models to transcribe audio and video files.
    • --web: Setting up the Selenium crawler.

Supported Data Types

  • (computer) file::.doc.docx.pdf.ppt.pptx
  • imagery::.png.jpg.jpeg.tiff.bmp.heic
  • video::.mp4.mkv.avi.mov
  • sound frequency::.mp3.wav.aac
  • web page: dynamic web pages.http://.com

usage example

  1. document resolution::
    python server.py --host 0.0.0.0 --port 8000 --documents
    

    This loads all document parsing models ready to process data of the document type.

  2. multimedia analysis::
    python server.py --host 0.0.0.0 --port 8000 --media
    

    This loads the Whisper model, ready to process audio and video files.

  3. web crawler::
    python server.py --host 0.0.0.0 --port 8000 --web
    

    This will set up the Selenium crawler, ready to process web content.

May not be reproduced without permission:Chief AI Sharing Circle " OmniParse: extract any unstructured data from documents/multimedia and parse it into structured data
en_USEnglish