OmniParse: extract any unstructured data from documents/multimedia and parse it into structured data

Latest AI Resources9mos agoupdate AI Sharing Circle

1.9K 00

General Introduction

OmniParse is a powerful data parsing and optimization platform designed to transform any unstructured data into structured, actionable data optimized for GenAI (Generative Artificial Intelligence) frameworks. Whether you are working with documents, tables, images, videos, audio files or web content, OmniParse makes your data clean, structured and ready for AI applications such as RAG (Retrieval Augmented Generation) and fine-tuning.

: Open source demo address: https://colab.research.google.com/github/adithya-s-k/omniparse/blob/main/examples/OmniParse_GoogleColab.ipynb

Function List

Fully localized, no external API required
For T4 GPUs
Supports about 20 file types
Convert documents, multimedia and web pages into high-quality structured Markdown
Table extraction, image extraction/subtitling, audio/video transcription, web crawling
Easy Deployment with Docker and Skypilot
Friendly Colab environment
Interactive UI powered by Gradio

Using Help

Installation process

clone warehouse::

git clone https://github.com/adithya-s-k/omniparse
cd omniparse

Creating a Virtual Environment::

conda create -n omniparse-venv python=3.10
conda activate omniparse-venv

Installation of dependencies::

poetry install
# 或者
pip install -e .
# 或者
pip install -r pyproject.toml

Using Docker

Pulling OmniParse API images from Docker Hub::
```
docker pull savatar101/omniparse:0.1
```

Run the Docker container, exposing port 8000::

# 如果使用GPU
docker run --gpus all -p 8000:8000 savatar101/omniparse:0.1
# 否则
docker run -p 8000:8000 savatar101/omniparse:0.1

Operations Server

Start the server::
```
python server.py --host 0.0.0.0 --port 8000 --documents --media --web
```
- --documents: Load all the models that help parse and ingest documents (e.g., the Surya OCR family of models and Florence-2).
- --media: Load Whisper models to transcribe audio and video files.
- --web: Setting up the Selenium crawler.

Supported Data Types

(computer) file::.doc, .docx, .pdf, .ppt, .pptx
imagery::.png, .jpg, .jpeg, .tiff, .bmp, .heic
video::.mp4, .mkv, .avi, .mov
sound frequency::.mp3, .wav, .aac
web page: dynamic web pages.http://.com

usage example

document resolution::
```
python server.py --host 0.0.0.0 --port 8000 --documents
```
This loads all document parsing models ready to process data of the document type.
multimedia analysis::
```
python server.py --host 0.0.0.0 --port 8000 --media
```
This loads the Whisper model, ready to process audio and video files.
web crawler::
```
python server.py --host 0.0.0.0 --port 8000 --web
```
This will set up the Selenium crawler, ready to process web content.