Vision Parse: Intelligent Conversion of PDF Documents to Markdown Format Using Visual Language Models

Latest AI Resources8mos agoupdate AI Sharing Circle

1.7K 00

General Introduction

Vision Parse is a revolutionary document processing tool that cleverly combines state-of-the-art Visual Language Models (Vision Language Models) technology to intelligently convert PDF documents into high-quality Markdown format content. The tool supports a variety of top visual language models, including OpenAI, LLama and Google Gemini, etc., can accurately extract the text and tables in the document, and maintain the hierarchical structure of the original document, style and indentation.Vision Parse not only supports multi-page PDF processing, but also provides a local model deployment options, so that users can be in the same time to ensure that the document security realize offline processing. Its simple API design allows developers to achieve complex document conversion tasks with just a few lines of code, greatly improving the efficiency and accuracy of document processing.

Vision Parse：使用视觉语言模型将PDF文档智能转换为Markdown格式

Function List

Intelligent Content Extraction: Use advanced visual language models to accurately recognize and extract text and table content
Formatting integrity: maintains document hierarchies, styles, and indentation formatting in its entirety
Multi-model support: compatible with OpenAI, LLama, Gemini and other visual language model providers
PDF multi-page processing: support for multi-page PDF documents will be converted to base64 encoded images for processing
Local Model Deployment: Supports local model deployment through Ollama for document security and offline use.
Custom Configuration: Support custom PDF processing parameters, such as DPI, color space, etc.
Flexible API: provides simple and intuitive Python API interface

Using Help

1. Preparation for installation

Basic requirements:

Python 3.9 or higher
If you want to use a local model, you need to install Ollama.
Using OpenAI or Google Gemini requires the appropriate API key

Installation Steps:

Use pip to install the base package:

pip install vision-parse

Install additional dependencies as needed:

OpenAI Support:pip install 'vision-parse[openai]'
Gemini Support:pip install 'vision-parse[gemini]'

2. Basic usage

Sample code:

from vision_parse import VisionParser
# 初始化解析器
parser = VisionParser(
model_name="llama3.2-vision:11b",  # 使用本地模型
temperature=0.4,
top_p=0.3,
extraction_complexity=False  # 设置为True获取更详细的提取结果
)
# 转换PDF文件
pdf_path = "your_document.pdf"
markdown_pages = parser.convert_pdf(pdf_path)
# 处理转换结果
for i, page_content in enumerate(markdown_pages):
print(f"\n--- 第 {i+1} 页 ---\n{page_content}")

3. Advanced configuration

PDF page configuration:

from vision_parse import VisionParser, PDFPageConfig
# 配置PDF处理设置
page_config = PDFPageConfig(
dpi=400,
color_space="RGB",
include_annotations=True,
preserve_transparency=False
)
# 使用自定义配置初始化解析器
parser = VisionParser(
model_name="llama3.2-vision:11b",
temperature=0.7,
top_p=0.4,
page_config=page_config
)

4. Supported models

Vision Parse supports many mainstream visual language models:

OpenAI models: gpt-4o, gpt-4o-mini
Google Gemini models: gemini-1.5-flash, gemini-2.0-flash-exp, gemini-1.5-pro
Meta Llama and Llava (via Ollama): llava:13b, llava:34b, llama3.2-vision:11b, llama3.2-vision:70b

5. Utilization techniques

Choosing the right model: choose a local model or a cloud-based service according to your needs
Adjustment of parameters: creativity and accuracy of the output is adjusted by the parameters temperature and top_p.
Extraction complexity: for complex documents, it is recommended to set extraction_complexity=True
Local deployment: Sensitive documentation suggests using Ollama for local model deployment
PDF configuration: adjust parameters such as DPI and color space according to document characteristics