General Introduction
Vision Parse is a revolutionary document processing tool that cleverly combines state-of-the-art Visual Language Models (Vision Language Models) technology to intelligently convert PDF documents into high-quality Markdown format content. The tool supports a variety of top visual language models, including OpenAI, LLama and Google Gemini, etc., can accurately extract the text and tables in the document, and maintain the hierarchical structure of the original document, style and indentation.Vision Parse not only supports multi-page PDF processing, but also provides a local model deployment options, so that users can be in the same time to ensure that the document security realize offline processing. Its simple API design allows developers to achieve complex document conversion tasks with just a few lines of code, greatly improving the efficiency and accuracy of document processing.
Function List
- Intelligent Content Extraction: Use advanced visual language models to accurately recognize and extract text and table content
- Formatting integrity: maintains document hierarchies, styles, and indentation formatting in its entirety
- Multi-model support: compatible with OpenAI, LLama, Gemini and other visual language model providers
- PDF multi-page processing: support for multi-page PDF documents will be converted to base64 encoded images for processing
- Local Model Deployment: Supports local model deployment through Ollama for document security and offline use.
- Custom Configuration: Support custom PDF processing parameters, such as DPI, color space, etc.
- Flexible API: provides simple and intuitive Python API interface
Using Help
1. Preparation for installation
Basic requirements:
- Python 3.9 or higher
- If you want to use a local model, you need to install Ollama.
- Using OpenAI or Google Gemini requires the appropriate API key
Installation Steps:
- Use pip to install the base package:
pip install vision-parse
- Install additional dependencies as needed:
- OpenAI Support:
pip install 'vision-parse[openai]'
- Gemini Support:
pip install 'vision-parse[gemini]'
2. Basic usage
Sample code:
from vision_parse import VisionParser
# Initialize the parser
parser = VisionParser(
model_name="llama3.2-vision:11b", # using local model
temperature=0.4,
top_p=0.3,
extraction_complexity=False # Set to True for more detailed extraction results
)
# Convert PDF file
pdf_path = "your_document.pdf"
markdown_pages = parser.convert_pdf(pdf_path)
# processing results of the conversion
for i, page_content in enumerate(markdown_pages):
print(f"\n--- Page {i+1} ---\n{page_content}")
3. Advanced configuration
PDF page configuration:
from vision_parse import VisionParser, PDFPageConfig
# configure PDF processing settings
page_config = PDFPageConfig(
page_config = PDFPageConfig(
color_space="RGB", include_annotations=True
include_annotations=True, preserve_transparency=Free
preserve_transparency=False
)
# Initialize the parser with a custom configuration
parser = VisionParser(
model_name="llama3.2-vision:11b",
temperature=0.7,
top_p=0.4,
page_config=page_config
)
4. Supported models
Vision Parse supports many mainstream visual language models:
- OpenAI models: gpt-4o, gpt-4o-mini
- Google Gemini models: gemini-1.5-flash, gemini-2.0-flash-exp, gemini-1.5-pro
- Meta Llama and Llava (via Ollama): llava:13b, llava:34b, llama3.2-vision:11b, llama3.2-vision:70b
5. Utilization techniques
- Choosing the right model: choose a local model or a cloud-based service according to your needs
- Adjustment of parameters: creativity and accuracy of the output is adjusted by the parameters temperature and top_p.
- Extraction complexity: for complex documents, it is recommended to set extraction_complexity=True
- Local deployment: Sensitive documentation suggests using Ollama for local model deployment
- PDF configuration: adjust parameters such as DPI and color space according to document characteristics