Docling: support for a variety of formats document parsing and export as Markdown and JSON, PDF support OCR

Latest AI Resources8mos agorelease AI Sharing Circle

2.9K 00

General Introduction

Docling is a powerful document parsing and exporting tool that supports a wide range of document formats including PDF, DOCX, PPTX, XLSX, images, HTML, AsciiDoc, and Markdown.It parses and exports these documents to HTML, Markdown, and JSON formats, with support for embedding and referencing images. Docling provides advanced PDF document understanding, including parsing of page layout, reading order and table structure. It also supports OCR technology for scanning PDF documents.Docling is easy to integrate and supports integration with the powerful RAG/QA applications of LlamaIndex and LangChain, providing a simple and convenient command line interface (CLI).

Docling：支持多种格式文档解析并导出为Markdown和JSON，PDF支持OCR

Function List

Parse multiple document formats (PDF, DOCX, PPTX, XLSX, Image, HTML, AsciiDoc, Markdown)
Export to HTML, Markdown and JSON formats
Advanced PDF document comprehension (page layout, reading order, table structure)
Supports OCR technology to parse scanned PDFs
Provides a unified DoclingDocument representation format.
Easy integration with LlamaIndex and LangChain
Simple and convenient command line interface (CLI)

Using Help

Installation process

To use Docling, simply install docling from a package manager, e.g. using pip:

pip install docling

Docling is available for macOS, Linux and Windows environments and supports x86_64 and arm64 architectures. Detailed installation instructions can be found in the official documentation.

Guidelines for use

Converting a single document

To convert individual documents, you can use the convert() Methods, for example:

from docling.document_converter import DocumentConverter
source = "path/to/document.pdf"  # 文档的本地路径或 URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())  # 输出转换后的 Markdown 格式文档

Advanced Usage

Docling offers a rich set of advanced usage options that can be configured and extended as needed. Detailed instructions and examples can be found in the official documentation.

Functional operation flow

document resolution: Import a document into Docling and use the built-in parser to parse the document content.
format conversion: Select the format you need to export (HTML, Markdown, JSON) and use the corresponding export function to convert the format.
OCR Parsing: For scanned PDF documents, enable the OCR function to extract the text content in the document.
integrated application: Integrate Docling with LlamaIndex or LangChain to build powerful RAG/QA applications.
command-line operation: Use the CLI tools provided by Docling to quickly perform document parsing and export operations.

Docling's document parsing and exporting features are powerful and easy to use for a wide range of document processing needs. Users can quickly get started and fully utilize Docling's features with detailed official documentation and examples.