从 PDF 中提取有价值的信息：Gemini 2.0 结构化输出方案-首席AI分享圈

🚀邀请体验：中国首家 AI IDE 智能编程软件 Trae 中文版下载，不限量使用 DeepSeek-R1 和 Doubao-pro!

上周，Google DeepMind 发布了 Gemini 2.0，其中包括 Gemini 2.0 Flash（全面可用）、Gemini 2.0 Flash-Lite（全新高性价比）和 Gemini 2.0 Pro（实验性）。所有模型都支持至少 100 万 Token 的输入上下文窗口，并支持文本、图像和音频以及函数调用/结构化输出。本文同时作为 LLM OCR 的局限性：光鲜外表下的文档解析难题的参考阅读资料。

从 PDF 中提取有价值的信息：Gemini 2.0 结构化输出方案-1

这为 PDF 处理带来了绝佳的用例。将 PDF 转换为结构化或机器可读的文本一直是一个主要的难题。试想一下，如果我们可以将 PDF 从文档转换为结构化数据，那将怎样？这就是 Gemini 2.0 能够发挥作用的地方。

在本教程中，读者将学习如何使用 Gemini 2.0 直接从 PDF 文档中提取结构化信息，例如发票号码、日期：

设置环境并创建推理客户端
处理 PDF 和其他文件
使用 Gemini 2.0 和 Pydantic 实现结构化输出
使用 Gemini 2.0 从 PDF 中提取结构化数据

1. 设置环境并创建推理客户端

首要任务是安装 google-genai Python SDK 并获取 API 密钥。如果读者还没有 API 密钥，可以从 Google AI Studio 获取：获取 Gemini API 密钥。

%pip install "google-genai>=1"

一旦拥有 SDK 和 API 密钥，读者就可以创建一个客户端并定义将要使用的模型，即新的 Gemini 2.0 Flash 模型，该模型通过免费层级提供，每天 1,500 个请求（截至 2025 年 2 月 6 日）。

from google import genai
# Create a client
api_key = "XXXXX"
client = genai.Client(api_key=api_key)

# Define the model you are going to use
model_id =  "gemini-2.0-flash" # or "gemini-2.0-flash-lite-preview-02-05"  , "gemini-2.0-pro-exp-02-05"

注意：如果读者想使用 Vertex AI，请点击此处了解如何创建客户端

2. 处理 PDF 和其他文件

Gemini 模型能够处理图像和视频，这可以与 base64 字符串或使用 files API 结合使用。上传文件后，读者可以直接在调用中包含文件 URI。Python API 包括 upload 和 delete 方法。

对于此示例，用户有 2 个 PDF 样本，一个基本发票和一个带有手写值的表单。

!wget -q -O https://storage.googleapis.com/generativeai-downloads/data/pdf_structured_outputs/handwriting_form.pdf
!wget -q -O https://storage.googleapis.com/generativeai-downloads/data/pdf_structured_outputs/invoice.pdf

现在读者可以使用客户端和 upload 方法上传文件。让我们对其中一个文件进行尝试。

invoice_pdf = client.files.upload(file="invoice.pdf", config={'display_name': 'invoice'})

注意：File API 允许每个项目最多存储 20 GB 的文件，每个文件最大大小为 2 GB。文件存储 48 小时。在此期间，可以使用用户的 API 密钥访问这些文件，但无法下载。文件上传是免费的。

文件上传后，读者可以检查它被转换为多少 Token。这不仅有助于理解用户正在处理的上下文，还有助于跟踪成本。

file_size = client.models.count_tokens(model=model_id,contents=invoice_pdf)
print(f'File: {invoice_pdf.display_name} equals to {file_size.total_tokens} tokens')
# File: invoice equals to 821 tokens

3. 使用 Gemini 2.0 和 Pydantic 实现结构化输出

结构化输出是一项功能，可确保 Gemini 始终生成符合预定义格式（例如 JSON Schema）的响应。这意味着用户可以更好地控制输出以及如何将其集成到应用程序中，因为它保证返回具有用户定义的 Schema 的有效 JSON 对象。

Gemini 2.0 当前支持 3 种不同的定义 JSON Schema 的类型：

单个 Python 类型，就像在 typing annotation 中使用的一样。
Pydantic BaseModel
genai.types.Schema / Pydantic BaseModel 的字典等价物

让我们看一个基于文本的快速示例。

from pydantic import BaseModel, Field

# Define a Pydantic model
# Use the Field class to add a description and default value to provide more context to the model
class Topic(BaseModel):
    name: str = Field(description="The name of the topic")

class Person(BaseModel):
    first_name: str = Field(description="The first name of the person")
    last_name: str = Field(description="The last name of the person")
    age: int = Field(description="The age of the person, if not provided please return 0")
    work_topics: list[Topic] = Field(description="The fields of interest of the person, if not provided please return an empty list")

# Define the prompt
prompt = "Philipp Schmid is a Senior AI Developer Relations Engineer at Google DeepMind working on Gemini, Gemma with the mission to help every developer to build and benefit from AI in a responsible way.  "

# Generate a response using the Person model
response = client.models.generate_content(model=model_id, contents=prompt, config={'response_mime_type': 'application/json', 'response_schema': Person})

# print the response as a json string
print(response.text)

# sdk automatically converts the response to the pydantic model
philipp: Person = response.parsed

# access an attribute of the json response
print(f"First name is {philipp.first_name}")

4. 使用 Gemini 2.0 从 PDF 中提取结构化数据

现在，让我们结合 File API 和结构化输出，从 PDF 中提取信息。用户可以创建一个简单的方法，该方法接受本地文件路径和 Pydantic 模型，并为用户返回结构化数据。该方法将：

将文件上传到 File API
使用 Gemini API 生成结构化响应
将响应转换为 Pydantic 模型并返回

def extract_structured_data(file_path: str, model: BaseModel):
    # Upload the file to the File API
    file = client.files.upload(file=file_path, config={'display_name': file_path.split('/')[-1].split('.')[0]})
    # Generate a structured response using the Gemini API
    prompt = f"Extract the structured data from the following PDF file"
    response = client.models.generate_content(model=model_id, contents=[prompt, file], config={'response_mime_type': 'application/json', 'response_schema': model})
    # Convert the response to the pydantic model and return it
    return response.parsed

在示例中，每个 PDF 彼此不同。因此，用户需要为每个 PDF 定义唯一的 Pydantic 模型，以展示 Gemini 2.0 的性能。如果用户有非常相似的 PDF 并且想要提取相同的信息，则可以对所有 PDF 使用相同的模型。

Invoice.pdf：提取发票号码、日期和所有列表项，包括描述、数量和总价值以及总总价值
handwriting_form.pdf：提取表单号码、计划开始日期以及年初和年末的计划负债

注意：使用 Pydantic 功能，用户可以向模型添加更多上下文，使其更准确，并对数据进行一些验证。添加全面的描述可以显著提高模型的性能。instructor 等库添加了基于验证错误的自动重试，这可能会有很大帮助，但会增加额外的请求成本。

Invoice.pdf

from pydantic import BaseModel, Field

class Item(BaseModel):
    description: str = Field(description="The description of the item")
    quantity: float = Field(description="The Qty of the item")
    gross_worth: float = Field(description="The gross worth of the item")

class Invoice(BaseModel):
    """Extract the invoice number, date and all list items with description, quantity and gross worth and the total gross worth."""
    invoice_number: str = Field(description="The invoice number e.g. 1234567890")
    date: str = Field(description="The date of the invoice e.g. 2024-01-01")
    items: list[Item] = Field(description="The list of items with description, quantity and gross worth")
    total_gross_worth: float = Field(description="The total gross worth of the invoice")

result = extract_structured_data("invoice.pdf", Invoice)
print(type(result))
print(f"Extracted Invoice: {result.invoice_number} on {result.date} with total gross worth {result.total_gross_worth}")
for item in result.items:
    print(f"Item: {item.description} with quantity {item.quantity} and gross worth {item.gross_worth}")

太棒了！该模型在从发票中提取信息方面做得非常出色。

handwriting_form.pdf

class Form(BaseModel):
    """Extract the form number, fiscal start date, fiscal end date, and the plan liabilities beginning of the year and end of the year."""
    form_number: str = Field(description="The Form Number")
    start_date: str = Field(description="Effective Date")
    beginning_of_year: float = Field(description="The plan liabilities beginning of the year")
    end_of_year: float = Field(description="The plan liabilities end of the year")

result = extract_structured_data("handwriting_form.pdf", Form)

print(f'Extracted Form Number: {result.form_number} with start date {result.start_date}. \nPlan liabilities beginning of the year {result.beginning_of_year} and end of the year {result.end_of_year}')
# Extracted Form Number: CA530082 with start date 02/05/2022. 
# Plan liabilities beginning of the year 40000.0 and end of the year 55000.0

最佳实践和局限性

在使用 Gemini 2.0 进行 PDF 处理时，请记住以下注意事项：

文件大小管理：虽然 File API 支持大文件，但最佳实践是在上传之前优化 PDF。
Token 限制：处理大型文档时，请检查 Token 计数，以确保用户保持在模型限制和预算范围内。
结构化输出设计：仔细设计用户的 Pydantic 模型，以捕获所有必要信息，同时保持清晰度，添加描述和示例可以提高模型的性能。
错误处理：为文件上传和处理状态实施稳健的错误处理，包括重试和处理来自模型的错误消息。

结论

Gemini 2.0 的多模态功能与结构化输出相结合，可以帮助用户处理和提取 PDF 和其他文件中的信息。这可以消除复杂且耗时费力的人工或半自动数据提取流程。无论用户是构建发票处理系统、文档分析工具还是任何其他以文档为中心的应用程序，都应该尝试 Gemini 2.0，因为它最初可以免费测试，之后每百万输入 Token 仅需 0.1 美元。谷歌强调 Gemini 2.0 初期测试免费，之后每百万 Token 仅需 0.1 美元，这无疑降低了用户尝试新技术的门槛，但长期成本效益仍需进一步观察。

从 PDF 中提取有价值的信息：Gemini 2.0 结构化输出方案

1. 设置环境并创建推理客户端

2. 处理 PDF 和其他文件

3. 使用 Gemini 2.0 和 Pydantic 实现结构化输出

4. 使用 Gemini 2.0 从 PDF 中提取结构化数据

Invoice.pdf

handwriting_form.pdf

最佳实践和局限性

结论

相关文章

相关推荐

找不到AI工具？在这试试！

FLUX.1图像生成器（支持中文输入）

近期AI热点

AI工具推荐

AI工具分类