Last week.Google DeepMind Releases Gemini 2.0These include Gemini 2.0 Flash (fully available), Gemini 2.0 Flash-Lite (new cost-effective) and Gemini 2.0 Pro (experimental). All models support at least 1 million Token of the input context window and supports text, image and audio as well as function calls/structured output. This paper also serves as the Limitations of LLM OCR: The Document Parsing Challenge Behind the Glossy Surface The reference reading material.
This opens up excellent use cases for PDF processing. Converting PDF to structured or machine-readable text has always been a major challenge. Imagine if we could convert PDF from documents to structured data? This is where Gemini 2.0 comes into play.
In this tutorial, readers will learn how to use Gemini 2.0 to extract structured information, such as invoice numbers, dates, directly from PDF documents:
- Setting up the environment and creating the reasoning client
- Handling PDF and other documents
- Structured Output with Gemini 2.0 and Pydantic
- Extracting Structured Data from PDF with Gemini 2.0
1. Setting up the environment and creating the reasoning client
The first task is to install google-genai
Python SDK and get the API key. If the reader doesn't already have the API key, he or she can get it from the Google AI Studio Get:Get Gemini API keyThe
%pip install "google-genai>=1"
Once in possession of the SDK and API key, the reader can create a client and define the model that will be used, the new Gemini 2.0 Flash model, which is available through a free tier with 1,500 requests per day (as of February 6, 2025).
from google import genai # Create a client api_key = "XXXXX" client = genai.Client(api_key=api_key) # Define the model you are going to use model_id = "gemini-2.0-flash" # or "gemini-2.0-flash-lite-preview-02-05" , "gemini-2.0-pro-exp-02-05"
Note: If the reader wants to use Vertex AI, pleaseClick hereLearn how to create a client
2. Processing of PDF and other documents
The Gemini model is capable of handlingImages and VideosThis can be done with base64 strings or using the files
The Python API includes upload and delete methods. After uploading a file, the reader can include the file URI directly in the call.The Python API includes upload and delete methods.
For this example, the user has 2 PDF samples, a basic invoice and a form with handwritten values.
!wget -q -O https://storage.googleapis.com/generativeai-downloads/data/pdf_structured_outputs/handwriting_form.pdf !wget -q -O https://storage.googleapis.com/generativeai-downloads/data/pdf_structured_outputs/invoice.pdf
The reader can now use the client and upload
method to upload files. Let's try it on one of the files.
invoice_pdf = client.files.upload(file="invoice.pdf", config={'display_name': 'invoice'})
Note: The File API allows up to 20 GB of files to be stored per project, with a maximum size of 2 GB per file. files are stored for 48 hours. During this time, the files can be accessed using the user's API key, but cannot be downloaded. File uploads are free.
Once the file is uploaded, the reader can check how many Token it has been converted to.This not only helps to understand the context of what the user is dealing with, but also helps to track the cost.
file_size = client.models.count_tokens(model=model_id,contents=invoice_pdf) print(f'File: {invoice_pdf.display_name} equals to {file_size.total_tokens} tokens') # File: invoice equals to 821 tokens
3. Structured output using Gemini 2.0 and Pydantic
Structured Output is a feature that ensures that Gemini always generates a response that conforms to a predefined format, such as a JSON Schema. This means that the user has more control over the output and how it is integrated into the application, as it is guaranteed to return valid JSON objects with a user-defined Schema.
Gemini 2.0 currently supports 3 different types of JSON Schema definitions:
- A single Python type, like in the typing annotation The same as used in the
- Pydantic BaseModel
- genai.types.Schema / Pydantic BaseModel dictionary equivalent
Let's look at a quick text-based example.
from pydantic import BaseModel, Field # Define a Pydantic model # Use the Field class to add a description and default value to provide more context to the model Use the Field class to add a description and default value to provide more context to the model name: str = Field(description="The name of the topic") class Person(BaseModel): first_name: str = Field(description="The name of the topic") first_name: str = Field(description="The first name of the person") last_name: str = Field(description="The last name of the person") last_name: str = Field(description="The last name of the person") last_name: str = Field(description="The age of the person") work_topics: list[Topic] = Field(description="The fields of interest of the person, if not provided please return an empty list") # Define the prompt prompt = "Philipp Schmid is a Senior AI Developer Relations Engineer at Google DeepMind working on Gemini, Gemma with the mission to help every developer to build and benefit from AI in a responsible way. " # Generate a response using the Person model response = client.models.generate_content(model=model_id, contents=prompt, config={'response_mime_type': 'application/json', 'response_ schema': Person})) schema': Person}) # print the response as a json string print(response.text) # sdk automatically converts the response to the pydantic model philipp: Person = response.parsed # access an attribute of the json response print(f "First name is {philipp.first_name}")
4. Extract structured data from PDF using Gemini 2.0
Now, let's combine the File API with structured output to extract information from a PDF. The user can create a simple method that takes a local file path and a Pydantic model and returns structured data for the user. The method will:
- Uploading files to the File API
- utilization Gemini API Generate Structured Responses
- Transforms the response into a Pydantic model and returns the
def extract_structured_data(file_path: str, model: BaseModel):: # Upload the file to the File API. # Upload the file to the File API file = client.files.upload(file=file_path, config={'display_name': file_path.split('/')[-1].split('.') [0]}) # Generate a structured response using the Gemini API prompt = f "Extract the structured data from the following PDF file" response = client.models.generate_content(model=model_id, contents=[prompt, file], config={'response_mime_type': 'application/json', ' response_schema': model}) # Convert the response to the pydantic model and return it return response.parsed
In the example, each PDF is different from the other. Therefore, the user needs to define a unique Pydantic model for each PDF to demonstrate the performance of Gemini 2.0. If the user has very similar PDFs and wants to extract the same information, the same model can be used for all PDFs.
Invoice.pdf
: Extract invoice number, date and all list items including description, quantity and total value and total gross valuehandwriting_form.pdf
: Withdrawal form number, plan start date, and plan liabilities at beginning and end of year
Note: Using the Pydantic feature, users can add more context to the model to make it more accurate and perform some validation of the data. Adding a comprehensive description can significantly improve the performance of the model. libraries such as instructor add automatic retries based on validation errors, which can be very helpful but adds additional request costs.
Invoice.pdf
from pydantic import BaseModel, Field class Item(BaseModel). description: str = Field(description="The description of the item") quantity: float = Field(description="The Qty of the item") gross_worth: float = Field(description="The gross worth of the item") class Invoice(BaseModel). """Extract the invoice number, date and all list items with description, quantity and gross worth and the total gross worth."""" invoice_number: str = Field(description="The invoice number e.g. 1234567890") date: str = Field(description="The date of the invoice e.g. 2024-01-01") items: list[Item] = Field(description="The list of items with description, quantity and gross worth") total_gross_worth: float = Field(description="The total gross worth of the invoice") result = extract_structured_data("invoice.pdf", Invoice) print(type(result)) print(f "Extracted Invoice: {result.invoice_number} on {result.date} with total gross worth {result.total_gross_worth}") for item in result.items: print(f "Item: {item_gross_worth}") print(f "Item: {item.description} with quantity {item.quantity} and gross worth {item.gross_worth}")
Fantastic! The model does an excellent job of extracting information from the invoices.
handwriting_form.pdf
class Form(BaseModel). """Extract the form number, fiscal start date, fiscal end date, and the plan liabilities beginning of the year and end of the year."""" form_number: str = Field(description="The Form Number") start_date: str = Field(description="Effective Date") beginning_of_year: float = Field(description="The plan liabilities beginning of the year") end_of_year: float = Field(description="The plan liabilities end of the year") result = extract_structured_data("handwriting_form.pdf", Form) print(f'Extracted Form Number: {result.form_number} with start date {result.start_date}. \nPlan liabilities beginning of the year {result.beginning_of_year} and end of the year {result.end_of_year}') # Extracted Form Number: CA530082 with start date 02/05/2022. # Plan liabilities beginning of the year 40000.0 and end of the year 55000.0
Best practices and limitations
When using Gemini 2.0 for PDF processing, keep the following considerations in mind:
- File Size Management: While the File API supports large files, the best practice is to optimize the PDF before uploading.
- Token Limits: When working with large documents, check Token counts to ensure that users stay within model limits and budgets.
- Structured Output Design: Carefully design the user's Pydantic model to capture all the necessary information while maintaining clarity; adding descriptions and examples can improve the model's performance.
- Error Handling: Implement robust error handling for file uploads and processing states, including retrying and handling error messages from models.
reach a verdict
Gemini 2.0's multimodal capabilities combined with structured output help users process and extract information from PDFs and other documents. This eliminates complex and time-consuming manual or semi-automated data extraction processes. Whether you are building an invoice processing system, a document analysis tool or any other document-centric application, you should try Gemini 2.0, as it is initially free to test and then costs only $0.1 per million input Token. Google emphasized that Gemini 2.0 is free for initial testing and then only $0.1 per million Token, which certainly lowers the barrier for users to try the new technology, but the long-term cost-effectiveness remains to be seen.