PDF에서 귀중한 정보 추출: Gemini 2.0 구조화된 출력 솔루션

51.3K 00

지난 주구글 딥마인드, 제미니 2.0 출시여기에는 다음이 포함됩니다. 쌍둥이자리 2.0 플래시(정식 버전), Gemini 2.0 플래시 라이트(새로운 비용 효율적 버전) 및 Gemini 2.0 프로(실험적 버전). 모든 모델에서 최소 1백만 개 이상 지원 토큰 입력 컨텍스트 창의 텍스트, 이미지, 오디오는 물론 함수 호출/구조화된 출력도 지원합니다. 이 문서는 또한 LLM OCR의 한계: 화려함 뒤에 숨겨진 문서 구문 분석의 어려움 참고 자료입니다.

이로써 PDF 처리를 위한 훌륭한 사용 사례가 열립니다. PDF를 구조화된 텍스트 또는 기계 판독이 가능한 텍스트로 변환하는 것은 항상 큰 과제였습니다. PDF를 문서에서 구조화된 데이터로 변환할 수 있다면 어떨까요? 바로 여기에 Gemini 2.0이 등장합니다.

이 자습서에서는 Gemini 2.0을 사용하여 PDF 문서에서 송장 번호, 날짜 등의 구조화된 정보를 직접 추출하는 방법을 알아봅니다:

환경 설정 및 추론 클라이언트 만들기
PDF 및 기타 문서 처리
Gemini 2.0 및 Pydantic을 사용한 구조화된 출력
Gemini 2.0으로 PDF에서 구조화된 데이터 추출하기

1. 환경 설정 및 추론 클라이언트 만들기

첫 번째 작업은 다음을 설치하는 것입니다. google-genai Python SDK 를 클릭하고 API 키를 받습니다. 독자가 아직 API 키를 가지고 있지 않은 경우 다음에서 키를 받을 수 있습니다. Google AI 스튜디오 Get:Gemini API 키 받기.

%pip install "google-genai>=1"

SDK와 API 키가 있으면 클라이언트를 생성하고 사용할 모델, 즉 하루 1,500건의 요청이 가능한 무료 티어(2025년 2월 6일 기준)를 통해 제공되는 새로운 Gemini 2.0 플래시 모델을 정의할 수 있습니다.

from google import genai
# Create a client
api_key = "XXXXX"
client = genai.Client(api_key=api_key)
 
# Define the model you are going to use
model_id =  "gemini-2.0-flash" # or "gemini-2.0-flash-lite-preview-02-05"  , "gemini-2.0-pro-exp-02-05"

참고: 독자가 Vertex AI를 사용하고자 하는 경우여기를 클릭하세요클라이언트를 만드는 방법 알아보기

2. PDF 및 기타 문서 처리

Gemini 모델은 다음을 처리할 수 있습니다.이미지 및 동영상이 작업은 base64 문자열을 사용하거나 files Python API에는 업로드 및 삭제 메서드가 포함되어 있습니다. 파일을 업로드한 후 리더는 호출에 파일 URI를 직접 포함할 수 있으며, Python API에는 업로드 및 삭제 메서드가 포함되어 있습니다.

이 예제에서는 사용자에게 2개의 PDF 샘플, 기본 송장 및 수기 값이 있는 양식이 있습니다.

!wget -q -O https://storage.googleapis.com/generativeai-downloads/data/pdf_structured_outputs/handwriting_form.pdf
!wget -q -O https://storage.googleapis.com/generativeai-downloads/data/pdf_structured_outputs/invoice.pdf

이제 독자는 클라이언트를 사용할 수 있으며 upload 메서드를 사용하여 파일을 업로드할 수 있습니다. 파일 중 하나에 적용해 보겠습니다.

invoice_pdf = client.files.upload(file="invoice.pdf", config={'display_name': 'invoice'})

참고: 파일 API는 프로젝트당 최대 20GB의 파일을 허용하며, 파일당 최대 파일 크기는 2GB입니다. 파일은 48시간 동안 저장됩니다. 이 기간 동안 사용자의 API 키를 사용하여 파일에 액세스할 수 있지만 다운로드할 수는 없습니다. 파일 업로드는 무료입니다.

파일이 업로드되면 독자는 변환된 토큰의 개수를 확인할 수 있으며, 이는 사용자가 처리하는 내용의 맥락을 이해하는 데 도움이 될 뿐만 아니라 비용을 추적하는 데도 도움이 됩니다.

file_size = client.models.count_tokens(model=model_id,contents=invoice_pdf)
print(f'File: {invoice_pdf.display_name} equals to {file_size.total_tokens} tokens')
# File: invoice equals to 821 tokens

3. Gemini 2.0 및 Pydantic을 사용한 구조화된 출력

구조화된 출력은 Gemini가 항상 JSON 스키마와 같이 미리 정의된 형식에 맞는 응답을 생성하도록 하는 기능입니다. 즉, 사용자가 정의한 스키마로 유효한 JSON 객체를 반환하도록 보장되므로 사용자가 출력과 애플리케이션에 통합되는 방식을 더 잘 제어할 수 있습니다.

Gemini 2.0은 현재 3가지 유형의 JSON 스키마 정의를 지원합니다:

에서와 같이 단일 파이썬 유형은 타이핑 주석 에서 사용된 것과 동일합니다.
Pydantic 기본 모델
genai.types.schema / 피단틱 베이스모델 사전 등가물

간단한 텍스트 기반 예시를 살펴보겠습니다.

from pydantic import BaseModel, Field
 
# Define a Pydantic model
# Use the Field class to add a description and default value to provide more context to the model
class Topic(BaseModel):
    name: str = Field(description="The name of the topic")
 
class Person(BaseModel):
    first_name: str = Field(description="The first name of the person")
    last_name: str = Field(description="The last name of the person")
    age: int = Field(description="The age of the person, if not provided please return 0")
    work_topics: list[Topic] = Field(description="The fields of interest of the person, if not provided please return an empty list")
 
 
# Define the prompt
prompt = "Philipp Schmid is a Senior AI Developer Relations Engineer at Google DeepMind working on Gemini, Gemma with the mission to help every developer to build and benefit from AI in a responsible way.  "
 
# Generate a response using the Person model
response = client.models.generate_content(model=model_id, contents=prompt, config={'response_mime_type': 'application/json', 'response_schema': Person})
 
# print the response as a json string
print(response.text)
 
# sdk automatically converts the response to the pydantic model
philipp: Person = response.parsed
 
# access an attribute of the json response
print(f"First name is {philipp.first_name}")

Gemini 2.0을 사용하여 PDF에서 구조화된 데이터 추출하기

이제 File API를 구조화된 출력과 결합하여 PDF에서 정보를 추출해 보겠습니다. 사용자는 로컬 파일 경로와 Pydantic 모델을 사용하여 사용자에게 구조화된 데이터를 반환하는 간단한 메서드를 만들 수 있습니다. 메서드는 다음과 같습니다:

파일 API에 파일 업로드하기
활용 Gemini API 구조화된 응답 생성
응답을 피단틱 모델로 변환하고 반환합니다.

def extract_structured_data(file_path: str, model: BaseModel):
    # Upload the file to the File API
    file = client.files.upload(file=file_path, config={'display_name': file_path.split('/')[-1].split('.')[0]})
    # Generate a structured response using the Gemini API
    prompt = f"Extract the structured data from the following PDF file"
    response = client.models.generate_content(model=model_id, contents=[prompt, file], config={'response_mime_type': 'application/json', 'response_schema': model})
    # Convert the response to the pydantic model and return it
    return response.parsed

이 예에서 각 PDF는 서로 다릅니다. 따라서 사용자는 Gemini 2.0의 성능을 입증하기 위해 각 PDF에 대해 고유한 Pydantic 모델을 정의해야 합니다. 사용자가 매우 유사한 PDF를 가지고 있고 동일한 정보를 추출하려는 경우 모든 PDF에 동일한 모델을 사용할 수 있습니다.

Invoice.pdf송장 번호, 날짜 및 설명, 수량, 총액 및 총 총액을 포함한 모든 목록 항목을 추출합니다.
handwriting_form.pdf출금 양식 번호, 제도 시작일, 연초 및 연말의 제도 부채.

참고: 사용자는 Pydantic 기능을 사용하여 모델에 더 많은 컨텍스트를 추가하여 모델을 더 정확하게 만들고 데이터의 유효성 검사를 수행할 수 있습니다. 포괄적인 설명을 추가하면 모델의 성능을 크게 향상시킬 수 있습니다. 교수자와 같은 라이브러리는 유효성 검사 오류에 따라 자동 재시도를 추가하므로 매우 유용할 수 있지만 요청 비용이 추가됩니다.

Invoice.pdf

from pydantic import BaseModel, Field
 
class Item(BaseModel):
    description: str = Field(description="The description of the item")
    quantity: float = Field(description="The Qty of the item")
    gross_worth: float = Field(description="The gross worth of the item")
 
class Invoice(BaseModel):
    """Extract the invoice number, date and all list items with description, quantity and gross worth and the total gross worth."""
    invoice_number: str = Field(description="The invoice number e.g. 1234567890")
    date: str = Field(description="The date of the invoice e.g. 2024-01-01")
    items: list[Item] = Field(description="The list of items with description, quantity and gross worth")
    total_gross_worth: float = Field(description="The total gross worth of the invoice")
 
 
result = extract_structured_data("invoice.pdf", Invoice)
print(type(result))
print(f"Extracted Invoice: {result.invoice_number} on {result.date} with total gross worth {result.total_gross_worth}")
for item in result.items:
    print(f"Item: {item.description} with quantity {item.quantity} and gross worth {item.gross_worth}")

환상적입니다! 이 모델은 인보이스에서 정보를 추출하는 데 탁월한 성능을 발휘합니다.

손글씨_양식.pdf

class Form(BaseModel):
    """Extract the form number, fiscal start date, fiscal end date, and the plan liabilities beginning of the year and end of the year."""
    form_number: str = Field(description="The Form Number")
    start_date: str = Field(description="Effective Date")
    beginning_of_year: float = Field(description="The plan liabilities beginning of the year")
    end_of_year: float = Field(description="The plan liabilities end of the year")
 
result = extract_structured_data("handwriting_form.pdf", Form)
 
print(f'Extracted Form Number: {result.form_number} with start date {result.start_date}. \nPlan liabilities beginning of the year {result.beginning_of_year} and end of the year {result.end_of_year}')
# Extracted Form Number: CA530082 with start date 02/05/2022. 
# Plan liabilities beginning of the year 40000.0 and end of the year 55000.0

모범 사례 및 제한 사항

PDF 처리를 위해 Gemini 2.0을 사용할 때는 다음 사항을 염두에 두세요:

파일 크기 관리: 파일 API는 대용량 파일을 지원하지만, 업로드하기 전에 PDF를 최적화하는 것이 가장 좋습니다.
토큰 제한: 대용량 문서로 작업할 때는 토큰 수를 확인하여 사용자가 모델 제한 및 예산 내에서 작업할 수 있도록 하세요.
구조화된 출력 디자인: 명확성을 유지하면서 필요한 모든 정보를 캡처할 수 있도록 사용자의 Pydantic 모델을 신중하게 디자인하고, 설명과 예제를 추가하면 모델의 성능을 향상시킬 수 있습니다.
오류 처리: 모델의 오류 메시지 재시도 및 처리를 포함하여 파일 업로드 및 처리 상태에 대한 강력한 오류 처리를 구현합니다.

평결에 도달하기

구조화된 출력과 결합된 Gemini 2.0의 멀티모달 기능은 사용자가 PDF 및 기타 문서에서 정보를 처리하고 추출하는 데 도움을 줍니다. 따라서 복잡하고 시간이 많이 소요되는 수동 또는 반자동 데이터 추출 프로세스가 필요 없습니다. 송장 처리 시스템, 문서 분석 도구 또는 기타 문서 중심 애플리케이션을 구축하는 경우, 초기에는 무료로 테스트할 수 있고 입력 토큰 백만 개당 0.1달러에 불과한 Gemini 2.0을 사용해 보시기 바랍니다. 구글은 Gemini 2.0이 초기에는 무료로 테스트할 수 있고 이후에는 백만 토큰당 0.1달러에 불과해 사용자가 새로운 기술을 사용해 볼 수 있는 장벽을 확실히 낮춘다고 강조하지만, 장기적인 비용 효율성은 아직 지켜봐야 합니다.