This guide describes how to leverage Claude's advanced natural language processing capabilities to efficiently summarize legal documents, extract key information and accelerate legal research. With Claude, you can streamline contract review, litigation preparation and compliance, saving time and ensuring accuracy in the legal process.
Visit our Abstract Recipes , see a sample implementation of legal summarization using Claude.
Before building with Claude
Deciding whether to use Claude for legal briefs
Here are some key instructions for summarizing legal documents using an LLM like Claude:
You want to review large volumes of documents efficiently and economically
Large-scale document review by hand can be time-consuming and costly, and Claude can quickly process and summarize large volumes of legal documents, significantly reducing the time and cost required for document review. This capability is particularly valuable in tasks such as due diligence, contract analysis or litigation discovery, where efficiency is critical.
You need to automatically extract key metadata
Claude efficiently extracts and categorizes important metadata from legal documents, such as the parties involved, dates, contract terms, or specific clauses. This automated extraction can help organize information and make it easier to search, analyze and manage large document collections. It is particularly useful in contract management, compliance checking or creating searchable databases of legal information.
You want to generate clear, concise and standardized summaries
Claude generates structured summaries that follow a predetermined format, enabling legal professionals to quickly grasp the key points of various documents. These standardized summaries improve readability, facilitate comparisons between documents, and enhance overall understanding, especially when dealing with complex legal language or technical terminology.
You need to provide accurate citations for your abstracts
When creating legal summaries, proper attribution and citation is critical to ensure credibility and compliance with legal standards.Claude can be prompted to provide accurate citations for all cited points of law, making it easier for legal professionals to review and validate the summarized information.
You want to simplify and speed up the legal research process
Claude can assist with legal research by quickly analyzing large volumes of case law, statutes, and law reviews. It identifies relevant precedents, extracts key legal principles, and summarizes complex legal arguments. This capability can significantly speed up the research process, allowing legal professionals to focus on higher-level analysis and strategy development.
Determine the details you want the summary to extract
There is no single correct summary for any given document. Without clear direction, Claude may have difficulty determining what details to include. For best results, identify the specific information you wish to include in the summary.
For example, when summarizing a sublease agreement, you may want to extract the following key points:
details_to_extract = [
'Related parties (sublessor, sublease lessee and original lessor)',
'Property details (address, description and permitted uses)',
'Term and rent (start date, end date, monthly rent and security deposit)',
'Liability (utilities, maintenance and repair)',
'Consent and notification (landlord's consent and notification requirements)',
'Special terms (furniture, parking and subletting restrictions)'
]
Establishment of success criteria
Assessing the quality of summaries is a notoriously challenging task. Unlike many other natural language processing tasks, the assessment of abstracts usually lacks explicit, objective metrics. The process tends to be highly subjective, and different readers may value different aspects of abstracts differently. Here's what you can expect when evaluating Claude Criteria that may need to be considered in how the legal brief is implemented.
Factual accuracy
The summary should accurately present the facts, legal concepts, and key points in the document.
Legal precision
Terminology and references to statutes, case law or regulations must be correct and comply with legal standards.
simplicity
The summary should compress the legal document into its core points without leaving out important details.
consistency
In the case of summarizing multiple documents, the Big Language Model should maintain a consistent structure and processing for each summary.
readable
The text should be clear and easy to understand. If the audience is not a legal expert, the summary should not contain legal terms that may confuse the audience.
Bias and Impartiality
Abstracts should present fair and unbiased legal arguments and positions.
Check out our guide to learn more about Establishment of success criteria The message.
How to use Claude to summarize legal documents
Selecting the right Claude model
When summarizing legal documents, the accuracy of the model is critical.Claude 3.5 Sonnet is an excellent choice for such use cases where high accuracy is required. If the size and number of documents is large, causing cost to be an issue, you can also try using a smaller model such as Claude 3 Haiku.
To help estimate these costs, here is a comparison of the costs of summarizing 1,000 sublease agreements using Sonnet and Haiku:
- Scale of content
- Number of agreements: 1,000
- Characters per agreement: 300,000
- Total characters: 300M
- Estimated Tokens
- Input tokens: 86M (Assumption 1) token (corresponding to 3.5 characters)
- Output tokens per abstract: 350
- Total output tokens: 350,000
- Claude 3.5 Sonnet Estimated Costs
- Input token cost: 86 MTok * $3.00/MTok = $258
- Output token cost: 0.35 MTok * $15.00/MTok = $5.25
- Total cost: $258.00 + $5.25 = $263.25
- Claude 3 Haiku Estimated cost
- Enter token cost: 86 MTok * $0.25/MTok = $21.50
- Output token cost: 0.35 MTok * $1.25/MTok = $0.44
- Total Cost: $21.50 + $0.44 = $21.96
Actual costs may differ from these estimates. The above estimates are based on prompting Examples in Chapters.
Convert files to a format that Claude can handle
Before you can start summarizing the document, you need to prepare the data. This involves extracting the text from the PDF, cleaning up the text, and making sure it can be processed by Claude.
Below is a demonstration of this process on a sample PDF:
from io import BytesIO
import re
import pypdf
import requests
def get_llm_text(pdf_file):
reader = pypdf.PdfReader(pdf_file)
text = "\n".join([page.extract_text() for page in reader.pages])
# Remove extra spaces
text = re.sub(r'\s+', ' ', text)
# Remove page number
text = re.sub(r'\n\s*\d+\s*\n', '\n', text)
return text
# Creating full URLs from GitHub repositories
url = "https://raw.githubusercontent.com/anthropics/anthropic-cookbook/main/skills/summarization/data/Sample Sublease Agreement.pdf"
url = url.replace(" ", "%20")
# Downloading PDF files to memory
response = requests.get(url)
# Load PDF from memory
pdf_file = BytesIO(response.content)
document_text = get_llm_text(pdf_file)
print(document_text[.50000])
In this example, we first downloaded a PDF of a sublease agreement from the summarization cookbook . The agreement originates from the sec.gov website The sublease agreement that was publicized on the
We use the pypdf library to extract the contents of the PDF and convert it to text. The text data is then cleaned up by removing extra spaces and page numbers.
Building Powerful Cue Words
Claude can accommodate a variety of summarization styles. You can adjust the details of the cue words as needed to direct Claude to generate more or less detailed or concise content, to include more or less jargon, or to provide a higher or lower level of contextual summarization.
Below is an example showing how to create a cue word to ensure that the summaries generated when analyzing sublease agreements follow a consistent structure:
import anthropic
# Initialize Anthropic Client
client = anthropic.Anthropic()
def summarize_document(text, details_to_extract, model="claude-3-5-sonnet-20240620", max_tokens=1000):
# Formatting the details to be extracted as part of the context of the cue word
details_to_extract_str = '\n'.join(details_to_extract)
# Tip Model Summary Sublease Agreement
prompt = f"""Summarize the following sublease agreement. Focus on these key aspects.
{details_to_extract_str}
Provide the summary in bullet points nested within the XML header for each section. For example.
- Sublessor: [Name]
// Add more details as needed
</parties involved
If some information is not explicitly stated in the documentation, mark it as 'not stated'. Do not use preambles.
Sublease agreement content:
{text}
"""
response = client.messages.create(
model=model,
model=model, max_tokens=max_tokens, system=max_tokens, max_tokens
system="You are a legal analyst specializing in real estate law, known for highly accurate and detailed summaries of sublease agreements.",
messages=[
{"role": "user", "content": prompt}, {
{"role": "assistant", "content": "Here is the summary of the sublease agreement: " }
],
stop_sequences=[""]
)
return response.content[0].text
sublease_summary = summarize_document(document_text, details_to_extract)
print(sublease_summary)
This code implements a summarize_document
function that uses Claude to summarize the contents of a sublease agreement. The function takes as input a text string and a list of details to be extracted. In this example, we use the document_text
cap (a poem) details_to_extract
variable calls this function.
Inside the function, a cue word is generated for Claude that contains the document to be summarized, the details to be extracted, and specific instructions for summarizing the document. The prompt instructs Claude to return a summary of each extracted detail as a nested XML tag.
Since we decided to output each part of the summary within a tag, we can easily parse each part in a post-processing step. This approach generates structured summaries, adapts to your usage scenario, and ensures that each summary follows the same pattern.
Evaluate your cue words
Cue words often need to be tested and optimized before they can be put into production use. To determine if your solution is ready, use a systematic process combining quantitative and qualitative methods to assess the quality of the summaries. Create success criteria based on definedStrong empirical assessmentwill help you optimize your cue words. Here are some metrics you may want to include in your assessment:
ROUGE score
BLEU score
Context embedding similarity
LLM-based scoring
manual assessment
Deployment Tips
Keep the following considerations in mind when deploying your solution to a production environment.
- Ensure that there is no risk of liability: Understand the potential legal implications of errors in the abstracts, which could result in legal liability for your organization or clients. Provide a disclaimer or legal statement that the abstract was generated by AI and needs to be reviewed by a legal professional.
- Handles multiple document types: In this guide, we discuss how to extract text from PDF. In practice, documents may be in multiple formats (PDF, Word documents, text files, etc.). Make sure that your data extraction process converts all the file formats you may receive.
- Parallel calls to Claude's API: For long documents with a large number of Tokens, it can take up to a minute for Claude to generate a digest. For large collections of documents, you may need to send API calls to Claude in parallel to ensure that the digest is completed in a reasonable amount of time. See Anthropic's speed limit to determine the maximum number of API calls that can be executed in parallel.
improve performance
In complex scenarios, in addition to the standard Tip Engineering Beyond that, it may be beneficial to consider some additional strategies to improve performance. Here are some advanced strategies:
Executive meta-summaries to summarize long documents
Legal summarization often involves working with long documents or multiple related documents, which may be outside of Claude's context window. You can use a chunking method called meta-digesting to deal with this situation. This technique involves splitting documents into smaller, manageable chunks and then processing each chunk separately. Later, you can combine the summaries from each chunk to produce a meta-summary of the entire document.
The following is an example of how to perform a meta-summary:
import anthropic
# Initialize Anthropic Client
client = anthropic.Anthropic()
def chunk_text(text, chunk_size=20000):
return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
def summarize_long_document(text, details_to_extract, model="claude-3-5-sonnet-20240620", max_tokens=1000):
# Formatting Extraction Details for Placement in the Context of a Prompt
details_to_extract_str = '\n'.join(details_to_extract)
# Traverse the blocks and summarize each block separately
chunk_summaries = [summarize_document(chunk, details_to_extract, model=model, max_tokens=max_tokens) for chunk in chunk_text(text)]
final_summary_prompt = f"""
You are viewing chunked summaries of multiple related documents.
Consolidate the following summaries of documents from different trusted sources into a coherent overall summary:
{"".join(chunk_summaries)}
Focus on the following key aspects:
{details_to_extract_str})
Provides the summary as bullet points nested under the XML header of each section. Example:
<parties involved
- Sublessee: [name].
// Add more details as needed
If any information is not explicitly stated in the document, state 'not specified'. Do not include a preamble.
"""
response = client.messages.create(
model=model,
model=model, max_tokens=max_tokens, system=max_tokens, max_tokens
system="You are a legal expert in summarizing document notes.",
messages=[
{"role": "user", "content": final_summary_prompt}, {
{"role": "assistant", "content": "The following is a summary of the sublease agreement:" }
],
stop_sequences=[""]
)
return response.content[0].text
long_summary = summarize_long_document(document_text, details_to_extract)
print(long_summary)
summarize_long_document
function is based on the previous summarize_document
function, which does this by splitting the document into smaller chunks and summarizing each chunk separately.
The code does this by setting the summarize_document
The function is applied to each of the 20,000 character blocks in the original document to accomplish this. The summaries of each block are then combined to produce a final summary consisting of these block summaries.
Note that for our example PDF, thesummarize_long_document
function is not strictly necessary because the entire document can fit into Claude's context window. However, this approach is critical when the document exceeds Claude's context window or when multiple related documents need to be summarized. In any case, this meta-summarization technique can often capture more important details in the final summary that were missed in earlier single-summarization methods.
Explore large numbers of documents using summary indexed documents
Searching document collections using Large Language Models (LLMs) typically involves Retrieval Augmentation Generation (RAG). However, in scenarios involving large documents or where precise information retrieval is critical, the basic RAG method may be insufficient. Digest indexing documents is an advanced RAG approach that provides a more efficient way of ranking documents for retrieval, using less context than traditional RAG methods. In this method, Claude is used to first generate a concise summary for each document in the corpus, and then Clade is used to rank the relevance of each summary to the query. For more details on this approach, including a code-based example, check out the summarization cookbook The summary index document section in the
Fine-tuning Claude to learn your dataset
Another advanced technique for improving Claude's ability to generate summaries is fine-tuning. Fine-tuning involves training Claude on a customized dataset that is highly aligned with your legal summarization needs, ensuring that it adapts to your usage scenario. The following is an overview of performing fine-tuning:
- Misidentification: Begin by collecting examples of Claude summaries that do not meet the requirements - this may include omitting key legal details, misunderstanding the context, or using inappropriate legal terminology.
- Preparation of data sets: Once these issues are identified, compile a dataset containing examples of these issues. This dataset should include the original legal documents as well as your corrected summaries to ensure that Claude learns the desired behaviors.
- Implementation of fine-tuning: Fine-tuning involves retraining the model on the dataset you've collated to adjust its weights and parameters. This retraining helps Claude better understand the specific requirements of your area of law and improves its ability to summarize documents according to your criteria.
- Iterative Improvement: Fine-tuning is not a one-time process. As Claude continues to generate summaries, you can iteratively add new underperforming examples to further refine its capabilities. Over time, this ongoing feedback loop will produce a highly specialized model dedicated to your legal summarization tasks.
Fine Tuning is currently only available through Amazon Bedrock. For more details, see AWS Publishing BlogThe