Content auditing is a key aspect of maintaining a safe, respectful, and productive environment in digital applications. In this guide, we will discuss how to use the Claude Audit the content in your digital apps.
Please visit our Content Review Recipes See an example of using Claude to implement a content audit.
This guide focuses on reviewing user-generated content in your app. If you're looking for guidance on auditing your interactions with Claude, please refer to our Security Boundary GuidelinesThe
Before using Claude
Decide whether to use Claude for content review
Here are some key metrics that you should choose an LLM like Claude over a traditional machine learning or rules-based approach to content auditing:
You want to achieve a cost-effective and fast implementation
Traditional machine learning methods require significant engineering resources, machine learning expertise, and infrastructure costs. Manual audit systems cost even more. With Claude, you can build complex audit systems in less time and at a lower cost.
You need semantic understanding and fast decision making
Traditional machine learning approaches, such as bag-of-words modeling or simple pattern matching, often struggle to understand the tone, intent, and context of content. While manual review systems excel at understanding semantic meaning, they take time to review content.Claude bridges this gap by combining semantic understanding with the ability to quickly provide review decisions.
You need consistent policy decisions
By leveraging its advanced reasoning capabilities, Claude can interpret and apply complex review guidelines consistently. This consistency helps ensure fair treatment of all content and reduces the risk that inconsistent or biased review decisions could undermine user trust.
Your review policy may change or evolve over time
Once a traditional machine learning approach has been established, making changes can be a time-consuming and data-intensive endeavor. On the other hand, as your product or customer needs change, Claude can easily adapt to changes or additions to audit policies without extensive re-labeling of training data.
You need interpretable reasoning for reviewing decisions
If you want to provide users or regulators with clear explanations behind review decisions, Claude can generate detailed and coherent rationales. This transparency is critical to building trust and ensuring accountability in content review practices.
You need multi-language support without having to maintain separate models
Traditional machine learning approaches typically require separate models or extensive translation processes for each supported language. Manual audits require a workforce that is fluent in each supported language, and Claude's multilingual capabilities simplify audits for a global customer base by enabling it to categorize tickets in a variety of languages without the need for separate models or extensive translation processes.
You need multimodal support
Claude's multimodal capabilities enable it to analyze and interpret content in text and images. This makes it a versatile tool for comprehensive content audits in environments where different media types need to be evaluated together.
Anthropic has trained all Claude models to be honest, helpful and harmless. This may result in Claude reviewing content that is considered particularly dangerous (in line with our Acceptable Use Policy), regardless of the prompt used. For example, an adult site that wants to allow users to post explicit sexual content may find that Claude will still mark explicit content as requiring moderation, even though they explicitly specify in their prompts that they do not moderate explicit content. We recommend that you review our AUP before building a review solution.
Example of generating a content audit
Before developing a content auditing solution, first create examples of content that should be flagged and examples of content that should not be flagged. Make sure to include edge cases and complex scenarios that may challenge the processing effectiveness of your content auditing system. After that, review your examples to create a clear list of audit categories. For example, an example generated by a social media platform might include the following:
allowed_user_comments = [
'The movie was fantastic, I really enjoyed it. The main character did a great job!',
'I hate Mondays.',
'It's a good time to invest in gold!'
]
disallowed_user_comments = [
'Delete this post immediately or you better hide. I will catch up with you and your family.',
'Stay away from 5G phones! They are controlling you with 5G.',
'Congratulations! You have won a gift card worth $1,000. Click here to claim your prize!'
]
# Example of a user comment used to test content review
user_comments = allowed_user_comments + disallowed_user_comments
# List of Content Review Categories Considered Unsafe
unsafe_categories = [
'Child exploitation',
'Conspiracy theory',
'Hatred',
'Undifferentiated Weapons',
'Intellectual Property Rights',
'Non-violent crime',
'Privacy',
'Self-harm',
'Sexual Offenses',
'Pornographic content',
'Professional advice',
'Violent crime'
]
Effectively reviewing these examples requires a nuanced understanding of the language. In the review.'The movie was fantastic, I really enjoyed it. The main character did a great job!'
, the content review system needs to recognize that 'acting very well' is a metaphor, not an indication of actual violence. Instead, despite the lack of explicit references to violence, reviews'Delete this post immediately or you better hide. I will catch up with you and your family.'
It should be flagged by the content review system.
unsafe_categories
The list can be customized to meet your specific needs. For example, if you want to prevent minors from creating content on your site, you can add 'Underage Publishing' to the list.
How to use Claude for content auditing
Selecting the right Claude model
When choosing a model, it is important to consider the size of the data. If cost is an issue, a smaller model like Claude 3 Haiku is an excellent choice because it is cost-effective. Here's an estimate of the cost of performing a text audit of a social media platform that receives a billion posts per month:
- Content size
- Number of posts per month: 1bn
- Characters per post: 100
- Total characters: 100bn
- Estimated Token Count
- Input Token: 28.6bn (assuming 1 token per 3.5 characters)
- Percentage of tagged messages: 3%
- Output for each tagged message Token Number: 50
- Total Token Output: 1.5bn
- Claude 3 Haiku Estimated cost
- Input Token Cost: 2,860 MTok * $0.25/MTok = $715
- Output Token Cost: 1,500 MTok * $1.25/MTok = $1,875
- Monthly cost: $715 + $1,875 = $2,590
- Claude 3.5 Sonnet Estimated Costs
- Input Token Cost: 2,860 MTok * $3.00/MTok = $8,580
- Output Token Cost: 1,500 MTok * $15.00/MTok = $22,500
- Monthly cost: $8,580 + $22,500 = $31,080
Actual costs may differ from these estimates. These estimates are based on information about batch file highlighted tips in the section. By removing the explanation
field, the number of Output Token can be further reduced.
Building Powerful Tips
In order to use Claude for content auditing, Claude must understand your application's auditing requirements. Let's start writing a prompt so you can define your auditing requirements:
import anthropic
import json
# Initialize Anthropic Client
client = anthropic.Anthropic()
def moderate_message(message, unsafe_categories):
# Convert the list of unsafe categories to strings, one line per category
unsafe_category_str = '\n'.join(unsafe_categories)
# Building tips for Claude, including messages and insecurity categories
assessment_prompt = f"""
Determine whether the following messages need to be reviewed, based
based on the insecurity categories listed below.
Message:
<message{message}Insecure categories:
{unsafe_category_str}
Only respond to JSON objects using the following format:
{{
"violation": , {{
"categories": [comma-separated list of violated categories], {{ "explanation": [Optional.
"explanation": [Optional. Include only if there is a violation].
}}"""
# sends request to Claude for content review
response = client.messages.create(
model="claude-3-haiku-20240307", # Uses Haiku Model to Reduce Costs
max_tokens=200,
temperature=0, # Uses 0 Temperature to Improve Consistency
messages=[
{"role": "user", "content": assessment_prompt}
]
)
# Parsing Claude's JSON Response
assessment = json.loads(response.content[)0].text)
# Extracting Violation Status from Assessments
contains_violation = assessment['violation']
# Get category and explanation if there is a violation; otherwise use empty defaults
violated_categories = assessment.get('categories', []) if contains_violation else []
explanation = assessment.get('explanation') if contains_violation else None
return contains_violation, violated_categories, explanation
# Processes each comment and prints results
for comment in user_comments.
print(f"\n Comments. {comment}")
violation, violated_categories, explanation = moderate_message(comment, unsafe_categories)
if violation.
print(f "Categories of violations. {', '.join(violated_categories)}")
print(f "Explanation. {explanation}")
else:
print("No problems detected.")
In this example, themoderate_message
The function contains an evaluation prompt with the unsafe content category and the message we wish to evaluate. The prompt asks Claude to evaluate whether the message should be audited or not, based on the insecurity category we defined.
The model's evaluation is then parsed to determine if a violation exists. If a violation exists, Claude also returns a list of violated categories and an explanation of why the message is considered insecure.
Evaluate your tips
Content review is a matter of categorization. Therefore, you can use our classification cookbook The same techniques outlined in Determining the Accuracy of Your Content Audit System.
Another consideration is that rather than viewing content auditing as a binary categorization problem, you can create multiple categories to represent different risk levels. Creating multiple risk levels allows you to adjust the intensity of your audits. For example, you may want to automatically block user queries that are considered high risk, while marking many medium-risk queries as requiring manual review.
import anthropic
import json
# Initialize Anthropic Client
client = anthropic.Anthropic()
def assess_risk_level(message, unsafe_categories):
# Convert the list of unsafe categories to strings, one line per category
unsafe_category_str = '\n'.join(unsafe_categories)
# Construct prompts for Claude that include the message to be evaluated, the insecurity category, and the risk level definition
assessment_prompt = f"""
Evaluate the risk level of the following messages to determine if they need to be reviewed, based
based on the insecurity categories listed below.
Message:
<message{message}Insecure categories:
{unsafe_category_str}
Assigns a risk level based on your confidence that user messages should be audited according to the defined insecurity category:
0 - no risk
1 - Low risk
2 - Medium risk
3 - High risk
To reply in JSON object format only, use the following format:
{{
"risk_level": , "categories": [comma-separated list of violation categories], {{{
"categories": [comma-separated list of offending categories], {{ "risk_level": , {{ "risk_explanation": <optional.
"explanation":
}}"""
# Send request to Claude for risk assessment
response = client.messages.create(
model="claude-3-haiku-20240307", # Uses Haiku Model to Reduce Costs
max_tokens=200,
temperature=0, # Uses 0 Temperature to Improve Consistency
messages=[
{"role": "user", "content": assessment_prompt}
]
)
# Parsing Claude's JSON Response
assessment = json.loads(response.content[)0].text)
# Extract risk level, violation category and explanation from assessment
risk_level = assessment["risk_level"]
violated_categories = assessment["categories"explanation = assessment.get()
explanation = assessment.get("explanation")
return risk_level, violated_categories, explanation
# Processes each comment and prints the results
for comment in user_comments.
print(f"\n Comments. {comment}")
risk_level, violated_categories, explanation = assess_risk_level(comment, unsafe_categories)
print(f "Risk level. {risk_level}")
if violated_categories.
print(f "Category of violation. {', '.join(violated_categories)}")
if explanation.
print(f "Explanation. {explanation}")
The code implements a assess_risk_level
function that uses Claude to evaluate the risk level of a message. The function takes as input a list of messages and unsafe categories.
Inside the function, a prompt is generated for Claude containing the message to be evaluated, the category of insecurity, and a specific description of the risk level to be evaluated. The prompt instructs Claude to respond in the form of a JSON object with the risk level, violation category, and optionally an explanation.
This approach enables flexible content review by assigning risk levels. It can be seamlessly integrated into larger systems to automatically filter content or flag comments for manual review based on assessed risk levels. For example, when executing this code, comments Delete this post now or you better hide. I am coming after you and your family.
Identified as high risk due to its hazardous threat. Instead, the comment Stay away from the 5G cellphones!!!! They are using 5G to control you.
Classified as medium risk.
Deploying your tips
When you're satisfied with the quality of your solution, it's time to deploy it to a production environment. Here are some best practices when using Content Audit:
- **Provide clear feedback to users:** When user input is blocked or a response is flagged due to content moderation, provide useful and constructive feedback to help users understand why their message has been flagged and inform them how to restate it appropriately. In the code example above, this is accomplished through the Claude response in the
explanation
Tagged Realization. - **Analyze reviewed content:** Track the types of content flagged by the review system to identify trends and possible areas for improvement.
- **Continuous Evaluation and Improvement:** Regularly evaluate the performance of your content review system using metrics such as precision and recall tracking. Use this data to iteratively improve your audit prompts, keywords, and evaluation criteria.
improve performance
In complex scenarios, in addition to the standard Tip Engineering Beyond that, considering some additional strategies may help improve performance. Here are some advanced strategies:
Define topics and provide examples
In addition to listing the unsafe categories in the prompt, further improvements could be made by providing definitions and related phrases for each category.
import anthropic
import json
# Initialize Anthropic Client
client = anthropic.Anthropic()
# Content Audit Insecurity Category Dictionary with Definitions
unsafe_category_definitions = {
'Child Exploitation': 'Content that involves child nudity or promotes, encourages, justifies or depicts child sexual abuse.',
'Conspiracy Theories': """Content that promotes or supports unsubstantiated false or misleading theories about events, situations, or individuals.
These theories may incite to harm or undermine public trust in institutions or facts."""",
'Hate': """"Based on protected characteristics (race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, gender, gender identity, and major medical conditions)
Content that expresses hatred towards groups of people or promotes negative stereotypes."""",
'Indiscriminate Weapons': 'Content involving the manufacture of weapons of mass destruction.',
'Intellectual Property': 'Content that infringes on the intellectual property rights of third parties.',
'Non-Violent Crimes': 'Content that encourages, justifies, or promotes nonviolent crime.',
'Privacy': 'Content containing private personal sensitive information.',
'Self-Harm': 'Content that encourages, justifies or promotes self-harming behavior.',
'Sex Crimes': 'Content that encourages, justifies or promotes sexual offenses.',
'Sexual Content': 'Content that contains pornographic descriptions or explicit depictions of sexual behavior.',
'Specialized Advice': """Contains elements of financial, medical or legal advice.
Financial advice includes guidance on investments, stocks, bonds, or any financial planning."""",
'Violent Crimes': 'Content that encourages, justifies or promotes violent crime.',
}
def moderate_message_with_definitions(message, unsafe_category_definitions):
# Format insecure category strings with one line for each category and its definition
unsafe_category_str = '\n'.join(f'{category}: {definition}'
for category, definition in unsafe_category_definitions.items()
)
# Tips for building Claude, including messages and insecurity categories
assessment_prompt = f"""Based on the insecurity categories listed below, determine whether the following messages need to be reviewed.
Message:{message}Unsafe categories and their definitions:
{unsafe_category_str}
Remember all the unsafe categories and their definitions.
Respond only in JSON format, using the following format:
{{
"violation": , {{{
"categories": [comma-separated list of violated categories], {{ "explanation": [optional
"explanation": [Optional. Include only if there is a violation].
}}"""
# Sends content review request to Claude
response = client.messages.create(
model="claude-3-haiku-20240307", # Uses Haiku Model to Reduce Costs
max_tokens=200,
temperature=0, # Uses 0 Temperature to Improve Consistency
messages=[
{"role": "user", "content": assessment_prompt}
]
)
# Parsing Claude's JSON Response
assessment = json.loads(response.content[)0].text)
# Extracting Violation Status from Assessment Results
contains_violation = assessment['violation']
# If there is a violation, get the violation category and explanation; otherwise use empty defaults
violated_categories = assessment.get('categories', []) if contains_violation else []
explanation = assessment.get('explanation') if contains_violation else None
return contains_violation, violated_categories, explanation
# Processes each comment and prints the results
for comment in user_comments.
print(f"\n Comments:{comment}")
violation, violated_categories, explanation = moderate_message_with_definitions(comment, unsafe_category_definitions)
if violation.
print(f" Category of violation:{', '.join(violated_categories)}")
print(f" explanation:{explanation}")
else:
print("No problems found.")
moderate_message_with_definitions
function is based on the previous moderate_message
function was extended by pairing each unsafe category with a detailed definition. The code combines the original function's unsafe_categories
The list is replaced with unsafe_category_definitions
Dictionary. The dictionary maps each unsafe category to its corresponding definition. Both the category name and its definition are included in the prompt message.
It is worth noting thatSpecialized Advice
The definition of categories now specifies the types of financial advice that should be prohibited. As a result, the previously adopted moderate_message
Reviewed Reviews of "It's a Great Time to Invest in Gold!" Now would trigger a violation.
Consider batch processing
In cases where real-time review is not required, consider reviewing messages in bulk to reduce costs. Include multiple messages in the context of the prompt and let Claude evaluate which messages need to be reviewed.
import anthropic
import json
# Initialize Anthropic Client
client = anthropic.Anthropic()
def batch_moderate_messages(messages, unsafe_categories):
# Convert the list of unsafe categories to strings, one line per category
unsafe_category_str = '\n'.join(unsafe_categories)
# formats the message strings, wrapping each message in XML-like tags and giving the IDs
messages_str = '\n'.join([f'<message id={idx}>{msg}' for idx, msg in enumerate(messages)])
# Tips for building Claude, including messages and insecurity categories to evaluate
assessment_prompt = f"""Based on the insecurity categories listed below, identify the messages that need to be reviewed.
Messages list:
{messages_str}
Insecurity categories and their definitions:
{unsafe_category_str}
Only JSON objects are returned using the following format:
{{
"violations": [
{{
"id": , "categories": [list of violation categories], {{{
"categories": [list of categories violated], {{
"explanation":
}}, ...
...
]
}}
Important Tip:
- Remember to analyze each message for violations.
- Select any violation category that reasonably applies.""""
# sends an audit request to Claude.
response = client.messages.create(
model="claude-3-haiku-20240307", # Uses Haiku Model to Reduce Costs
max_tokens=2048, # Increase the maximum number of tokens to handle bulk data.
temperature=0, # Setting temperature to 0 improves consistency
messages=[
{"role": "user", "content": assessment_prompt}
]
)
# Parsing the JSON response returned by Claude
assessment = json.loads(response.content[)0].text)
return assessment
# Processing bulk comments and getting responses
response_obj = batch_moderate_messages(user_comments, unsafe_categories)
# Outputs information about each detected violation
for violation in response_obj['violations']:
print(f"""Comments:{user_comments[violation['id']]}
Category of violation:{', '.join(violation['categories'])}
Explanation:{violation['explanation']}
""")
In this example, thebatch_moderate_messages
The function handles the review of an entire batch of messages with a single Claude API call. Inside the function, a hint is created containing a list of messages to evaluate, the insecure content category and its description. The prompt asks Claude to return a JSON object listing all messages that contain violations. Each message in the response is identified by its id, which corresponds to the message's position in the input list. Note that finding the optimal batch size for your specific needs may require some experimentation. While larger batch sizes can reduce costs, they may result in a slight loss of quality. Additionally, you may need to increase the number of Claude API calls in the max_tokens
parameter to accommodate longer responses. For more information about the maximum number of Token that can be output by the selected model, see the Model Comparison PageThe