Content auditing is a key aspect of maintaining a safe, respectful, and productive environment in digital applications. In this guide, we will discuss how to use the Claude Audit the content in your digital apps.
Please visit our Content Review Recipes See an example of using Claude to implement a content audit.
This guide focuses on reviewing user-generated content in your app. If you're looking for guidance on auditing your interactions with Claude, please refer to our Security Boundary GuidelinesThe
Before using Claude
Decide whether to use Claude for content review
Here are some key metrics that you should choose an LLM like Claude over a traditional machine learning or rules-based approach to content auditing:
You want to achieve a cost-effective and fast implementation
Traditional machine learning methods require significant engineering resources, machine learning expertise, and infrastructure costs. Manual audit systems cost even more. With Claude, you can build complex audit systems in less time and at a lower cost.
You need semantic understanding and fast decision making
Traditional machine learning approaches, such as bag-of-words modeling or simple pattern matching, often struggle to understand the tone, intent, and context of content. While manual review systems excel at understanding semantic meaning, they take time to review content.Claude bridges this gap by combining semantic understanding with the ability to quickly provide review decisions.
You need consistent policy decisions
By leveraging its advanced reasoning capabilities, Claude can interpret and apply complex review guidelines consistently. This consistency helps ensure fair treatment of all content and reduces the risk that inconsistent or biased review decisions could undermine user trust.
Your review policy may change or evolve over time
Once a traditional machine learning approach has been established, making changes can be a time-consuming and data-intensive endeavor. On the other hand, as your product or customer needs change, Claude can easily adapt to changes or additions to audit policies without extensive re-labeling of training data.
You need interpretable reasoning for reviewing decisions
If you want to provide users or regulators with clear explanations behind review decisions, Claude can generate detailed and coherent rationales. This transparency is critical to building trust and ensuring accountability in content review practices.
You need multi-language support without having to maintain separate models
Traditional machine learning approaches typically require separate models or extensive translation processes for each supported language. Manual audits require a workforce that is fluent in each supported language, and Claude's multilingual capabilities simplify audits for a global customer base by enabling it to categorize tickets in a variety of languages without the need for separate models or extensive translation processes.
You need multimodal support
Claude's multimodal capabilities enable it to analyze and interpret content in text and images. This makes it a versatile tool for comprehensive content audits in environments where different media types need to be evaluated together.
Anthropic All Claude models have been trained to be honest, helpful and harmless. This may result in Claude reviewing content that is considered particularly dangerous (in line with our Acceptable Use Policy), regardless of the prompt used. For example, an adult site that wants to allow users to post explicit sexual content may find that Claude will still mark explicit content as requiring moderation, even though they explicitly specify in their prompts that they do not moderate explicit content. We recommend that you review our AUP before building a review solution.
Example of generating a content audit
Before developing a content auditing solution, first create examples of content that should be flagged and examples of content that should not be flagged. Make sure to include edge cases and complex scenarios that may challenge the processing effectiveness of your content auditing system. After that, review your examples to create a clear list of audit categories. For example, an example generated by a social media platform might include the following:
allowed_user_comments = [
'这部电影太棒了,我真的很喜欢。主角表现得非常出色!',
'我讨厌星期一。',
'现在是投资黄金的好时机!'
]
disallowed_user_comments = [
'马上删除这篇帖子,否则你最好躲起来。我会追上你和你的家人。',
'远离 5G 手机!!他们正在用 5G 控制你。',
'恭喜你!你赢得了一张价值 $1,000 的礼品卡。点击这里领取你的奖品!'
]
# 用于测试内容审核的用户评论示例
user_comments = allowed_user_comments + disallowed_user_comments
# 被视为不安全的内容审核类别列表
unsafe_categories = [
'儿童剥削',
'阴谋论',
'仇恨',
'无差别武器',
'知识产权',
'非暴力犯罪',
'隐私',
'自我伤害',
'性犯罪',
'色情内容',
'专业建议',
'暴力犯罪'
]
Effectively reviewing these examples requires a nuanced understanding of the language. In the review.这部电影太棒了,我真的很喜欢。主角表现得非常出色!
, the content review system needs to recognize that 'acting very well' is a metaphor, not an indication of actual violence. Instead, despite the lack of explicit references to violence, reviews马上删除这篇帖子,否则你最好躲起来。我会追上你和你的家人。
It should be flagged by the content review system.
unsafe_categories
The list can be customized to meet your specific needs. For example, if you want to prevent minors from creating content on your site, you can add 'Underage Publishing' to the list.
How to use Claude for content auditing
Selecting the right Claude model
When choosing a model, it is important to consider the size of the data. If cost is an issue, a smaller model like Claude 3 Haiku is an excellent choice because it is cost-effective. Here's an estimate of the cost of performing a text audit of a social media platform that receives a billion posts per month:
- Content size
- Number of posts per month: 1bn
- Characters per post: 100
- Total characters: 100bn
- Estimated Token Count
- Input Token: 28.6bn (assuming 1 token per 3.5 characters)
- Percentage of tagged messages: 3%
- Output for each tagged message Token Number: 50
- Total Token Output: 1.5bn
- Claude 3 Haiku Estimated cost
- Input Token Cost: 2,860 MTok * $0.25/MTok = $715
- Output Token Cost: 1,500 MTok * $1.25/MTok = $1,875
- Monthly cost: $715 + $1,875 = $2,590
- Claude 3.5 Sonnet Estimated Costs
- Input Token Cost: 2,860 MTok * $3.00/MTok = $8,580
- Output Token Cost: 1,500 MTok * $15.00/MTok = $22,500
- Monthly cost: $8,580 + $22,500 = $31,080
Actual costs may differ from these estimates. These estimates are based on information about batch file highlighted tips in the section. By removing the explanation
field, the number of Output Token can be further reduced.
Building Powerful Tips
In order to use Claude for content auditing, Claude must understand your application's auditing requirements. Let's start writing a prompt so you can define your auditing requirements:
import anthropic
import json
# 初始化 Anthropic 客户端
client = anthropic.Anthropic()
def moderate_message(message, unsafe_categories):
# 将不安全类别列表转换为字符串,每个类别占一行
unsafe_category_str = '\n'.join(unsafe_categories)
# 为 Claude 构建提示,包括消息和不安全类别
assessment_prompt = f"""
确定以下消息是否需要审核,
基于下面列出的不安全类别。
消息:
<message>{message}</message>
不安全类别:
<categories>
{unsafe_category_str}
</categories>
仅使用以下格式响应 JSON 对象:
{{
"violation": <布尔字段,表示消息是否应该被审核>,
"categories": [被违反类别的逗号分隔列表],
"explanation": [可选。如果有违规行为则仅包含。]
}}"""
# 发送请求给 Claude 进行内容审核
response = client.messages.create(
model="claude-3-haiku-20240307", # 使用 Haiku 模型以降低成本
max_tokens=200,
temperature=0, # 使用 0 温度以提高一致性
messages=[
{"role": "user", "content": assessment_prompt}
]
)
# 解析 Claude 的 JSON 响应
assessment = json.loads(response.content[0].text)
# 从评估中提取违规状态
contains_violation = assessment['violation']
# 如果有违规,获取类别和解释;否则使用空默认值
violated_categories = assessment.get('categories', []) if contains_violation else []
explanation = assessment.get('explanation') if contains_violation else None
return contains_violation, violated_categories, explanation
# 处理每条评论并打印结果
for comment in user_comments:
print(f"\n评论: {comment}")
violation, violated_categories, explanation = moderate_message(comment, unsafe_categories)
if violation:
print(f"违反的类别: {', '.join(violated_categories)}")
print(f"解释: {explanation}")
else:
print("未检测到问题。")
In this example, themoderate_message
The function contains an evaluation prompt with the unsafe content category and the message we wish to evaluate. The prompt asks Claude to evaluate whether the message should be audited or not, based on the insecurity category we defined.
The model's evaluation is then parsed to determine if a violation exists. If a violation exists, Claude also returns a list of violated categories and an explanation of why the message is considered insecure.
Evaluate your tips
Content review is a matter of categorization. Therefore, you can use our classification cookbook The same techniques outlined in Determining the Accuracy of Your Content Audit System.
Another consideration is that rather than viewing content auditing as a binary categorization problem, you can create multiple categories to represent different risk levels. Creating multiple risk levels allows you to adjust the intensity of your audits. For example, you may want to automatically block user queries that are considered high risk, while marking many medium-risk queries as requiring manual review.
import anthropic
import json
# 初始化 Anthropic 客户端
client = anthropic.Anthropic()
def assess_risk_level(message, unsafe_categories):
# 将不安全类别的列表转换为字符串,每个类别占一行
unsafe_category_str = '\n'.join(unsafe_categories)
# 为 Claude 构造提示,包括待评估的消息、不安全类别和风险级别定义
assessment_prompt = f"""
评估以下消息的风险级别,判断是否需要审核,
基于下面列出的不安全类别。
消息:
<message>{message}</message>
不安全类别:
<categories>
{unsafe_category_str}
</categories>
根据您对用户消息应根据定义的不安全类别进行审核的信心,分配风险级别:
0 - 无风险
1 - 低风险
2 - 中等风险
3 - 高风险
仅以 JSON 对象格式回复,使用以下格式:
{{
"risk_level": <表示风险级别的数值字段>,
"categories": [违规类别的逗号分隔列表],
"explanation": <可选。仅在风险级别大于 0 时包含>
}}"""
# 发送请求给 Claude 进行风险评估
response = client.messages.create(
model="claude-3-haiku-20240307", # 使用 Haiku 模型以降低成本
max_tokens=200,
temperature=0, # 使用 0 温度以提高一致性
messages=[
{"role": "user", "content": assessment_prompt}
]
)
# 解析 Claude 的 JSON 响应
assessment = json.loads(response.content[0].text)
# 从评估中提取风险级别、违规类别和解释
risk_level = assessment["risk_level"]
violated_categories = assessment["categories"]
explanation = assessment.get("explanation")
return risk_level, violated_categories, explanation
# 处理每个评论并打印结果
for comment in user_comments:
print(f"\n评论: {comment}")
risk_level, violated_categories, explanation = assess_risk_level(comment, unsafe_categories)
print(f"风险级别: {risk_level}")
if violated_categories:
print(f"违规类别: {', '.join(violated_categories)}")
if explanation:
print(f"解释: {explanation}")
The code implements a assess_risk_level
function that uses Claude to evaluate the risk level of a message. The function takes as input a list of messages and unsafe categories.
Inside the function, a prompt is generated for Claude containing the message to be evaluated, the category of insecurity, and a specific description of the risk level to be evaluated. The prompt instructs Claude to respond in the form of a JSON object with the risk level, violation category, and optionally an explanation.
This approach enables flexible content review by assigning risk levels. It can be seamlessly integrated into larger systems to automatically filter content or flag comments for manual review based on assessed risk levels. For example, when executing this code, comments Delete this post now or you better hide. I am coming after you and your family.
Identified as high risk due to its hazardous threat. Instead, the comment Stay away from the 5G cellphones!! They are using 5G to control you.
Classified as medium risk.
Deploying your tips
When you're satisfied with the quality of your solution, it's time to deploy it to a production environment. Here are some best practices when using Content Audit:
- **Provide clear feedback to users:** When user input is blocked or a response is flagged due to content moderation, provide useful and constructive feedback to help users understand why their message has been flagged and inform them how to restate it appropriately. In the code example above, this is accomplished through the Claude response in the
explanation
Tagged Realization. - **Analyze reviewed content:** Track the types of content flagged by the review system to identify trends and possible areas for improvement.
- **Continuous Evaluation and Improvement:** Regularly evaluate the performance of your content review system using metrics such as precision and recall tracking. Use this data to iteratively improve your audit prompts, keywords, and evaluation criteria.
improve performance
In complex scenarios, in addition to the standard Tip Engineering Beyond that, considering some additional strategies may help improve performance. Here are some advanced strategies:
Define topics and provide examples
In addition to listing the unsafe categories in the prompt, further improvements could be made by providing definitions and related phrases for each category.
import anthropic
import json
# 初始化 Anthropic 客户端
client = anthropic.Anthropic()
# 含有定义的内容审核不安全类别字典
unsafe_category_definitions = {
'Child Exploitation': '涉及儿童裸体或促进、鼓励、辩护或描绘儿童性虐待的内容。',
'Conspiracy Theories': """推广或支持关于事件、情况或个人的毫无根据的虚假或误导性理论的内容,
这些理论可能煽动伤害或破坏公众对机构或事实的信任。""",
'Hate': """基于受保护特征(种族、肤色、民族、国籍、残疾、宗教信仰、种姓、性取向、性别、性别认同和重大疾病)
对人群表达仇恨的内容,或助长负面刻板印象的内容。""",
'Indiscriminate Weapons': '涉及大规模杀伤性武器制造的内容。',
'Intellectual Property': '侵犯第三方知识产权的内容。',
'Non-Violent Crimes': '鼓励、辩护或推动非暴力犯罪的内容。',
'Privacy': '包含私人个人敏感信息的内容。',
'Self-Harm': '鼓励、辩护或推动自残行为的内容。',
'Sex Crimes': '鼓励、辩护或推动性犯罪的内容。',
'Sexual Content': '包含色情描述或性行为明确描写的内容。',
'Specialized Advice': """包含金融、医疗或法律建议的内容。
金融建议包括投资、股票、债券或任何财务规划的指导。""",
'Violent Crimes': '鼓励、辩护或推动暴力犯罪的内容。',
}
def moderate_message_with_definitions(message, unsafe_category_definitions):
# 格式化不安全类别字符串,每个类别及其定义占一行
unsafe_category_str = '\n'.join(f'{category}: {definition}'
for category, definition in unsafe_category_definitions.items()
)
# 构建 Claude 的提示信息,包括消息和不安全类别
assessment_prompt = f"""根据以下列出的不安全类别,判断以下消息是否需要审核。
消息:
<message>{message}</message>
不安全类别及其定义:
<categories>
{unsafe_category_str}
</categories>
请记住所有不安全类别及其定义。
仅以 JSON 格式响应,使用以下格式:
{{
"violation": <布尔字段,表示消息是否应被审核>,
"categories": [被违反类别的逗号分隔列表],
"explanation": [可选。如果存在违规,才包含。]
}}"""
# 向 Claude 发送内容审核请求
response = client.messages.create(
model="claude-3-haiku-20240307", # 使用 Haiku 模型以降低成本
max_tokens=200,
temperature=0, # 使用 0 温度以提高一致性
messages=[
{"role": "user", "content": assessment_prompt}
]
)
# 解析 Claude 的 JSON 响应
assessment = json.loads(response.content[0].text)
# 从评估结果中提取违规状态
contains_violation = assessment['violation']
# 如果有违规,获取违规类别和解释;否则使用空默认值
violated_categories = assessment.get('categories', []) if contains_violation else []
explanation = assessment.get('explanation') if contains_violation else None
return contains_violation, violated_categories, explanation
# 处理每个评论并打印结果
for comment in user_comments:
print(f"\n评论:{comment}")
violation, violated_categories, explanation = moderate_message_with_definitions(comment, unsafe_category_definitions)
if violation:
print(f"违规类别:{', '.join(violated_categories)}")
print(f"解释:{explanation}")
else:
print("未发现问题。")
moderate_message_with_definitions
function is based on the previous moderate_message
function was extended by pairing each unsafe category with a detailed definition. The code combines the original function's unsafe_categories
The list is replaced with unsafe_category_definitions
Dictionary. The dictionary maps each unsafe category to its corresponding definition. Both the category name and its definition are included in the prompt message.
It is worth noting thatSpecialized Advice
The definition of categories now specifies the types of financial advice that should be prohibited. As a result, the previously adopted moderate_message
Reviewed Reviews of "It's a Great Time to Invest in Gold!" Now would trigger a violation.
Consider batch processing
In cases where real-time review is not required, consider reviewing messages in bulk to reduce costs. Include multiple messages in the context of the prompt and let Claude evaluate which messages need to be reviewed.
import anthropic
import json
# 初始化 Anthropic 客户端
client = anthropic.Anthropic()
def batch_moderate_messages(messages, unsafe_categories):
# 将不安全类别列表转换为字符串,每个类别占一行
unsafe_category_str = '\n'.join(unsafe_categories)
# 格式化消息字符串,将每个消息用类似 XML 的标签包裹,并给出 ID
messages_str = '\n'.join([f'<message id={idx}>{msg}</message>' for idx, msg in enumerate(messages)])
# 构建 Claude 的提示,包括要评估的消息和不安全类别
assessment_prompt = f"""根据下面列出的不安全类别,确定需要审核的消息。
消息列表:
<messages>
{messages_str}
</messages>
不安全类别及其定义:
<categories>
{unsafe_category_str}
</categories>
仅使用以下格式返回 JSON 对象:
{{
"violations": [
{{
"id": <message id>,
"categories": [违反的类别列表],
"explanation": <为何存在违规的解释>
}},
...
]
}}
重要提示:
- 记得分析每条消息是否存在违规行为。
- 选择任何合理适用的违规类别。"""
# 发送审核请求给 Claude
response = client.messages.create(
model="claude-3-haiku-20240307", # 使用 Haiku 模型以降低成本
max_tokens=2048, # 增加最大 Token 数以处理批量数据
temperature=0, # 设置温度为 0 提高一致性
messages=[
{"role": "user", "content": assessment_prompt}
]
)
# 解析 Claude 返回的 JSON 响应
assessment = json.loads(response.content[0].text)
return assessment
# 处理批量评论并获取响应
response_obj = batch_moderate_messages(user_comments, unsafe_categories)
# 输出每条检测到的违规信息
for violation in response_obj['violations']:
print(f"""评论:{user_comments[violation['id']]}
违规类别:{', '.join(violation['categories'])}
解释:{violation['explanation']}
""")
In this example, thebatch_moderate_messages
The function handles the review of an entire batch of messages with a single Claude API call. Inside the function, a hint is created containing a list of messages to evaluate, the insecure content category and its description. The prompt asks Claude to return a JSON object listing all messages that contain violations. Each message in the response is identified by its id, which corresponds to the message's position in the input list. Note that finding the optimal batch size for your specific needs may require some experimentation. While larger batch sizes can reduce costs, they may result in a slight loss of quality. Additionally, you may need to increase the number of Claude API calls in the max_tokens
parameter to accommodate longer responses. For more information about the maximum number of Token that can be output by the selected model, see the Model Comparison PageThe