免费使用 Gemini 2.0 将语音转换为多发言人字幕

1.5K 00

转录的字幕可以带有说话人标签和秒级时间戳，并可以准确识别笑声和铃声，并正确识别歌曲。受输出 tokens 所限，最长大约可以转录15分钟音频。可以在 Google AI Studio 中快速尝试，最后我会附上使用免费的 Gemini 2.0 模型和 colab 免费将语音转录为字幕文本的代码。

提示词

生成该音频的转录文本。包括时间戳并标注讲话者。
讲话者包括：
- 小美
示例：
[00:00] Brady: Hello there.
[00:02] Tim: Hi Brady.
务必使用正确的讲话者姓名。使用你之前识别的姓名。如果确实无法确定讲话者姓名，则用字母代替，例如未知讲话者可标记为 'A'，另一个未知讲话者可标记为 'B'。
如果有音乐或短暂的音乐片段播放，请标注如下：
[01:02] [MUSIC] 或 [01:02] [JINGLE]
如果可以识别播放的音乐或片段名称，则使用该名称，例如：
[01:02] [Firework by Katy Perry] 或 [01:02] [The Sofa Shop jingle]
如果播放的是其他声音，请尝试标识该声音，例如：
[01:02] [Bell ringing]
每个字幕内容应尽量简短，最多几句简短的句子。
在节目结束时标注 [END]。
不要使用任何 Markdown 格式，例如加粗或斜体。
仅使用英文字母，除非你确信需要使用其他语言字符。
确保使用正确的单词并拼写准确。利用播客的上下文来帮助识别。
如果主持人讨论电影、书籍或名人，确保电影、书籍或名人的名称拼写正确。""")

colab代码

%pip install google-genai jinja2

import os
from google import genai

# create client
api_key = os.getenv("GEMINI_API_KEY","xxx")
client = genai.Client(api_key=api_key)

from jinja2 import Template


# path to the file to upload
file_path = "../assests/porsche.mp3" # Repalce with your own file path

# Upload the file to the File API
file = client.files.upload(file=file_path)

# Generate a structured response using the Gemini API
prompt_template = Template("""Generate a transcript of the episode. Include timestamps and identify speakers.

Speakers are: 
{% for speaker in speakers %}- {{ speaker }}{% if not loop.last %}\n{% endif %}{% endfor %}

eg:
[00:00] Brady: Hello there.
[00:02] Tim: Hi Brady.

It is important to include the correct speaker names. Use the names you identified earlier. If you really don't know the speaker's name, identify them with a letter of the alphabet, eg there may be an unknown speaker 'A' and another unknown speaker 'B'.

If there is music or a short jingle playing, signify like so:
[01:02] [MUSIC] or [01:02] [JINGLE]

If you can identify the name of the music or jingle playing then use that instead, eg:
[01:02] [Firework by Katy Perry] or [01:02] [The Sofa Shop jingle]

If there is some other sound playing try to identify the sound, eg:
[01:02] [Bell ringing]

Each individual caption should be quite short, a few short sentences at most.

Signify the end of the episode with [END].

Don't use any markdown formatting, like bolding or italics.

Only use characters from the English alphabet, unless you genuinely believe foreign characters are correct.

It is important that you use the correct words and spell everything correctly. Use the context of the podcast to help.
If the hosts discuss something like a movie, book or celebrity, make sure the movie, book, or celebrity name is spelled correctly.""")

# Define the speakers and render the prompt
speakers = ["John"]
prompt = prompt_template.render(speakers=speakers)

response = client.models.generate_content(
model="gemini-2.0-flash",
contents=[prompt, file],
)

print(response.text)