Converting video, voice to SRT subtitles using Gimine 2.5 Pro model

Latest AI Resources9mos agorelease AI Sharing Circle

34.5K 00

Tried it before. Convert Speech to Multi-Speaker Subtitles with Gemini 2.0 for free! The results are noteworthy. Took advantage of the Gimine 2.5 pro to try it again.

First find a more standard SRT subtitle sample as a reference benchmark (do speech-to-text in advance and use the mainstream model in the market):

00:00:00,000 --> 00:00:01,480
刚出镜不用文案
00:00:01,480 --> 00:00:03,260
单拍产品就有人下单
00:00:03,260 --> 00:00:04,200
你相信吗
00:00:04,200 --> 00:00:07,070
国内的短视频卖货竞争大、成本高
00:00:07,070 --> 00:00:10,660
海外短视频刚起步、容量大、竞争力度小
00:00:10,660 --> 00:00:12,420
是目前的蓝海市场
00:00:12,420 --> 00:00:15,120
而且海外平台用户面向全球
00:00:15,120 --> 00:00:16,760
买家多、卖家少
00:00:16,760 --> 00:00:18,840
做海外账号涨粉引流
00:00:18,840 --> 00:00:20,920
不用开直播就能带货变现
00:00:20,920 --> 00:00:22,000
无需精通外语
00:00:22,000 --> 00:00:23,100
零基础也能做

Let's try it first. Convert Speech to Multi-Speaker Subtitles with Gemini 2.0 for free! The article provides cue words for multiple speakers that should work just fine when used for a single person.

Output subtitles:

[00:00] 小美: 不用出境,不用文案,單拍產品就有人下單,你相信嗎?
[00:04] 小美: 國內的短視頻賣貨,競爭大,成本高。
[00:07] 小美: 海外短視頻剛起步,流量大,競爭力度小,是目前的藍海市場。
[00:12] 小美: 而且海外平台用戶面向全球,買家多,賣家少。
[00:16] 小美: 做海外賬號長粉引流,不用開直播就能帶貨變現。
[00:20] 小美: 無需精通外語,零基礎也能做。
[00:23] [END]

The cue is better suited to conversational interview scenarios, always retaining the speaker's complete long sentences. Sentence breaking is slightly worse. In the following, we design a cue that is more suitable for generating standard SRT subtitles and requires accurate word and sentence breaks according to the tone of voice.

New Cue Words

你是一个专业的字幕转录助手。你的任务是将我提供的文件转录为文本，并将转录结果格式化为符合 EBU-STL 标准的 SRT字幕文件。具体要求如下：

## 每个字幕块必须严格按照以下结构输出：

[行号]
[时间行]
[文字行]
[空行]

**该结构的说明**
- [行号] 是字幕块的序号，从 1 开始递增，例如  1、2  等。
- [时间行] 是时间戳，格式为 HH:MM:SS,FFF --> HH:MM:SS,FFF，表示字幕的起始和结束时间（FFF 表示3位毫秒，例如 000 到 999）。如果你无法精确计算时间，可以根据音频内容合理估算，确保时间间隔逻辑合理。
- [文字行] 是转录的文本内容。
- [空行] 是字幕块之间的分隔，确保每个字幕块后有一个空行。

## 限制
输出时，必须严格遵守上述格式，不要省略任何部分，也不要添加多余的文本或注释。
每块字幕的持续时间尽量控制在 3-15 秒之间，具体根据语速和语义自然分割。


现在，请根据我提供的文件进行转录，并按上述格式输出字幕内容。

Output subtitles:

1
00:00:00,327 --> 00:01,187
不用出镜,不用文案,

2
00:01,747 --> 00:03,717
单拍产品就有人下单,你相信吗?

3
00:04,337 --> 00:06,637
国内的短视频卖货竞争大,成本高。

4
00:07,177 --> 00:08,627
海外短视频刚起步,

5
00:09,277 --> 00:10,597
流量大,竞争力度小,

6
00:11,067 --> 00:12,247
是目前的蓝海市场。

7
00:12,707 --> 00:14,637
而且海外平台用户面向全球,

8
00:15,167 --> 00:16,357
买家多,卖家少。

9
00:17,067 --> 00:18,467
做海外账号涨粉引流,

10
00:19,177 --> 00:20,687
不用开直播就能带货变现。

11
00:21,277 --> 00:22,927
无需精通外语,零基础也能做。

About the length of video and audio that can be processed

Audio and video around 15 minutes long are more stable. Theoretically, 1~2 hours long audio and video compression is in line with the input and output length, but the actual results are slightly worse (not tested in depth). Another important issue is that Gimine 2.5 Pro is not suitable for speech-to-text conversion, because the longer the audio, the longer the "thinking" time and the more unstable it is.