Transcript: extracting JSON data from 35 seconds of recorded video relying on Google Gemini multimodal capabilities

AI hands-on tutorials1yrs agorelease AI Sharing Circle

24.4K 00

The other day, I found myself needing to add up some values scattered across twelve different emails.

I didn't want to copy and paste all the numbers one by one, so I decided to try something different: could I record the screen while browsing my Gmail account and then use Google Gemini Extracting numbers from that video?

As a result, this method effectbundleGood.

AI Studio and QuickTime

I use QuickTime Player on my Mac to record video:文件 -> 新屏幕录制. I drew a box on the screen that framed the part of my Gmail account and then clicked on each email in turn, leaving each one for a few seconds.

Then, I uploaded the recorded file directly to Google's AI Studio tool and enter the following prompt:

将其转换为一个 JSON 数组，每个项目包含 yyyy-mm-dd 格式的日期和该日期的浮点金额

The ...... result was successful. It outputs a JSON array that looks like this:

[
  {
    "date": "2023-01-01",
    "amount": 2...
  },
  ...
]

转载：依托 Google Gemini 多模态能力，从35秒的录屏视频中提取JSON数据

I wanted to paste it into Numbers, so I proceeded to type:

将其转换为可复制粘贴的 csv

It gave me the same data in CSV format.

You should never fully trust these tools to not make mistakes, so I rewatched this 35 second video and manually checked all the numbers. It was all correct.

Originally I was going to use Gemini 1.5 Pro, which is Google's best model ...... but it turns out I forgot to select a model, and I actually used the much cheaper Gemini 1.5 Flash 002 for the whole process.

How much did it cost?

ground AI Studio By my count, I used 11,018 Token, of which 10,326 were for videos.

Gemini 1.5 Flash Fee Schedule $0.075/per million Token (Prices in Reduced in August).

11018/1000000 = 0.011018
0.011018 * $0.075 = $0.00082635

Therefore, this entire process should cost less than 1/10th of a cent!

In fact, it isfreeGoogle AI Studio be facing (us) It's "still free" in all supported areas even with billing. But I'm sure that means theyYou can train your dataAnd that's something that their paid APIs don't do.

The other alternatives aren't really that good

Let's look at other alternatives here.

I can click on the emails one by one and copy the data manually. This is error prone and quite boring. Processing 12 emails is fine, but 100 would be a real pain.
Programmatically access my Gmail data. Every year this becomes harder and harder - although it's still possible to access it via IMAP, as long as you set up a dedicatedapplication password, but it still requires a lot of work for an ad hoc capture task.Official API It doesn't work well at all.
Use some sort of browser automation tool (like Playwright or similar) to automatically click through to my Gmail account. Even with the big language model to help write the code, this still requires more work, and it doesn't solve the problem of email formatting differences - I'd still have to solve the email parsing step separately.
Use some sort of more advanced existing AI tool to access my email. Another Google product (also called Gemini) can do this if you grant it access, but so far I've not been particularly happy with the results. ai tools are inherently unpredictable. I'm also reluctant to give any tool full access to my email account because of the possibility of things likeCue InjectionRisks like that.

Video capture technology is very powerful

this workvideo captureThe great thing about the technology is that it applies to _anything_ you see on your screen... And you have complete control over what you expose to the AI model.

There is no website authentication or anti-scraping technology that prevents me from recording screen video while clicking through web applications.

The results I get depend entirely on how carefully I plan the screen capture area and the clicking action.

There's absolutely no setup cost for this process - just log into the site, hit record, browse at your leisure, and drop the video into Gemini.

The cost was so low that I had to recalculate three times to make sure I hadn't miscalculated.

I expect I will be using this technique more in the future. It also has applications in the field of data journalism, where there is often a need to grab data from sources that don't want to be grabbed.

Plus: a price calculator for large language models

At the time of writing this lab report, I was tired of manually calculating Token prices. I usually outsource this to ChatGPT Code Interpreter, but I found that it was converting from dollars to cents when theThere's been an error., so I always had to double check its results.

So I let Claude 3.5 Sonnet built this for me using Claude ArtifactsPrice Calculator Tool(The source code is here):

You can manually set the price of the input/output Token, or click on the preset buttons to automatically populate the prices of different existing models (as of October 16, 2024 - I don't promise to keep them up to date in the future!)

The entire calculator was written by Claude. Here isFull transcript of the conversation--We iterated through 10 different versions for 19 minutes.

Instead of looking up all the prices myself, I intercepted each model provider's pricing page and put them directly into a conversation with Claude: