General Introduction
SQLite-Utils-Ask is a powerful tool designed to help users perform question-and-answer data queries on SQLite databases and CSV/JSON files with the aid of LLM (Large Language Model). The tool is capable of automatically generating appropriate SQL queries based on the user's questions and executing the queries to return results, greatly simplifying the process of data analysis and processing.
Function List
- natural language questioning: Ask questions about the data in the database through natural language, and the tool will automatically generate the corresponding SQL queries.
- Database compatibility: Supports SQLite database for easy data management and querying.
- Handling CSV/JSON files: Supports direct querying of CSV, TSV or JSON files.
- Multi-document search: Supports federated queries for multiple documents.
- command-line tool: Provides an easy-to-use command line interface that allows users to quickly execute queries.
- Plug-in Support: Can be integrated with tools such as sqlite-utils to extend functionality and application scenarios.
Asking questions about SQLite databases and CSV/JSON files in Terminal
I'm working on a CLI tool for my sqlite-utilsBuilt a new plugin that lets you ask human language questions directly to SQLite databases and CSV/JSON files on your computer.
Its name is sqlite-utils-ask. install it as follows:
sqlite-utils installation sqlite-utils-ask
It gets the API key from the environment variableOPENAI_API_KEY
, or you can install LLM and use thellm keys set openaiStore the key in a configuration file.
Then you can use it like this:
curl -O https://datasette.io/content.db
sqlite-utils asked content.db " What is the number of sqlite-utils pypi downloads in 2024?"
This command will extract the SQL schema of the supplied database file, send it through LLM with your question, return the SQL query and try to run it to produce results.
If all goes well, it will give the following answer:
SELECT SUM(downloads)
FROM stats
WHERE package = 'sqlite-utils' AND date >= '2024-01-01' AND date < '2025-01-01'.
[
{
"SUM(downloads)": 4300221
}
]
If the SQL query execution fails (due to some syntax error), it passes that error back to the model for correction and retries up to three times before giving up.
increase-v/--verbose
to see the exact tips for its use:
System prompt.
You will be given a SQLite schema followed by a question. Generate a single SQL query to answer that question.
Generate a single SQL query to answer that question. Return that query in a ```sql ...'' Return that query in a ``sql ...''
Return that query in a ``sql ...'' fenced code block.
Example: How many repos are there?
Answer.
Answer: ```sql
select count(*) from repos
Prompt.
...
CREATE TABLE [stats] (
[package] TEXT,
[date] TEXT,
[downloads] INTEGER,
PRIMARY KEY ([package], [date])
);
...
how many sqlite-utils pypi downloads in 2024?
I have truncated the above to include only the relevant tables - it actually contains the full schema for each table in that database.
By default, the tool will only send that database schema and your question to LLM, and if you add that `-e/--examples` option, it will also include five public values for each text column in that schema, with an average length of less than 32 characters. This can sometimes help get better results, for example, sending the values "CA" and "FL" and "TX" for the `state` column can suggest that the model should use the state abbreviations instead of full names.
#### Problems asking for CSV and JSON data
The core `sqlite-utils` CLI normally runs directly against SQLite files, but three years ago I added the ability to use the [sqlite-utils memory](https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/) command to directly three years ago I added the ability to run SQL queries directly against CSV and JSON files using the [sqlite-utils memory]() command. This works by loading data into an in-memory SQLite database before executing the SQL query.
I decided to reuse this mechanism to enable LLM hints directly against CSV and JSON data.
The `sqlite-utils ask-files` command looks like this:
```shell
sqlite-utils ask-files transaction.csv "Total sales by year"
This command accepts one or more files, which you can supply in a mix of CSV, TSV and JSON formats. Each supplied file will be imported into a different table, allowing the model to construct join queries if necessary.
Description of the realization
The core implementation of the plugin is approximately250 lines of Python codeUsesqlite-utils
register_commands()Plugin hooks to addask
cap (a poem)ask-files
Command.
It adds LLM as a dependency and utilizes LLM'sPython APIto abstract the details of the dialog with the model. This means that thesqlite-utils-ask
Any model supported by LLM or its plugins can be used - if you want to pass the Claude 3.5 Sonnet runs the prompt and you can do the following:
sqlite-utils installs llm-claude-3
sqlite-utils ask content.db "count rows in news table" -m claude-3.5-sonnet
The plugin initially defaults to gpt-4o-mini to take advantage of the model's automatic hint caching: if you run multiple questions against the same schema, you'll end up sending the same long hint prefix multiple times, and OpenAI's hint caching should automatically kick in and provide a 50% discount for those input tokens.
Then I crunched the actual numbers and found thisgpt-4o-mini
It's cheap enough that even without caching 4,000 token hints (which is a pretty big SQL schema), the cost should be less than a tenth of a cent. So these cache savings aren't even worth mentioning!