OmniSQL: A Model for Transforming Natural Language into High-Quality SQL Queries

Latest AI Resources4mos agorelease AI Sharing Circle

1.3K 00

General Introduction

OmniSQL is an open source project developed by the RUCKBReasoning team and hosted on GitHub. Its core function is to transform user-input natural language questions into high-quality SQL query statements to help users easily interact with databases. Based on an automated text-to-SQL data generation framework, the project has launched SynSQL-2.5M dataset with 2.5 million samples, which is currently the largest cross-domain synthetic text-to-SQL dataset. OmniSQL provides three model sizes, 7B, 14B, and 32B, which are suitable for users with different needs. OmniSQL provides three model sizes, 7B, 14B and 32B, suitable for users with different needs. It provides powerful support for data analysis, database management, and model research. The project uses Apache 2.0 protocol, users can download and participate in the improvement for free.

Function List

Turn natural language into SQL: users enter questions and the model generates accurate SQL queries.
Complex Query Support: Generate advanced SQL from simple single-table queries to multi-table joins.
Dataset generation: SynSQL-2.5M is provided, containing 2.5 million high-quality samples.
Multi-scale modeling: Provides models with three parameter scales: 7B, 14B, and 32B.
Open source and free: the code and dataset are freely available on GitHub.

Using Help

OmniSQL is a code-based tool for users with some programming knowledge. Below is a detailed installation and usage guide to help you get started quickly.

Installation process

Preparing the environment
Make sure that Python 3.8 or later is installed on your computer. Open the command line and type python --version Check. If you don't have it installed, you can download it from the Python website.
Download Project
interviews https://github.com/RUCKBReasoning/OmniSQLClick the "Code" button and select "Download ZIP" to download the project zip file. Unzip it and get the project folder. Or you can clone it with Git command:

git clone https://github.com/RUCKBReasoning/OmniSQL.git

Installation of dependencies
Go to the project directory and run it from the command line:

pip install -r requirements.txt

This will install the Python libraries needed to run. If you need model inference, you'll also need to install the vLLM or Transformers with the following command:

pip install vllm

maybe

pip install transformers torch

Download models and datasets
OmniSQL offers three models and the SynSQL-2.5M dataset, which can be downloaded from the following links:

SynSQL-2.5M. HuggingFace
OmniSQL-7B. HuggingFace
OmniSQL-14B. HuggingFace
OmniSQL-32B. HuggingFace
After downloading, place the file in the project directory.

Running Projects
Go to the project directory and run python omnisql.py Check if the environment is normal. The model needs to be loaded for actual use, see below.

Main Functions

1. Convert natural language to SQL

The core functionality of OmniSQL is to transform problems into SQL queries. Using vLLM as an example, run the following code:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
# 定义提示模板
prompt = '''Task Overview:
You are a data science expert. Below, you are provided with a database schema and a natural language question. Your task is to generate a valid SQL query.
Database Engine: SQLite
Database Schema:
CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, age INTEGER);
Question:
查找 users 表中年龄大于 30 的人的名字
Instructions:
- 只输出问题要求的信息。
- 逐步思考后生成 SQL。
Output Format:

-- Your SQL query

'''
# 加载模型
model_path = "seeklhy/OmniSQL-7B"  # 替换为你的模型路径
tokenizer = AutoTokenizer.from_pretrained(model_path)
llm = LLM(model=model_path, dtype="float16")
# 生成 SQL
sampling_params = SamplingParams(temperature=0, max_tokens=2048)
chat_prompt = tokenizer.apply_chat_template([{"role": "user", "content": prompt}], add_generation_prompt=True, tokenize=False)
outputs = llm.generate([chat_prompt], sampling_params)
print(outputs[0].outputs[0].text)

The output may be:

SELECT name FROM users WHERE age > 30;

2. Using the SynSQL-2.5M dataset

The dataset contains 2.5 million samples, each including database structure, questions, SQL queries, and thought processes. Once downloaded, it can be used directly for training or research. View Samples:

Unzip the dataset file.
Open any JSON file in the format {"db": ..., "question": ..., "sql": ..., "cot": ...}The

3. Training and evaluation

The program provides training scripts, located in the train_and_evaluate folder. Run the example:

python train.py --model OmniSQL-7B --data SynSQL-2.5M

The evaluation scripts are also in the same folder to reproduce the official results.

Tips for use

Database support: Currently, only SQLite is supported; for other databases, the Data Generation Framework can be used to synthesize new data.
hardware requirementThe 7B model requires about 14GB of video memory, and the 32B requires a higher configuration.
View Example: Projects examples The folder provides examples of prompt templates.

With these steps, you can quickly generate SQL with OmniSQL or investigate text-to-SQL techniques.

application scenario

data analysis
Data analysts enter a question, such as "Find the top 10 selling items", and OmniSQL generates the corresponding SQL, saving time.
Modeling Studies
Researchers train new model with SynSQL-2.5M to improve text-to-SQL capability.
Educational learning
Students learn about database operations by entering questions and observing the generated SQL.

QA

What databases does OmniSQL support?
Currently only SQLite is supported, which can be extended with synthetic data in the future.
How big is the dataset?
SynSQL-2.5M contains 2.5 million samples covering 16,000 databases.
How strong are the models?
In benchmarks such as Spider, BIRD, etc., OmniSQL outperforms models such as GPT-4o.