General Introduction
OpenSearch-SQL is an open source project that is a powerful Text-to-SQL tool that transforms users' natural language descriptions into SQL query statements, helping people unfamiliar with databases to access data easily. The project is developed by the OpenSearch-AI team and is free and open based on the Apache 2.0 license.In August 2024, it won the first place in the BIRD benchmarks, with an accuracy of 69.3% for the validation set and 72.28% for the test set.OpenSearch-SQL can run without additional training and supports models such as GPT OpenSearch-SQL runs without additional training and supports models such as GPT, DeepSeek, etc. It is suitable for data analysis and database querying.
Function List
- Convert natural language questions to SQL queries, such as "What is the tallest building" to
SELECT building_name FROM buildings ORDER BY height DESC LIMIT 1
The - CoT (Chain of Thought) enhancement methodology that supports self-learning to improve query generation accuracy.
- Provides a SQL-Like intermediate language to optimize complex SQL generation.
- Includes input-output alignment features to reduce errors in model generation (the illusion problem).
- It supports five modules: preprocessing, extraction, generation, optimization and alignment, covering the whole query process.
- Open source and free of charge, users can modify the code as needed or integrate it into their own projects.
Using Help
OpenSearch-SQL is a command line tool that needs to be installed and configured to work. Below are detailed steps to help you get started from scratch.
Installation process
- Preparing the Python Environment
Make sure you have Python 3.8 or above installed on your computer. You can do this in the terminal by typingpython --version
Check. If not, download and install it from the official Python website (https://www.python.org/). - Download project files
Open your browser and go to https://github.com/OpenSearch-AI/OpenSearch-SQL. Click the "Code" button in the upper right corner and select "Download ZIP " and select "Download ZIP" to download the source code, or use the Git command to clone it:
git clone https://github.com/OpenSearch-AI/OpenSearch-SQL.git
Download and unzip it locally, for example C:\OpenSearch-SQL
maybe /home/user/OpenSearch-SQL
The
- Installation of dependencies
Go to the project folder, open a terminal and run the following command to install the required libraries:
pip install -r requirements.txt
This will install the Python packages needed to run the project, such as libraries for processing data and calling models.
Data preprocessing
OpenSearch-SQL requires few-shot examples to improve query accuracy. Data can be generated using the official script provided.
- Prepare data
The project provides a sample filebird_dev.json
Located inBird/bird_dev.json
It is based on the DAIL-SQL methodology. It is generated based on the DAIL-SQL methodology and contains few-shot query examples. If you have your own data, you can replace this file. - Running preprocessing scripts
Run it in the project root directory:
sh run/run_preprocess.sh
This script processes the few-shot data, table structures, and other information. When it's done, the terminal displays the output for each directory. If you're a Windows user, you can run it with Git Bash or WSL, or execute the commands in the script manually.
Run the main program
- triggering program
Run it in the project root directory:
sh run/run_main.sh
This will call the src/runner/database_manager.py
file to start processing the query. The program path is set in the script.
- Check Output
After the main program runs, it generates SQL query results according to the configuration. The output file path is in thesrc/runner/database_manager.py
(used form a nominal expression)_set_paths
defined in the function, which can be adjusted as needed.
Featured Function Operation
- Natural Language to SQL
Enter a question like "Which city has the highest sales?" The program will generate it:
SELECT city FROM sales ORDER BY amount DESC LIMIT 1
You can find more information on the questions.json
Add your own questions to the file and run it to see the results.
- CoT Enhancement
The project provides examples in Query-CoT-SQL format. For example: - User input: "What is the average age?"
- CoT process: first find the age column, then calculate the mean.
- Output:
SELECT AVG(age) FROM users
The - alignment function
If the results are generated incorrectly, the program will automatically correct them. For example, if you enter "List all student grades" but the database has multiple tables, it will ensure that the JOIN operation is correct:
SELECT s.name, sc.score FROM students s JOIN scores sc ON s.id = sc.student_id
caveat
- If you need to test the BIRD dataset, you can do so directly with the
Bird/fewshot/questions.json
Documentation. - The program supports multiple models and may require an API key for the default configuration. If using GPT or DeepSeek, the key needs to be set in the code.
With these steps, you can easily turn natural language into SQL queries to handle a variety of data analysis tasks.
application scenario
- data analysis
Data analysts can use it to turn questions into SQL and quickly count sales data or user behavior. - Education and training
Students can use it to learn SQL by entering questions and comparing the query statements generated. - Automated Reports
Organizations can use it to automatically generate report SQL and reduce manual writing time.
QA
- Does OpenSearch-SQL require an internet connection?
If using a local model, no networking is required. However, networking and API keys are required when using an online model such as GPT. - What databases are supported?
It targets any database that supports SQL, as long as the table structure and data are provided. - BIRD What does it mean to be number one on the list?
It shows that it has the highest accuracy in Text-to-SQL tasks and can handle complex queries, with an August 2024 score of 72.28% for the test set.