Website • Paper • Data Update Log • Submission Guidance
-
2025-01-10: Please refer to the data update log to track changes in the evaluation examples. The leaderboard results will also change dynamically accordingly.
-
2025-01-07: Please note that we do not recommend using the Spider 2.0 Gold SQL we released for SFT, as it may affect the fairness of evaluation and hinder better benchmarking of the model's SQL capabilities. The release of Gold SQL is intended to help users design prompts.
-
2024-12-26: Using Spider-Agent to benchmark your LLMs! Considering the widespread attention to the traditional text-to-SQL setting, we now recommend using spider-agent-lite and spider-agent-snow to work with spider2-lite and spider2-snow for benchmarking your LLMs. The final output should be CSV files, not SQLs.
-
2024-12-24: Considering the many evaluation requirements, we have decided to release all examples and gold answers for self-evaluation. However, only a small amount of gold SQL is available. The leaderboard is still active. To have your method officially validated and upload your scores to the leaderboard, please follow the submission guidance.
Setting | Task Type | #Examples | Databases | Cost |
---|---|---|---|---|
Spider 2.0-Snow | Text-to-SQL task | 547 | Snowflake(547) | NO COST!😊 |
Spider 2.0-Lite | Text-to-SQL task | 547 | BigQuery(214), Snowflake(198), SQLite(135) | Some cost incurred |
Spider 2.0 | Code agent task | 632 | BigQuery(214), Snowflake(198), Postgres(10), ClickHouse(7), SQLite(135), DuckDB (DBT)(68) | Some cost incurred |
In 2018, we introduced Spider 1.0, SParC, and CoSQL as part of the Yale Semantic Parsing and Text-to-SQL Challenge Series, attracting over 300 submissions from leading research labs worldwide.
Now, in the era of Large Language Models (LLMs), we present Spider 2.0 to advance code generation, particularly text-to-SQL capabilities.
This new benchmark offers a more realistic and challenging test of LLMs' performance on complex enterprise-level text-to-SQL workflows, involving complex data environments (e.g., >3000 columns), multiple SQL dialects (e.g., BigQuery, Snowflake), and diverse operations (e.g., transformation, analytics).
Notably, as shown below, even the most advanced LLMs, including GPT-4, solve only 6.0% of Spider 2.0 tasks, compared to 86.6% on Spider 1.0 and 57.4% on BIRD, highlighting the significant challenges posed by Spider 2.0.
Spider 1.0 dev | Spider 1.0 test | BIRD test | Spider 2.0-lite | Spider 2.0-snow | |
---|---|---|---|---|---|
DailSQL + GPT-4 | 82.4 | 86.6 | 57.4 | 5.6 | 2.2 |
CodeS-15B | 85.4 | - | 59.3 | 0.7 | 0.0 |
The questions/instructions are in spider2-lite.jsonl and spider2-snow.jsonl.
We also release some gold SQLs to help users design prompts and methods, note that we do not recommend using the Spider 2.0 Gold SQL we released for fine-tuning.
-
To sign up for a BigQuery account, please follow this guideline, get your own credentials.
-
Follow this guideline and fill out this Spider2 Snowflake Access, and we will send you an account sign-up email, which will allow you to access the Snowflake database.
Important Notes:
-
If you want to access the FULL dataset of Spider 2.0 or Spider 2.0-Lite, you must complete Step1 and Step2.
-
If you only want access to the FULL dataset of Spider 2.0-Snow, you only need to complete Step2.
We highly recommend that you directly use Spider2-snow and Spider2-lite for benchmarking and research. First, run the Spider-Agent Framework!!
For more details, please refer to the following links:
We only release the gold answer of about partial examples of Spider 2.0, Spider 2.0-Lite and Spider 2.0-Snow. You must follow this submission guidance to upload your score to leaderboard.
We thank Snowflake for their generous support in hosting the Spider 2.0 Challenge. We also thank Tianbao Xie, Yiheng Xu, Fan Zhou, Yuting Lan, Per Jacobsson, Yiming Huang, Canwen Xu, Zhewei Yao, and Binyuan Hui for their helpful feedback on this work. The website and submission guidelines are greatly inspired by BIRD-SQL, and we thank them for their contributions.
If you find our work helpful, please cite as
@misc{lei2024spider2,
title={Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows},
author={Fangyu Lei and Jixuan Chen and Yuxiao Ye and Ruisheng Cao and Dongchan Shin and Hongjin Su and Zhaoqing Suo and Hongcheng Gao and Wenjing Hu and Pengcheng Yin and Victor Zhong and Caiming Xiong and Ruoxi Sun and Qian Liu and Sida Wang and Tao Yu},
year={2024},
eprint={2411.07763},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.07763},
}