SeSQL: A High-Quality Large-Scale Session-Level Chinese Text-to-SQL Dataset

Huang, Saihao; Wang, Lijie; Li, Zhenghua; Liu, Zeyang; Dou, Chenhui; Yan, Fukang; Xiao, Xinyan; Wu, Hua; Zhang, Min

doi:10.1007/978-3-031-44693-1_42

Saihao Huang¹¹,
Lijie Wang¹²,
Zhenghua Li¹¹,
Zeyang Liu¹¹,
Chenhui Dou¹¹,
Fukang Yan¹¹,
Xinyan Xiao¹²,
Hua Wu¹² &
…
Min Zhang¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14302))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

1112 Accesses

Abstract

As the first session-level Chinese dataset, CHASE contains two separate parts, i.e., 2,003 sessions manually constructed from scratch (CHASE-C), and 3,456 sessions translated from English SParC (CHASE-T). We find the two parts are highly discrepant and incompatible. In this work, we present SeSQL, a high-quality large-scale session-level Chinese text-to-SQL dataset, consisting of 5,028 sessions all manually constructed from scratch. Compared with previous datasets, in order to guarantee data quality, we adopt an iterative annotation workflow to facilitate intense and in-time review of previous-round natural language (NL) questions and SQL queries. Moreover, by completing all context-dependent NL questions, we obtain 27,012 context-independent question/SQL pairs, allowing SeSQL to be used as the largest dataset for single-round text-to-SQL parsing. We conduct benchmark session-level text-to-SQL parsing experiments on SeSQL via employing three competitive session-level parsers, and present detailed analysis.

S. Huang and L. Wang—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Please note that we ask annotators not to introduce identification information and ask them to anonymize the existing identification information.
2.
The average salary is about 20 RMB for a part-time KFC employee in our city.
3.
The values of “Both” for other datasets are inferred from their reported results of “Core.” and “Elli.”.

References

Bertomeu, N., Uszkoreit, H., Frank, A., Krieger, H.U., Jörg, B.: Contextual phenomena and thematic relations in database QA dialogues: results from a Wizard-of-Oz experiment. In: Proceedings of HLT-NAACL, pp. 1–8 (2006)
Google Scholar
Cai, Y., Wan, X.: IGSQL: database schema interaction graph based neural model for context-dependent text-to-SQL generation. In: Proceedings of EMNLP, pp. 6903–6912 (2020)
Google Scholar
Cao, R., Chen, L., Chen, Z., Zhao, Y., Zhu, S., Yu, K.: LGESQL: line graph enhanced text-to-SQL model with mixed local and non-local relations. In: Proceedings of ACL, pp. 2541–2555 (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Geva, M., Goldberg, Y., Berant, J.: Are we modeling the task or the annotator? An investigation of annotator bias in natural language understanding datasets. In: Proceedings of EMNLP-IJCNLP, pp. 1161–1166 (2019)
Google Scholar
Guo, J., et al.: Chase: a large-scale and pragmatic Chinese dataset for cross-database context-dependent text-to-SQL. In: Proceedings of ACL, pp. 2316–2331 (2021)
Google Scholar
Hui, B., et al.: Dynamic hybrid relation exploration network for cross-domain context-dependent semantic parsing. In: Proceedings of AAAI, pp. 13116–13124 (2021)
Google Scholar
Scholak, T., Li, R., Bahdanau, D., de Vries, H., Pal, C.: DuoRAT: towards simpler text-to-SQL models. In: Proceedings of NAACL-HLT, pp. 1313–1321 (2021)
Google Scholar
Tang, L.R., Mooney, R.J.: Using multiple clause constructors in inductive logic programming for semantic parsing. In: Proceedings of ECML, pp. 466–477 (2001)
Google Scholar
Wang, B., Shin, R., Liu, X., Polozov, O., Richardson, M.: RAT-SQL: relation-aware schema encoding and linking for text-to-SQL parsers. In: Proceedings of ACL, pp. 7567–7578 (2020)
Google Scholar
Wang, L., et al.: DuSQL: a large-scale and pragmatic Chinese text-to-SQL dataset. In: Proceedings of EMNLP, pp. 6923–6935 (2020)
Google Scholar
Yu, T., et al.: CoSQL: a conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. In: Proceedings of EMNLP-IJCNLP, pp. 1962–1979 (2019)
Google Scholar
Yu, T., Zhang, R., Polozov, A., Meek, C., Awadallah, A.H.: SCoRe: pre-training for context representation in conversational semantic parsing. In: Proceedings of ICLR (2020)
Google Scholar
Yu, T., et al.: Spider: a large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In: Proceedings of EMNLP, pp. 3911–3921 (2018)
Google Scholar
Yu, T., et al.: SParC: cross-domain semantic parsing in context. In: Proceedings of ACL, pp. 4511–4523 (2019)
Google Scholar
Zhang, R., et al.: Editing-based SQL query generation for cross-domain context-dependent questions. In: Proceedings of EMNLP-IJCNLP, pp. 5338–5349 (2019)
Google Scholar
Zheng, Y., Wang, H., Dong, B., Wang, X., Li, C.: HIE-SQL: history information enhanced network for context-dependent text-to-SQL semantic parsing. In: Proceedings of ACL, pp. 2997–3007 (2022)
Google Scholar
Zhong, V., Xiong, C., Socher, R.: Seq2SQL: generating structured queries from natural language using reinforcement learning. arXiv:1709.00103 (2017)

Download references

Acknowledgement

We want to thank all anonymous reviewers for their valuable comments. We thank all annotators for their great effort in data annotation and review as well. This work was supported by the National Natural Science Foundation of China (Grant No. 62176173) and the Projected Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Author information

Authors and Affiliations

School of Computer Science and Technology, Soochow University, Suzhou, China
Saihao Huang, Zhenghua Li, Zeyang Liu, Chenhui Dou, Fukang Yan & Min Zhang
Baidu Inc., Beijing, China
Lijie Wang, Xinyan Xiao & Hua Wu

Authors

Saihao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Lijie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhenghua Li
View author publications
You can also search for this author in PubMed Google Scholar
Zeyang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chenhui Dou
View author publications
You can also search for this author in PubMed Google Scholar
Fukang Yan
View author publications
You can also search for this author in PubMed Google Scholar
Xinyan Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Hua Wu
View author publications
You can also search for this author in PubMed Google Scholar
Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhenghua Li .

Editor information

Editors and Affiliations

Emory University, Atlanta, GA, USA
Fei Liu
Microsoft Research Asia, Beijing, China
Nan Duan
Soochow University, Suzhou, China
Qingting Xu
Soochow University, Suzhou, China
Yu Hong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, S. et al. (2023). SeSQL: A High-Quality Large-Scale Session-Level Chinese Text-to-SQL Dataset. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14302. Springer, Cham. https://doi.org/10.1007/978-3-031-44693-1_42

Download citation

DOI: https://doi.org/10.1007/978-3-031-44693-1_42
Published: 08 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44692-4
Online ISBN: 978-3-031-44693-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)