Skip to main content

SeSQL: A High-Quality Large-Scale Session-Level Chinese Text-to-SQL Dataset

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2023)

Abstract

As the first session-level Chinese dataset, CHASE contains two separate parts, i.e., 2,003 sessions manually constructed from scratch (CHASE-C), and 3,456 sessions translated from English SParC (CHASE-T). We find the two parts are highly discrepant and incompatible. In this work, we present SeSQL, a high-quality large-scale session-level Chinese text-to-SQL dataset, consisting of 5,028 sessions all manually constructed from scratch. Compared with previous datasets, in order to guarantee data quality, we adopt an iterative annotation workflow to facilitate intense and in-time review of previous-round natural language (NL) questions and SQL queries. Moreover, by completing all context-dependent NL questions, we obtain 27,012 context-independent question/SQL pairs, allowing SeSQL to be used as the largest dataset for single-round text-to-SQL parsing. We conduct benchmark session-level text-to-SQL parsing experiments on SeSQL via employing three competitive session-level parsers, and present detailed analysis.

S. Huang and L. Wang—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Please note that we ask annotators not to introduce identification information and ask them to anonymize the existing identification information.

  2. 2.

    The average salary is about 20 RMB for a part-time KFC employee in our city.

  3. 3.

    The values of “Both” for other datasets are inferred from their reported results of “Core.” and “Elli.”.

References

  1. Bertomeu, N., Uszkoreit, H., Frank, A., Krieger, H.U., Jörg, B.: Contextual phenomena and thematic relations in database QA dialogues: results from a Wizard-of-Oz experiment. In: Proceedings of HLT-NAACL, pp. 1–8 (2006)

    Google Scholar 

  2. Cai, Y., Wan, X.: IGSQL: database schema interaction graph based neural model for context-dependent text-to-SQL generation. In: Proceedings of EMNLP, pp. 6903–6912 (2020)

    Google Scholar 

  3. Cao, R., Chen, L., Chen, Z., Zhao, Y., Zhu, S., Yu, K.: LGESQL: line graph enhanced text-to-SQL model with mixed local and non-local relations. In: Proceedings of ACL, pp. 2541–2555 (2021)

    Google Scholar 

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)

    Google Scholar 

  5. Geva, M., Goldberg, Y., Berant, J.: Are we modeling the task or the annotator? An investigation of annotator bias in natural language understanding datasets. In: Proceedings of EMNLP-IJCNLP, pp. 1161–1166 (2019)

    Google Scholar 

  6. Guo, J., et al.: Chase: a large-scale and pragmatic Chinese dataset for cross-database context-dependent text-to-SQL. In: Proceedings of ACL, pp. 2316–2331 (2021)

    Google Scholar 

  7. Hui, B., et al.: Dynamic hybrid relation exploration network for cross-domain context-dependent semantic parsing. In: Proceedings of AAAI, pp. 13116–13124 (2021)

    Google Scholar 

  8. Scholak, T., Li, R., Bahdanau, D., de Vries, H., Pal, C.: DuoRAT: towards simpler text-to-SQL models. In: Proceedings of NAACL-HLT, pp. 1313–1321 (2021)

    Google Scholar 

  9. Tang, L.R., Mooney, R.J.: Using multiple clause constructors in inductive logic programming for semantic parsing. In: Proceedings of ECML, pp. 466–477 (2001)

    Google Scholar 

  10. Wang, B., Shin, R., Liu, X., Polozov, O., Richardson, M.: RAT-SQL: relation-aware schema encoding and linking for text-to-SQL parsers. In: Proceedings of ACL, pp. 7567–7578 (2020)

    Google Scholar 

  11. Wang, L., et al.: DuSQL: a large-scale and pragmatic Chinese text-to-SQL dataset. In: Proceedings of EMNLP, pp. 6923–6935 (2020)

    Google Scholar 

  12. Yu, T., et al.: CoSQL: a conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. In: Proceedings of EMNLP-IJCNLP, pp. 1962–1979 (2019)

    Google Scholar 

  13. Yu, T., Zhang, R., Polozov, A., Meek, C., Awadallah, A.H.: SCoRe: pre-training for context representation in conversational semantic parsing. In: Proceedings of ICLR (2020)

    Google Scholar 

  14. Yu, T., et al.: Spider: a large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In: Proceedings of EMNLP, pp. 3911–3921 (2018)

    Google Scholar 

  15. Yu, T., et al.: SParC: cross-domain semantic parsing in context. In: Proceedings of ACL, pp. 4511–4523 (2019)

    Google Scholar 

  16. Zhang, R., et al.: Editing-based SQL query generation for cross-domain context-dependent questions. In: Proceedings of EMNLP-IJCNLP, pp. 5338–5349 (2019)

    Google Scholar 

  17. Zheng, Y., Wang, H., Dong, B., Wang, X., Li, C.: HIE-SQL: history information enhanced network for context-dependent text-to-SQL semantic parsing. In: Proceedings of ACL, pp. 2997–3007 (2022)

    Google Scholar 

  18. Zhong, V., Xiong, C., Socher, R.: Seq2SQL: generating structured queries from natural language using reinforcement learning. arXiv:1709.00103 (2017)

Download references

Acknowledgement

We want to thank all anonymous reviewers for their valuable comments. We thank all annotators for their great effort in data annotation and review as well. This work was supported by the National Natural Science Foundation of China (Grant No. 62176173) and the Projected Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhenghua Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huang, S. et al. (2023). SeSQL: A High-Quality Large-Scale Session-Level Chinese Text-to-SQL Dataset. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14302. Springer, Cham. https://doi.org/10.1007/978-3-031-44693-1_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44693-1_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44692-4

  • Online ISBN: 978-3-031-44693-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics