ConDA: state-based data augmentation for context-dependent text-to-SQL

Wang, Dingzirui; Dou, Longxu; Che, Wanxiang; Wang, Jiaqi; Liu, Jinbo; Li, Lixin; Shang, Jingan; Tao, Lei; Zhang, Jie; Fu, Cong; Song, Xuri

doi:10.1007/s13042-023-02086-z

ConDA: state-based data augmentation for context-dependent text-to-SQL

Original Article
Published: 17 February 2024

Volume 15, pages 3157–3168, (2024)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Dingzirui Wang¹,
Longxu Dou¹,
Wanxiang Che ORCID: orcid.org/0000-0002-3907-0335¹,
Jiaqi Wang²,
Jinbo Liu³,
Lixin Li²,
Jingan Shang⁴,
Lei Tao²,
Jie Zhang⁴,
Cong Fu² &
…
Xuri Song²

384 Accesses
Explore all metrics

Abstract

The context-dependent text-to-SQL task has profound real-world implications, as it facilitates users in extracting knowledge from vast databases, which allows users to acquire the information interactively for better accuracy. Unfortunately, current models struggle to address this task effectively due to the scarcity of data led by the high annotation overhead. The most straightforward method for addressing this problem is data augmentation, which aims at scaling up the parsing corpus. However, the naive methods suffer from the low diversity of the augmented data. To address this limitation, we propose the state-based CONtext-dependent text-to-SQL Data Augmentation (ConDA), which generate and filter augmented data based on the dialogue state, which has higher diversity. Experimental results show that ConDA yields performance improvement on all experimental datasets with an average boosting of $1.6\%$, proving the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

UniSAr: a unified structure-aware autoregressive language model for text-to-SQL semantic parsing

Article 05 July 2023

CMAT: Column-Mask-Augmented Training for Text-to-SQL Parsers

SeSQL: A High-Quality Large-Scale Session-Level Chinese Text-to-SQL Dataset

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

All used experimental datasets are publicly available.

References

Yu T, Zhang R, Yang K, Yasunaga M, Wang D, Li Z, Ma J, Li I, Yao Q, Roman S, Zhang Z, Radev D (2018) Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. arXiv:1809.08887
Zelle JM, Mooney RJ (1996) Learning to parse database queries using inductive logic programming. In: AAAI/IAAI, Vol. 2
Yu T, Zhang R, Er HY, Li S, Xue E, Pang B, Lin XV, Tan YC, Shi T, Li Z, Jiang Y, Yasunaga M, Shim S, Chen T, Fabbri A, Li Z, Chen L, Zhang Y, Dixit S, Zhang V, Xiong C, Socher R, Lasecki WS, Radev D (2019) CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases
Liu Q, Ye Z, Yu T, Song L, Blunsom P (2022) Augmenting multi-turn text-to-SQL datasets with self-play. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 5608–5620. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates. https://aclanthology.org/2022.findings-emnlp.411
Cai Z, Li X, Hui B, Yang M, Li B, Li B, Cao Z, Li W, Huang F, Si L, Li Y (2022) STAR: SQL guided pre-training for context-dependent text-to-SQL parsing. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 1235–1247. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates. https://aclanthology.org/2022.findings-emnlp.89
Yu T, Zhang R, Yasunaga M, Tan YC, Lin XV, Li S, Er H, Li I, Pang B, Chen T, Ji E, Dixit S, Proctor D, Shim S, Kraft J, Zhang V, Xiong C, Socher R, Radev D (2019) SParC: Cross-Domain Semantic Parsing in Context
Zelle JM, Mooney RJ (1996) Learning to parse database queries using inductive logic programming. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1050–1055
Wang B, Shin R, Liu X, Polozov O, Richardson M (2020) RAT-SQL: Relation-aware schema encoding and linking for text-to-SQL parsers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7567–7578. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.acl-main.677. https://aclanthology.org/2020.acl-main.677
Cai Y, Wan X (2020) IGSQL: Database schema interaction graph based neural model for context-dependent text-to-SQL generation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6903–6912. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.emnlp-main.560. https://aclanthology.org/2020.emnlp-main.560
Scholak T, Schucher N, Bahdanau D (2021) PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9895–9901. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic. https://doi.org/10.18653/v1/2021.emnlp-main.779. https://aclanthology.org/2021.emnlp-main.779
Dou L, Gao Y, Pan M, Wang D, Lou J-G, Che W, Zhan D (2022) Unisar: A unified structure-aware autoregressive language model for text-to-sql. arXiv:2203.07781
Rajkumar N, Li R, Bahdanau D (2022) Evaluating the Text-to-SQL Capabilities of Large Language Models
Pourreza M, Rafiei D (2023) DIN-SQL: Decomposed in-context learning of text-to-SQL with self-correction. In: Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=p53QDxSIc5
Chang S, Fosler-Lussier E (2023) Selective Demonstrations for Cross-domain Text-to-SQL
Zhang R, Yu T, Er H, Shim S, Xue E, Lin XV, Shi T, Xiong C, Socher R, Radev D (2019) Editing-based SQL query generation for cross-domain context-dependent questions. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5338–5349. Association for Computational Linguistics, Hong Kong, China. https://doi.org/10.18653/v1/D19-1537. https://aclanthology.org/D19-1537
Hui B, Geng R, Ren Q, Li B, Li Y, Sun J, Huang F, Si L, Zhu P, Zhu X (2021) Dynamic hybrid relation exploration network for cross-domain context-dependent semantic parsing. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, pp. 13116–13124. AAAI Press. https://ojs.aaai.org/index.php/AAAI/article/view/17550
Wang R, Ling Z, Zhou J, Hu Y (2021) Tracking interaction states for multi-turn text-to-sql semantic parsing. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp. 13979–13987. AAAI Press, ???. https://ojs.aaai.org/index.php/AAAI/article/view/17646
Wu K, Wang L, Li Z, Zhang A, Xiao X, Wu H, Zhang M, Wang H (2021) Data Augmentation with Hierarchical SQL-to-Question Generation for Cross-domain Text-to-SQL Parsing. arXiv:2103.02227
Yu T, Wu C-S, Lin XV, bailin wang, Tan YC, Yang X, Radev D, richard socher, Xiong C (2021) Grappa: Grammar-augmented pre-training for table semantic parsing. In: International Conference on Learning Representations. https://openreview.net/forum?id=kyaIeYj4zZ
Yu T, Zhang R, Polozov A, Meek C, Awadallah AH (2021) SCore: Pre-training for context representation in conversational semantic parsing. In: International Conference on Learning Representations. https://openreview.net/forum?id=oyZxhRI2RiE
Li B, Hou Y, Che W (2022) Data augmentation approaches in natural language processing: a survey. AI Open 3:71–90. https://doi.org/10.1016/j.aiopen.2022.03.001
Article Google Scholar
Barzilay R, McKeown KR (2001) Extracting paraphrases from a parallel corpus. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 50–57. Association for Computational Linguistics, Toulouse, France. https://doi.org/10.3115/1073012.1073020. https://aclanthology.org/P01-1008
Sennrich R, Haddow B, Birch A (2016) Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 86–96. Association for Computational Linguistics, Berlin, Germany. https://doi.org/10.18653/v1/P16-1009. https://aclanthology.org/P16-1009
Liu Q, Kusner M, Blunsom P (2021) Counterfactual data augmentation for neural machine translation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 187–197
Longpre S, Lu Y, Tu Z, DuBois C (2019) An exploration of data augmentation and sampling techniques for domain-agnostic question answering. In: Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 220–227. Association for Computational Linguistics, Hong Kong, China. https://doi.org/10.18653/v1/D19-5829. https://aclanthology.org/D19-5829
Jia R, Liang P (2016) Data Recombination for Neural Semantic Parsing. arXiv:1606.03622
Hou Y, Liu Y, Che W, Liu T (2018) Sequence-to-Sequence Data Augmentation for Dialogue Language Understanding. arXiv:1807.01554
Yu T, Wu C-S, Lin XV, Wang B, Tan YC, Yang X, Radev D, Socher R, Xiong C (2020) GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing. arXiv:2009.13845
Zhong V, Lewis M, Wang SI, Zettlemoyer L (2021) Grounded Adaptation for Zero-shot Executable Semantic Parsing
Wang B, Yin W, Lin XV, Xiong C (2021) Learning to Synthesize Data for Semantic Parsing. arXiv:2104.05827
Yang K, Deng O, Chen C, Shin R, Roy S, Van Durme B (2022) Addressing resource and privacy constraints in semantic parsing through data augmentation. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 3685–3695. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.findings-acl.291. https://aclanthology.org/2022.findings-acl.291
LI S, Yavuz S, Hashimoto K, Li J, Niu T, Rajani N, Yan X, Zhou Y, Xiong C (2021) Coco: Controllable counterfactuals for evaluating dialogue state trackers. In: International Conference on Learning Representations. https://openreview.net/forum?id=eom0IUrF__F
Yu T, Yasunaga M, Yang K, Zhang R, Wang D, Li Z, Radev D (2018) SyntaxSQLNet: Syntax tree networks for complex and cross-domain text-to-SQL task. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1653–1663. Association for Computational Linguistics, Brussels, Belgium. https://doi.org/10.18653/v1/D18-1193. https://aclanthology.org/D18-1193
Guo D, Sun Y, Tang D, Duan N, Yin J, Chi H, Cao J, Chen P, Zhou M (2018) Question Generation from SQL Queries Improves Neural Semantic Parsing. arXiv:1808.06304
Qin L, Xie T, Che W, Liu T. A survey on spoken language understanding: Recent advances and new frontiers. In: Zhou, Z.-H. (ed.) Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pp. 4577–4584. International Joint Conferences on Artificial Intelligence Organization. https://doi.org/10.24963/ijcai.2021/622. Survey Track
Liu Q, Yang D, Zhang J, Guo J, Zhou B, Lou J-G (2021) Awakening latent grounding from pretrained language models for semantic parsing. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 1174–1189. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2021.findings-acl.100. https://aclanthology.org/2021.findings-acl.100
Yu T, Zhang R, Yasunaga M, Tan YC, Lin XV, Li S, Er H, Li I, Pang B, Chen T, Ji E, Dixit S, Proctor D, Shim S, Kraft J, Zhang V, Xiong C, Socher R, Radev D (2019) SParC: Cross-domain semantic parsing in context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4511–4523. Association for Computational Linguistics, Florence, Italy. https://doi.org/10.18653/v1/P19-1443. https://aclanthology.org/P19-1443
Yu T, Zhang R, Er H, Li S, Xue E, Pang B, Lin XV, Tan YC, Shi T, Li Z, Jiang Y, Yasunaga M, Shim S, Chen T, Fabbri A, Li Z, Chen L, Zhang Y, Dixit S, Zhang V, Xiong C, Socher R, Lasecki W, Radev D (2019) CoSQL: A conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1962–1979. Association for Computational Linguistics, Hong Kong, China. https://doi.org/10.18653/v1/D19-1204. https://aclanthology.org/D19-1204
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2020) BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.acl-main.703. https://aclanthology.org/2020.acl-main.703
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2019) Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Scao TL, Gugger S, Drame M, Lhoest Q, Rush AM (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online. https://www.aclweb.org/anthology/2020.emnlp-demos.6
Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, Grangier D, Auli M (2019) fairseq: A fast, extensible toolkit for sequence modeling. In: Proceedings of NAACL-HLT 2019: Demonstrations
Scholak T, Li R, Bahdanau D, de Vries H, Pal C (2020) DuoRAT: Towards Simpler Text-to-SQL Models. arXiv:2010.11119 [cs]

Download references

Acknowledgements

This work is supported by the Science and Technology Program of State Grid Corporation of China under Grant No 5108-202212052A-1-1-ZN.

Author information

Authors and Affiliations

Harbin Institute of Technology, Harbin, 150001, Heilongjiang, China
Dingzirui Wang, Longxu Dou & Wanxiang Che
China Electric Power Research Institute, Beijing, 100192, Beijing, China
Jiaqi Wang, Lixin Li, Lei Tao, Cong Fu & Xuri Song
State Grid Corporation of China, Beijing, 100031, Beijing, China
Jinbo Liu
State Grid Tianjin Electric Power Company, Tianjin, 300010, Tianjin, China
Jingan Shang & Jie Zhang

Authors

Dingzirui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Longxu Dou
View author publications
You can also search for this author in PubMed Google Scholar
Wanxiang Che
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jinbo Liu
View author publications
You can also search for this author in PubMed Google Scholar
Lixin Li
View author publications
You can also search for this author in PubMed Google Scholar
Jingan Shang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Tao
View author publications
You can also search for this author in PubMed Google Scholar
Jie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Cong Fu
View author publications
You can also search for this author in PubMed Google Scholar
Xuri Song
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wanxiang Che.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, D., Dou, L., Che, W. et al. ConDA: state-based data augmentation for context-dependent text-to-SQL. Int. J. Mach. Learn. & Cyber. 15, 3157–3168 (2024). https://doi.org/10.1007/s13042-023-02086-z

Download citation

Received: 22 May 2023
Accepted: 22 December 2023
Published: 17 February 2024
Issue Date: August 2024
DOI: https://doi.org/10.1007/s13042-023-02086-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ConDA: state-based data augmentation for context-dependent text-to-SQL

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

UniSAr: a unified structure-aware autoregressive language model for text-to-SQL semantic parsing

CMAT: Column-Mask-Augmented Training for Text-to-SQL Parsers

SeSQL: A High-Quality Large-Scale Session-Level Chinese Text-to-SQL Dataset

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

ConDA: state-based data augmentation for context-dependent text-to-SQL

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

UniSAr: a unified structure-aware autoregressive language model for text-to-SQL semantic parsing

CMAT: Column-Mask-Augmented Training for Text-to-SQL Parsers

SeSQL: A High-Quality Large-Scale Session-Level Chinese Text-to-SQL Dataset

Explore related subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation