Abstract
The context-dependent text-to-SQL task has profound real-world implications, as it facilitates users in extracting knowledge from vast databases, which allows users to acquire the information interactively for better accuracy. Unfortunately, current models struggle to address this task effectively due to the scarcity of data led by the high annotation overhead. The most straightforward method for addressing this problem is data augmentation, which aims at scaling up the parsing corpus. However, the naive methods suffer from the low diversity of the augmented data. To address this limitation, we propose the state-based CONtext-dependent text-to-SQL Data Augmentation (ConDA), which generate and filter augmented data based on the dialogue state, which has higher diversity. Experimental results show that ConDA yields performance improvement on all experimental datasets with an average boosting of \(1.6\%\), proving the effectiveness of our method.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
All used experimental datasets are publicly available.
References
Yu T, Zhang R, Yang K, Yasunaga M, Wang D, Li Z, Ma J, Li I, Yao Q, Roman S, Zhang Z, Radev D (2018) Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. arXiv:1809.08887
Zelle JM, Mooney RJ (1996) Learning to parse database queries using inductive logic programming. In: AAAI/IAAI, Vol. 2
Yu T, Zhang R, Er HY, Li S, Xue E, Pang B, Lin XV, Tan YC, Shi T, Li Z, Jiang Y, Yasunaga M, Shim S, Chen T, Fabbri A, Li Z, Chen L, Zhang Y, Dixit S, Zhang V, Xiong C, Socher R, Lasecki WS, Radev D (2019) CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases
Liu Q, Ye Z, Yu T, Song L, Blunsom P (2022) Augmenting multi-turn text-to-SQL datasets with self-play. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 5608–5620. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates. https://aclanthology.org/2022.findings-emnlp.411
Cai Z, Li X, Hui B, Yang M, Li B, Li B, Cao Z, Li W, Huang F, Si L, Li Y (2022) STAR: SQL guided pre-training for context-dependent text-to-SQL parsing. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 1235–1247. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates. https://aclanthology.org/2022.findings-emnlp.89
Yu T, Zhang R, Yasunaga M, Tan YC, Lin XV, Li S, Er H, Li I, Pang B, Chen T, Ji E, Dixit S, Proctor D, Shim S, Kraft J, Zhang V, Xiong C, Socher R, Radev D (2019) SParC: Cross-Domain Semantic Parsing in Context
Zelle JM, Mooney RJ (1996) Learning to parse database queries using inductive logic programming. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1050–1055
Wang B, Shin R, Liu X, Polozov O, Richardson M (2020) RAT-SQL: Relation-aware schema encoding and linking for text-to-SQL parsers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7567–7578. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.acl-main.677. https://aclanthology.org/2020.acl-main.677
Cai Y, Wan X (2020) IGSQL: Database schema interaction graph based neural model for context-dependent text-to-SQL generation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6903–6912. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.emnlp-main.560. https://aclanthology.org/2020.emnlp-main.560
Scholak T, Schucher N, Bahdanau D (2021) PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9895–9901. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic. https://doi.org/10.18653/v1/2021.emnlp-main.779. https://aclanthology.org/2021.emnlp-main.779
Dou L, Gao Y, Pan M, Wang D, Lou J-G, Che W, Zhan D (2022) Unisar: A unified structure-aware autoregressive language model for text-to-sql. arXiv:2203.07781
Rajkumar N, Li R, Bahdanau D (2022) Evaluating the Text-to-SQL Capabilities of Large Language Models
Pourreza M, Rafiei D (2023) DIN-SQL: Decomposed in-context learning of text-to-SQL with self-correction. In: Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=p53QDxSIc5
Chang S, Fosler-Lussier E (2023) Selective Demonstrations for Cross-domain Text-to-SQL
Zhang R, Yu T, Er H, Shim S, Xue E, Lin XV, Shi T, Xiong C, Socher R, Radev D (2019) Editing-based SQL query generation for cross-domain context-dependent questions. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5338–5349. Association for Computational Linguistics, Hong Kong, China. https://doi.org/10.18653/v1/D19-1537. https://aclanthology.org/D19-1537
Hui B, Geng R, Ren Q, Li B, Li Y, Sun J, Huang F, Si L, Zhu P, Zhu X (2021) Dynamic hybrid relation exploration network for cross-domain context-dependent semantic parsing. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, pp. 13116–13124. AAAI Press. https://ojs.aaai.org/index.php/AAAI/article/view/17550
Wang R, Ling Z, Zhou J, Hu Y (2021) Tracking interaction states for multi-turn text-to-sql semantic parsing. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp. 13979–13987. AAAI Press, ???. https://ojs.aaai.org/index.php/AAAI/article/view/17646
Wu K, Wang L, Li Z, Zhang A, Xiao X, Wu H, Zhang M, Wang H (2021) Data Augmentation with Hierarchical SQL-to-Question Generation for Cross-domain Text-to-SQL Parsing. arXiv:2103.02227
Yu T, Wu C-S, Lin XV, bailin wang, Tan YC, Yang X, Radev D, richard socher, Xiong C (2021) Grappa: Grammar-augmented pre-training for table semantic parsing. In: International Conference on Learning Representations. https://openreview.net/forum?id=kyaIeYj4zZ
Yu T, Zhang R, Polozov A, Meek C, Awadallah AH (2021) SCore: Pre-training for context representation in conversational semantic parsing. In: International Conference on Learning Representations. https://openreview.net/forum?id=oyZxhRI2RiE
Li B, Hou Y, Che W (2022) Data augmentation approaches in natural language processing: a survey. AI Open 3:71–90. https://doi.org/10.1016/j.aiopen.2022.03.001
Barzilay R, McKeown KR (2001) Extracting paraphrases from a parallel corpus. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 50–57. Association for Computational Linguistics, Toulouse, France. https://doi.org/10.3115/1073012.1073020. https://aclanthology.org/P01-1008
Sennrich R, Haddow B, Birch A (2016) Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 86–96. Association for Computational Linguistics, Berlin, Germany. https://doi.org/10.18653/v1/P16-1009. https://aclanthology.org/P16-1009
Liu Q, Kusner M, Blunsom P (2021) Counterfactual data augmentation for neural machine translation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 187–197
Longpre S, Lu Y, Tu Z, DuBois C (2019) An exploration of data augmentation and sampling techniques for domain-agnostic question answering. In: Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 220–227. Association for Computational Linguistics, Hong Kong, China. https://doi.org/10.18653/v1/D19-5829. https://aclanthology.org/D19-5829
Jia R, Liang P (2016) Data Recombination for Neural Semantic Parsing. arXiv:1606.03622
Hou Y, Liu Y, Che W, Liu T (2018) Sequence-to-Sequence Data Augmentation for Dialogue Language Understanding. arXiv:1807.01554
Yu T, Wu C-S, Lin XV, Wang B, Tan YC, Yang X, Radev D, Socher R, Xiong C (2020) GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing. arXiv:2009.13845
Zhong V, Lewis M, Wang SI, Zettlemoyer L (2021) Grounded Adaptation for Zero-shot Executable Semantic Parsing
Wang B, Yin W, Lin XV, Xiong C (2021) Learning to Synthesize Data for Semantic Parsing. arXiv:2104.05827
Yang K, Deng O, Chen C, Shin R, Roy S, Van Durme B (2022) Addressing resource and privacy constraints in semantic parsing through data augmentation. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 3685–3695. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.findings-acl.291. https://aclanthology.org/2022.findings-acl.291
LI S, Yavuz S, Hashimoto K, Li J, Niu T, Rajani N, Yan X, Zhou Y, Xiong C (2021) Coco: Controllable counterfactuals for evaluating dialogue state trackers. In: International Conference on Learning Representations. https://openreview.net/forum?id=eom0IUrF__F
Yu T, Yasunaga M, Yang K, Zhang R, Wang D, Li Z, Radev D (2018) SyntaxSQLNet: Syntax tree networks for complex and cross-domain text-to-SQL task. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1653–1663. Association for Computational Linguistics, Brussels, Belgium. https://doi.org/10.18653/v1/D18-1193. https://aclanthology.org/D18-1193
Guo D, Sun Y, Tang D, Duan N, Yin J, Chi H, Cao J, Chen P, Zhou M (2018) Question Generation from SQL Queries Improves Neural Semantic Parsing. arXiv:1808.06304
Qin L, Xie T, Che W, Liu T. A survey on spoken language understanding: Recent advances and new frontiers. In: Zhou, Z.-H. (ed.) Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pp. 4577–4584. International Joint Conferences on Artificial Intelligence Organization. https://doi.org/10.24963/ijcai.2021/622. Survey Track
Liu Q, Yang D, Zhang J, Guo J, Zhou B, Lou J-G (2021) Awakening latent grounding from pretrained language models for semantic parsing. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 1174–1189. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2021.findings-acl.100. https://aclanthology.org/2021.findings-acl.100
Yu T, Zhang R, Yasunaga M, Tan YC, Lin XV, Li S, Er H, Li I, Pang B, Chen T, Ji E, Dixit S, Proctor D, Shim S, Kraft J, Zhang V, Xiong C, Socher R, Radev D (2019) SParC: Cross-domain semantic parsing in context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4511–4523. Association for Computational Linguistics, Florence, Italy. https://doi.org/10.18653/v1/P19-1443. https://aclanthology.org/P19-1443
Yu T, Zhang R, Er H, Li S, Xue E, Pang B, Lin XV, Tan YC, Shi T, Li Z, Jiang Y, Yasunaga M, Shim S, Chen T, Fabbri A, Li Z, Chen L, Zhang Y, Dixit S, Zhang V, Xiong C, Socher R, Lasecki W, Radev D (2019) CoSQL: A conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1962–1979. Association for Computational Linguistics, Hong Kong, China. https://doi.org/10.18653/v1/D19-1204. https://aclanthology.org/D19-1204
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2020) BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.acl-main.703. https://aclanthology.org/2020.acl-main.703
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2019) Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Scao TL, Gugger S, Drame M, Lhoest Q, Rush AM (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online. https://www.aclweb.org/anthology/2020.emnlp-demos.6
Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, Grangier D, Auli M (2019) fairseq: A fast, extensible toolkit for sequence modeling. In: Proceedings of NAACL-HLT 2019: Demonstrations
Scholak T, Li R, Bahdanau D, de Vries H, Pal C (2020) DuoRAT: Towards Simpler Text-to-SQL Models. arXiv:2010.11119 [cs]
Acknowledgements
This work is supported by the Science and Technology Program of State Grid Corporation of China under Grant No 5108-202212052A-1-1-ZN.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, D., Dou, L., Che, W. et al. ConDA: state-based data augmentation for context-dependent text-to-SQL. Int. J. Mach. Learn. & Cyber. 15, 3157–3168 (2024). https://doi.org/10.1007/s13042-023-02086-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-023-02086-z