Skip to main content

ConDA: state-based data augmentation for context-dependent text-to-SQL

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

The context-dependent text-to-SQL task has profound real-world implications, as it facilitates users in extracting knowledge from vast databases, which allows users to acquire the information interactively for better accuracy. Unfortunately, current models struggle to address this task effectively due to the scarcity of data led by the high annotation overhead. The most straightforward method for addressing this problem is data augmentation, which aims at scaling up the parsing corpus. However, the naive methods suffer from the low diversity of the augmented data. To address this limitation, we propose the state-based CONtext-dependent text-to-SQL Data Augmentation (ConDA), which generate and filter augmented data based on the dialogue state, which has higher diversity. Experimental results show that ConDA yields performance improvement on all experimental datasets with an average boosting of \(1.6\%\), proving the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

All used experimental datasets are publicly available.

References

  1. Yu T, Zhang R, Yang K, Yasunaga M, Wang D, Li Z, Ma J, Li I, Yao Q, Roman S, Zhang Z, Radev D (2018) Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. arXiv:1809.08887

  2. Zelle JM, Mooney RJ (1996) Learning to parse database queries using inductive logic programming. In: AAAI/IAAI, Vol. 2

  3. Yu T, Zhang R, Er HY, Li S, Xue E, Pang B, Lin XV, Tan YC, Shi T, Li Z, Jiang Y, Yasunaga M, Shim S, Chen T, Fabbri A, Li Z, Chen L, Zhang Y, Dixit S, Zhang V, Xiong C, Socher R, Lasecki WS, Radev D (2019) CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases

  4. Liu Q, Ye Z, Yu T, Song L, Blunsom P (2022) Augmenting multi-turn text-to-SQL datasets with self-play. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 5608–5620. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates. https://aclanthology.org/2022.findings-emnlp.411

  5. Cai Z, Li X, Hui B, Yang M, Li B, Li B, Cao Z, Li W, Huang F, Si L, Li Y (2022) STAR: SQL guided pre-training for context-dependent text-to-SQL parsing. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 1235–1247. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates. https://aclanthology.org/2022.findings-emnlp.89

  6. Yu T, Zhang R, Yasunaga M, Tan YC, Lin XV, Li S, Er H, Li I, Pang B, Chen T, Ji E, Dixit S, Proctor D, Shim S, Kraft J, Zhang V, Xiong C, Socher R, Radev D (2019) SParC: Cross-Domain Semantic Parsing in Context

  7. Zelle JM, Mooney RJ (1996) Learning to parse database queries using inductive logic programming. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1050–1055

  8. Wang B, Shin R, Liu X, Polozov O, Richardson M (2020) RAT-SQL: Relation-aware schema encoding and linking for text-to-SQL parsers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7567–7578. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.acl-main.677. https://aclanthology.org/2020.acl-main.677

  9. Cai Y, Wan X (2020) IGSQL: Database schema interaction graph based neural model for context-dependent text-to-SQL generation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6903–6912. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.emnlp-main.560. https://aclanthology.org/2020.emnlp-main.560

  10. Scholak T, Schucher N, Bahdanau D (2021) PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9895–9901. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic. https://doi.org/10.18653/v1/2021.emnlp-main.779. https://aclanthology.org/2021.emnlp-main.779

  11. Dou L, Gao Y, Pan M, Wang D, Lou J-G, Che W, Zhan D (2022) Unisar: A unified structure-aware autoregressive language model for text-to-sql. arXiv:2203.07781

  12. Rajkumar N, Li R, Bahdanau D (2022) Evaluating the Text-to-SQL Capabilities of Large Language Models

  13. Pourreza M, Rafiei D (2023) DIN-SQL: Decomposed in-context learning of text-to-SQL with self-correction. In: Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=p53QDxSIc5

  14. Chang S, Fosler-Lussier E (2023) Selective Demonstrations for Cross-domain Text-to-SQL

  15. Zhang R, Yu T, Er H, Shim S, Xue E, Lin XV, Shi T, Xiong C, Socher R, Radev D (2019) Editing-based SQL query generation for cross-domain context-dependent questions. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5338–5349. Association for Computational Linguistics, Hong Kong, China. https://doi.org/10.18653/v1/D19-1537. https://aclanthology.org/D19-1537

  16. Hui B, Geng R, Ren Q, Li B, Li Y, Sun J, Huang F, Si L, Zhu P, Zhu X (2021) Dynamic hybrid relation exploration network for cross-domain context-dependent semantic parsing. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, pp. 13116–13124. AAAI Press. https://ojs.aaai.org/index.php/AAAI/article/view/17550

  17. Wang R, Ling Z, Zhou J, Hu Y (2021) Tracking interaction states for multi-turn text-to-sql semantic parsing. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp. 13979–13987. AAAI Press, ???. https://ojs.aaai.org/index.php/AAAI/article/view/17646

  18. Wu K, Wang L, Li Z, Zhang A, Xiao X, Wu H, Zhang M, Wang H (2021) Data Augmentation with Hierarchical SQL-to-Question Generation for Cross-domain Text-to-SQL Parsing. arXiv:2103.02227

  19. Yu T, Wu C-S, Lin XV, bailin wang, Tan YC, Yang X, Radev D, richard socher, Xiong C (2021) Grappa: Grammar-augmented pre-training for table semantic parsing. In: International Conference on Learning Representations. https://openreview.net/forum?id=kyaIeYj4zZ

  20. Yu T, Zhang R, Polozov A, Meek C, Awadallah AH (2021) SCore: Pre-training for context representation in conversational semantic parsing. In: International Conference on Learning Representations. https://openreview.net/forum?id=oyZxhRI2RiE

  21. Li B, Hou Y, Che W (2022) Data augmentation approaches in natural language processing: a survey. AI Open 3:71–90. https://doi.org/10.1016/j.aiopen.2022.03.001

    Article  Google Scholar 

  22. Barzilay R, McKeown KR (2001) Extracting paraphrases from a parallel corpus. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 50–57. Association for Computational Linguistics, Toulouse, France. https://doi.org/10.3115/1073012.1073020. https://aclanthology.org/P01-1008

  23. Sennrich R, Haddow B, Birch A (2016) Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 86–96. Association for Computational Linguistics, Berlin, Germany. https://doi.org/10.18653/v1/P16-1009. https://aclanthology.org/P16-1009

  24. Liu Q, Kusner M, Blunsom P (2021) Counterfactual data augmentation for neural machine translation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 187–197

  25. Longpre S, Lu Y, Tu Z, DuBois C (2019) An exploration of data augmentation and sampling techniques for domain-agnostic question answering. In: Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 220–227. Association for Computational Linguistics, Hong Kong, China. https://doi.org/10.18653/v1/D19-5829. https://aclanthology.org/D19-5829

  26. Jia R, Liang P (2016) Data Recombination for Neural Semantic Parsing. arXiv:1606.03622

  27. Hou Y, Liu Y, Che W, Liu T (2018) Sequence-to-Sequence Data Augmentation for Dialogue Language Understanding. arXiv:1807.01554

  28. Yu T, Wu C-S, Lin XV, Wang B, Tan YC, Yang X, Radev D, Socher R, Xiong C (2020) GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing. arXiv:2009.13845

  29. Zhong V, Lewis M, Wang SI, Zettlemoyer L (2021) Grounded Adaptation for Zero-shot Executable Semantic Parsing

  30. Wang B, Yin W, Lin XV, Xiong C (2021) Learning to Synthesize Data for Semantic Parsing. arXiv:2104.05827

  31. Yang K, Deng O, Chen C, Shin R, Roy S, Van Durme B (2022) Addressing resource and privacy constraints in semantic parsing through data augmentation. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 3685–3695. Association for Computational Linguistics, Dublin, Ireland. https://doi.org/10.18653/v1/2022.findings-acl.291. https://aclanthology.org/2022.findings-acl.291

  32. LI S, Yavuz S, Hashimoto K, Li J, Niu T, Rajani N, Yan X, Zhou Y, Xiong C (2021) Coco: Controllable counterfactuals for evaluating dialogue state trackers. In: International Conference on Learning Representations. https://openreview.net/forum?id=eom0IUrF__F

  33. Yu T, Yasunaga M, Yang K, Zhang R, Wang D, Li Z, Radev D (2018) SyntaxSQLNet: Syntax tree networks for complex and cross-domain text-to-SQL task. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1653–1663. Association for Computational Linguistics, Brussels, Belgium. https://doi.org/10.18653/v1/D18-1193. https://aclanthology.org/D18-1193

  34. Guo D, Sun Y, Tang D, Duan N, Yin J, Chi H, Cao J, Chen P, Zhou M (2018) Question Generation from SQL Queries Improves Neural Semantic Parsing. arXiv:1808.06304

  35. Qin L, Xie T, Che W, Liu T. A survey on spoken language understanding: Recent advances and new frontiers. In: Zhou, Z.-H. (ed.) Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pp. 4577–4584. International Joint Conferences on Artificial Intelligence Organization. https://doi.org/10.24963/ijcai.2021/622. Survey Track

  36. Liu Q, Yang D, Zhang J, Guo J, Zhou B, Lou J-G (2021) Awakening latent grounding from pretrained language models for semantic parsing. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 1174–1189. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2021.findings-acl.100. https://aclanthology.org/2021.findings-acl.100

  37. Yu T, Zhang R, Yasunaga M, Tan YC, Lin XV, Li S, Er H, Li I, Pang B, Chen T, Ji E, Dixit S, Proctor D, Shim S, Kraft J, Zhang V, Xiong C, Socher R, Radev D (2019) SParC: Cross-domain semantic parsing in context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4511–4523. Association for Computational Linguistics, Florence, Italy. https://doi.org/10.18653/v1/P19-1443. https://aclanthology.org/P19-1443

  38. Yu T, Zhang R, Er H, Li S, Xue E, Pang B, Lin XV, Tan YC, Shi T, Li Z, Jiang Y, Yasunaga M, Shim S, Chen T, Fabbri A, Li Z, Chen L, Zhang Y, Dixit S, Zhang V, Xiong C, Socher R, Lasecki W, Radev D (2019) CoSQL: A conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1962–1979. Association for Computational Linguistics, Hong Kong, China. https://doi.org/10.18653/v1/D19-1204. https://aclanthology.org/D19-1204

  39. Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2020) BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.acl-main.703. https://aclanthology.org/2020.acl-main.703

  40. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2019) Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683

  41. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Scao TL, Gugger S, Drame M, Lhoest Q, Rush AM (2020) Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online. https://www.aclweb.org/anthology/2020.emnlp-demos.6

  42. Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, Grangier D, Auli M (2019) fairseq: A fast, extensible toolkit for sequence modeling. In: Proceedings of NAACL-HLT 2019: Demonstrations

  43. Scholak T, Li R, Bahdanau D, de Vries H, Pal C (2020) DuoRAT: Towards Simpler Text-to-SQL Models. arXiv:2010.11119 [cs]

Download references

Acknowledgements

This work is supported by the Science and Technology Program of State Grid Corporation of China under Grant No 5108-202212052A-1-1-ZN.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wanxiang Che.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, D., Dou, L., Che, W. et al. ConDA: state-based data augmentation for context-dependent text-to-SQL. Int. J. Mach. Learn. & Cyber. 15, 3157–3168 (2024). https://doi.org/10.1007/s13042-023-02086-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-023-02086-z

Keywords