Skip to main content

Improving the Accuracy of Text-to-SQL Tools Based on Large Language Models for Real-World Relational Databases

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2024)

Abstract

Real-world relational databases (RW-RDB) have large, complex schemas often expressed in terms alien to end-users. This scenario is challenging to LLM-based text-to-SQL tools, that is, tools that translate Natural Language (NL) sentences into SQL queries using a Large Language Model (LLM). Indeed, their accuracy on RW-RDBs is considerably less than that reported for well-known synthetic benchmarks. This paper then introduces a technique to improve the accuracy of LLM-based text-to-SQL tools on RW-RDBs using Retrieval-Augmented Generation. The technique consists of two steps. Using the RW-RDB schema, the first step generates a synthetic dataset E of pairs \((Q_N,Q_S)\), where \(Q_N\) is an NL sentence and \(Q_S\) is the corresponding SQL translation. The core contribution of the paper is an algorithm that implements this first step. Given an input NL sentence \(Q_I\), the second step retrieves pairs \((Q_N,Q_S)\) from E based on the similarity of \(Q_I\) and \(Q_N\), and prompts such pairs to the LLM to improve accuracy. To argue in favor of the proposed technique, the paper includes experiments with an RW-RDB, which is in production at an Energy company, and a well-known text-to-SQL prompt strategy. It repeats the experiments with Mondial, an openly available database with a large schema. These experiments constitute a second contribution of the paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://yale-lily.github.io/spider.

  2. 2.

    https://bird-bench.github.io.

  3. 3.

    https://python.langchain.com.

  4. 4.

    https://www.dbis.informatik.uni-goettingen.de/Mondial/.

  5. 5.

    Available on request.

References

  1. Affolter, K., Stockinger, K., Bernstein, A.: A comparative survey of recent natural language interfaces for databases. VLDB J. 28 (2019). https://doi.org/10.1007/s00778-019-00567-8

  2. Gan, Y., et al.: Towards robustness of text-to-sql models against synonym substitution. CoRR abs/2106.01065 (2021). https://doi.org/10.48550/arXiv.2106.01065

  3. Gan, Y., Chen, X., Purver, M.: Exploring underexplored limitations of cross-domain text-to-sql generalization. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8926–8931, January 2021. https://doi.org/10.18653/v1/2021.emnlp-main.702

  4. Gao, Y., et al.: Retrieval-augmented generation for large language models: a survey. arXiv preprint (2024). https://doi.org/10.48550/arXiv.2312.10997

  5. Guo, C., et al.: Retrieval-augmented gpt-3.5-based text-to-sql framework with sample-aware prompting and dynamic revision chain. arXiv preprint (2023). https://doi.org/10.48550/arXiv.2307.05074

  6. Katsogiannis-Meimarakis, G., Koutrika, G.: A survey on deep learning approaches for text-to-SQL. VLDB J. 32(4), 905–936 (2023). https://doi.org/10.1007/s00778-022-00776-8

    Article  Google Scholar 

  7. Kim, H., So, B.H., Han, W.S., Lee, H.: Natural language to SQL: where are we today? Proc. VLDB Endow. 13(10), 1737–1750 (2020). https://doi.org/10.14778/3401960.3401970

  8. Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474. Curran Associates, Inc. (2020). https://api.semanticscholar.org/CorpusID:218869575

  9. Li, J., et al.: Can LLM already serve as a database interface? A big bench for large-scale database grounded text-to-SQLs. arXiv preprint (2023). https://doi.org/10.48550/arXiv.2305.03111

  10. Manning, C.D.: Human language understanding & reasoning. Daedalus 151(2), 127–138 (2022). https://doi.org/10.1162/daed_a_01905

  11. Nascimento, E.R., et al.: My database user is a large language model. In: Proceedings of the 26th International Conference on Enterprise Information Systems, vol. 1, pp. 800–806 (2024). https://doi.org/10.5220/0012697700003690

  12. Nascimento, E.R., et al.: Text-to-SQL meets the real-world. In: Proceedings of the 26th International Conference on Enterprise Information Systems, vol. 1, pp. 61–72 (2024). https://doi.org/10.5220/0012555200003690

  13. Panda, S., Gozluklu, B.: Build a robust text-to-sql solution generating complex queries, self-correcting, and querying diverse data sources. AWS Machine Learning Blog, 28 February 2024

    Google Scholar 

  14. Yu, T., et al.: Spider: a large-scale human-labeled dataset for complex and cross-domain semantic parsing and Text-to-SQL task. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3911–3921, Oct–Nov 2018. https://doi.org/10.18653/v1/D18-1425

Download references

Acknowledgements

This work was partly funded by FAPERJ under grant E-26/202.818/2017; by CAPES under grants 88881.310592-2018/01, 88881.134081/2016-01, and 88882.164913/2010-01; by CNPq under grant 302303/2017-0; and by Petrobras.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco A. Casanova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Coelho, G.M.C. et al. (2024). Improving the Accuracy of Text-to-SQL Tools Based on Large Language Models for Real-World Relational Databases. In: Strauss, C., Amagasa, T., Manco, G., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2024. Lecture Notes in Computer Science, vol 14910. Springer, Cham. https://doi.org/10.1007/978-3-031-68309-1_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-68309-1_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-68308-4

  • Online ISBN: 978-3-031-68309-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics