Improving the Accuracy of Text-to-SQL Tools Based on Large Language Models for Real-World Relational Databases

Coelho, Gustavo M. C.; Nascimento, Eduardo R. S.; Izquierdo, Yenier T.; García, Grettel M.; Feijó, Lucas; Lemos, Melissa; Garcia, Robinson L. S.; de Oliveira, Aiko R.; Pinheiro, João P.; Casanova, Marco A.

doi:10.1007/978-3-031-68309-1_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14910))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

625 Accesses

Abstract

Real-world relational databases (RW-RDB) have large, complex schemas often expressed in terms alien to end-users. This scenario is challenging to LLM-based text-to-SQL tools, that is, tools that translate Natural Language (NL) sentences into SQL queries using a Large Language Model (LLM). Indeed, their accuracy on RW-RDBs is considerably less than that reported for well-known synthetic benchmarks. This paper then introduces a technique to improve the accuracy of LLM-based text-to-SQL tools on RW-RDBs using Retrieval-Augmented Generation. The technique consists of two steps. Using the RW-RDB schema, the first step generates a synthetic dataset E of pairs $(Q_N,Q_S)$, where $Q_N$ is an NL sentence and $Q_S$ is the corresponding SQL translation. The core contribution of the paper is an algorithm that implements this first step. Given an input NL sentence $Q_I$, the second step retrieves pairs $(Q_N,Q_S)$ from E based on the similarity of $Q_I$ and $Q_N$, and prompts such pairs to the LLM to improve accuracy. To argue in favor of the proposed technique, the paper includes experiments with an RW-RDB, which is in production at an Energy company, and a well-known text-to-SQL prompt strategy. It repeats the experiments with Mondial, an openly available database with a large schema. These experiments constitute a second contribution of the paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 159.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LLM-Based Text-to-SQL for Real-World Databases

Article 31 January 2025

COMBINE: A Pipeline for SQL Generation from Natural Language

A Survey on Text-to-SQL Parsing: From Rule-Based Foundations to Large Language Models

Notes

1.
https://yale-lily.github.io/spider.
2.
https://bird-bench.github.io.
3.
https://python.langchain.com.
4.
https://www.dbis.informatik.uni-goettingen.de/Mondial/.
5.
Available on request.

References

Affolter, K., Stockinger, K., Bernstein, A.: A comparative survey of recent natural language interfaces for databases. VLDB J. 28 (2019). https://doi.org/10.1007/s00778-019-00567-8
Gan, Y., et al.: Towards robustness of text-to-sql models against synonym substitution. CoRR abs/2106.01065 (2021). https://doi.org/10.48550/arXiv.2106.01065
Gan, Y., Chen, X., Purver, M.: Exploring underexplored limitations of cross-domain text-to-sql generalization. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8926–8931, January 2021. https://doi.org/10.18653/v1/2021.emnlp-main.702
Gao, Y., et al.: Retrieval-augmented generation for large language models: a survey. arXiv preprint (2024). https://doi.org/10.48550/arXiv.2312.10997
Guo, C., et al.: Retrieval-augmented gpt-3.5-based text-to-sql framework with sample-aware prompting and dynamic revision chain. arXiv preprint (2023). https://doi.org/10.48550/arXiv.2307.05074
Katsogiannis-Meimarakis, G., Koutrika, G.: A survey on deep learning approaches for text-to-SQL. VLDB J. 32(4), 905–936 (2023). https://doi.org/10.1007/s00778-022-00776-8
Article Google Scholar
Kim, H., So, B.H., Han, W.S., Lee, H.: Natural language to SQL: where are we today? Proc. VLDB Endow. 13(10), 1737–1750 (2020). https://doi.org/10.14778/3401960.3401970
Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474. Curran Associates, Inc. (2020). https://api.semanticscholar.org/CorpusID:218869575
Li, J., et al.: Can LLM already serve as a database interface? A big bench for large-scale database grounded text-to-SQLs. arXiv preprint (2023). https://doi.org/10.48550/arXiv.2305.03111
Manning, C.D.: Human language understanding & reasoning. Daedalus 151(2), 127–138 (2022). https://doi.org/10.1162/daed_a_01905
Nascimento, E.R., et al.: My database user is a large language model. In: Proceedings of the 26th International Conference on Enterprise Information Systems, vol. 1, pp. 800–806 (2024). https://doi.org/10.5220/0012697700003690
Nascimento, E.R., et al.: Text-to-SQL meets the real-world. In: Proceedings of the 26th International Conference on Enterprise Information Systems, vol. 1, pp. 61–72 (2024). https://doi.org/10.5220/0012555200003690
Panda, S., Gozluklu, B.: Build a robust text-to-sql solution generating complex queries, self-correcting, and querying diverse data sources. AWS Machine Learning Blog, 28 February 2024
Google Scholar
Yu, T., et al.: Spider: a large-scale human-labeled dataset for complex and cross-domain semantic parsing and Text-to-SQL task. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3911–3921, Oct–Nov 2018. https://doi.org/10.18653/v1/D18-1425

Download references

Acknowledgements

This work was partly funded by FAPERJ under grant E-26/202.818/2017; by CAPES under grants 88881.310592-2018/01, 88881.134081/2016-01, and 88882.164913/2010-01; by CNPq under grant 302303/2017-0; and by Petrobras.

Author information

Authors and Affiliations

Instituto Tecgraf, PUC-Rio, Rio de Janeiro, RJ, 22451-900, Brazil
Gustavo M. C. Coelho, Eduardo R. S. Nascimento, Yenier T. Izquierdo, Grettel M. García, Lucas Feijó, Melissa Lemos, Aiko R. de Oliveira & Marco A. Casanova
Petrobras, Rio de Janeiro, RJ, 20031-912, Brazil
Robinson L. S. Garcia
Departamento de Informática, PUC-Rio, Rio de Janeiro, RJ, 22451-900, Brazil
Aiko R. de Oliveira, João P. Pinheiro & Marco A. Casanova

Authors

Gustavo M. C. Coelho
View author publications
You can also search for this author in PubMed Google Scholar
Eduardo R. S. Nascimento
View author publications
You can also search for this author in PubMed Google Scholar
Yenier T. Izquierdo
View author publications
You can also search for this author in PubMed Google Scholar
Grettel M. García
View author publications
You can also search for this author in PubMed Google Scholar
Lucas Feijó
View author publications
You can also search for this author in PubMed Google Scholar
Melissa Lemos
View author publications
You can also search for this author in PubMed Google Scholar
Robinson L. S. Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Aiko R. de Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
João P. Pinheiro
View author publications
You can also search for this author in PubMed Google Scholar
Marco A. Casanova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco A. Casanova .

Editor information

Editors and Affiliations

University of Vienna, Vienna, Austria
Christine Strauss
University of Tsukuba, Tsukuba, Japan
Toshiyuki Amagasa
National Research Council (CNR), Rende, Italy
Giuseppe Manco
Johannes Kepler University Linz, Linz, Austria
Gabriele Kotsis
Vienna University of Technology, Vienna, Austria
A Min Tjoa
Johannes Kepler University Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Coelho, G.M.C. et al. (2024). Improving the Accuracy of Text-to-SQL Tools Based on Large Language Models for Real-World Relational Databases. In: Strauss, C., Amagasa, T., Manco, G., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2024. Lecture Notes in Computer Science, vol 14910. Springer, Cham. https://doi.org/10.1007/978-3-031-68309-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-68309-1_8
Published: 18 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-68308-4
Online ISBN: 978-3-031-68309-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving the Accuracy of Text-to-SQL Tools Based on Large Language Models for Real-World Relational Databases