LLM-Based Text-to-SQL for Real-World Databases

Nascimento, Eduardo R.; García, Grettel; Izquierdo, Yenier T.; Feijó, Lucas; Coelho, Gustavo M. C.; de Oliveira, Aiko R.; Lemos, Melissa; Garcia, Robinson L. S.; Leme, Luiz A. P. Paes; Casanova, Marco A.

doi:10.1007/s42979-025-03662-6

LLM-Based Text-to-SQL for Real-World Databases

Original Research
Published: 31 January 2025

Volume 6, article number 130, (2025)
Cite this article

SN Computer Science Aims and scope Submit manuscript

Eduardo R. Nascimento¹,
Grettel García¹,
Yenier T. Izquierdo¹,
Lucas Feijó¹,
Gustavo M. C. Coelho¹,
Aiko R. de Oliveira³,
Melissa Lemos^1,3,
Robinson L. S. Garcia²,
Luiz A. P. Paes Leme⁴ &
…
Marco A. Casanova ORCID: orcid.org/0000-0003-0765-9636^1,3

452 Accesses
Explore all metrics

Abstract

Text-to-SQL refers to the task defined as “given a relational database D and a natural language sentence S that describes a question on D, generate an SQL query Q over D that expresses S”. Several LLM-based text-to-SQL tools, that is, text-to-SQL tools that explore Large Language Models (LLMs), emerged that outperformed previous approaches on well-known benchmarks. This article first shows that the performance of a selected set of LLM-based text-to-SQL tools is, however, significantly less when run on two challenging databases with a large number of tables, columns, and foreign keys. A closer analysis reveals that one of the problems lie in that the relational schema is an inappropriate specification of the database from the point of view of the LLM. The article then introduces database specifications based on LLM-friendly views, that are close to the language of the users’ questions and that eliminate frequently used joins, and LLM-friendly data descriptions of the database values. The article proceeds to show that the use of a set of LLM-friendly views and data samples considerably improves the performance of a text-to-SQL prompt strategy over a real-world database. This result suggests that real-world databases require rethinking how schema specifications should be passed to the LLM to recover state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Small, Medium, and Large Language Models for Text-to-SQL

Improving the Accuracy of Text-to-SQL Tools Based on Large Language Models for Real-World Relational Databases

SQL-to-Schema Enhances Schema Linking in Text-to-SQL

Data availability

Data is partially available as indicated in “Results for the Mondial Benchmark”.

Notes

References

Katsogiannis-Meimarakis G, Koutrika G. A survey on deep learning approaches for text-to-SQL. VLDB J. 2023;32(4):905–36. https://doi.org/10.1007/s00778-022-00776-8.
Article Google Scholar
Kim H, So B-H, Han W-S, Lee H. Natural language to SQL: where are we today? Proc VLDB Endow. 2020;13(10):1737–1750. https://doi.org/10.14778/3401960.3401970
Article MATH Google Scholar
Affolter K, Stockinger K, Bernstein A. A comparative survey of recent natural language interfaces for databases. VLDB J. 2019;28:793–819. https://doi.org/10.1007/s00778-019-00567-8.
Article MATH Google Scholar
Yu T, Zhang R, Yang K, Yasunaga M, Wang D, Li Z, Ma J, Li I, Yao Q, Roman S, Zhang Z, Radev D. Spider: alarge-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In: Riloff E, Chiang D, Hockenmaier J, Tsujii J, editors. Proceedings of 2018 conference on empirical methods in natural language processing. Brussels, Belgium: Association for Computational Linguistics; 2018. pp. 3911–3921. https://doi.org/10.18653/v1/D18-1425; https://aclanthology.org/D18-1425.
Li J, Hui B, Qu G, Yang J, Li B, Li B, Wang B, Qin B, Geng R, Huo N, Zhou X, Ma C, Li G, Chang K, Huang F, Cheng R, Li Y. Can llm already serve as a database interface? A big bench for large-scale database grounded text-to-sqls. In: Proceedings of the 37th international conference on neural information processing systems. NIPS ’23. Curran Associates Inc., Red Hook, NY, USA. 2024.
Izquierdo YT, García GM, Lemos M, Novello A, Novelli B, Damasceno C, Leme LAPP, Casanova MA. A platform for keyword search and its application for COVID-19 pandemic data. J Inf Data Manag. 2021;12(5):521–35. https://doi.org/10.5753/jidm.2021.1904.
Article Google Scholar
Nascimento ER, Casanova MA, Leme LAPP, García GM, Lemos M, Izquierdo YT, Garcia R, Victorio W. A family of natural language interfaces for databases based on chatgpt and langchain (short paper). In: Companion proceedings of the 42nd international conference on conceptual modeling: posters and demos co-located with ER 2023, Lisbon, Portugal, November 06–09, 2023. CEUR Workshop Proceedings, vol. 3618. 2023. https://ceur-ws.org/Vol-3618/pd_paper_1.pdf.
Dong X, Zhang C, Ge Y, Mao Y, Gao Y, Chen L, Lin J, Lou D. C3: zero-shot text-to-SQL with chatgpt. arXiv preprint. 2023. https://doi.org/10.48550/arXiv.2307.07306.
Pourreza M, Rafiei D. DIN-SQL: decomposed in-context learning of text-to-SQL with self-correction. In: Proceedings of the 37th international conference on neural information processing systems. NIPS ’23. Curran Associates Inc., Red Hook, NY, USA. 2024.
Nascimento ERS, Garcia GM, Feijó L, Victorio W, Izquierdo YT, Oliveira A, Coelho GMC, Lemos M, Garcia RLS, Leme LAPP, Casanova MA. Text-to-SQL meets the real-world. In: Proceedings of the 26th international conference on enterprise information systems, vol. 1. ICEIS. SciTePress, Setúbal, Portugal. 2024. pp. 61–72. INSTICC. https://doi.org/10.5220/0012555200003690.
Nascimento ERS, Izquierdo YT, Garcia GM, Coelho G, Feijó L, Lemos M, Leme LAPP, Casanova MA. My database user is a large language model. In: Proceedings of the 26th international conference on enterprise information systems, vol. 1. ICEIS. SciTePress, Setúbal, Portugal. 2024. pp. 800–806. INSTICC. https://doi.org/10.5220/0012697700003690.
Zhong V, Xiong C Socher R. SEQ2SQL: generating structured queries from natural language using reinforcement learning. arXiv preprint. 2017. https://doi.org/10.48550/arXiv.1709.00103.
Guo J, Si Z, Wang Y, Liu Q, Fan M, Lou J-G, Yang Z, Liu T Chase: a large-scale and pragmatic chinese dataset for cross-database context-dependent text-to-SQL. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing. 2021. pp. 2316–2331. https://aclanthology.org/2021.acl-long.180.
Ping WJ. Open-sourcing SQLEval: our framework for evaluating LLM-generated SQL. 2023. https://defog.ai/blog/open-sourcing-sqleval/.
Gao D, Wang H, Li Y, Sun X, Qian Y, Ding B, Zhou J. Text-to-SQL empowered by large language models: a benchmark evaluation. arXiv preprint. 2023. https://doi.org/10.48550/arXiv:2308.15363.
Izquierdo YT, García GM, Menendez ES, Casanova MA, Dartayre F, Levy CH. QUIOW: a keyword-based query processing tool for rdf datasets and relational databases. In: Hartmann S, Ma H, Hameurlain A, Pernul G, Wagner RR, editors. International conference on database and expert systems applications (DEXA). Springer, Cham. 2018. pp. 259–269. https://doi.org/10.1007/978-3-319-98812-2_22.
García GM, Izquierdo YT, Menendez E, Dartayre F, Casanova MA. RDF keyword-based query technology meets a real-world dataset. In: Proceedings of the 20th international conference on extending database technology (EDBT). OpenProceedings.org, Venice, Italy. 2017. pp. 656–667. https://doi.org/10.5441/002/edbt.2017.86.
Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E, Le Q, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint. 2023. https://doi.org/10.48550/arXiv.2310.12516.
Yu X, Cheng H, Liu X, Roth D, Gao J Automatic hallucination assessment for aligned large language models via transferable adversarial attacks. arXiv preprint. 2023. https://doi.org/10.48550/arXiv:2310.12516.
Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, Lewis M, Yih W-T, Rocktäschel T, Riedel S, Kiela D. Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H, editors. Advances in neural information processing systems, vol. 33. Curran Associates, Inc., Red Hook, NY, USA, 2020. pp. 9459–9474. https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf.

Download references

Acknowledgements

This work was partly funded by FAPERJ under grant E-26-204.322/2024; by CAPES under grants 88881.310592-2018/01, 88881.134081/2016-01, and 88882.164913/2010-01; by CNPq under grant 302303/2017-0; and by Petrobras.

Funding

Funding Information should not be included here as it’s mentioned in acknowledgements.

Author information

Authors and Affiliations

Instituto Tecgraf, PUC-Rio, Rio de Janeiro, 22451-900, RJ, Brazil
Eduardo R. Nascimento, Grettel García, Yenier T. Izquierdo, Lucas Feijó, Gustavo M. C. Coelho, Melissa Lemos & Marco A. Casanova
Petrobras, Rio de Janeiro, 20031-912, RJ, Brazil
Robinson L. S. Garcia
Depto. Informática, PUC-Rio, Rio de Janeiro, 22451-900, RJ, Brazil
Aiko R. de Oliveira, Melissa Lemos & Marco A. Casanova
Instituto de Computação, UFF, Niterói, 24210-310, RJ, Brazil
Luiz A. P. Paes Leme

Authors

Eduardo R. Nascimento
View author publications
You can also search for this author inPubMed Google Scholar
Grettel García
View author publications
You can also search for this author inPubMed Google Scholar
Yenier T. Izquierdo
View author publications
You can also search for this author inPubMed Google Scholar
Lucas Feijó
View author publications
You can also search for this author inPubMed Google Scholar
Gustavo M. C. Coelho
View author publications
You can also search for this author inPubMed Google Scholar
Aiko R. de Oliveira
View author publications
You can also search for this author inPubMed Google Scholar
Melissa Lemos
View author publications
You can also search for this author inPubMed Google Scholar
Robinson L. S. Garcia
View author publications
You can also search for this author inPubMed Google Scholar
Luiz A. P. Paes Leme
View author publications
You can also search for this author inPubMed Google Scholar
Marco A. Casanova
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

All authors contributed equally and approved the final version of the manuscript.

Corresponding author

Correspondence to Marco A. Casanova.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Research involving humans and/or animals

Not applicable.

Informed consent

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Nascimento, E.R., García, G., Izquierdo, Y.T. et al. LLM-Based Text-to-SQL for Real-World Databases. SN COMPUT. SCI. 6, 130 (2025). https://doi.org/10.1007/s42979-025-03662-6

Download citation

Received: 05 October 2024
Accepted: 25 December 2024
Published: 31 January 2025
DOI: https://doi.org/10.1007/s42979-025-03662-6

Keywords

Mathematics Subject Classification

Part of a collection:

Advanced Research on Enterprise Information Systems (ICEIS 2024)

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LLM-Based Text-to-SQL for Real-World Databases

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Small, Medium, and Large Language Models for Text-to-SQL

Improving the Accuracy of Text-to-SQL Tools Based on Large Language Models for Real-World Relational Databases

SQL-to-Schema Enhances Schema Linking in Text-to-SQL

Data availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Research involving humans and/or animals

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now