Adaptation of Large Language Models for the Public Sector: A Clustering Use Case

Caudron, Emilien; Ghesquière, Nathan; Travers, Wouter; Balahur, Alexandra

doi:10.1007/978-3-031-70242-6_31

Emilien Caudron¹¹,
Nathan Ghesquière¹¹,
Wouter Travers¹¹ &
…
Alexandra Balahur¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14763))

Included in the following conference series:

International Conference on Applications of Natural Language to Information Systems

486 Accesses

Abstract

Recent research has highlighted the potential of domain adaptation in improving the performance of generic Large Language Models (LLMs) [1, 5, 19, 20]. In this study, we aim to explore the benefits of LLMs and their adaptation on language and domain-specific data for the public sector. Our focus is on investigating the impact of domain adaptation for a specific use case in the European public service: the clustering of pledges on the Transition Pathway for Tourism (see here). First, a limited corpus of official documents and legislation on the Transition Pathway for Tourism was collected. Then, relying on existing approaches for domain adaptation of large language models, this corpus was used to adapt two pre-trained language models (BERT and RoBERTa) on the domain of interest. Finally, an innovative approach based on Azure OpenAI GPT4 as a human emulator was used to evaluate the impact of domain adaptation. The results of our study revealed a nuanced impact of domain adaptation. While the domain-adapted LLMs did not generate quantitatively more coherent clusters compared to their pre-trained counterparts, they exhibited a positive impact on the accuracy of the model (at a level of 5%) when considering the qualitative aspect of the clusters’ content. This suggests that domain adaptation can enhance the interpretability and usability of the clusters, even when working with a small dataset. However, it is worth noting that, in terms of interpretability and usability, a simpler model like Word2Vec outperformed the LLMs even after domain adaptation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The Transition Pathway for Tourism is an initiative from the European Commission defining key actions, targets and conditions to achieve the green and digital transitions of the sector [9].

References

Alsentzer, E., et al.: Publicly available clinical BERT embeddings (2019)
Google Scholar
Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retr. Boston. 12(4), 461–486 (2009)
Article MATH Google Scholar
Araci, D.: Finbert: Financial sentiment analysis with pre-trained language models (2019)
Google Scholar
Arefeva, V., Egger, R.: When BERT started traveling: TourBERT-a natural language processing model for the travel industry. Digital 2(4), 546–559 (2022). https://doi.org/10.3390/digital2040030, https://www.mdpi.com/2673-6470/2/4/30
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text (2019)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Dudek, A.: Silhouette index as clustering evaluation tool. In: Jajuga, K., Batóg, J., Walesiak, M. (eds.) SKAD 2019. SCDAKO, pp. 19–33. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52348-0_2
Chapter MATH Google Scholar
Edwards, A., Camacho-Collados, J., De Ribaupierre, H., Preece, A.: Go simple and pre-train on domain-specific corpora: on the role of training data for text classification. In: Scott, D., Bel, N., Zong, C. (eds.) Proceedings of the 28th International Conference on Computational Linguistics, pp. 5522–5529. International Committee on Computational Linguistics, Barcelona, Spain (Online) (2020). https://doi.org/10.18653/v1/2020.coling-main.481, https://aclanthology.org/2020.coling-main.481
European Commission: first transition pathway co-created with industry and civil society for a resilient, green and digital tourism ecosystem. https://ec.europa.eu/commission/presscorner/detail/en/ip_22_850
European Commission: shaping Europe’s digital future. https://digital-strategy.ec.europa.eu/en/activities/digital-programme
European Commission: communication from the commission: artificial intelligence for Europe, com(2018) 237 final (2018). https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=COM:2018:237:FIN#document1
European Commission: transition pathway for tourism (2022). https://op.europa.eu/s/y7Ht
Gururangan, S., et al.: Don’t stop pretraining: adapt language models to domains and tasks. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.740, https://aclanthology.org/2020.acl-main.740
Hadi, M.U., et al.: Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects. https://api.semanticscholar.org/CorpusID:266378240
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Gurevych, I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339. Association for Computational Linguistics, Melbourne, Australia (2018). https://doi.org/10.18653/v1/P18-1031, https://aclanthology.org/P18-1031
Huang, K., Altosaar, J., Ranganath, R.: Clinicalbert: Modeling clinical notes and predicting hospital readmission (2020)
Google Scholar
Kumar, P.S., Reddy, P.V.: Document clustering using RoBERTa and convolution neural network model. Int. J. Intell. Syst. Appl. Eng. 12(8s), 221–230 (2023). https://www.ijisae.org/index.php/IJISAE/article/view/4112
Tangi, L., Combetto, M., MARTIN, B.J., RODRIGUEZ, M.P.,: Artificial intelligence for interoperability in the European public sector (KJ-NA-31-675-EN-N (online)) (2023). https://doi.org/10.2760/633646
Lee, J.S., Hsiang, J.: PatentBERT: patent classification with fine-tuning a pre-trained BERT model (2019)
Google Scholar
Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2019). https://doi.org/10.1093/bioinformatics/btz682
Ling, C., et al.: Domain specialization as the key to make large language models disruptive: a comprehensive survey (2023)
Google Scholar
Naveed, H., et al.: A comprehensive overview of large language models (2024)
Google Scholar
OpenAI: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
OpenAI: Gpt-4 technical report (2023)
Google Scholar
Pasta, A., et al.: Clustering users based on hearing aid use: an exploratory analysis of real-world data. Front. Digit.l Health 3 (2021). https://doi.org/10.3389/fdgth.2021.725130
Semantic Interoperability Community: text mining on grow tourism pledges - documentation (2023). https://github.com/SEMICeu/semic_pledges
Subakti, A., Murfi, H., Hariadi, N.: The performance of BERT as data representation of text clustering. J. Big Data 9(1), 15 (2022). https://doi.org/10.1186/s40537-022-00564-9,
Wang, H., Li, J., Wu, H., Hovy, E., Sun, Y.: Pre-trained language models and their applications. Engineering 25, 51–65 (2023). https://doi.org/10.1016/j.eng.2022.04.024,https://www.sciencedirect.com/science/article/pii/S2095809922006324
Wei, J., et al.: Emergent abilities of large language models (2022)
Google Scholar
Yang, Y., UY, M.C.S., Huang, A.: FinBERT: a pretrained language model for financial communications (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

PwC EU Services, Brussels, Belgium
Emilien Caudron, Nathan Ghesquière & Wouter Travers
European Commission, Directorate General for Digital Services, Brussels, Belgium
Alexandra Balahur

Authors

Emilien Caudron
View author publications
You can also search for this author in PubMed Google Scholar
Nathan Ghesquière
View author publications
You can also search for this author in PubMed Google Scholar
Wouter Travers
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra Balahur
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Emilien Caudron .

Editor information

Editors and Affiliations

University of Turin, Turin, Italy
Amon Rapp
University of Turin, Turin, Italy
Luigi Di Caro
University of Derby, Derby, UK
Farid Meziane
Oakland University, Rochester, MI, USA
Vijayan Sugumaran

Appendices

GPT Summarisation

For summarising a cluster with GPT, the following prompt was used (Fig. 5):

Here are a set of examples obtained using this approach (Fig. 6):

GPT Classification

For classifying pledges with GPT, the following prompt was used (Fig. 7):

For instance, for the pledge displayed below and the clusters defined by Word2vec, we obtained the following results (Fig. 8):

“Actually we as an association are still pretty much at the beginning due to the pandemic which took the better part of our resources. What we want to provide is a proper guideline for STR how to achieve, maintain and develop a sustainable business. Additionally our business is very fragmented and diverse. We have privat hosts with one or only a few apartments, professional hosts, local property managers of all sizes, platforms for STR (local, national and international level) and service providers for the industry of all kind. Our target for 2025 is to achieve that guideline.”

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Caudron, E., Ghesquière, N., Travers, W., Balahur, A. (2024). Adaptation of Large Language Models for the Public Sector: A Clustering Use Case. In: Rapp, A., Di Caro, L., Meziane, F., Sugumaran, V. (eds) Natural Language Processing and Information Systems. NLDB 2024. Lecture Notes in Computer Science, vol 14763. Springer, Cham. https://doi.org/10.1007/978-3-031-70242-6_31

Download citation

DOI: https://doi.org/10.1007/978-3-031-70242-6_31
Published: 20 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70241-9
Online ISBN: 978-3-031-70242-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Adaptation of Large Language Models for the Public Sector: A Clustering Use Case