Abstract
Recent research has highlighted the potential of domain adaptation in improving the performance of generic Large Language Models (LLMs) [1, 5, 19, 20]. In this study, we aim to explore the benefits of LLMs and their adaptation on language and domain-specific data for the public sector. Our focus is on investigating the impact of domain adaptation for a specific use case in the European public service: the clustering of pledges on the Transition Pathway for Tourism (see here). First, a limited corpus of official documents and legislation on the Transition Pathway for Tourism was collected. Then, relying on existing approaches for domain adaptation of large language models, this corpus was used to adapt two pre-trained language models (BERT and RoBERTa) on the domain of interest. Finally, an innovative approach based on Azure OpenAI GPT4 as a human emulator was used to evaluate the impact of domain adaptation. The results of our study revealed a nuanced impact of domain adaptation. While the domain-adapted LLMs did not generate quantitatively more coherent clusters compared to their pre-trained counterparts, they exhibited a positive impact on the accuracy of the model (at a level of 5%) when considering the qualitative aspect of the clusters’ content. This suggests that domain adaptation can enhance the interpretability and usability of the clusters, even when working with a small dataset. However, it is worth noting that, in terms of interpretability and usability, a simpler model like Word2Vec outperformed the LLMs even after domain adaptation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The Transition Pathway for Tourism is an initiative from the European Commission defining key actions, targets and conditions to achieve the green and digital transitions of the sector [9].
References
Alsentzer, E., et al.: Publicly available clinical BERT embeddings (2019)
Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retr. Boston. 12(4), 461–486 (2009)
Araci, D.: Finbert: Financial sentiment analysis with pre-trained language models (2019)
Arefeva, V., Egger, R.: When BERT started traveling: TourBERT-a natural language processing model for the travel industry. Digital 2(4), 546–559 (2022). https://doi.org/10.3390/digital2040030, https://www.mdpi.com/2673-6470/2/4/30
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Dudek, A.: Silhouette index as clustering evaluation tool. In: Jajuga, K., Batóg, J., Walesiak, M. (eds.) SKAD 2019. SCDAKO, pp. 19–33. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52348-0_2
Edwards, A., Camacho-Collados, J., De Ribaupierre, H., Preece, A.: Go simple and pre-train on domain-specific corpora: on the role of training data for text classification. In: Scott, D., Bel, N., Zong, C. (eds.) Proceedings of the 28th International Conference on Computational Linguistics, pp. 5522–5529. International Committee on Computational Linguistics, Barcelona, Spain (Online) (2020). https://doi.org/10.18653/v1/2020.coling-main.481, https://aclanthology.org/2020.coling-main.481
European Commission: first transition pathway co-created with industry and civil society for a resilient, green and digital tourism ecosystem. https://ec.europa.eu/commission/presscorner/detail/en/ip_22_850
European Commission: shaping Europe’s digital future. https://digital-strategy.ec.europa.eu/en/activities/digital-programme
European Commission: communication from the commission: artificial intelligence for Europe, com(2018) 237 final (2018). https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=COM:2018:237:FIN#document1
European Commission: transition pathway for tourism (2022). https://op.europa.eu/s/y7Ht
Gururangan, S., et al.: Don’t stop pretraining: adapt language models to domains and tasks. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.740, https://aclanthology.org/2020.acl-main.740
Hadi, M.U., et al.: Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects. https://api.semanticscholar.org/CorpusID:266378240
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Gurevych, I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339. Association for Computational Linguistics, Melbourne, Australia (2018). https://doi.org/10.18653/v1/P18-1031, https://aclanthology.org/P18-1031
Huang, K., Altosaar, J., Ranganath, R.: Clinicalbert: Modeling clinical notes and predicting hospital readmission (2020)
Kumar, P.S., Reddy, P.V.: Document clustering using RoBERTa and convolution neural network model. Int. J. Intell. Syst. Appl. Eng. 12(8s), 221–230 (2023). https://www.ijisae.org/index.php/IJISAE/article/view/4112
Tangi, L., Combetto, M., MARTIN, B.J., RODRIGUEZ, M.P.,: Artificial intelligence for interoperability in the European public sector (KJ-NA-31-675-EN-N (online)) (2023). https://doi.org/10.2760/633646
Lee, J.S., Hsiang, J.: PatentBERT: patent classification with fine-tuning a pre-trained BERT model (2019)
Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2019). https://doi.org/10.1093/bioinformatics/btz682
Ling, C., et al.: Domain specialization as the key to make large language models disruptive: a comprehensive survey (2023)
Naveed, H., et al.: A comprehensive overview of large language models (2024)
OpenAI: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
OpenAI: Gpt-4 technical report (2023)
Pasta, A., et al.: Clustering users based on hearing aid use: an exploratory analysis of real-world data. Front. Digit.l Health 3 (2021). https://doi.org/10.3389/fdgth.2021.725130
Semantic Interoperability Community: text mining on grow tourism pledges - documentation (2023). https://github.com/SEMICeu/semic_pledges
Subakti, A., Murfi, H., Hariadi, N.: The performance of BERT as data representation of text clustering. J. Big Data 9(1), 15 (2022). https://doi.org/10.1186/s40537-022-00564-9,
Wang, H., Li, J., Wu, H., Hovy, E., Sun, Y.: Pre-trained language models and their applications. Engineering 25, 51–65 (2023). https://doi.org/10.1016/j.eng.2022.04.024,https://www.sciencedirect.com/science/article/pii/S2095809922006324
Wei, J., et al.: Emergent abilities of large language models (2022)
Yang, Y., UY, M.C.S., Huang, A.: FinBERT: a pretrained language model for financial communications (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
GPT Summarisation
For summarising a cluster with GPT, the following prompt was used (Fig. 5):
Here are a set of examples obtained using this approach (Fig. 6):
GPT Classification
For classifying pledges with GPT, the following prompt was used (Fig. 7):
For instance, for the pledge displayed below and the clusters defined by Word2vec, we obtained the following results (Fig. 8):
“Actually we as an association are still pretty much at the beginning due to the pandemic which took the better part of our resources. What we want to provide is a proper guideline for STR how to achieve, maintain and develop a sustainable business. Additionally our business is very fragmented and diverse. We have privat hosts with one or only a few apartments, professional hosts, local property managers of all sizes, platforms for STR (local, national and international level) and service providers for the industry of all kind. Our target for 2025 is to achieve that guideline.”
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Caudron, E., Ghesquière, N., Travers, W., Balahur, A. (2024). Adaptation of Large Language Models for the Public Sector: A Clustering Use Case. In: Rapp, A., Di Caro, L., Meziane, F., Sugumaran, V. (eds) Natural Language Processing and Information Systems. NLDB 2024. Lecture Notes in Computer Science, vol 14763. Springer, Cham. https://doi.org/10.1007/978-3-031-70242-6_31
Download citation
DOI: https://doi.org/10.1007/978-3-031-70242-6_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70241-9
Online ISBN: 978-3-031-70242-6
eBook Packages: Computer ScienceComputer Science (R0)