Skip to main content

Adaptation of Large Language Models for the Public Sector: A Clustering Use Case

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14763))

  • 486 Accesses

Abstract

Recent research has highlighted the potential of domain adaptation in improving the performance of generic Large Language Models (LLMs) [1, 5, 19, 20]. In this study, we aim to explore the benefits of LLMs and their adaptation on language and domain-specific data for the public sector. Our focus is on investigating the impact of domain adaptation for a specific use case in the European public service: the clustering of pledges on the Transition Pathway for Tourism (see here). First, a limited corpus of official documents and legislation on the Transition Pathway for Tourism was collected. Then, relying on existing approaches for domain adaptation of large language models, this corpus was used to adapt two pre-trained language models (BERT and RoBERTa) on the domain of interest. Finally, an innovative approach based on Azure OpenAI GPT4 as a human emulator was used to evaluate the impact of domain adaptation. The results of our study revealed a nuanced impact of domain adaptation. While the domain-adapted LLMs did not generate quantitatively more coherent clusters compared to their pre-trained counterparts, they exhibited a positive impact on the accuracy of the model (at a level of 5%) when considering the qualitative aspect of the clusters’ content. This suggests that domain adaptation can enhance the interpretability and usability of the clusters, even when working with a small dataset. However, it is worth noting that, in terms of interpretability and usability, a simpler model like Word2Vec outperformed the LLMs even after domain adaptation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The Transition Pathway for Tourism is an initiative from the European Commission defining key actions, targets and conditions to achieve the green and digital transitions of the sector [9].

References

  1. Alsentzer, E., et al.: Publicly available clinical BERT embeddings (2019)

    Google Scholar 

  2. Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retr. Boston. 12(4), 461–486 (2009)

    Article  MATH  Google Scholar 

  3. Araci, D.: Finbert: Financial sentiment analysis with pre-trained language models (2019)

    Google Scholar 

  4. Arefeva, V., Egger, R.: When BERT started traveling: TourBERT-a natural language processing model for the travel industry. Digital 2(4), 546–559 (2022). https://doi.org/10.3390/digital2040030, https://www.mdpi.com/2673-6470/2/4/30

  5. Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text (2019)

    Google Scholar 

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423

  7. Dudek, A.: Silhouette index as clustering evaluation tool. In: Jajuga, K., Batóg, J., Walesiak, M. (eds.) SKAD 2019. SCDAKO, pp. 19–33. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52348-0_2

    Chapter  MATH  Google Scholar 

  8. Edwards, A., Camacho-Collados, J., De Ribaupierre, H., Preece, A.: Go simple and pre-train on domain-specific corpora: on the role of training data for text classification. In: Scott, D., Bel, N., Zong, C. (eds.) Proceedings of the 28th International Conference on Computational Linguistics, pp. 5522–5529. International Committee on Computational Linguistics, Barcelona, Spain (Online) (2020). https://doi.org/10.18653/v1/2020.coling-main.481, https://aclanthology.org/2020.coling-main.481

  9. European Commission: first transition pathway co-created with industry and civil society for a resilient, green and digital tourism ecosystem. https://ec.europa.eu/commission/presscorner/detail/en/ip_22_850

  10. European Commission: shaping Europe’s digital future. https://digital-strategy.ec.europa.eu/en/activities/digital-programme

  11. European Commission: communication from the commission: artificial intelligence for Europe, com(2018) 237 final (2018). https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=COM:2018:237:FIN#document1

  12. European Commission: transition pathway for tourism (2022). https://op.europa.eu/s/y7Ht

  13. Gururangan, S., et al.: Don’t stop pretraining: adapt language models to domains and tasks. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.740, https://aclanthology.org/2020.acl-main.740

  14. Hadi, M.U., et al.: Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects. https://api.semanticscholar.org/CorpusID:266378240

  15. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Gurevych, I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339. Association for Computational Linguistics, Melbourne, Australia (2018). https://doi.org/10.18653/v1/P18-1031, https://aclanthology.org/P18-1031

  16. Huang, K., Altosaar, J., Ranganath, R.: Clinicalbert: Modeling clinical notes and predicting hospital readmission (2020)

    Google Scholar 

  17. Kumar, P.S., Reddy, P.V.: Document clustering using RoBERTa and convolution neural network model. Int. J. Intell. Syst. Appl. Eng. 12(8s), 221–230 (2023). https://www.ijisae.org/index.php/IJISAE/article/view/4112

  18. Tangi, L., Combetto, M., MARTIN, B.J., RODRIGUEZ, M.P.,: Artificial intelligence for interoperability in the European public sector (KJ-NA-31-675-EN-N (online)) (2023). https://doi.org/10.2760/633646

  19. Lee, J.S., Hsiang, J.: PatentBERT: patent classification with fine-tuning a pre-trained BERT model (2019)

    Google Scholar 

  20. Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2019). https://doi.org/10.1093/bioinformatics/btz682

  21. Ling, C., et al.: Domain specialization as the key to make large language models disruptive: a comprehensive survey (2023)

    Google Scholar 

  22. Naveed, H., et al.: A comprehensive overview of large language models (2024)

    Google Scholar 

  23. OpenAI: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

  24. OpenAI: Gpt-4 technical report (2023)

    Google Scholar 

  25. Pasta, A., et al.: Clustering users based on hearing aid use: an exploratory analysis of real-world data. Front. Digit.l Health 3 (2021). https://doi.org/10.3389/fdgth.2021.725130

  26. Semantic Interoperability Community: text mining on grow tourism pledges - documentation (2023). https://github.com/SEMICeu/semic_pledges

  27. Subakti, A., Murfi, H., Hariadi, N.: The performance of BERT as data representation of text clustering. J. Big Data 9(1), 15 (2022). https://doi.org/10.1186/s40537-022-00564-9,

  28. Wang, H., Li, J., Wu, H., Hovy, E., Sun, Y.: Pre-trained language models and their applications. Engineering 25, 51–65 (2023). https://doi.org/10.1016/j.eng.2022.04.024,https://www.sciencedirect.com/science/article/pii/S2095809922006324

  29. Wei, J., et al.: Emergent abilities of large language models (2022)

    Google Scholar 

  30. Yang, Y., UY, M.C.S., Huang, A.: FinBERT: a pretrained language model for financial communications (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emilien Caudron .

Editor information

Editors and Affiliations

Appendices

GPT Summarisation

For summarising a cluster with GPT, the following prompt was used (Fig. 5):

Fig. 5.
figure 5

Prompt used for summarising a cluster of pledges using GPT .

Here are a set of examples obtained using this approach (Fig. 6):

Fig. 6.
figure 6

Examples of answers generated by GPT for clusters created with the different models.

GPT Classification

For classifying pledges with GPT, the following prompt was used (Fig. 7):

Fig. 7.
figure 7

Prompt used for classifying pledges into clusters using GPT.

For instance, for the pledge displayed below and the clusters defined by Word2vec, we obtained the following results (Fig. 8):

“Actually we as an association are still pretty much at the beginning due to the pandemic which took the better part of our resources. What we want to provide is a proper guideline for STR how to achieve, maintain and develop a sustainable business. Additionally our business is very fragmented and diverse. We have privat hosts with one or only a few apartments, professional hosts, local property managers of all sizes, platforms for STR (local, national and international level) and service providers for the industry of all kind. Our target for 2025 is to achieve that guideline.”

Fig. 8.
figure 8

Example of answers generated by GPT for classifying a pledge into the clusters created with Word2Vec.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Caudron, E., Ghesquière, N., Travers, W., Balahur, A. (2024). Adaptation of Large Language Models for the Public Sector: A Clustering Use Case. In: Rapp, A., Di Caro, L., Meziane, F., Sugumaran, V. (eds) Natural Language Processing and Information Systems. NLDB 2024. Lecture Notes in Computer Science, vol 14763. Springer, Cham. https://doi.org/10.1007/978-3-031-70242-6_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70242-6_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70241-9

  • Online ISBN: 978-3-031-70242-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics