Abstract
In an era of rapidly increasing numbers of scientific publications, researchers face the challenge of keeping pace with field-specific advances. This paper presents methodological advancements in topic modeling by utilizing state-of-the-art language models. We introduce the AHAM methodology and a score for domain-specific adaptation of the BERTopic framework to enhance scientific text analysis. Utilizing the LLaMa2 model, we generate topic definitions through one-shot learning, with help from domain experts to craft prompts that guide literature mining by asking the model to label topics. We employ language generation and translation scores for inter-topic similarity assessment, aiming to minimize outlier topics and overlap between topic definitions. AHAM has been validated on a new corpus of scientific papers, proving effective in revealing novel insights across research areas. We also examine the impact of sentence-transformer domain adaptation on topic modeling precision, using datasets from arXiv, focusing on data size, adaptation niche, and the role of domain adaptation. Our findings indicate a significant interaction between domain adaptation and topic modeling accuracy, especially regarding outliers and topic clarity. We release our code at https://github.com/bkolosk1/aham
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019)
Gordon, M.D., Dumais, S.: Using latent semantic indexing for literature based discovery. J. Am. Soc. Inf. Sci. 49(8), 674–685 (1998)
Grootendorst, M.: Keybert: minimal keyword extraction with bert (2020)
Grootendorst, M.: BERTopic: neural topic modeling with a class-based TF-IDF procedure (2022)
Hofstätter, S., Althammer, S., Schröder, M., Sertkan, M., Hanbury, A.: Improving efficient neural ranking models with cross-architecture knowledge distillation (2020)
Kastrin, A., Hristovski, D.: Scientometric analysis and knowledge mapping of literature-based discovery (1986–2020). Scientometrics 126(2), 1415–1451 (2021)
Koloski, B., Pollak, S., Škrlj, B., Martinc, M.: Out of thin air: is zero-shot cross-lingual keyword detection better than unsupervised? In: Language Resources and Evaluation Conference, pp. 400–409. European Language Resources Association, Marseille, France Jun 2022
Lampinen, A., et al.: Can language models learn from explanations in context? In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) EMNLP 2022, pp. 537–563. Association for Computational Linguistics Dec 2022
Lavrač, N., Martinc, M., Pollak, S., Pompe Novak, M., Cestnik, B.: Bisociative literature-based discovery: lessons learned and new word embedding approach. N. Gener. Comput. 38(4), 773–800 (2020)
McInnes, L., Healy, J., Astels, S.: HDBSCAN: hierarchical density based clustering. J. Open Source Softw. 2(11), 205 (2017)
McInnes, L., Healy, J., Saul, N., Großberger, L.: UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3(29), 861 (2018)
Min, B., et al.: Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput. Surv. 56(2), 1–40 (2023)
Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: MTEB: massive text embedding benchmark. In: Vlachos, A., Augenstein, I. (eds.) European Chapter of the Association for Computational Linguistics, pp. 2014–2037. Association for Computational Linguistics, Dubrovnik, Croatia (May 2023)https://doi.org/10.18653/v1/2023.eacl-main.148
Pan, J., Gao, T., Chen, H., Chen, D.: What in-context learning “learns” in-context: disentangling task recognition and task learning. In: ACL 2023, pp. 8298–8319 (2023) https://doi.org/10.18653/v1/2023.findings-acl.527
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (Nov 2019)
Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (Nov 2020)
Sang, S., Yang, Z., Wang, L., Liu, X., Lin, H., Wang, J.: SemaTyP: a knowledge graph based literature mining method for drug discovery. BMC Bioinf. 19(1), 193 (2018). https://doi.org/10.1186/s12859-018-2167-5
Sebastian, Y., Siew, E.G., Orimaye, S.O.: Emerging approaches in literature-based discovery: techniques and performance review. Know. Eng. Rev. 32, e12 (2017). https://doi.org/10.1017/S0269888917000042
Škrlj, B., Koloski, B., Pollak, S.: Retrieval-efficiency trade-off of unsupervised keyword extraction. In: Pascal, P., Ienco, D. (eds.) Discovery Science, pp. 379–393. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-18840-4_27
Swanson, D.R.: Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect. Biol. Med. 30(1), 7–18 (1986)
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: BEIR: a heterogeneous benchmark for zero-shot evaluation of information retrieval models. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)D
Thilakaratne, M., Falkner, K., Atapattu, T.: A systematic review on literature-based discovery workflow. PeerJ Comput. Sci 5, e235 (2019)
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models (2023)
Vayansky, I., Kumar, S.A.P.: A review of topic modeling methods. Inf. Syst. 94, 10158101582 (2020)
Wang, K., Reimers, N., Gurevych, I.: Tsdae: using transformer-based sequential denoising auto-encoder for unsupervised sentence embedding learning. In: EMNLP 2021, pp. 671–688. Association for Computational Linguistics, Punta Cana, Dominican Republic (Nov 2021)
Wang, K., Thakur, N., Reimers, N., Gurevych, I.: GPL: generative pseudo labeling for unsupervised domain adaptation of dense retrieval. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2345–2360. Association for Computational Linguistics, Seattle, USA (Jul 2022)
Wang, Q., Downey, D., Ji, H., Hope, T.: Learning to generate novel scientific directions with contextualized literature-based discovery (2023)
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
Acknowledgements
The authors acknowledge the financial support from the Slovenian Research and Innovation Agency through research core funding project Knowledge technologies (No. P2-0103) and research projects: Research collaboration prediction using literature-based discovery approach (No. J5-2552), Embeddings-based techniques for Media Monitoring Applications (No. L2-50070) and Hate speech in contemporary conceptualizations of nationalism, racism, gender and migration (No. J5-3102). A Young Researcher Grant PR-12394 supported the work of the first author.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Koloski, B., Lavrač, N., Cestnik, B., Pollak, S., Škrlj, B., Kastrin, A. (2024). AHAM: Adapt, Help, Ask, Model Harvesting LLMs for Literature Mining. In: Miliou, I., Piatkowski, N., Papapetrou, P. (eds) Advances in Intelligent Data Analysis XXII. IDA 2024. Lecture Notes in Computer Science, vol 14641. Springer, Cham. https://doi.org/10.1007/978-3-031-58547-0_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-58547-0_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-58546-3
Online ISBN: 978-3-031-58547-0
eBook Packages: Computer ScienceComputer Science (R0)