AHAM: Adapt, Help, Ask, Model Harvesting LLMs for Literature Mining

Koloski, Boshko; Lavrač, Nada; Cestnik, Bojan; Pollak, Senja; Škrlj, Blaž; Kastrin, Andrej

doi:10.1007/978-3-031-58547-0_21

Boshko Koloski^10,11,
Nada Lavrač^10,12,
Bojan Cestnik^10,13,
Senja Pollak¹⁰,
Blaž Škrlj¹⁰ &
…
Andrej Kastrin¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14641))

Included in the following conference series:

International Symposium on Intelligent Data Analysis

570 Accesses

Abstract

In an era of rapidly increasing numbers of scientific publications, researchers face the challenge of keeping pace with field-specific advances. This paper presents methodological advancements in topic modeling by utilizing state-of-the-art language models. We introduce the AHAM methodology and a score for domain-specific adaptation of the BERTopic framework to enhance scientific text analysis. Utilizing the LLaMa2 model, we generate topic definitions through one-shot learning, with help from domain experts to craft prompts that guide literature mining by asking the model to label topics. We employ language generation and translation scores for inter-topic similarity assessment, aiming to minimize outlier topics and overlap between topic definitions. AHAM has been validated on a new corpus of scientific papers, proving effective in revealing novel insights across research areas. We also examine the impact of sentence-transformer domain adaptation on topic modeling precision, using datasets from arXiv, focusing on data size, adaptation niche, and the role of domain adaptation. Our findings indicate a significant interaction between domain adaptation and topic modeling accuracy, especially regarding outliers and topic clarity. We release our code at https://github.com/bkolosk1/aham

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

NLPExplorer: Exploring the Universe of NLP Papers

Automated Literature Review Using Large Language Models

Under-Sampling Strategies for Better Transformer-Based Classifications Models

Notes

References

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019)
Google Scholar
Gordon, M.D., Dumais, S.: Using latent semantic indexing for literature based discovery. J. Am. Soc. Inf. Sci. 49(8), 674–685 (1998)
Article Google Scholar
Grootendorst, M.: Keybert: minimal keyword extraction with bert (2020)
Google Scholar
Grootendorst, M.: BERTopic: neural topic modeling with a class-based TF-IDF procedure (2022)
Google Scholar
Hofstätter, S., Althammer, S., Schröder, M., Sertkan, M., Hanbury, A.: Improving efficient neural ranking models with cross-architecture knowledge distillation (2020)
Google Scholar
Kastrin, A., Hristovski, D.: Scientometric analysis and knowledge mapping of literature-based discovery (1986–2020). Scientometrics 126(2), 1415–1451 (2021)
Article Google Scholar
Koloski, B., Pollak, S., Škrlj, B., Martinc, M.: Out of thin air: is zero-shot cross-lingual keyword detection better than unsupervised? In: Language Resources and Evaluation Conference, pp. 400–409. European Language Resources Association, Marseille, France Jun 2022
Google Scholar
Lampinen, A., et al.: Can language models learn from explanations in context? In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) EMNLP 2022, pp. 537–563. Association for Computational Linguistics Dec 2022
Google Scholar
Lavrač, N., Martinc, M., Pollak, S., Pompe Novak, M., Cestnik, B.: Bisociative literature-based discovery: lessons learned and new word embedding approach. N. Gener. Comput. 38(4), 773–800 (2020)
Article Google Scholar
McInnes, L., Healy, J., Astels, S.: HDBSCAN: hierarchical density based clustering. J. Open Source Softw. 2(11), 205 (2017)
Article Google Scholar
McInnes, L., Healy, J., Saul, N., Großberger, L.: UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3(29), 861 (2018)
Article Google Scholar
Min, B., et al.: Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput. Surv. 56(2), 1–40 (2023)
Article Google Scholar
Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: MTEB: massive text embedding benchmark. In: Vlachos, A., Augenstein, I. (eds.) European Chapter of the Association for Computational Linguistics, pp. 2014–2037. Association for Computational Linguistics, Dubrovnik, Croatia (May 2023)https://doi.org/10.18653/v1/2023.eacl-main.148
Pan, J., Gao, T., Chen, H., Chen, D.: What in-context learning “learns” in-context: disentangling task recognition and task learning. In: ACL 2023, pp. 8298–8319 (2023) https://doi.org/10.18653/v1/2023.findings-acl.527
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
MathSciNet Google Scholar
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (Nov 2019)
Google Scholar
Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (Nov 2020)
Google Scholar
Sang, S., Yang, Z., Wang, L., Liu, X., Lin, H., Wang, J.: SemaTyP: a knowledge graph based literature mining method for drug discovery. BMC Bioinf. 19(1), 193 (2018). https://doi.org/10.1186/s12859-018-2167-5
Article Google Scholar
Sebastian, Y., Siew, E.G., Orimaye, S.O.: Emerging approaches in literature-based discovery: techniques and performance review. Know. Eng. Rev. 32, e12 (2017). https://doi.org/10.1017/S0269888917000042
Article Google Scholar
Škrlj, B., Koloski, B., Pollak, S.: Retrieval-efficiency trade-off of unsupervised keyword extraction. In: Pascal, P., Ienco, D. (eds.) Discovery Science, pp. 379–393. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-18840-4_27
Chapter Google Scholar
Swanson, D.R.: Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect. Biol. Med. 30(1), 7–18 (1986)
Article Google Scholar
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: BEIR: a heterogeneous benchmark for zero-shot evaluation of information retrieval models. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)D
Google Scholar
Thilakaratne, M., Falkner, K., Atapattu, T.: A systematic review on literature-based discovery workflow. PeerJ Comput. Sci 5, e235 (2019)
Article Google Scholar
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models (2023)
Google Scholar
Vayansky, I., Kumar, S.A.P.: A review of topic modeling methods. Inf. Syst. 94, 10158101582 (2020)
Article Google Scholar
Wang, K., Reimers, N., Gurevych, I.: Tsdae: using transformer-based sequential denoising auto-encoder for unsupervised sentence embedding learning. In: EMNLP 2021, pp. 671–688. Association for Computational Linguistics, Punta Cana, Dominican Republic (Nov 2021)
Google Scholar
Wang, K., Thakur, N., Reimers, N., Gurevych, I.: GPL: generative pseudo labeling for unsupervised domain adaptation of dense retrieval. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2345–2360. Association for Computational Linguistics, Seattle, USA (Jul 2022)
Google Scholar
Wang, Q., Downey, D., Ji, H., Hope, T.: Learning to generate novel scientific directions with contextualized literature-based discovery (2023)
Google Scholar
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
Google Scholar

Download references

Acknowledgements

The authors acknowledge the financial support from the Slovenian Research and Innovation Agency through research core funding project Knowledge technologies (No. P2-0103) and research projects: Research collaboration prediction using literature-based discovery approach (No. J5-2552), Embeddings-based techniques for Media Monitoring Applications (No. L2-50070) and Hate speech in contemporary conceptualizations of nationalism, racism, gender and migration (No. J5-3102). A Young Researcher Grant PR-12394 supported the work of the first author.

Author information

Authors and Affiliations

Jožef Stefan Institute, Ljubljana, Slovenia
Boshko Koloski, Nada Lavrač, Bojan Cestnik, Senja Pollak & Blaž Škrlj
International Postgraduate School Jožef Stefan, Ljubljana, Slovenia
Boshko Koloski
University of Nova Gorica, Vipava, Slovenia
Nada Lavrač
Temida d.o.o, Ljubljana, Slovenia
Bojan Cestnik
University of Ljubljana, Institute for Biostatistics and Medical Informatics, Ljubljana, Slovenia
Andrej Kastrin

Authors

Boshko Koloski
View author publications
You can also search for this author in PubMed Google Scholar
Nada Lavrač
View author publications
You can also search for this author in PubMed Google Scholar
Bojan Cestnik
View author publications
You can also search for this author in PubMed Google Scholar
Senja Pollak
View author publications
You can also search for this author in PubMed Google Scholar
Blaž Škrlj
View author publications
You can also search for this author in PubMed Google Scholar
Andrej Kastrin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Boshko Koloski .

Editor information

Editors and Affiliations

Stockholm University, Kista, Sweden
Ioanna Miliou
Fraunhofer IAIS, Sankt Augustin, Germany
Nico Piatkowski
Stockholm University, Kista, Sweden
Panagiotis Papapetrou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Koloski, B., Lavrač, N., Cestnik, B., Pollak, S., Škrlj, B., Kastrin, A. (2024). AHAM: Adapt, Help, Ask, Model Harvesting LLMs for Literature Mining. In: Miliou, I., Piatkowski, N., Papapetrou, P. (eds) Advances in Intelligent Data Analysis XXII. IDA 2024. Lecture Notes in Computer Science, vol 14641. Springer, Cham. https://doi.org/10.1007/978-3-031-58547-0_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-58547-0_21
Published: 16 April 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-58546-3
Online ISBN: 978-3-031-58547-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

AHAM: Adapt, Help, Ask, Model Harvesting LLMs for Literature Mining

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

NLPExplorer: Exploring the Universe of NLP Papers

Automated Literature Review Using Large Language Models

Under-Sampling Strategies for Better Transformer-Based Classifications Models

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

AHAM: Adapt, Help, Ask, Model Harvesting LLMs for Literature Mining

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

NLPExplorer: Exploring the Universe of NLP Papers

Automated Literature Review Using Large Language Models

Under-Sampling Strategies for Better Transformer-Based Classifications Models

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation