Skip to main content

AHAM: Adapt, Help, Ask, Model Harvesting LLMs for Literature Mining

  • Conference paper
  • First Online:
Advances in Intelligent Data Analysis XXII (IDA 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14641))

Included in the following conference series:

  • 570 Accesses

Abstract

In an era of rapidly increasing numbers of scientific publications, researchers face the challenge of keeping pace with field-specific advances. This paper presents methodological advancements in topic modeling by utilizing state-of-the-art language models. We introduce the AHAM methodology and a score for domain-specific adaptation of the BERTopic framework to enhance scientific text analysis. Utilizing the LLaMa2 model, we generate topic definitions through one-shot learning, with help from domain experts to craft prompts that guide literature mining by asking the model to label topics. We employ language generation and translation scores for inter-topic similarity assessment, aiming to minimize outlier topics and overlap between topic definitions. AHAM has been validated on a new corpus of scientific papers, proving effective in revealing novel insights across research areas. We also examine the impact of sentence-transformer domain adaptation on topic modeling precision, using datasets from arXiv, focusing on data size, adaptation niche, and the role of domain adaptation. Our findings indicate a significant interaction between domain adaptation and topic modeling accuracy, especially regarding outliers and topic clarity. We release our code at https://github.com/bkolosk1/aham

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://huggingface.co/.

  2. 2.

    https://github.com/seatgeek/fuzzywuzzy.

References

  1. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019)

    Google Scholar 

  2. Gordon, M.D., Dumais, S.: Using latent semantic indexing for literature based discovery. J. Am. Soc. Inf. Sci. 49(8), 674–685 (1998)

    Article  Google Scholar 

  3. Grootendorst, M.: Keybert: minimal keyword extraction with bert (2020)

    Google Scholar 

  4. Grootendorst, M.: BERTopic: neural topic modeling with a class-based TF-IDF procedure (2022)

    Google Scholar 

  5. Hofstätter, S., Althammer, S., Schröder, M., Sertkan, M., Hanbury, A.: Improving efficient neural ranking models with cross-architecture knowledge distillation (2020)

    Google Scholar 

  6. Kastrin, A., Hristovski, D.: Scientometric analysis and knowledge mapping of literature-based discovery (1986–2020). Scientometrics 126(2), 1415–1451 (2021)

    Article  Google Scholar 

  7. Koloski, B., Pollak, S., Škrlj, B., Martinc, M.: Out of thin air: is zero-shot cross-lingual keyword detection better than unsupervised? In: Language Resources and Evaluation Conference, pp. 400–409. European Language Resources Association, Marseille, France Jun 2022

    Google Scholar 

  8. Lampinen, A., et al.: Can language models learn from explanations in context? In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) EMNLP 2022, pp. 537–563. Association for Computational Linguistics Dec 2022

    Google Scholar 

  9. Lavrač, N., Martinc, M., Pollak, S., Pompe Novak, M., Cestnik, B.: Bisociative literature-based discovery: lessons learned and new word embedding approach. N. Gener. Comput. 38(4), 773–800 (2020)

    Article  Google Scholar 

  10. McInnes, L., Healy, J., Astels, S.: HDBSCAN: hierarchical density based clustering. J. Open Source Softw. 2(11), 205 (2017)

    Article  Google Scholar 

  11. McInnes, L., Healy, J., Saul, N., Großberger, L.: UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3(29), 861 (2018)

    Article  Google Scholar 

  12. Min, B., et al.: Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput. Surv. 56(2), 1–40 (2023)

    Article  Google Scholar 

  13. Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: MTEB: massive text embedding benchmark. In: Vlachos, A., Augenstein, I. (eds.) European Chapter of the Association for Computational Linguistics, pp. 2014–2037. Association for Computational Linguistics, Dubrovnik, Croatia (May 2023)https://doi.org/10.18653/v1/2023.eacl-main.148

  14. Pan, J., Gao, T., Chen, H., Chen, D.: What in-context learning “learns” in-context: disentangling task recognition and task learning. In: ACL 2023, pp. 8298–8319 (2023) https://doi.org/10.18653/v1/2023.findings-acl.527

  15. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)

    MathSciNet  Google Scholar 

  16. Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (Nov 2019)

    Google Scholar 

  17. Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (Nov 2020)

    Google Scholar 

  18. Sang, S., Yang, Z., Wang, L., Liu, X., Lin, H., Wang, J.: SemaTyP: a knowledge graph based literature mining method for drug discovery. BMC Bioinf. 19(1), 193 (2018). https://doi.org/10.1186/s12859-018-2167-5

    Article  Google Scholar 

  19. Sebastian, Y., Siew, E.G., Orimaye, S.O.: Emerging approaches in literature-based discovery: techniques and performance review. Know. Eng. Rev. 32, e12 (2017). https://doi.org/10.1017/S0269888917000042

    Article  Google Scholar 

  20. Škrlj, B., Koloski, B., Pollak, S.: Retrieval-efficiency trade-off of unsupervised keyword extraction. In: Pascal, P., Ienco, D. (eds.) Discovery Science, pp. 379–393. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-18840-4_27

    Chapter  Google Scholar 

  21. Swanson, D.R.: Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect. Biol. Med. 30(1), 7–18 (1986)

    Article  Google Scholar 

  22. Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: BEIR: a heterogeneous benchmark for zero-shot evaluation of information retrieval models. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)D

    Google Scholar 

  23. Thilakaratne, M., Falkner, K., Atapattu, T.: A systematic review on literature-based discovery workflow. PeerJ Comput. Sci 5, e235 (2019)

    Article  Google Scholar 

  24. Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models (2023)

    Google Scholar 

  25. Vayansky, I., Kumar, S.A.P.: A review of topic modeling methods. Inf. Syst. 94, 10158101582 (2020)

    Article  Google Scholar 

  26. Wang, K., Reimers, N., Gurevych, I.: Tsdae: using transformer-based sequential denoising auto-encoder for unsupervised sentence embedding learning. In: EMNLP 2021, pp. 671–688. Association for Computational Linguistics, Punta Cana, Dominican Republic (Nov 2021)

    Google Scholar 

  27. Wang, K., Thakur, N., Reimers, N., Gurevych, I.: GPL: generative pseudo labeling for unsupervised domain adaptation of dense retrieval. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2345–2360. Association for Computational Linguistics, Seattle, USA (Jul 2022)

    Google Scholar 

  28. Wang, Q., Downey, D., Ji, H., Hope, T.: Learning to generate novel scientific directions with contextualized literature-based discovery (2023)

    Google Scholar 

  29. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)

    Google Scholar 

Download references

Acknowledgements

The authors acknowledge the financial support from the Slovenian Research and Innovation Agency through research core funding project Knowledge technologies (No. P2-0103) and research projects: Research collaboration prediction using literature-based discovery approach (No. J5-2552), Embeddings-based techniques for Media Monitoring Applications (No. L2-50070) and Hate speech in contemporary conceptualizations of nationalism, racism, gender and migration (No. J5-3102). A Young Researcher Grant PR-12394 supported the work of the first author.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Boshko Koloski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Koloski, B., Lavrač, N., Cestnik, B., Pollak, S., Škrlj, B., Kastrin, A. (2024). AHAM: Adapt, Help, Ask, Model Harvesting LLMs for Literature Mining. In: Miliou, I., Piatkowski, N., Papapetrou, P. (eds) Advances in Intelligent Data Analysis XXII. IDA 2024. Lecture Notes in Computer Science, vol 14641. Springer, Cham. https://doi.org/10.1007/978-3-031-58547-0_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-58547-0_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-58546-3

  • Online ISBN: 978-3-031-58547-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics