Skip to main content

Effective Information Retrieval, Question Answering and Abstractive Summarization on Large-Scale Biomedical Document Corpora

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2023)

Abstract

During the COVID-19 pandemic, a concentrated effort was made to collate published literature on SARS-Cov-2 and other coronaviruses for the benefit of the medical community. One such initiative is the COVID-19 Open Research Dataset which contains over 400,000 published research articles. To expedite access to relevant information sources for health workers and researchers, it is vital to design effective information retrieval and information extraction systems. In this article, an IR approach leveraging transformer-based models to enable question-answering and abstractive summarization is presented. Various keyword-based and neural-network-based models are experimented with and incorporated to reduce the search space and determine relevant sentences from the vast corpus for ranked retrieval. For abstractive summarization, candidate sentences are determined using a combination of various standard scoring metrics. Finally, the summary and the user query are utilized for supporting question answering. The proposed model is evaluated based on standard metrics on the standard CovidQA dataset for both natural language and keyword queries. The proposed approach achieved promising performance for both query classes, while outperforming various unsupervised baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bachina, S., Balumuri, S., Kamath, S.: Ensemble ALBERT and RoBERTa for span prediction in question answering. In: Proceedings of 59th Annual Meeting of the Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), pp. 63–68 (2021)

    Google Scholar 

  2. Beltagy, I., Lo, K., Cohan, A.: Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 (2019)

  3. Bhatia, P., et al.: AWS CORD-19 search: a neural search engine for COVID-19 literature. In: Shaban-Nejad, A., Michalowski, M., Bianco, S. (eds.) W3PHAI 2021. SCI, vol. 1013, pp. 131–145. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-93080-6_11

    Chapter  Google Scholar 

  4. Bhopale, A.P., Shevgoor, S.K.: Temporal topic modeling of scholarly publications for future trend forecasting. In: Reddy, P.K., Sureka, A., Chakravarthy, S., Bhalla, S. (eds.) BDA 2017. LNCS, vol. 10721, pp. 144–163. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-72413-3_10

    Chapter  Google Scholar 

  5. Canese, K., Weis, S.: Pubmed: the bibliographic database. The NCBI handbook, vol. 2(1) (2013)

    Google Scholar 

  6. Chen, Q., Peng, Y., Lu, Z.: Biosentvec: creating sentence embeddings for biomedical texts. In: 2019 IEEE International Conference on Healthcare Informatics, pp. 1–5. IEEE (2019)

    Google Scholar 

  7. Das, D., et al.: Information retrieval and extraction on COVID-19 clinical articles using graph community detection and bio-Bert embeddings. In: Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020 (2020)

    Google Scholar 

  8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)

  9. Esteva, A., et al.: COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization. NPJ Digital Med. 4(1), 1–9 (2021)

    Article  Google Scholar 

  10. Johnson, A.E., et al.: Mimic-iii, a freely accessible critical care database. Sci. Data 3(1), 160035 (2016)

    Article  MathSciNet  Google Scholar 

  11. Krishnan, G.S., Sowmya Kamath, S., Sugumaran, V.: Predicting vaccine hesitancy and vaccine sentiment using topic modeling and evolutionary optimization. In: Métais, E., Meziane, F., Horacek, H., Kapetanios, E. (eds.) NLDB 2021. LNCS, vol. 12801, pp. 255–263. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-80599-9_23

    Chapter  Google Scholar 

  12. Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)

    Article  Google Scholar 

  13. Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. arXiv preprint arXiv:1808.09602 (2018)

  14. Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: CoCo@ NIPs (2016)

    Google Scholar 

  15. Nogueira, R., Jiang, Z., Lin, J.: Document ranking with a pretrained sequence-to-sequence model. arXiv preprint arXiv:2003.06713 (2020)

  16. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)

    MathSciNet  MATH  Google Scholar 

  17. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)

  18. Robertson, S.E., Walker, S., Beaulieu, M., Gatford, M., Payne, A.: Okapi at TREC-4. Nist Special Publication Sp pp. 73–96 (1996)

    Google Scholar 

  19. Tang, R., et al.: Rapidly bootstrapping a question answering dataset for COVID-19. arXiv preprint arXiv:2004.11339 (2020)

  20. Tsatsaronis, G., et al.: An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform. 16(1), 1–28 (2015)

    Article  Google Scholar 

  21. Upadhya, B.A., Udupa, S.: Deep neural network models for question classification in community question-answering forums. In: 2019 10th International Conference on Computing, Communication and Networking Technologies. IEEE (2019)

    Google Scholar 

  22. Wang, L.L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., et al.: Cord-19: The covid-19 open research dataset (2020)

    Google Scholar 

  23. Xing, W., Ghorbani, A.: Weighted pagerank algorithm. In: Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004, pp. 305–314. IEEE (2004)

    Google Scholar 

  24. Zhang, E., Gupta, N., Tang, R., Han, X., Pradeep, R., et al.: Covidex: neural ranking models and keyword search infrastructure for the COVID-19 open research dataset (2020). https://doi.org/10.48550/ARXIV.2007.07846

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Sowmya Kamath .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shenoy, N., Nayak, P., Jain, S., Sowmya Kamath, S., Sugumaran, V. (2023). Effective Information Retrieval, Question Answering and Abstractive Summarization on Large-Scale Biomedical Document Corpora. In: Métais, E., Meziane, F., Sugumaran, V., Manning, W., Reiff-Marganiec, S. (eds) Natural Language Processing and Information Systems. NLDB 2023. Lecture Notes in Computer Science, vol 13913. Springer, Cham. https://doi.org/10.1007/978-3-031-35320-8_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-35320-8_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-35319-2

  • Online ISBN: 978-3-031-35320-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics