Effective Information Retrieval, Question Answering and Abstractive Summarization on Large-Scale Biomedical Document Corpora

Shenoy, Naveen; Nayak, Pratham; Jain, Sarthak; Sowmya Kamath, S.; Sugumaran, Vijayan

doi:10.1007/978-3-031-35320-8_29

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13913))

Included in the following conference series:

International Conference on Applications of Natural Language to Information Systems

959 Accesses

Abstract

During the COVID-19 pandemic, a concentrated effort was made to collate published literature on SARS-Cov-2 and other coronaviruses for the benefit of the medical community. One such initiative is the COVID-19 Open Research Dataset which contains over 400,000 published research articles. To expedite access to relevant information sources for health workers and researchers, it is vital to design effective information retrieval and information extraction systems. In this article, an IR approach leveraging transformer-based models to enable question-answering and abstractive summarization is presented. Various keyword-based and neural-network-based models are experimented with and incorporated to reduce the search space and determine relevant sentences from the vast corpus for ranked retrieval. For abstractive summarization, candidate sentences are determined using a combination of various standard scoring metrics. Finally, the summary and the user query are utilized for supporting question answering. The proposed model is evaluated based on standard metrics on the standard CovidQA dataset for both natural language and keyword queries. The proposed approach achieved promising performance for both query classes, while outperforming various unsupervised baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bachina, S., Balumuri, S., Kamath, S.: Ensemble ALBERT and RoBERTa for span prediction in question answering. In: Proceedings of 59th Annual Meeting of the Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), pp. 63–68 (2021)
Google Scholar
Beltagy, I., Lo, K., Cohan, A.: Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 (2019)
Bhatia, P., et al.: AWS CORD-19 search: a neural search engine for COVID-19 literature. In: Shaban-Nejad, A., Michalowski, M., Bianco, S. (eds.) W3PHAI 2021. SCI, vol. 1013, pp. 131–145. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-93080-6_11
Chapter Google Scholar
Bhopale, A.P., Shevgoor, S.K.: Temporal topic modeling of scholarly publications for future trend forecasting. In: Reddy, P.K., Sureka, A., Chakravarthy, S., Bhalla, S. (eds.) BDA 2017. LNCS, vol. 10721, pp. 144–163. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-72413-3_10
Chapter Google Scholar
Canese, K., Weis, S.: Pubmed: the bibliographic database. The NCBI handbook, vol. 2(1) (2013)
Google Scholar
Chen, Q., Peng, Y., Lu, Z.: Biosentvec: creating sentence embeddings for biomedical texts. In: 2019 IEEE International Conference on Healthcare Informatics, pp. 1–5. IEEE (2019)
Google Scholar
Das, D., et al.: Information retrieval and extraction on COVID-19 clinical articles using graph community detection and bio-Bert embeddings. In: Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020 (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
Esteva, A., et al.: COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization. NPJ Digital Med. 4(1), 1–9 (2021)
Article Google Scholar
Johnson, A.E., et al.: Mimic-iii, a freely accessible critical care database. Sci. Data 3(1), 160035 (2016)
Article MathSciNet Google Scholar
Krishnan, G.S., Sowmya Kamath, S., Sugumaran, V.: Predicting vaccine hesitancy and vaccine sentiment using topic modeling and evolutionary optimization. In: Métais, E., Meziane, F., Horacek, H., Kapetanios, E. (eds.) NLDB 2021. LNCS, vol. 12801, pp. 255–263. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-80599-9_23
Chapter Google Scholar
Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)
Article Google Scholar
Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. arXiv preprint arXiv:1808.09602 (2018)
Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: CoCo@ NIPs (2016)
Google Scholar
Nogueira, R., Jiang, Z., Lin, J.: Document ranking with a pretrained sequence-to-sequence model. arXiv preprint arXiv:2003.06713 (2020)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
MathSciNet MATH Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)
Robertson, S.E., Walker, S., Beaulieu, M., Gatford, M., Payne, A.: Okapi at TREC-4. Nist Special Publication Sp pp. 73–96 (1996)
Google Scholar
Tang, R., et al.: Rapidly bootstrapping a question answering dataset for COVID-19. arXiv preprint arXiv:2004.11339 (2020)
Tsatsaronis, G., et al.: An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform. 16(1), 1–28 (2015)
Article Google Scholar
Upadhya, B.A., Udupa, S.: Deep neural network models for question classification in community question-answering forums. In: 2019 10th International Conference on Computing, Communication and Networking Technologies. IEEE (2019)
Google Scholar
Wang, L.L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., et al.: Cord-19: The covid-19 open research dataset (2020)
Google Scholar
Xing, W., Ghorbani, A.: Weighted pagerank algorithm. In: Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004, pp. 305–314. IEEE (2004)
Google Scholar
Zhang, E., Gupta, N., Tang, R., Han, X., Pradeep, R., et al.: Covidex: neural ranking models and keyword search infrastructure for the COVID-19 open research dataset (2020). https://doi.org/10.48550/ARXIV.2007.07846

Download references

Author information

Authors and Affiliations

Department of Information Technology, National Institute of Technology Karnataka, Surathkal, 575025, India
Naveen Shenoy, Pratham Nayak, Sarthak Jain & S. Sowmya Kamath
Department of Decision and Information Sciences, Oakland University, Rochester, MI, 48309, USA
Vijayan Sugumaran

Authors

Naveen Shenoy
View author publications
You can also search for this author in PubMed Google Scholar
Pratham Nayak
View author publications
You can also search for this author in PubMed Google Scholar
Sarthak Jain
View author publications
You can also search for this author in PubMed Google Scholar
S. Sowmya Kamath
View author publications
You can also search for this author in PubMed Google Scholar
Vijayan Sugumaran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Sowmya Kamath .

Editor information

Editors and Affiliations

Conservatoire National des Arts et Métiers, Paris, France
Elisabeth Métais
University of Derby, Derby, UK
Farid Meziane
Oakland University, Rochester, NY, USA
Vijayan Sugumaran
University of Derby, Derby, UK
Warren Manning
University of Derby, Derby, UK
Stephan Reiff-Marganiec

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shenoy, N., Nayak, P., Jain, S., Sowmya Kamath, S., Sugumaran, V. (2023). Effective Information Retrieval, Question Answering and Abstractive Summarization on Large-Scale Biomedical Document Corpora. In: Métais, E., Meziane, F., Sugumaran, V., Manning, W., Reiff-Marganiec, S. (eds) Natural Language Processing and Information Systems. NLDB 2023. Lecture Notes in Computer Science, vol 13913. Springer, Cham. https://doi.org/10.1007/978-3-031-35320-8_29

Download citation

DOI: https://doi.org/10.1007/978-3-031-35320-8_29
Published: 14 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35319-2
Online ISBN: 978-3-031-35320-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Effective Information Retrieval, Question Answering and Abstractive Summarization on Large-Scale Biomedical Document Corpora