Skip to main content

DataQuest: An Approach to Automatically Extract Dataset Mentions from Scientific Papers

  • Conference paper
  • First Online:
Towards Open and Trustworthy Digital Societies (ICADL 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13133))

Included in the following conference series:

  • 1283 Accesses

Abstract

The rapid growth of scientific literature is presenting several challenges for the search and discovery of research artifacts. Datasets are the backbone of scientific experiments. It is crucial to locate the datasets used or generated by previous research as building suitable datasets is costly in terms of time, money, and human labor. Hence automated mechanisms to aid the search and discovery of datasets from scientific publications can aid reproducibility and reusability of these valuable scientific artifacts. Here in this work, utilizing the next sentence prediction capability of language models, we show that a BERT-based entity recognition model with POS aware embedding can be effectively used to address this problem. Our investigation shows that identifying sentences containing dataset mentions in the first place proves critical to the task. Our method outperforms earlier ones and achieves an F1 score of 56.2 in extracting dataset mentions from research papers on a popular corpus of social science publications. We make our codes available at https://github.com/sandeep82945/data_discovery.

S. Kumar and T. Ghosal—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    B, I, and O denote the beginning, intermediate, and outside of dataset mention.

References

  1. The coleridge initiative announces rich context competition—NYU cusp. https://cusp.nyu.edu/blog/the-coleridge-initiative-announces-rich-context-competition/. Accessed 14 July 2021

  2. Github - rich-context-competition/rich-context-book-2019. https://github.com/rich-context-competition/rich-context-book-2019. Accessed 14 July 2021

  3. Rich context project - coleridge initiative. https://coleridgeinitiative.org/rich-context-project/. Accessed 14 July 2021

  4. Richcontextcompetition - coleridge initiative. https://coleridgeinitiative.org/richcontext/richcontextcompetition/. Accessed 14 July 2021

  5. Spacy industrial-strength natural language processing in python. https://spacy.io/. Accessed 15 July 2021

  6. Cohan, A., Beltagy, I., King, D., Dalvi, B., Weld, D.S.: Pretrained language models for sequential sentence classification. In: EMNLP (2019)

    Google Scholar 

  7. Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019, pp. 3613–3618. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1371

  8. Boland, K., Ritze, D., Eckert, K., Mathiak, B.: Identifying references to datasets in publications. In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds.) TPDL 2012. LNCS, vol. 7489, pp. 150–161. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33290-6_17

    Chapter  Google Scholar 

  9. Chen, X., et al.: DataMed - an open source discovery index for finding biomedical datasets. J. Am. Medical Informatics Assoc. 25(3), 300–308 (2018)

    Article  Google Scholar 

  10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019)

    Google Scholar 

  11. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019 (Long and Short Papers), vol. 1, pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423

  12. Feger, S.S.: Interactive tools for reproducible science - understanding, supporting, and motivating reproducible science practices. CoRR abs/2012.02570 (2020). https://arxiv.org/abs/2012.02570

  13. Gardner, M., et al.: AllenNLP: a deep semantic natural language processing platform (2018). http://arxiv.org/abs/1803.07640

  14. Ghavimi, B., Mayr, P., Lange, C., Vahdati, S., Auer, S.: A semi-automatic approach for detecting dataset references in social science texts. Inf. Serv. Use 36(3–4), 171–187 (2016)

    Google Scholar 

  15. Ghavimi, B., Mayr, P., Vahdati, S., Lange, C.: Identifying and improving dataset references in social sciences full texts. In: Loizides, F., Schmidt, B. (eds.) Positioning and Power in Academic Publishing: Players, Agents and Agendas, 20th International Conference on Electronic Publishing, Göttingen, Germany, 7–9 June 2016, pp. 105–114. IOS Press (2016). https://doi.org/10.3233/978-1-61499-649-1-105

  16. Grover, M.: Amundsen - Lyft’s data discovery & metadata engine—by mark grover—Lyft engineering, April 2019. https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9. Accessed 31 Oct 2020

  17. Hong, G., Cao, M.S., Puerto-San-Roman, H.: Rich text competition. In: Rich Search and Discovery for Research Datasets: Building the Next Generation of Scholarly Infrastructure. Sage, London (2020)

    Google Scholar 

  18. King, D., Ammar, W., Beltagy, I., Betts, C., Gururangan, S., van Zuylen, M.: The AI2 submission at the rich context competition. In: Rich Search and Discovery for Research Datasets: Building the Next Generation of Scholarly Infrastructure. Sage, London (2020)

    Google Scholar 

  19. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2015). http://arxiv.org/abs/1412.6980

  20. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. CoRR abs/1603.01360 (2016). http://arxiv.org/abs/1603.01360

  21. Lu, M., Bangalore, S., Cormode, G., Hadjieleftheriou, M., Srivastava, D.: A dataset search engine for the research document corpus. In: 2012 IEEE 28th International Conference on Data Engineering, pp. 1237–1240. IEEE (2012)

    Google Scholar 

  22. Munafò, M., et al.: A manifesto for reproducible science. Nat. Hum. Behav. 1, 0021 (2017). https://doi.org/10.1038/s41562-016-0021

  23. Nasar, Z., Jaffry, S.W., Malik, M.K.: Information extraction from scientific articles: a survey. Scientometrics 117(3), 1931–1990 (2018)

    Article  Google Scholar 

  24. Ngonga, P.D.A., Srivastava, N., Jalota, R.: Dice @ rich context competition. In: Rich Search and Discovery for Research Datasets: Building the Next Generation of Scholarly Infrastructure. Sage, London (2020)

    Google Scholar 

  25. Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-77094-7_41

    Chapter  Google Scholar 

  26. Noy, N., Burgess, M., Brickley, D.: Google dataset search: building a search engine for datasets in an open web ecosystem. In: 28th Web Conference (WebConf 2019) (2019)

    Google Scholar 

  27. Otto, W., Zielinski, A., Ghavimi, B., Dimitrov, D., Tavakolpoursaleh, N.: Rich context competition phase 2. In: Rich Search and Discovery for Research Datasets: Building the Next Generation of Scholarly Infrastructure. Sage, London (2020)

    Google Scholar 

  28. Peng, F., McCallum, A.: Information extraction from research papers using conditional random fields. Inf. Process. Manag. 42(4), 963–979 (2006)

    Article  Google Scholar 

  29. Prasad, A., Si, C., Kan, M.Y.: Dataset mention extraction and classification. In: Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications, Minneapolis, Minnesota, pp. 31–36. Association for Computational Linguistics, June 2019. https://doi.org/10.18653/v1/W19-2604. https://www.aclweb.org/anthology/W19-2604

  30. Prasetyo, P.K., Silva, A., Lim, E.P., Achananuparp, P.: Simple extraction for social science publications. In: Rich Search and Discovery for Research Datasets: Building the Next Generation of Scholarly Infrastructure. Sage, London (2020)

    Google Scholar 

  31. Shamsfard, M., Jafari, H.S., Ilbeygi, M.: Step-1: a set of fundamental tools for Persian text processing. In: Calzolari, N., et al. (eds.) Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, 17–23 May 2010. European Language Resources Association (2010). http://www.lrec-conf.org/proceedings/lrec2010/summaries/809.html

  32. Singhal, A., Srivastava, J.: Research dataset discovery from research publications using web context. In: Web Intelligence, vol. 15, pp. 81–99. IOS Press (2017)

    Google Scholar 

  33. Zeng, T., Acuna, D.: Dataset mention extraction in scientific articles using a BiLSTM-CRF model. In: Rich Search and Discovery for Research Datasets: Building the Next Generation of Scholarly Infrastructure. Sage, London (2020)

    Google Scholar 

Download references

Acknowledgement

Sandeep Kumar acknowledges the Prime Minister Research Fellowship (PMRF) program of the Government of India for its support. Asif Ekbal is a recipient of the Visvesvaraya Young Faculty Award and acknowledges Digital India Corporation, Ministry of Electronics and Information Technology, Government of India for supporting this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sandeep Kumar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kumar, S., Ghosal, T., Ekbal, A. (2021). DataQuest: An Approach to Automatically Extract Dataset Mentions from Scientific Papers. In: Ke, HR., Lee, C.S., Sugiyama, K. (eds) Towards Open and Trustworthy Digital Societies. ICADL 2021. Lecture Notes in Computer Science(), vol 13133. Springer, Cham. https://doi.org/10.1007/978-3-030-91669-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-91669-5_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-91668-8

  • Online ISBN: 978-3-030-91669-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics