A study of BERT-based methods for formal citation identification of scientific data

Yang, Ning; Zhang, Zhiqiang; Huang, Feihu

doi:10.1007/s11192-023-04833-z

A study of BERT-based methods for formal citation identification of scientific data

Published: 16 September 2023

Volume 128, pages 5865–5881, (2023)
Cite this article

Scientometrics Aims and scope Submit manuscript

892 Accesses
2 Citations
Explore all metrics

Abstract

A study on scientific data citation is crucial to promote data sharing and is the basis for the examination of scientific data measurement and analysis. To this end, it is necessary to identify and label data reference information. Currently, there are many supervised methods for entity recognition and relationship extraction of diseases, drugs, proteins, symptoms, etc., but they have not discussed the effectiveness of scientific data recognition. To fill this gap, the effectiveness of the classical machine learning model and the deep learning model on recognizing scientific data citation are discussed in this study. In experiments, this study took the full text of scientific and technical papers as the research object, conducted annotated citation classification based on rules and manual recognition of their references to form a dataset. The results of the empirical study showed that: (1) the methods used in this paper can achieve automatic identification and extraction of data citations and can address the problem of automating the construction of citation relationships between scientific and technical literature and scientific data; (2) the BERT-based models have the optimal effectiveness in the recognition task of scientific data citation, especially the BioBERT and SciBERT; (3) the full-text information has a crucial impact on the recognition results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Deep Multi-Tasking Approach Leveraging on Cited-Citing Paper Relationship For Citation Intent Classification

Article Open access 13 December 2023

Contextualised segment-wise citation function classification

Article 12 July 2023

SDCF: semi-automatically structured dataset of citation functions

Article Open access 21 July 2022

References

Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3(6), 1137–1155. https://doi.org/10.1162/153244303322533223
Article MATH Google Scholar
Borgman, C. L. (2015). Scholarship in the Networked World: Big Data, Little Data, noData[R]. University of California. https://escholarship.org/uc/item/38v6n99v
Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2), 249–254.
Google Scholar
Chapman, B., & Chang, J. (2000). Biopython: Python tools for computational biology. ACM SIGBIO Newsletter, 20, 15–19. https://doi.org/10.1145/360262.360268
Article Google Scholar
Cui, B.-G., & Chen, X. (2010). An improved hidden markov model for literature metadata extraction (Vol. 6215). Springer.
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding, Minneapolis, Minnesota.
Duke, M., & Ball, A. (2012, 2012/10/30). How to Cite Datasets and Link to Publications. Paper presented at the 23rd International CODATA Conference.
Ghavimi, B., Mayr, P., Vahdati, S., & Lange, C. (2016). Identifying and improving dataset references in social sciences full texts. arXiv preprint arXiv:1603.01774.
Grechkin, M., Poon, H., & Howe, B. (2017). Wide-Open: Accelerating public data release by automating detection of overdue datasets. Plos Biology, 15(6), e2002477.
Article Google Scholar
Green, T. (2009). OECD publishing white paper we need publishing standards for datasets and data tables. Learned Publishing. https://doi.org/10.1087/20090411
Article Google Scholar
Haeussler, M., Gerner, M., & Bergman, C. M. (2011). Annotating genes and genomes with DNA sequences extracted from biomedical articles. Bioinformatics, 27(7), 980–986. https://doi.org/10.1093/bioinformatics/btr043
Article Google Scholar
Henderson, T., & Kotz, D. (2015). Data citation practices in the CRAWDAD wireless network data archive. D-Lib Magazine. https://doi.org/10.1045/january2015-henderson
Article Google Scholar
Hou, L., Zhang, J., Wu, O., Yu, T., Wang, Z., Li, Z., Gao, J., Ye, Y., & Yao, R. (2020). Method and dataset entity mining in scientific literature: A CNN + Bi-LSTM model with self-attention. Knowledge-Based Systems. https://doi.org/10.1016/j.knosys.2021.107621
Article Google Scholar
Kim, Y. (2014). Convolutional neural networks for sentence classification. http://arxiv.org/abs/1408.5882. Retrieved from https://ui.adsabs.harvard.edu/abs/2014arXiv1408.5882K
Lai, S. W., Xu, L. H., Liu, K., & Zhao, J. (2015). Recurrent convolutional neural networks for text classification. Proceedings of the Twenty-Ninth Aaai Conference on Artificial Intelligence, pp. 2267–2273.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240. https://doi.org/10.1093/bioinformatics/btz682
Article Google Scholar
Major, G. R. (2011). Impact of NASA EOS instrument data on the scientific literature: 10 years of published research results from Terra, Aqua, and Aura. Issues in Science and Technology Librarianship. https://doi.org/10.5062/F4CC0XMJ
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. http://arxiv.org/abs/1310.4546. Retrieved from https://ui.adsabs.harvard.edu/abs/2013arXiv1310.4546M
Mooney, H. (2011). Citing data sources in the social sciences: Do authors do it? Learned Publishing, 24, 99–108. https://doi.org/10.1087/20110204
Article Google Scholar
Mooney, H., & Newton, M. P. (2012). The anatomy of a data citation: Discovery, reuse, and credit. Journal of Librarianship and Scholarly Communication. https://doi.org/10.7710/2162-3309.1035
Article Google Scholar
Neveol, A., Wilbur, W. J., & Lu, Z. Y. (2011). Extraction of data deposition statements from the literature: A method for automatically tracking research results. Bioinformatics, 27(23), 3306–3312. https://doi.org/10.1093/bioinformatics/btr573
Article Google Scholar
Park, H., You, S., & Wolfram, D. (2017). Is informal data citation for data sharing and re-use more common than formal data citation? Proceedings of the Association for Information Science and Technology, 54, 768–769. https://doi.org/10.1002/pra2.2017.14505401150
Article Google Scholar
Park, H., You, S., & Wolfram, D. (2018). Informal data citation for data sharing and reuse is more common than formal data citation in biomedical fields. Journal of the Association for Information Science and Technology. https://doi.org/10.1002/asi.24049
Article Google Scholar
Peters, I., Kraker, P., Lex, E., Gumpenberger, C., & Gorraiz, J. (2015). Research data explored: Citations versus Altmetrics. http://arxiv.org/abs/1501.03342. Retrieved from https://ui.adsabs.harvard.edu/abs/2015arXiv150103342P
ESIP Data Preservation and Stewardship Committee. 2019. Data Citation Guidelines for Earth Science Data. Ver. 2. Earth Science Information Partners.https://doi.org/10.6084/m9.figshare.8441816
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427–437. https://doi.org/10.1016/j.ipm.2009.03.002
Article Google Scholar
Tenopir, C., Allard, S., Douglass, K., Aydinoglu, A. U., Wu, L., Read, E., Manoff, M., & Frame, M. (2011). Data sharing by scientists: practices and perceptions. PLoS ONE, 6, e21101. https://doi.org/10.1371/journal.pone.0021101
Article Google Scholar
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J. W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., & Mons, B. (2016). Comment: The FAIR guiding principles for scientific data management and stewardship. Scientific Data. https://doi.org/10.1038/sdata.2016.18
Article Google Scholar
Yu, Q., Ding, Y., Song, M., Song, S. J., Liu, J. H., & Zhang, B. (2015). Tracing database usage: Detecting main paths in database link networks. Journal of Informetrics, 9(1), 1–15. https://doi.org/10.1016/j.joi.2014.10.002
Article Google Scholar
Zenk-Moltgen, W., & Lepthien, G. (2014). Data sharing in sociology journals. Online Information Review, 38(6), 709–722. https://doi.org/10.1108/Oir-05-2014-0119
Article Google Scholar
Zhang, Q., Cheng, Q., & Lu, W. (2016). A bootstrapping-based method to automatically identify data-usage statements in publications. Journal of Data and Information Science, 1, 1–17. https://doi.org/10.20309/jdis.201606
Article Google Scholar
Zhao, M. N., Yan, E. J., & Li, K. (2018). Data set mentions and citations: A content analysis of full-text publications. Journal of the Association for Information Science and Technology, 69(1), 32–46. https://doi.org/10.1002/asi.23919
Article Google Scholar

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their helpful suggestions.

Author information

Authors and Affiliations

Chengdu Library and Information Center, Chinese Academy of Sciences, Chengdu, 610041, People’s Republic of China
Ning Yang & Zhiqiang Zhang
Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing, 100190, People’s Republic of China
Ning Yang & Zhiqiang Zhang
College of Computer Science, Sichuan University, Chengdu, 610065, People’s Republic of China
Feihu Huang

Authors

Ning Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Feihu Huang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

NY: Concieved and designed the analysis; Collected the data; Contributed data or analysis tool; Performed the analysis; Wrote the paper. ZZ: Concieved and designed the analysis; Wrote the paper. FH: Performed the analysis; Wrote the paper.

Corresponding author

Correspondence to Ning Yang.

Ethics declarations

Conflict of interest

The authors declared that they have no conflicts of interest to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, N., Zhang, Z. & Huang, F. A study of BERT-based methods for formal citation identification of scientific data. Scientometrics 128, 5865–5881 (2023). https://doi.org/10.1007/s11192-023-04833-z

Download citation

Received: 18 June 2021
Accepted: 06 September 2023
Published: 16 September 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s11192-023-04833-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A study of BERT-based methods for formal citation identification of scientific data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Deep Multi-Tasking Approach Leveraging on Cited-Citing Paper Relationship For Citation Intent Classification

Contextualised segment-wise citation function classification

SDCF: semi-automatically structured dataset of citation functions

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A study of BERT-based methods for formal citation identification of scientific data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Deep Multi-Tasking Approach Leveraging on Cited-Citing Paper Relationship For Citation Intent Classification

Contextualised segment-wise citation function classification

SDCF: semi-automatically structured dataset of citation functions

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation