Skip to main content
Log in

A study of BERT-based methods for formal citation identification of scientific data

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

A study on scientific data citation is crucial to promote data sharing and is the basis for the examination of scientific data measurement and analysis. To this end, it is necessary to identify and label data reference information. Currently, there are many supervised methods for entity recognition and relationship extraction of diseases, drugs, proteins, symptoms, etc., but they have not discussed the effectiveness of scientific data recognition. To fill this gap, the effectiveness of the classical machine learning model and the deep learning model on recognizing scientific data citation are discussed in this study. In experiments, this study took the full text of scientific and technical papers as the research object, conducted annotated citation classification based on rules and manual recognition of their references to form a dataset. The results of the empirical study showed that: (1) the methods used in this paper can achieve automatic identification and extraction of data citations and can address the problem of automating the construction of citation relationships between scientific and technical literature and scientific data; (2) the BERT-based models have the optimal effectiveness in the recognition task of scientific data citation, especially the BioBERT and SciBERT; (3) the full-text information has a crucial impact on the recognition results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig.6

Similar content being viewed by others

References

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their helpful suggestions.

Author information

Authors and Affiliations

Authors

Contributions

NY: Concieved and designed the analysis; Collected the data; Contributed data or analysis tool; Performed the analysis; Wrote the paper. ZZ: Concieved and designed the analysis; Wrote the paper. FH: Performed the analysis; Wrote the paper.

Corresponding author

Correspondence to Ning Yang.

Ethics declarations

Conflict of interest

The authors declared that they have no conflicts of interest to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, N., Zhang, Z. & Huang, F. A study of BERT-based methods for formal citation identification of scientific data. Scientometrics 128, 5865–5881 (2023). https://doi.org/10.1007/s11192-023-04833-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-023-04833-z

Keywords

Navigation