Important citations identification with semi-supervised classification model

An, Xin; Sun, Xin; Xu, Shuo

doi:10.1007/s11192-021-04212-6

Important citations identification with semi-supervised classification model

Published: 20 January 2022

Volume 127, pages 6533–6555, (2022)
Cite this article

Scientometrics Aims and scope Submit manuscript

578 Accesses
5 Citations
Explore all metrics

Abstract

Given that citations are not equally important, various techniques have been presented to identify important citations on the basis of supervised machine learning models. However, only a small volume of instances have been annotated manually with the labels. To make full use of unlabeled instances and promote the identification performance, the semi-supervised self-training technique is utilized here to identify important citations in this work. After six groups of features are engineered, the SVM and RF models are chosen as the base classifiers for self-training strategy. Then two experiments based on two different types of datasets are conducted. The experiment on the expert-labeled dataset from one single discipline shows that the semi-supervised versions of SVM and RF models significantly improve the performance of the conventional supervised versions when unannotated samples under 75% and 95% confidence level are rejoined to the training set, respectively. The AUC-PR and AUC-ROC of SVM model are 0.8102 and 0.9622, and those of RF model reach 0.9248 and 0.9841, which outperform their counterparts and the benchmark methods in the literature. This demonstrates the effectiveness of our semi-supervised self-training strategy for important citation identification. Another experiment on the author-labeled dataset from multiple disciplines, semi-supervised learning models can perform better than their supervised learning counterparts in term of AUC-PR when the ratio of labeled instances is less than 20%. Compared to our first experiment, insufficient amount of instances from each discipline in our second experiment enables the performance of the models to be unsatisfactory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SDCF: semi-automatically structured dataset of citation functions

Article Open access 21 July 2022

Important citation identification by exploiting the syntactic and contextual information of citations

Article 02 September 2020

Identification of important citations by exploiting research articles’ metadata and cue-terms from content

Article 22 November 2018

References

Abu-Jbara, A., Ezra, J., & Radev, D. (2013). Purpose and polarity of citation: Towards NLP-based bibliometrics. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 596–606).
Aljuaid, H., Iftikhar, R., Ahmad, S., Asif, M., & Afzal, M. T. (2021). Important citation identification using sentiment analysis of in-text citations. Telematics and Informatics, 56, 101492.
Article Google Scholar
An, X., Sun X., Xu, S. (2021b). Important citations identification with semi-supervised classification model. The first Workshop on AI + Informetrics at the iConference 2021.
An, X., Sun, X., Xu, S., Hao, L., & Li, J. (2021a). Important citations identification by exploiting generative model into discriminative model. Journal of Information Science. https://doi.org/10.1177/0165551521991034
Article Google Scholar
Bennett, K., & Demiriz, A. (1999). Semi-supervised support vector machines. Advances in Neural Information Processing Systems, 368–374.
Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory (pp. 92–100).
Blum, A., & Chawla, S. (2001). Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the 18th International Conference on Machine Learning (pp. 19–26).
Chapelle, O., Sindhwani, V., & Keerthi, S. S. (2008). Optimization techniques for semi-supervised support vector machines. Journal of Machine Learning Research, 9(2), 203–233.
MATH Google Scholar
Councill, I. G., Giles, C. L., & Kan, M. Y. (2008). ParsCit: An open-source CRF reference string parsing package. In Proceedings of the 6th International Conference on Language Resources and Evaluation (pp. 661–667).
Davis, J., & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning (pp. 233–240).
Dietz, L., Bickel, S., & Scheffer, T. (2007). Unsupervised prediction of citation influences. In Proceedings of the 24th International Conference on Machine Learning (pp. 233–240). ACM.
Dong, C., & Schäfer, U. (2011). Ensemble-style self-training on citation classification. In Proceedings of 5th International Joint Conference on Natural Language Processing (pp. 623–631).
Garfield E. (1965). Can citation indexing be automated. In Proceedings of the Symposium on Statistical Association Methods for Mechanized documentation (pp. 189–192).
Garfield, E. (2006). Citation indexes for science. A new dimension in documentation through association of ideas. International Journal of Epidemiology, 35(5), 1123–1127.
Article Google Scholar
Hassan, S. U., Akram, A., & Haddawy, P. (2017). Identifying important citations using contextual information from full text. In Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries (JCDL) (pp. 1–8). IEEE.
Hassan, S. U., Imran, M., Iqbal, S., Aljohani, N. R., & Nawaz, R. (2018a). Deep context of citations using machine-learning models in scholarly full-text articles. Scientometrics, 117(3), 1645–1662.
Article Google Scholar
Hassan, S. U., Safder, I., Akram, A., & Kamiran, F. (2018b). A novel machine-learning approach to measuring scientific knowledge flows using citation context analysis. Scientometrics, 116(2), 973–996.
Article Google Scholar
He, Y., & Zhou, D. (2011). Self-training from labeled features for sentiment analysis. Information Processing and Management, 47(4), 606–616.
Article Google Scholar
Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences, 102(46), 16569–16572.
Article MATH Google Scholar
Iqbal, S., Hassan, S. U., Aljohani, N. R., Alelyani, S., Nawaz, R., & Bornmann, L. (2021). A decade of in-text citation analysis based on natural language processing and machine learning techniques: An overview of empirical studies. Scientometrics, 126(8), 6551–6599.
Article Google Scholar
Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (pp. 200–209).
Lazaridis, T. (2010). Ranking university departments using the mean h-index. Scientometrics, 82(2), 211–216.
Article Google Scholar
Li, X., He, Y., Meyers, A., & Grishman, R. (2013, September). Towards fine-grained citation function classification. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (pp. 402–407).
Li, Y., Guan, C., Li, H., & Chin, Z. (2008). A self-training semi-supervised SVM algorithm and its application in an EEG-based brain computer interface speller system. Pattern Recognition Letters, 29(9), 1285–1294.
Article Google Scholar
Qayyum, F., & Afzal, M. T. (2019). Identification of important citations by exploiting research articles’ metadata and cue-terms from content. Scientometrics, 118(1), 21–43.
Article Google Scholar
Radoulov, R. (2008). Exploring automatic citation classification. Master's thesis, University of Waterloo.
Rosenberg, C., Hebert, M., & Schneiderman, H. (2005). Semi-supervised self-training of object detection models. In Proceedings of the 7th IEEE Workshop on Applications of Computer Vision (pp. 29–36).
Tanha, J., van Someren, M., & Afsarmanesh, H. (2017). Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics, 8(1), 355–370.
Article Google Scholar
Teufel, S., Siddharthan, A., & Tidhar, D. (2006). Automatic classification of citation function. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (pp. 103–110).
Valenzuela, M., Ha, V., & Etzioni, O. (2015). Identifying meaningful citations. In The 2015 AAAI Workshop on Scholarly Big Data: AI Perspectives, Challenges, and Ideas (pp. 21–26).
Van Engelen, J. E., & Hoos, H. H. (2020). A survey on semi-supervised learning. Machine Learning, 109(2), 373–440.
Article MathSciNet MATH Google Scholar
Vapnik, V. (1998). Statistical learning theory. Springer.
MATH Google Scholar
Wang, B., Spencer, B., Ling, C. X., & Zhang, H. (2008). Semi-supervised self-training for sentence subjectivity classification. In Proceedings of the 21st Conference of the Canadian Society for Computational Studies of Intelligence (pp. 344–355). Springer, Berlin, Heidelberg.
Wang, M., Zhang, J., Jiao, S., Zhang, X., Zhu, N., & Chen, G. (2020). Important citation identification by exploiting the syntactic and contextual information of citations. Scientometrics, 125(3), 2109–2129.
Article Google Scholar
Xu, S., Ma, F., & Tao, L. (2007). Learn from the information contained in the false splice sites as well as in the true splice sites using SVM. In Proceedings of the International Conference on Intelligent Systems and Knowledge Engineering (pp. 65–71). Atlantis Press.
Xu, S., An, X., Qiao, X., Zhu, L., & Li, L. (2011). Semi-supervised least-squares support vector regression machines. Journal of Information and Computational Science, 8(6), 885–892.
Google Scholar
Xu, S., Hao, L., An, X., Yang, G., & Wang, F. (2019). Emerging research topics detection with multiple machine learning models. Journal of Informetrics, 13(4), 100983.
Article Google Scholar
Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (pp. 189–196).
Zeng, T., & Acuna, D. E. (2020). Modeling citation worthiness by using attention-based bidirectional long short-term memory networks and interpretable models. Scientometrics, 124(1), 399–428.
Article Google Scholar
Zhang, F., Pan, T., & Wang, B. (2021). Semi-supervised object detection with adaptive class-rebalancing self-training. arXiv preprint. arXiv:2107.05031.
Zhu, X., Ghahramani, Z., & Lafferty, J. D. (2003). Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine learning (pp. 912–919).
Zhu, X., Lafferty, J., & Rosenfeld, R. (2005). Semi-supervised learning with graphs. Doctoral dissertation. Carnegie Mellon University.
Zhu, X. J. (2008). Semi-supervised learning literature survey. Technical Report. University of Wisconsin-Madison.
Zhu, X., Turney, P., Lemire, D., & Vellino, A. (2015). Measuring academic influence: Not all citations are equal. Journal of the Association for Information Science and Technology, 66(2), 408–427.
Article Google Scholar

Download references

Acknowledgements

The present study is an extended version of an article (An et al., 2021b) presented at the first Workshop on AI + Informetrics at the iConference 2021, 17 March, 2021. This research received the financial support from the National Natural Science Foundation of China under grant number 72004012 and 72074014.

Author information

Authors and Affiliations

School of Economics and Management, Beijing Forestry University, Beijing, 100083, People’s Republic of China
Xin An
Institute of Scientific and Technical Information of China, Beijing, 100038, People’s Republic of China
Xin Sun
College of Economics and Management, Beijing University of Technology, Beijing, 100124, People’s Republic of China
Shuo Xu

Authors

Xin An
View author publications
You can also search for this author in PubMed Google Scholar
Xin Sun
View author publications
You can also search for this author in PubMed Google Scholar
Shuo Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin An.

Rights and permissions

Reprints and permissions

About this article

Cite this article

An, X., Sun, X. & Xu, S. Important citations identification with semi-supervised classification model. Scientometrics 127, 6533–6555 (2022). https://doi.org/10.1007/s11192-021-04212-6

Download citation

Received: 19 June 2021
Accepted: 09 November 2021
Published: 20 January 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s11192-021-04212-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Important citations identification with semi-supervised classification model

Abstract

Access this article

Similar content being viewed by others

SDCF: semi-automatically structured dataset of citation functions

Important citation identification by exploiting the syntactic and contextual information of citations

Identification of important citations by exploiting research articles’ metadata and cue-terms from content

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Important citations identification with semi-supervised classification model

Abstract

Access this article

Similar content being viewed by others

SDCF: semi-automatically structured dataset of citation functions

Important citation identification by exploiting the syntactic and contextual information of citations

Identification of important citations by exploiting research articles’ metadata and cue-terms from content

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation