Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning

Chen, Guo; Chen, Jing; Shao, Yu; Xiao, Lu

doi:10.1007/s11192-022-04598-x

Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning

Published: 17 December 2022

Volume 128, pages 1187–1204, (2023)
Cite this article

Scientometrics Aims and scope Submit manuscript

Guo Chen¹,
Jing Chen¹,
Yu Shao² &
…
Lu Xiao ORCID: orcid.org/0000-0001-5485-1407³

321 Accesses
Explore all metrics

Abstract

Constructing a bibliographic dataset is fundamental for domain analysis in bibliometric research. However, irrelevant documents(so-called “impurities”) in the initial domain dataset are inevitable and difficult to identify, requiring considerable human efforts to eliminate. To solve this problem, we propose a weak-supervised noise reduction approach based on the Positive-Unlabeled Learning (PU-Learning) algorithm to clean the initial bibliographic dataset automatically. The basic idea is to use a batch of “absolutely positive sample sets” already available in the dataset to obtain a collection of “reliable negative sample sets,” based on which a training set can be constructed for the downstream supervised classification. This paper conducted a comparative experiment using the Artificial Intelligence (AI) domain of the US National Technical Reports Library (NTIS) report as an example. We compared schemes with different variables to explore the influence of various technical aspects on the final noise reduction performance. Our approach achieved significant improvements compared with the similarity-based unsupervised baseline; the recall rose from 0.3742 to 0.8103, and the precision rose from 0.6621 to 0.7383. We found that the impact of document representation algorithms is crucial while classification strategies and s_ratio in PU-Learning are not. Our approach needs no manual annotation data and thus can provide powerful help for bibliometric researchers to construct high-quality bibliographic datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Active Learning with Adaptive Density Weighted Sampling for Information Extraction from Scientific Papers

Active Learning to Select Unlabeled Examples with Effective Features for Document Classification

Important citations identification with semi-supervised classification model

Article 20 January 2022

References

An, X., Sun, X., & Xu, S. (2022). Important citations identification with semi-supervised classification model. Scientometrics. https://doi.org/10.1007/s11192-021-04212-6
Article Google Scholar
Asghari, M., Sierra-Sosa, D., & Elmaghraby, A. S. (2020). A topic modeling framework for Spatio-temporal information management. Information Processing & Management, 57(6), 102340. https://doi.org/10.1016/j.ipm.2020.102340
Article Google Scholar
Bekker, J., & Davis, J. (2020). Learning from positive and unlabeled data: A survey. Machine Learning, 109(4), 719–760. https://doi.org/10.1007/s10994-020-05877-5
Article MathSciNet MATH Google Scholar
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135–146. https://doi.org/10.1162/tacl_a_00051
Article Google Scholar
Borgman, C. L. (2017). Big data, little data, no data: Scholarship in the networked world. MIT press.
Google Scholar
Boukerche, A., Zheng, L., & Alfandi, O. (2020). Outlier Detection: Methods, models, and classification. ACM Computing Surveys, 53(3), 1–37. https://doi.org/10.1145/3381028
Article Google Scholar
Bradford, S. C. (1934). Sources of information on specific subjects. Engineering, 137, 85–86.
Google Scholar
Chen, G., & Xiao, L. (2016). Selecting publication keywords for domain analysis in bibliometrics: A comparison of three methods. Journal of Informetrics, 10(1), 212–223.
Article MathSciNet Google Scholar
Cheng, J., Mai, X., & Wang, S. (2019). Research on abnormal data mining algorithm based on ICA. Cluster Computing, 22(S2), 3613–3619. https://doi.org/10.1007/s10586-018-2211-2
Article Google Scholar
Choi, Y., Park, S., & Lee, S. (2021). Identifying emerging technologies to envision a future innovation ecosystem: A machine learning approach to patent data. Scientometrics, 126(7), 5431–5476.
Article Google Scholar
Chu, X., Ilyas, I. F., Krishnan, S., & Wang, J. (2016). Data Cleaning: Overview and Emerging Challenges. In: Proceedings of the 2016 International Conference on Management of Data, 2201–2206. San Francisco, CA. USA: ACM. https://doi.org/10.1145/2882903.2912574
de Campos, L. M., Fernández-Luna, J. M., Huete, J. F., & Redondo-Expósito, L. (2018). Positive unlabeled learning for building recommender systems in a parliamentary setting. Information Sciences, 433–434, 221–232. https://doi.org/10.1016/j.ins.2017.12.046
Article MathSciNet Google Scholar
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint https://arXiv.org/1810.04805.
Freyman, C. A., Byrnes, J. J., & Alexander, J. (2016). Machine-learning-based classification of research grant award records. Research Evaluation, 25(4), 442–450.
Google Scholar
Glänzel, W., & Schubert, A. (2003). A new classification scheme of science fields and subfields designed for scientometric evaluation purposes. Scientometrics, 56(3), 357–367. https://doi.org/10.1023/A:1022378804087
Article Google Scholar
Goh, Y. C., Cai, X. Q., Theseira, W., Ko, G., & Khor, K. A. (2020). Evaluating human versus machine learning performance in classifying research abstracts. Scientometrics, 125(2), 1197–1212.
Article Google Scholar
Gong, C., Yang, J., You, J. J., & Sugiyama, M. (2020). Centroid Estimation with Guaranteed Efficiency: A General Framework for Weakly Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2020.3044997
Article Google Scholar
Grover, A., & Leskovec, J. (2016). Node2vec: Scalable Feature Learning for Networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 855–864. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2939672.2939754
Hittawe, M. M., Afzal, S., Jamil, T., Snoussi, H., Hoteit, I., & Knio, O. (2019). Abnormal events detection using deep neural networks: Application to extreme sea surface temperature detection in the Red Sea. Journal of Electronic Imaging, 28(02), 1. https://doi.org/10.1117/1.JEI.28.2.021012
Article Google Scholar
Iqbal, W., Qadir, J., Tyson, G., Mian, A. N., Hassan, S., & Crowcroft, J. (2019). A bibliometric analysis of publications in computer networking research. Scientometrics, 119(2), 1121–1155. https://doi.org/10.1007/s11192-019-03086-z
Article Google Scholar
Jaskie, K., & Spanias, A. (2019). Positive And Unlabeled Learning Algorithms And Applications: A Survey. 2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA), 1–8. PATRAS, Greece: IEEE. https://doi.org/10.1109/IISA.2019.8900698
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of Tricks for Efficient Text Classification. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 427–431. Valencia, Spain: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/E17-2068
Km, P., Mondal, S., & Chandra, J. (2020). A Graph combination with edge pruning-based approach for author name disambiguation. Journal of the Association for Information Science and Technology, 71(1), 69–83. https://doi.org/10.1002/asi.24212
Article Google Scholar
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In: International conference on machine learning (pp. 1188-1196). PMLR.
Li, X. L., Yu, P. S., Liu, B., & Ng, S. K. (2009). Positive unlabeled learning for data stream classification. In: Proceedings of the 2009 SIAM international conference on data mining (pp. 259–270). Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611972795.23
Li, J., Goerlandt, F., & Reniers, G. (2021). An overview of scientometric mapping for the safety science community: Methods, tools, and framework. Safety Science, 134, 105093. https://doi.org/10.1016/j.ssci.2020.105093
Article Google Scholar
Lietz, H. (2020). Drawing impossible boundaries: Field delineation of social network science. Scientometrics, 125(3), 2841–2876. https://doi.org/10.1007/s11192-020-03527-0
Article Google Scholar
Liu B, Lee WS, Yu PS, Li X (2002). Partially supervised classification of text documents. In: Machine learning, proceedings of the nineteenth international conference (ICML 2002), 2002, University of New South Wales, Sydney, Australia, pp 387–394.
Liu, X., Glänzel, W., & De Moor, B. (2012). Optimal and hierarchical clustering of large-scale hybrid networks for scientific mapping. Scientometrics, 91(2), 473–493.
Article Google Scholar
Lu, C., Bu, Y., Dong, X., Wang, J., Ding, Y., Larivière, V., & Zhang, C. (2019). Analyzing linguistic complexity and scientific impact. Journal of Informetrics, 13(3), 817–829. https://doi.org/10.1016/j.joi.2019.07.004
Article Google Scholar
Ma, X., Wang, Z., Ng, P., Nallapati, R., & Xiang, B. (2019). Universal Text Representation From Bert: An empirical study. arXiv preprint https://arXiv.org/1910.07973.
Milanez, D. H., Noyons, E., & de Faria, L. I. L. (2016). A delineating procedure to retrieve relevant publication data in research areas: The case of nanocellulose. Scientometrics, 107(2), 627–643. https://doi.org/10.1007/s11192-016-1922-5
Article Google Scholar
Mogoutov, A., & Kahane, B. (2007). Data search strategy for science and technology emergence: A scalable and evolutionary query for nanotechnology tracking. Research Policy, 36(6), 893–903.
Article Google Scholar
Najmi, A., Rashidi, T. H., Abbasi, A., & Travis Waller, S. (2017). Reviewing the transport domain: An evolutionary bibliometrics and network analysis. Scientometrics, 110(2), 843–865. https://doi.org/10.1007/s11192-016-2171-3
Article Google Scholar
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). DeepWalk: Online Learning of Social Representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 701–710. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2623330.2623732
Roh, Y., Heo, G., & Whang, S. E. (2019). A survey on data collection for machine learning: A big data - AI integration perspective. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2019.2946162
Article Google Scholar
Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), 1443–1471. https://doi.org/10.1162/089976601750264965
Article MATH Google Scholar
Shen, Y., & Zhang, D. (2012). Research on data preparation in Bibliometric Analysis. Library Development, (05), 90–92. (in China).
Shu, F., Julien, C.-A., Zhang, L., Qiu, J., Zhang, J., & Larivière, V. (2019). Comparing journal and paper level classifications of science. Journal of Informetrics, 13(1), 202–225. https://doi.org/10.1016/j.joi.2018.12.005
Article Google Scholar
Singh, A. K., & Shashi, M. (2019). Vectorization of text documents for identifying unifiable news articles. International Journal of Advanced Computer Science and Applications. https://doi.org/10.14569/IJACSA.2019.0100742
Article Google Scholar
Song, B., Tan, S., Shi, H., & Zhao, B. (2020). Fault detection and diagnosis via standardized k nearest neighbor for multimode process. Journal of the Taiwan Institute of Chemical Engineers, 106, 1–8. https://doi.org/10.1016/j.jtice.2019.09.017
Article Google Scholar
Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., & Mei, Q. (2015). LINE: Large-Scale Information Network Embedding. In : Proceedings of the 24th International Conference on World Wide Web, 1067–1077. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee. https://doi.org/10.1145/2736277.2741093
Tian, Y., Mirzabagheri, M., Tirandazi, P., & Bamakan, S. M. H. (2020). A non-convex semi-supervised approach to opinion spam detection by ramp-one class SVM. Information Processing & Management, 57(6), 102381. https://doi.org/10.1016/j.ipm.2020.102381
Article Google Scholar
Trittenbach, H., Englhardt, A., & Böhm, K. (2021). An overview and a benchmark of active learning for outlier detection with one-class classifiers. Expert Systems with Applications, 168, 114372. https://doi.org/10.1016/j.eswa.2020.114372
Article Google Scholar
Vo, D.-T., & Bagheri, E. (2019). Feature-enriched matrix factorization for relation extraction. Information Processing and Management, 56(3), 424–444.
Article Google Scholar
Waltman, L., & van Eck, N. J. (2012). A new methodology for constructing a publication-level classification system of science. Journal of the American Society for Information Science and Technology, 63(12), 2378–2392. https://doi.org/10.1002/asi.22748
Article Google Scholar
Wang, B., Wang, A., Chen, F., Wang, Y., & Kuo, C. C. J. (2019). Evaluating word embedding models: methods and experimental results. APSIPA Transactions on Signal and Information Processing. https://doi.org/10.1017/ATSIP.2019.12
Article Google Scholar
Wang, T., Ke, H., Zheng, X., Wang, K., Sangaiah, A. K., & Liu, A. (2020). Big data cleaning based on mobile edge computing in industrial sensor-cloud. IEEE Transactions on Industrial Informatics, 16(2), 1321–1329. https://doi.org/10.1109/TII.2019.2938861
Article Google Scholar
Wang, T., Miao, Z., Chen, Y., Zhou, Y., Shan, G., & Snoussi, H. (2019b). AED-Net: An abnormal event detection network. Engineering, 5(5), 930–939. https://doi.org/10.1016/j.eng.2019.02.008
Article Google Scholar
Westgate, M. J. (2019). revtools: An R package to support article screening for evidence synthesis. Research Synthesis Methods, 10(4), 606–614.
Article Google Scholar
Wu, Z., Cao, J., Wang, Y., Wang, Y., Zhang, L., & Wu, J. (2018). hPSD: A hybrid PU-learning-based spammer detection model for product reviews. IEEE Transactions on Cybernetics, 50(4), 1595–1606.
Article Google Scholar
Yu, H., Yang, J., & Han, J. (2003). Classifying Large Data Sets Using SVMs with Hierarchical Clusters. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 306–315. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/956750.956786
Zhang, C., Ren, D., Liu, T., Yang, J., & Gong, C. (2019). Positive and Unlabeled Learning with Label Disambiguation. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 4250–4256. Macao, China: International Joint Conferences on Artificial Intelligence Organization. https://doi.org/10.24963/ijcai.2019/590
Zhang, B., & Zuo, W. (2008). A Novel Reliable Negative Method Based on Clustering for Learning from Positive and Unlabeled Examples. In H. Li, T. Liu, W.-Y. Ma, T. Sakai, K.-F. Wong, & G. Zhou (Eds.), 385–392. Springer.
Google Scholar
Zhang, G., Yang, Z., Xie, H., & Liu, W. (2021). A secure authorized deduplication scheme for cloud data based on blockchain. Information Processing & Management, 58(3), 102510. https://doi.org/10.1016/j.ipm.2021.102510
Article Google Scholar
Zhong, L., Leng, F., & Luo, S. (2013). An analysis of the factors influencing the effectiveness of information research. Information Studies: Theory & Application. https://doi.org/10.16353/j.cnki.1000-7490.2013.07.015
Article Google Scholar
Zhou, Y., Lin, H., Liu, Y., & Ding, W. (2019). A novel method to identify emerging technologies using a semi-supervised topic clustering model: a case of 3d printing industry. Scientometrics, 120(1): 167–185.

Download references

Acknowledgements

This study is supported by the Humanities and Social Sciences Youth Foundation, the Ministry of Education of the People’s Republic of China (Grant No. 21YJC870003), and the Social Science Foundation of Jiangsu Province (Grant No. 21TQC002).

Author information

Authors and Affiliations

Department of Information Management, Nanjing University of Science and Technology, Xiaolingwei St 200, Nanjing, 210094, China
Guo Chen & Jing Chen
Information Centre, Northwest Engineering Corporation Limited, East Zhangba Rd 18, Xian, 710065, China
Yu Shao
School of Journalism, Nanjing University of Finance and Economics, Wenyuan Rd 3, Nanjing, 210023, China
Lu Xiao

Authors

Guo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jing Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yu Shao
View author publications
You can also search for this author in PubMed Google Scholar
Lu Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lu Xiao.

Ethics declarations

Conflict of interest

All authors declares that they have no conflict of interest.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, G., Chen, J., Shao, Y. et al. Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning. Scientometrics 128, 1187–1204 (2023). https://doi.org/10.1007/s11192-022-04598-x

Download citation

Received: 02 November 2021
Accepted: 23 November 2022
Published: 17 December 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s11192-022-04598-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning

Abstract

Access this article

Similar content being viewed by others

Active Learning with Adaptive Density Weighted Sampling for Information Extraction from Scientific Papers

Active Learning to Select Unlabeled Examples with Effective Features for Document Classification

Important citations identification with semi-supervised classification model

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning

Abstract

Access this article

Similar content being viewed by others

Active Learning with Adaptive Density Weighted Sampling for Information Extraction from Scientific Papers

Active Learning to Select Unlabeled Examples with Effective Features for Document Classification

Important citations identification with semi-supervised classification model

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation