Abstract
Constructing a bibliographic dataset is fundamental for domain analysis in bibliometric research. However, irrelevant documents(so-called “impurities”) in the initial domain dataset are inevitable and difficult to identify, requiring considerable human efforts to eliminate. To solve this problem, we propose a weak-supervised noise reduction approach based on the Positive-Unlabeled Learning (PU-Learning) algorithm to clean the initial bibliographic dataset automatically. The basic idea is to use a batch of “absolutely positive sample sets” already available in the dataset to obtain a collection of “reliable negative sample sets,” based on which a training set can be constructed for the downstream supervised classification. This paper conducted a comparative experiment using the Artificial Intelligence (AI) domain of the US National Technical Reports Library (NTIS) report as an example. We compared schemes with different variables to explore the influence of various technical aspects on the final noise reduction performance. Our approach achieved significant improvements compared with the similarity-based unsupervised baseline; the recall rose from 0.3742 to 0.8103, and the precision rose from 0.6621 to 0.7383. We found that the impact of document representation algorithms is crucial while classification strategies and s_ratio in PU-Learning are not. Our approach needs no manual annotation data and thus can provide powerful help for bibliometric researchers to construct high-quality bibliographic datasets.
Similar content being viewed by others
References
An, X., Sun, X., & Xu, S. (2022). Important citations identification with semi-supervised classification model. Scientometrics. https://doi.org/10.1007/s11192-021-04212-6
Asghari, M., Sierra-Sosa, D., & Elmaghraby, A. S. (2020). A topic modeling framework for Spatio-temporal information management. Information Processing & Management, 57(6), 102340. https://doi.org/10.1016/j.ipm.2020.102340
Bekker, J., & Davis, J. (2020). Learning from positive and unlabeled data: A survey. Machine Learning, 109(4), 719–760. https://doi.org/10.1007/s10994-020-05877-5
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135–146. https://doi.org/10.1162/tacl_a_00051
Borgman, C. L. (2017). Big data, little data, no data: Scholarship in the networked world. MIT press.
Boukerche, A., Zheng, L., & Alfandi, O. (2020). Outlier Detection: Methods, models, and classification. ACM Computing Surveys, 53(3), 1–37. https://doi.org/10.1145/3381028
Bradford, S. C. (1934). Sources of information on specific subjects. Engineering, 137, 85–86.
Chen, G., & Xiao, L. (2016). Selecting publication keywords for domain analysis in bibliometrics: A comparison of three methods. Journal of Informetrics, 10(1), 212–223.
Cheng, J., Mai, X., & Wang, S. (2019). Research on abnormal data mining algorithm based on ICA. Cluster Computing, 22(S2), 3613–3619. https://doi.org/10.1007/s10586-018-2211-2
Choi, Y., Park, S., & Lee, S. (2021). Identifying emerging technologies to envision a future innovation ecosystem: A machine learning approach to patent data. Scientometrics, 126(7), 5431–5476.
Chu, X., Ilyas, I. F., Krishnan, S., & Wang, J. (2016). Data Cleaning: Overview and Emerging Challenges. In: Proceedings of the 2016 International Conference on Management of Data, 2201–2206. San Francisco, CA. USA: ACM. https://doi.org/10.1145/2882903.2912574
de Campos, L. M., Fernández-Luna, J. M., Huete, J. F., & Redondo-Expósito, L. (2018). Positive unlabeled learning for building recommender systems in a parliamentary setting. Information Sciences, 433–434, 221–232. https://doi.org/10.1016/j.ins.2017.12.046
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint https://arXiv.org/1810.04805.
Freyman, C. A., Byrnes, J. J., & Alexander, J. (2016). Machine-learning-based classification of research grant award records. Research Evaluation, 25(4), 442–450.
Glänzel, W., & Schubert, A. (2003). A new classification scheme of science fields and subfields designed for scientometric evaluation purposes. Scientometrics, 56(3), 357–367. https://doi.org/10.1023/A:1022378804087
Goh, Y. C., Cai, X. Q., Theseira, W., Ko, G., & Khor, K. A. (2020). Evaluating human versus machine learning performance in classifying research abstracts. Scientometrics, 125(2), 1197–1212.
Gong, C., Yang, J., You, J. J., & Sugiyama, M. (2020). Centroid Estimation with Guaranteed Efficiency: A General Framework for Weakly Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2020.3044997
Grover, A., & Leskovec, J. (2016). Node2vec: Scalable Feature Learning for Networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 855–864. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2939672.2939754
Hittawe, M. M., Afzal, S., Jamil, T., Snoussi, H., Hoteit, I., & Knio, O. (2019). Abnormal events detection using deep neural networks: Application to extreme sea surface temperature detection in the Red Sea. Journal of Electronic Imaging, 28(02), 1. https://doi.org/10.1117/1.JEI.28.2.021012
Iqbal, W., Qadir, J., Tyson, G., Mian, A. N., Hassan, S., & Crowcroft, J. (2019). A bibliometric analysis of publications in computer networking research. Scientometrics, 119(2), 1121–1155. https://doi.org/10.1007/s11192-019-03086-z
Jaskie, K., & Spanias, A. (2019). Positive And Unlabeled Learning Algorithms And Applications: A Survey. 2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA), 1–8. PATRAS, Greece: IEEE. https://doi.org/10.1109/IISA.2019.8900698
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of Tricks for Efficient Text Classification. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 427–431. Valencia, Spain: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/E17-2068
Km, P., Mondal, S., & Chandra, J. (2020). A Graph combination with edge pruning-based approach for author name disambiguation. Journal of the Association for Information Science and Technology, 71(1), 69–83. https://doi.org/10.1002/asi.24212
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In: International conference on machine learning (pp. 1188-1196). PMLR.
Li, X. L., Yu, P. S., Liu, B., & Ng, S. K. (2009). Positive unlabeled learning for data stream classification. In: Proceedings of the 2009 SIAM international conference on data mining (pp. 259–270). Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611972795.23
Li, J., Goerlandt, F., & Reniers, G. (2021). An overview of scientometric mapping for the safety science community: Methods, tools, and framework. Safety Science, 134, 105093. https://doi.org/10.1016/j.ssci.2020.105093
Lietz, H. (2020). Drawing impossible boundaries: Field delineation of social network science. Scientometrics, 125(3), 2841–2876. https://doi.org/10.1007/s11192-020-03527-0
Liu B, Lee WS, Yu PS, Li X (2002). Partially supervised classification of text documents. In: Machine learning, proceedings of the nineteenth international conference (ICML 2002), 2002, University of New South Wales, Sydney, Australia, pp 387–394.
Liu, X., Glänzel, W., & De Moor, B. (2012). Optimal and hierarchical clustering of large-scale hybrid networks for scientific mapping. Scientometrics, 91(2), 473–493.
Lu, C., Bu, Y., Dong, X., Wang, J., Ding, Y., Larivière, V., & Zhang, C. (2019). Analyzing linguistic complexity and scientific impact. Journal of Informetrics, 13(3), 817–829. https://doi.org/10.1016/j.joi.2019.07.004
Ma, X., Wang, Z., Ng, P., Nallapati, R., & Xiang, B. (2019). Universal Text Representation From Bert: An empirical study. arXiv preprint https://arXiv.org/1910.07973.
Milanez, D. H., Noyons, E., & de Faria, L. I. L. (2016). A delineating procedure to retrieve relevant publication data in research areas: The case of nanocellulose. Scientometrics, 107(2), 627–643. https://doi.org/10.1007/s11192-016-1922-5
Mogoutov, A., & Kahane, B. (2007). Data search strategy for science and technology emergence: A scalable and evolutionary query for nanotechnology tracking. Research Policy, 36(6), 893–903.
Najmi, A., Rashidi, T. H., Abbasi, A., & Travis Waller, S. (2017). Reviewing the transport domain: An evolutionary bibliometrics and network analysis. Scientometrics, 110(2), 843–865. https://doi.org/10.1007/s11192-016-2171-3
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). DeepWalk: Online Learning of Social Representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 701–710. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2623330.2623732
Roh, Y., Heo, G., & Whang, S. E. (2019). A survey on data collection for machine learning: A big data - AI integration perspective. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2019.2946162
Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), 1443–1471. https://doi.org/10.1162/089976601750264965
Shen, Y., & Zhang, D. (2012). Research on data preparation in Bibliometric Analysis. Library Development, (05), 90–92. (in China).
Shu, F., Julien, C.-A., Zhang, L., Qiu, J., Zhang, J., & Larivière, V. (2019). Comparing journal and paper level classifications of science. Journal of Informetrics, 13(1), 202–225. https://doi.org/10.1016/j.joi.2018.12.005
Singh, A. K., & Shashi, M. (2019). Vectorization of text documents for identifying unifiable news articles. International Journal of Advanced Computer Science and Applications. https://doi.org/10.14569/IJACSA.2019.0100742
Song, B., Tan, S., Shi, H., & Zhao, B. (2020). Fault detection and diagnosis via standardized k nearest neighbor for multimode process. Journal of the Taiwan Institute of Chemical Engineers, 106, 1–8. https://doi.org/10.1016/j.jtice.2019.09.017
Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., & Mei, Q. (2015). LINE: Large-Scale Information Network Embedding. In : Proceedings of the 24th International Conference on World Wide Web, 1067–1077. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee. https://doi.org/10.1145/2736277.2741093
Tian, Y., Mirzabagheri, M., Tirandazi, P., & Bamakan, S. M. H. (2020). A non-convex semi-supervised approach to opinion spam detection by ramp-one class SVM. Information Processing & Management, 57(6), 102381. https://doi.org/10.1016/j.ipm.2020.102381
Trittenbach, H., Englhardt, A., & Böhm, K. (2021). An overview and a benchmark of active learning for outlier detection with one-class classifiers. Expert Systems with Applications, 168, 114372. https://doi.org/10.1016/j.eswa.2020.114372
Vo, D.-T., & Bagheri, E. (2019). Feature-enriched matrix factorization for relation extraction. Information Processing and Management, 56(3), 424–444.
Waltman, L., & van Eck, N. J. (2012). A new methodology for constructing a publication-level classification system of science. Journal of the American Society for Information Science and Technology, 63(12), 2378–2392. https://doi.org/10.1002/asi.22748
Wang, B., Wang, A., Chen, F., Wang, Y., & Kuo, C. C. J. (2019). Evaluating word embedding models: methods and experimental results. APSIPA Transactions on Signal and Information Processing. https://doi.org/10.1017/ATSIP.2019.12
Wang, T., Ke, H., Zheng, X., Wang, K., Sangaiah, A. K., & Liu, A. (2020). Big data cleaning based on mobile edge computing in industrial sensor-cloud. IEEE Transactions on Industrial Informatics, 16(2), 1321–1329. https://doi.org/10.1109/TII.2019.2938861
Wang, T., Miao, Z., Chen, Y., Zhou, Y., Shan, G., & Snoussi, H. (2019b). AED-Net: An abnormal event detection network. Engineering, 5(5), 930–939. https://doi.org/10.1016/j.eng.2019.02.008
Westgate, M. J. (2019). revtools: An R package to support article screening for evidence synthesis. Research Synthesis Methods, 10(4), 606–614.
Wu, Z., Cao, J., Wang, Y., Wang, Y., Zhang, L., & Wu, J. (2018). hPSD: A hybrid PU-learning-based spammer detection model for product reviews. IEEE Transactions on Cybernetics, 50(4), 1595–1606.
Yu, H., Yang, J., & Han, J. (2003). Classifying Large Data Sets Using SVMs with Hierarchical Clusters. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 306–315. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/956750.956786
Zhang, C., Ren, D., Liu, T., Yang, J., & Gong, C. (2019). Positive and Unlabeled Learning with Label Disambiguation. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 4250–4256. Macao, China: International Joint Conferences on Artificial Intelligence Organization. https://doi.org/10.24963/ijcai.2019/590
Zhang, B., & Zuo, W. (2008). A Novel Reliable Negative Method Based on Clustering for Learning from Positive and Unlabeled Examples. In H. Li, T. Liu, W.-Y. Ma, T. Sakai, K.-F. Wong, & G. Zhou (Eds.), 385–392. Springer.
Zhang, G., Yang, Z., Xie, H., & Liu, W. (2021). A secure authorized deduplication scheme for cloud data based on blockchain. Information Processing & Management, 58(3), 102510. https://doi.org/10.1016/j.ipm.2021.102510
Zhong, L., Leng, F., & Luo, S. (2013). An analysis of the factors influencing the effectiveness of information research. Information Studies: Theory & Application. https://doi.org/10.16353/j.cnki.1000-7490.2013.07.015
Zhou, Y., Lin, H., Liu, Y., & Ding, W. (2019). A novel method to identify emerging technologies using a semi-supervised topic clustering model: a case of 3d printing industry. Scientometrics, 120(1): 167–185.
Acknowledgements
This study is supported by the Humanities and Social Sciences Youth Foundation, the Ministry of Education of the People’s Republic of China (Grant No. 21YJC870003), and the Social Science Foundation of Jiangsu Province (Grant No. 21TQC002).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declares that they have no conflict of interest.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, G., Chen, J., Shao, Y. et al. Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning. Scientometrics 128, 1187–1204 (2023). https://doi.org/10.1007/s11192-022-04598-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-022-04598-x