Skip to main content
Log in

Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Constructing a bibliographic dataset is fundamental for domain analysis in bibliometric research. However, irrelevant documents(so-called “impurities”) in the initial domain dataset are inevitable and difficult to identify, requiring considerable human efforts to eliminate. To solve this problem, we propose a weak-supervised noise reduction approach based on the Positive-Unlabeled Learning (PU-Learning) algorithm to clean the initial bibliographic dataset automatically. The basic idea is to use a batch of “absolutely positive sample sets” already available in the dataset to obtain a collection of “reliable negative sample sets,” based on which a training set can be constructed for the downstream supervised classification. This paper conducted a comparative experiment using the Artificial Intelligence (AI) domain of the US National Technical Reports Library (NTIS) report as an example. We compared schemes with different variables to explore the influence of various technical aspects on the final noise reduction performance. Our approach achieved significant improvements compared with the similarity-based unsupervised baseline; the recall rose from 0.3742 to 0.8103, and the precision rose from 0.6621 to 0.7383. We found that the impact of document representation algorithms is crucial while classification strategies and s_ratio in PU-Learning are not. Our approach needs no manual annotation data and thus can provide powerful help for bibliometric researchers to construct high-quality bibliographic datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • An, X., Sun, X., & Xu, S. (2022). Important citations identification with semi-supervised classification model. Scientometrics. https://doi.org/10.1007/s11192-021-04212-6

    Article  Google Scholar 

  • Asghari, M., Sierra-Sosa, D., & Elmaghraby, A. S. (2020). A topic modeling framework for Spatio-temporal information management. Information Processing & Management, 57(6), 102340. https://doi.org/10.1016/j.ipm.2020.102340

    Article  Google Scholar 

  • Bekker, J., & Davis, J. (2020). Learning from positive and unlabeled data: A survey. Machine Learning, 109(4), 719–760. https://doi.org/10.1007/s10994-020-05877-5

    Article  MathSciNet  MATH  Google Scholar 

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135–146. https://doi.org/10.1162/tacl_a_00051

    Article  Google Scholar 

  • Borgman, C. L. (2017). Big data, little data, no data: Scholarship in the networked world. MIT press.

    Google Scholar 

  • Boukerche, A., Zheng, L., & Alfandi, O. (2020). Outlier Detection: Methods, models, and classification. ACM Computing Surveys, 53(3), 1–37. https://doi.org/10.1145/3381028

    Article  Google Scholar 

  • Bradford, S. C. (1934). Sources of information on specific subjects. Engineering, 137, 85–86.

    Google Scholar 

  • Chen, G., & Xiao, L. (2016). Selecting publication keywords for domain analysis in bibliometrics: A comparison of three methods. Journal of Informetrics, 10(1), 212–223.

    Article  MathSciNet  Google Scholar 

  • Cheng, J., Mai, X., & Wang, S. (2019). Research on abnormal data mining algorithm based on ICA. Cluster Computing, 22(S2), 3613–3619. https://doi.org/10.1007/s10586-018-2211-2

    Article  Google Scholar 

  • Choi, Y., Park, S., & Lee, S. (2021). Identifying emerging technologies to envision a future innovation ecosystem: A machine learning approach to patent data. Scientometrics, 126(7), 5431–5476.

    Article  Google Scholar 

  • Chu, X., Ilyas, I. F., Krishnan, S., & Wang, J. (2016). Data Cleaning: Overview and Emerging Challenges. In: Proceedings of the 2016 International Conference on Management of Data, 2201–2206. San Francisco, CA. USA: ACM. https://doi.org/10.1145/2882903.2912574

  • de Campos, L. M., Fernández-Luna, J. M., Huete, J. F., & Redondo-Expósito, L. (2018). Positive unlabeled learning for building recommender systems in a parliamentary setting. Information Sciences, 433–434, 221–232. https://doi.org/10.1016/j.ins.2017.12.046

    Article  MathSciNet  Google Scholar 

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint https://arXiv.org/1810.04805.

  • Freyman, C. A., Byrnes, J. J., & Alexander, J. (2016). Machine-learning-based classification of research grant award records. Research Evaluation, 25(4), 442–450.

    Google Scholar 

  • Glänzel, W., & Schubert, A. (2003). A new classification scheme of science fields and subfields designed for scientometric evaluation purposes. Scientometrics, 56(3), 357–367. https://doi.org/10.1023/A:1022378804087

    Article  Google Scholar 

  • Goh, Y. C., Cai, X. Q., Theseira, W., Ko, G., & Khor, K. A. (2020). Evaluating human versus machine learning performance in classifying research abstracts. Scientometrics, 125(2), 1197–1212.

    Article  Google Scholar 

  • Gong, C., Yang, J., You, J. J., & Sugiyama, M. (2020). Centroid Estimation with Guaranteed Efficiency: A General Framework for Weakly Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2020.3044997

    Article  Google Scholar 

  • Grover, A., & Leskovec, J. (2016). Node2vec: Scalable Feature Learning for Networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 855–864. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2939672.2939754

  • Hittawe, M. M., Afzal, S., Jamil, T., Snoussi, H., Hoteit, I., & Knio, O. (2019). Abnormal events detection using deep neural networks: Application to extreme sea surface temperature detection in the Red Sea. Journal of Electronic Imaging, 28(02), 1. https://doi.org/10.1117/1.JEI.28.2.021012

    Article  Google Scholar 

  • Iqbal, W., Qadir, J., Tyson, G., Mian, A. N., Hassan, S., & Crowcroft, J. (2019). A bibliometric analysis of publications in computer networking research. Scientometrics, 119(2), 1121–1155. https://doi.org/10.1007/s11192-019-03086-z

    Article  Google Scholar 

  • Jaskie, K., & Spanias, A. (2019). Positive And Unlabeled Learning Algorithms And Applications: A Survey. 2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA), 1–8. PATRAS, Greece: IEEE. https://doi.org/10.1109/IISA.2019.8900698

  • Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of Tricks for Efficient Text Classification. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 427–431. Valencia, Spain: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/E17-2068

  • Km, P., Mondal, S., & Chandra, J. (2020). A Graph combination with edge pruning-based approach for author name disambiguation. Journal of the Association for Information Science and Technology, 71(1), 69–83. https://doi.org/10.1002/asi.24212

    Article  Google Scholar 

  • Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In: International conference on machine learning (pp. 1188-1196). PMLR.

  • Li, X. L., Yu, P. S., Liu, B., & Ng, S. K. (2009). Positive unlabeled learning for data stream classification. In: Proceedings of the 2009 SIAM international conference on data mining (pp. 259–270). Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611972795.23

  • Li, J., Goerlandt, F., & Reniers, G. (2021). An overview of scientometric mapping for the safety science community: Methods, tools, and framework. Safety Science, 134, 105093. https://doi.org/10.1016/j.ssci.2020.105093

    Article  Google Scholar 

  • Lietz, H. (2020). Drawing impossible boundaries: Field delineation of social network science. Scientometrics, 125(3), 2841–2876. https://doi.org/10.1007/s11192-020-03527-0

    Article  Google Scholar 

  • Liu B, Lee WS, Yu PS, Li X (2002). Partially supervised classification of text documents. In: Machine learning, proceedings of the nineteenth international conference (ICML 2002), 2002, University of New South Wales, Sydney, Australia, pp 387–394.

  • Liu, X., Glänzel, W., & De Moor, B. (2012). Optimal and hierarchical clustering of large-scale hybrid networks for scientific mapping. Scientometrics, 91(2), 473–493.

    Article  Google Scholar 

  • Lu, C., Bu, Y., Dong, X., Wang, J., Ding, Y., Larivière, V., & Zhang, C. (2019). Analyzing linguistic complexity and scientific impact. Journal of Informetrics, 13(3), 817–829. https://doi.org/10.1016/j.joi.2019.07.004

    Article  Google Scholar 

  • Ma, X., Wang, Z., Ng, P., Nallapati, R., & Xiang, B. (2019). Universal Text Representation From Bert: An empirical study. arXiv preprint https://arXiv.org/1910.07973.

  • Milanez, D. H., Noyons, E., & de Faria, L. I. L. (2016). A delineating procedure to retrieve relevant publication data in research areas: The case of nanocellulose. Scientometrics, 107(2), 627–643. https://doi.org/10.1007/s11192-016-1922-5

    Article  Google Scholar 

  • Mogoutov, A., & Kahane, B. (2007). Data search strategy for science and technology emergence: A scalable and evolutionary query for nanotechnology tracking. Research Policy, 36(6), 893–903.

    Article  Google Scholar 

  • Najmi, A., Rashidi, T. H., Abbasi, A., & Travis Waller, S. (2017). Reviewing the transport domain: An evolutionary bibliometrics and network analysis. Scientometrics, 110(2), 843–865. https://doi.org/10.1007/s11192-016-2171-3

    Article  Google Scholar 

  • Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). DeepWalk: Online Learning of Social Representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 701–710. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2623330.2623732

  • Roh, Y., Heo, G., & Whang, S. E. (2019). A survey on data collection for machine learning: A big data - AI integration perspective. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2019.2946162

    Article  Google Scholar 

  • Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), 1443–1471. https://doi.org/10.1162/089976601750264965

    Article  MATH  Google Scholar 

  • Shen, Y., & Zhang, D. (2012). Research on data preparation in Bibliometric Analysis. Library Development, (05), 90–92. (in China).

  • Shu, F., Julien, C.-A., Zhang, L., Qiu, J., Zhang, J., & Larivière, V. (2019). Comparing journal and paper level classifications of science. Journal of Informetrics, 13(1), 202–225. https://doi.org/10.1016/j.joi.2018.12.005

    Article  Google Scholar 

  • Singh, A. K., & Shashi, M. (2019). Vectorization of text documents for identifying unifiable news articles. International Journal of Advanced Computer Science and Applications. https://doi.org/10.14569/IJACSA.2019.0100742

    Article  Google Scholar 

  • Song, B., Tan, S., Shi, H., & Zhao, B. (2020). Fault detection and diagnosis via standardized k nearest neighbor for multimode process. Journal of the Taiwan Institute of Chemical Engineers, 106, 1–8. https://doi.org/10.1016/j.jtice.2019.09.017

    Article  Google Scholar 

  • Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., & Mei, Q. (2015). LINE: Large-Scale Information Network Embedding. In : Proceedings of the 24th International Conference on World Wide Web, 1067–1077. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee. https://doi.org/10.1145/2736277.2741093

  • Tian, Y., Mirzabagheri, M., Tirandazi, P., & Bamakan, S. M. H. (2020). A non-convex semi-supervised approach to opinion spam detection by ramp-one class SVM. Information Processing & Management, 57(6), 102381. https://doi.org/10.1016/j.ipm.2020.102381

    Article  Google Scholar 

  • Trittenbach, H., Englhardt, A., & Böhm, K. (2021). An overview and a benchmark of active learning for outlier detection with one-class classifiers. Expert Systems with Applications, 168, 114372. https://doi.org/10.1016/j.eswa.2020.114372

    Article  Google Scholar 

  • Vo, D.-T., & Bagheri, E. (2019). Feature-enriched matrix factorization for relation extraction. Information Processing and Management, 56(3), 424–444.

    Article  Google Scholar 

  • Waltman, L., & van Eck, N. J. (2012). A new methodology for constructing a publication-level classification system of science. Journal of the American Society for Information Science and Technology, 63(12), 2378–2392. https://doi.org/10.1002/asi.22748

    Article  Google Scholar 

  • Wang, B., Wang, A., Chen, F., Wang, Y., & Kuo, C. C. J. (2019). Evaluating word embedding models: methods and experimental results. APSIPA Transactions on Signal and Information Processing. https://doi.org/10.1017/ATSIP.2019.12

    Article  Google Scholar 

  • Wang, T., Ke, H., Zheng, X., Wang, K., Sangaiah, A. K., & Liu, A. (2020). Big data cleaning based on mobile edge computing in industrial sensor-cloud. IEEE Transactions on Industrial Informatics, 16(2), 1321–1329. https://doi.org/10.1109/TII.2019.2938861

    Article  Google Scholar 

  • Wang, T., Miao, Z., Chen, Y., Zhou, Y., Shan, G., & Snoussi, H. (2019b). AED-Net: An abnormal event detection network. Engineering, 5(5), 930–939. https://doi.org/10.1016/j.eng.2019.02.008

    Article  Google Scholar 

  • Westgate, M. J. (2019). revtools: An R package to support article screening for evidence synthesis. Research Synthesis Methods, 10(4), 606–614.

    Article  Google Scholar 

  • Wu, Z., Cao, J., Wang, Y., Wang, Y., Zhang, L., & Wu, J. (2018). hPSD: A hybrid PU-learning-based spammer detection model for product reviews. IEEE Transactions on Cybernetics, 50(4), 1595–1606.

    Article  Google Scholar 

  • Yu, H., Yang, J., & Han, J. (2003). Classifying Large Data Sets Using SVMs with Hierarchical Clusters. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 306–315. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/956750.956786

  • Zhang, C., Ren, D., Liu, T., Yang, J., & Gong, C. (2019). Positive and Unlabeled Learning with Label Disambiguation. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 4250–4256. Macao, China: International Joint Conferences on Artificial Intelligence Organization. https://doi.org/10.24963/ijcai.2019/590

  • Zhang, B., & Zuo, W. (2008). A Novel Reliable Negative Method Based on Clustering for Learning from Positive and Unlabeled Examples. In H. Li, T. Liu, W.-Y. Ma, T. Sakai, K.-F. Wong, & G. Zhou (Eds.), 385–392. Springer.

    Google Scholar 

  • Zhang, G., Yang, Z., Xie, H., & Liu, W. (2021). A secure authorized deduplication scheme for cloud data based on blockchain. Information Processing & Management, 58(3), 102510. https://doi.org/10.1016/j.ipm.2021.102510

    Article  Google Scholar 

  • Zhong, L., Leng, F., & Luo, S. (2013). An analysis of the factors influencing the effectiveness of information research. Information Studies: Theory & Application. https://doi.org/10.16353/j.cnki.1000-7490.2013.07.015

    Article  Google Scholar 

  • Zhou, Y., Lin, H., Liu, Y., & Ding, W. (2019). A novel method to identify emerging technologies using a semi-supervised topic clustering model: a case of 3d printing industry. Scientometrics, 120(1): 167–185.

Download references

Acknowledgements

This study is supported by the Humanities and Social Sciences Youth Foundation, the Ministry of Education of the People’s Republic of China (Grant No. 21YJC870003), and the Social Science Foundation of Jiangsu Province (Grant No. 21TQC002).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lu Xiao.

Ethics declarations

Conflict of interest

All authors declares that they have no conflict of interest.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, G., Chen, J., Shao, Y. et al. Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning. Scientometrics 128, 1187–1204 (2023). https://doi.org/10.1007/s11192-022-04598-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-022-04598-x

Keywords

Navigation