Abstract
Active learning (AL) has widely been used to address the shortage of labeled datasets. Yet, most AL techniques require an initial set of labeled data as the knowledge base to perform active querying. The informativeness of the initial labeled set significantly affects the subsequent active query; hence the performance of active learning. In this paper, a new clustering-based active learning framework, namely Active Learning using a Clustering-based Sampling (ALCS), is proposed to simultaneously consider the representativeness and informativeness of samples using no prior label information. A density-based clustering approach is employed to explore the cluster structure from the data without requiring exhaustive parameter tuning. A simple yet effective distance-based querying strategy is adopted to adjust the sampling weight between the center-based and boundary-based selections for active learning. A novel bi-cluster boundary-based sample query procedure is introduced to select the most uncertain samples across the boundary among adjacent clusters. Additionally, we developed an effective diversity exploration strategy to address the redundancy among queried samples. Our extensive experimentation provided a comparison of the ALCS approach with state-of-the-art methods, exhibiting that ALCS produces statistically better or comparable performance than state-of-the-art methods.
Similar content being viewed by others
Code Availability
The python code is developed by the authors and it is available at https://github.com/XuyangAbert/ALCS.
References
Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Amer Stat 46(3):175–185
Cai D, He X (2011) Manifold adaptive experimental design for text categorization. IEEE Trans Knowl Data Eng 24(4):707–719
Chattopadhyay R, Wang Z, Fan W, Davidson I, Panchanathan S, Ye J (2013) Batch mode active sampling based on marginal probability distribution matching. ACM Trans Knowl Discov Data (TKDD) 7(3):1–25
Cortes C, Mohri M (2014) Domain adaptation and sample bias correction theory and algorithm for regression. Theor Comput Sci 519:103–126
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Dagan I, Engelson SP (1995) Committee-based sampling for training probabilistic classifiers. In: Machine Learning Proceedings 1995, Elsevier. pp 150–157
Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: Proceedings of the 25th international conference on Machine learning, pp 208–215
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30
Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Donmez P, Carbonell JG, Bennett PN (2007) Dual strategy active learning. In: European Conference on Machine Learning, Springer. pp 116–127
Freund Y, Seung HS, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm. Mach Learn 28(2-3):133–168
Gu S, Cai Y, Shan J, Hou C (2019) Active learning with error-correcting output codes. Neurocomputing 364:182–191
Hoi SC, Jin R, Zhu J, Lyu MR (2006) Batch mode active learning and its application to medical image classification. In: Proceedings of the 23rd international conference on Machine learning, pp 417–424
Hoi SC, Jin R, Zhu J, Lyu MR (2009) Semisupervised svm batch mode active learning with applications to image retrieval. ACM Trans Inform Syst (TOIS) 27(3):1–29
Holub A, Perona P, Burl MC (2008) Entropy-based active learning for object recognition. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, IEEE. pp 1–8
Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1-3):489–501
Huang SJ, Jin R, Zhou ZH (2010) Active learning by querying informative and representative examples. In: Advances in neural information processing systems, pp 892–900
Huang SJ, Zong CC, Ning KP, Ye HB (2021) Asynchronous active learning with distributed label querying. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, International Joint Conferences on Artificial Intelligence Organization, pp 2570–2576
Kading C, Freytag A, Rodner E, Bodesheim P, Denzler J (2015) Active learning and discovery of object categories in the presence of unnameable instances. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4343–4352
Kee S, Del Castillo E, Runger G (2018) Query-by-committee improvement with diversity and density in batch active learning. Inf Sci 454:401–418
Krempl G, Kottke D, Lemaire V (2015) Optimised probabilistic active learning (opal). Mach Learn 100(2):449–476
Lewis DD, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: Machine learning proceedings 1994, Elsevier, pp 148–156
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: SIGIR’94, Springer. pp 3–12
Li H, Wang Y, Li Y, Xiao G, Hu P, Zhao R (2021a) Batch mode active learning via adaptive criteria weights. Appl Intell 51(6):3475–3489
Li H, Wang Y, Li Y, Xiao G, Hu P, Zhao R, Li B (2021b) Learning adaptive criteria weights for active semi-supervised learning. Inf Sci 561:286–303
Lu J, Zhao P, Hoi SC (2016) Online passive-aggressive active learning. Mach Learn 103 (2):141–183
Lughofer E (2012) Hybrid active learning for reducing the annotation effort of operators in classification systems. Pattern Recogn 45(2):884–896
Lughofer E (2017) On-line active learning: a new paradigm to improve practical useability of data stream modeling methods. Inf Sci 415:356–376
Min F, Zhang SM, Ciucci D, Wang M (2020) Three-way active learning through clustering selection. Int J Mach Learn Cybern 11(5):1033–1046
Nguyen HT, Smeulders A (2004) Active learning using pre-clustering. In: Proceedings of the twenty-first international conference on Machine learning, p 79
Nuhu AR, Yan X, Opoku D, Homaifar A (2021) A niching framework based on fitness proportionate sharing for multi-objective genetic algorithm (moga-fps). In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, Association for Computing Machinery, New York, NY, USA, GECCO ’21, p 191–192 . https://doi.org/10.1145/3449726.3459566
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344 (6191):1492–1496
Roy N, McCallum A (2001) Toward optimal active learning through monte carlo estimation of error reduction. ICML, Williamstown 441–448
Schein AI, Ungar LH (2007) Active learning for logistic regression: an evaluation. Mach Learn 68(3):235–265
Settles B, Craven M, Ray S (2008) Multiple-instance active learning. In: Advances in neural information processing systems, pp 1289–1296
Seung HS, Opper M, Sompolinsky H (1992) Query by committee. In: Proceedings of the fifth annual workshop on Computational learning theory, pp 287–294
Smith JS, Nebgen B, Lubbers N, Isayev O, Roitberg AE (2018) Less is more: Sampling chemical space with active learning. J Chem Phys 148(24):241733
Tang YP, Huang SJ (2021) Dual active learning for both model and data selection. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, International Joint Conferences on Artificial Intelligence Organization, pp 3052–3058
Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. J Mach Learn Res 2(Nov):45–66
Tsou YL, Lin HT (2019) Annotation cost-sensitive active learning by tree sampling. Mach Learn 108(5):785–807
Viering TJ, Krijthe JH, Loog M (2019) Nuclear discrepancy for single-shot batch active learning. Mach Learn 108(8):1561–1599
Wang L, Hu X, Yuan B, Lu J (2015) Active learning via query synthesis and nearest neighbour search. Neurocomputing 147:426–434
Wang M, Hua XS (2011) Active learning in multimedia annotation and retrieval: a survey. ACM Trans Intell Syst Technol (TIST) 2(2):1–21
Wang M, Min F, Zhang ZH, Wu YX (2017a) Active learning through density clustering. Expert Syst Appl 85:305– 317
Wang M, Fu K, Min F (2018a) Active learning through two-stage clustering. In: 2018 IEEE International conference on fuzzy systems (FUZZ-IEEE), IEEE, pp 1–7
Wang M, Zhang YY, Min F (2019) Active learning through multi-standard optimization. IEEE Access 7:56772–56784
Wang M, Fu K, Min F, Jia X (2020) Active learning through label error statistical methods. Knowl-Based Syst 189:105140
Wang R, Wang XZ, Kwong S, Xu C (2017b) Incorporating diversity and informativeness in multiple-instance active learning. IEEE Trans Fuzzy Syst 25(6):1460–1475
Wang Z, Ye J (2015) Querying discriminative and representative samples for batch mode active learning. ACM Trans Knowl Discov Data (TKDD) 9(3):1–23
Wang Z, Du B, Zhang L, Zhang L (2016) A batch-mode active learning framework by querying discriminative and representative samples for hyperspectral image classification. Neurocomputing 179:88–100
Wang Z, Fang X, Tang X, Wu C (2018b) Multi-class active learning by integrating uncertainty and diversity. IEEE Access 6:22794–22803
Workineh A, Homaifar A (2012) Fitness proportionate niching: Maintaining diversity in a rugged fitness landscape. In: Proceedings of the International Conference on Genetic and Evolutionary Methods (GEM), The Steering Committee of The World Congress in Computer Science Computer ..., pp 1–7
Xiao Y, Chang Z, Liu B (2020) An efficient active learning method for multi-task learning. Knowl-Based Syst 190:105137
Yan X, Homaifar A, Nazmi S, Razeghi-Jahromi M (2017) A novel clustering algorithm based on fitness proportionate sharing. In: Systems, man, and cybernetics (SMC), 2017 IEEE International Conference on IEEE, pp 1960–1965
Yan X, Razeghi-Jahromi M, Homaifar A, Erol BA, Girma A, Tunstel E (2019) A novel streaming data clustering algorithm based on fitness proportionate sharing. IEEE Access 7:184985–185000
Yan X, Nazmi S, Erol BA, Homaifar A, Gebru B, Tunstel E (2020) An efficient unsupervised feature selection procedure through feature clustering. Pattern Recognition Letters
Yan X, Homaifar A, Sarkar M, Girma A, Tunstel E (2021) A clustering-based framework for classifying data streams. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence IJCAI-21, International Joint Conferences on Artificial Intelligence Organization, pp 3257–3263
Yang MS, Wu KL (2004) A similarity-based robust clustering method. IEEE Trans Pattern Anal Mach Intell 26(4):434–448
Yang Y, Loog M (2016) Active learning using uncertainty information. In: 2016 23Rd international conference on pattern recognition (ICPR), IEEE, pp 2646–2651
Yang Y, Loog M (2018) A variance maximization criterion for active learning. Pattern Recogn 78:358–370
Yang Y, Ma Z, Nie F, Chang X, Hauptmann AG (2015) Multi-class active learning by uncertainty sampling with diversity maximization. Int J Comput Vis 113(2):113–127
Yang YY, Lee SC, Chung YA, Wu TE, Chen SA, Lin HT (2017) libact: Pool-based active learning in python. arXiv:171000379
Yu D, Varadarajan B, Deng L, Acero A (2010) Active learning and semi-supervised learning for speech recognition: a unified framework using the global entropy reduction maximization criterion. Comput Speech Lang 24(3):433–444
Yu H, Sun C, Yang W, Yang X, Zuo X (2015) Al-elm: One uncertainty-based active learning algorithm using extreme learning machine. Neurocomputing 166:140–150
Acknowledgements
This paper is based on research sponsored by the Air Force Research Laboratory and the Office of the Secretary of Defense (OSD) under agreement number FA8750-15-2-0116. Also, this work is partially funded through the National Science Foundation (NSF) under grant number 2000320.
Funding
This study was funded by Air Force Research Laboratory and the Office of the Secretary of Defense (OSD) under agreement number FA8750-15-2-0116. This work is also partially supported by National Science Foundation under grant number 2000320.
Author information
Authors and Affiliations
Contributions
The conceptualization, methodology, software implementation & debug, and validation are contributed by Xuyang Yan, Shabnam Nazmi, Biniam Gebru, and Mrinmoy Sarkar. The first draft was prepared by Xuyang Yan and all authors participated the editing of the manuscript. Xuyang Yan and Dr.Abodllah Homaifar shaped up the original idea of the new contribution for the first revision. The implementation of the new contribution and additional experiments are conducted by Xuyang Yan, Mrinmoy Sarkar and Kishor Datta Gupta. Drs. Mohd Anwar and Adbollah Homaifar suggested many important modifications to improve the overall writing quality and re-organzation of this revised manuscript. This research was supervised by Professors Abdollah Homaifar and Mohd Anwar. The funding of this research was acquired by Dr.Abdollah Homaifar.
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Availability of data and material
The data that support the findings of this study are available from the UCI machine learning repository https://archive.ics.uci.edu/ml/index.php.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yan, X., Nazmi, S., Gebru, B. et al. A clustering-based active learning method to query informative and representative samples. Appl Intell 52, 13250–13267 (2022). https://doi.org/10.1007/s10489-021-03139-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-03139-y