A clustering-based active learning method to query informative and representative samples

Yan, Xuyang; Nazmi, Shabnam; Gebru, Biniam; Anwar, Mohd; Homaifar, Abdollah; Sarkar, Mrinmoy; Gupta, Kishor Datta

doi:10.1007/s10489-021-03139-y

A clustering-based active learning method to query informative and representative samples

Published: 25 February 2022

Volume 52, pages 13250–13267, (2022)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Xuyang Yan¹,
Shabnam Nazmi¹,
Biniam Gebru¹,
Mohd Anwar¹,
Abdollah Homaifar ORCID: orcid.org/0000-0003-1179-3221¹,
Mrinmoy Sarkar¹ &
…
Kishor Datta Gupta¹

860 Accesses
8 Citations
Explore all metrics

Abstract

Active learning (AL) has widely been used to address the shortage of labeled datasets. Yet, most AL techniques require an initial set of labeled data as the knowledge base to perform active querying. The informativeness of the initial labeled set significantly affects the subsequent active query; hence the performance of active learning. In this paper, a new clustering-based active learning framework, namely Active Learning using a Clustering-based Sampling (ALCS), is proposed to simultaneously consider the representativeness and informativeness of samples using no prior label information. A density-based clustering approach is employed to explore the cluster structure from the data without requiring exhaustive parameter tuning. A simple yet effective distance-based querying strategy is adopted to adjust the sampling weight between the center-based and boundary-based selections for active learning. A novel bi-cluster boundary-based sample query procedure is introduced to select the most uncertain samples across the boundary among adjacent clusters. Additionally, we developed an effective diversity exploration strategy to address the redundancy among queried samples. Our extensive experimentation provided a comparison of the ALCS approach with state-of-the-art methods, exhibiting that ALCS produces statistically better or comparable performance than state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Fig. 5

Fig. 8

A Multicriterion Query-Based Batch Mode Active Learning Technique

Active Semi-supervised K-Means Clustering Based on Silhouette Coefficient

An Active Learning Based on Uncertainty and Density Method for Positive and Unlabeled Data

Code Availability

The python code is developed by the authors and it is available at https://github.com/XuyangAbert/ALCS.

Notes

https://github.com/XuyangAbert/ALCS

References

Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Amer Stat 46(3):175–185
MathSciNet Google Scholar
Cai D, He X (2011) Manifold adaptive experimental design for text categorization. IEEE Trans Knowl Data Eng 24(4):707–719
Article Google Scholar
Chattopadhyay R, Wang Z, Fan W, Davidson I, Panchanathan S, Ye J (2013) Batch mode active sampling based on marginal probability distribution matching. ACM Trans Knowl Discov Data (TKDD) 7(3):1–25
Article Google Scholar
Cortes C, Mohri M (2014) Domain adaptation and sample bias correction theory and algorithm for regression. Theor Comput Sci 519:103–126
Article MathSciNet MATH Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Dagan I, Engelson SP (1995) Committee-based sampling for training probabilistic classifiers. In: Machine Learning Proceedings 1995, Elsevier. pp 150–157
Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: Proceedings of the 25th international conference on Machine learning, pp 208–215
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30
MathSciNet MATH Google Scholar
Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Donmez P, Carbonell JG, Bennett PN (2007) Dual strategy active learning. In: European Conference on Machine Learning, Springer. pp 116–127
Freund Y, Seung HS, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm. Mach Learn 28(2-3):133–168
Article MATH Google Scholar
Gu S, Cai Y, Shan J, Hou C (2019) Active learning with error-correcting output codes. Neurocomputing 364:182–191
Article Google Scholar
Hoi SC, Jin R, Zhu J, Lyu MR (2006) Batch mode active learning and its application to medical image classification. In: Proceedings of the 23rd international conference on Machine learning, pp 417–424
Hoi SC, Jin R, Zhu J, Lyu MR (2009) Semisupervised svm batch mode active learning with applications to image retrieval. ACM Trans Inform Syst (TOIS) 27(3):1–29
Article Google Scholar
Holub A, Perona P, Burl MC (2008) Entropy-based active learning for object recognition. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, IEEE. pp 1–8
Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1-3):489–501
Article Google Scholar
Huang SJ, Jin R, Zhou ZH (2010) Active learning by querying informative and representative examples. In: Advances in neural information processing systems, pp 892–900
Huang SJ, Zong CC, Ning KP, Ye HB (2021) Asynchronous active learning with distributed label querying. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, International Joint Conferences on Artificial Intelligence Organization, pp 2570–2576
Kading C, Freytag A, Rodner E, Bodesheim P, Denzler J (2015) Active learning and discovery of object categories in the presence of unnameable instances. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4343–4352
Kee S, Del Castillo E, Runger G (2018) Query-by-committee improvement with diversity and density in batch active learning. Inf Sci 454:401–418
Article MathSciNet Google Scholar
Krempl G, Kottke D, Lemaire V (2015) Optimised probabilistic active learning (opal). Mach Learn 100(2):449–476
Article MathSciNet MATH Google Scholar
Lewis DD, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: Machine learning proceedings 1994, Elsevier, pp 148–156
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: SIGIR’94, Springer. pp 3–12
Li H, Wang Y, Li Y, Xiao G, Hu P, Zhao R (2021a) Batch mode active learning via adaptive criteria weights. Appl Intell 51(6):3475–3489
Article Google Scholar
Li H, Wang Y, Li Y, Xiao G, Hu P, Zhao R, Li B (2021b) Learning adaptive criteria weights for active semi-supervised learning. Inf Sci 561:286–303
Article MathSciNet Google Scholar
Lu J, Zhao P, Hoi SC (2016) Online passive-aggressive active learning. Mach Learn 103 (2):141–183
Article MathSciNet MATH Google Scholar
Lughofer E (2012) Hybrid active learning for reducing the annotation effort of operators in classification systems. Pattern Recogn 45(2):884–896
Article Google Scholar
Lughofer E (2017) On-line active learning: a new paradigm to improve practical useability of data stream modeling methods. Inf Sci 415:356–376
Article Google Scholar
Min F, Zhang SM, Ciucci D, Wang M (2020) Three-way active learning through clustering selection. Int J Mach Learn Cybern 11(5):1033–1046
Article Google Scholar
Nguyen HT, Smeulders A (2004) Active learning using pre-clustering. In: Proceedings of the twenty-first international conference on Machine learning, p 79
Nuhu AR, Yan X, Opoku D, Homaifar A (2021) A niching framework based on fitness proportionate sharing for multi-objective genetic algorithm (moga-fps). In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, Association for Computing Machinery, New York, NY, USA, GECCO ’21, p 191–192 . https://doi.org/10.1145/3449726.3459566
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344 (6191):1492–1496
Article Google Scholar
Roy N, McCallum A (2001) Toward optimal active learning through monte carlo estimation of error reduction. ICML, Williamstown 441–448
Schein AI, Ungar LH (2007) Active learning for logistic regression: an evaluation. Mach Learn 68(3):235–265
Article MATH Google Scholar
Settles B, Craven M, Ray S (2008) Multiple-instance active learning. In: Advances in neural information processing systems, pp 1289–1296
Seung HS, Opper M, Sompolinsky H (1992) Query by committee. In: Proceedings of the fifth annual workshop on Computational learning theory, pp 287–294
Smith JS, Nebgen B, Lubbers N, Isayev O, Roitberg AE (2018) Less is more: Sampling chemical space with active learning. J Chem Phys 148(24):241733
Article Google Scholar
Tang YP, Huang SJ (2021) Dual active learning for both model and data selection. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, International Joint Conferences on Artificial Intelligence Organization, pp 3052–3058
Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. J Mach Learn Res 2(Nov):45–66
MATH Google Scholar
Tsou YL, Lin HT (2019) Annotation cost-sensitive active learning by tree sampling. Mach Learn 108(5):785–807
Article MathSciNet MATH Google Scholar
Viering TJ, Krijthe JH, Loog M (2019) Nuclear discrepancy for single-shot batch active learning. Mach Learn 108(8):1561–1599
Article MathSciNet MATH Google Scholar
Wang L, Hu X, Yuan B, Lu J (2015) Active learning via query synthesis and nearest neighbour search. Neurocomputing 147:426–434
Article Google Scholar
Wang M, Hua XS (2011) Active learning in multimedia annotation and retrieval: a survey. ACM Trans Intell Syst Technol (TIST) 2(2):1–21
Article Google Scholar
Wang M, Min F, Zhang ZH, Wu YX (2017a) Active learning through density clustering. Expert Syst Appl 85:305– 317
Article Google Scholar
Wang M, Fu K, Min F (2018a) Active learning through two-stage clustering. In: 2018 IEEE International conference on fuzzy systems (FUZZ-IEEE), IEEE, pp 1–7
Wang M, Zhang YY, Min F (2019) Active learning through multi-standard optimization. IEEE Access 7:56772–56784
Article Google Scholar
Wang M, Fu K, Min F, Jia X (2020) Active learning through label error statistical methods. Knowl-Based Syst 189:105140
Article Google Scholar
Wang R, Wang XZ, Kwong S, Xu C (2017b) Incorporating diversity and informativeness in multiple-instance active learning. IEEE Trans Fuzzy Syst 25(6):1460–1475
Article Google Scholar
Wang Z, Ye J (2015) Querying discriminative and representative samples for batch mode active learning. ACM Trans Knowl Discov Data (TKDD) 9(3):1–23
Google Scholar
Wang Z, Du B, Zhang L, Zhang L (2016) A batch-mode active learning framework by querying discriminative and representative samples for hyperspectral image classification. Neurocomputing 179:88–100
Article Google Scholar
Wang Z, Fang X, Tang X, Wu C (2018b) Multi-class active learning by integrating uncertainty and diversity. IEEE Access 6:22794–22803
Article Google Scholar
Workineh A, Homaifar A (2012) Fitness proportionate niching: Maintaining diversity in a rugged fitness landscape. In: Proceedings of the International Conference on Genetic and Evolutionary Methods (GEM), The Steering Committee of The World Congress in Computer Science Computer ..., pp 1–7
Xiao Y, Chang Z, Liu B (2020) An efficient active learning method for multi-task learning. Knowl-Based Syst 190:105137
Article Google Scholar
Yan X, Homaifar A, Nazmi S, Razeghi-Jahromi M (2017) A novel clustering algorithm based on fitness proportionate sharing. In: Systems, man, and cybernetics (SMC), 2017 IEEE International Conference on IEEE, pp 1960–1965
Yan X, Razeghi-Jahromi M, Homaifar A, Erol BA, Girma A, Tunstel E (2019) A novel streaming data clustering algorithm based on fitness proportionate sharing. IEEE Access 7:184985–185000
Article Google Scholar
Yan X, Nazmi S, Erol BA, Homaifar A, Gebru B, Tunstel E (2020) An efficient unsupervised feature selection procedure through feature clustering. Pattern Recognition Letters
Yan X, Homaifar A, Sarkar M, Girma A, Tunstel E (2021) A clustering-based framework for classifying data streams. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence IJCAI-21, International Joint Conferences on Artificial Intelligence Organization, pp 3257–3263
Yang MS, Wu KL (2004) A similarity-based robust clustering method. IEEE Trans Pattern Anal Mach Intell 26(4):434–448
Article Google Scholar
Yang Y, Loog M (2016) Active learning using uncertainty information. In: 2016 23Rd international conference on pattern recognition (ICPR), IEEE, pp 2646–2651
Yang Y, Loog M (2018) A variance maximization criterion for active learning. Pattern Recogn 78:358–370
Article Google Scholar
Yang Y, Ma Z, Nie F, Chang X, Hauptmann AG (2015) Multi-class active learning by uncertainty sampling with diversity maximization. Int J Comput Vis 113(2):113–127
Article MathSciNet Google Scholar
Yang YY, Lee SC, Chung YA, Wu TE, Chen SA, Lin HT (2017) libact: Pool-based active learning in python. arXiv:171000379
Yu D, Varadarajan B, Deng L, Acero A (2010) Active learning and semi-supervised learning for speech recognition: a unified framework using the global entropy reduction maximization criterion. Comput Speech Lang 24(3):433–444
Article Google Scholar
Yu H, Sun C, Yang W, Yang X, Zuo X (2015) Al-elm: One uncertainty-based active learning algorithm using extreme learning machine. Neurocomputing 166:140–150
Article Google Scholar

Download references

Acknowledgements

This paper is based on research sponsored by the Air Force Research Laboratory and the Office of the Secretary of Defense (OSD) under agreement number FA8750-15-2-0116. Also, this work is partially funded through the National Science Foundation (NSF) under grant number 2000320.

Funding

This study was funded by Air Force Research Laboratory and the Office of the Secretary of Defense (OSD) under agreement number FA8750-15-2-0116. This work is also partially supported by National Science Foundation under grant number 2000320.

Author information

Authors and Affiliations

North Carolina A&T State University, Greensboro, 27401, NC, USA
Xuyang Yan, Shabnam Nazmi, Biniam Gebru, Mohd Anwar, Abdollah Homaifar, Mrinmoy Sarkar & Kishor Datta Gupta

Authors

Xuyang Yan
View author publications
You can also search for this author in PubMed Google Scholar
Shabnam Nazmi
View author publications
You can also search for this author in PubMed Google Scholar
Biniam Gebru
View author publications
You can also search for this author in PubMed Google Scholar
Mohd Anwar
View author publications
You can also search for this author in PubMed Google Scholar
Abdollah Homaifar
View author publications
You can also search for this author in PubMed Google Scholar
Mrinmoy Sarkar
View author publications
You can also search for this author in PubMed Google Scholar
Kishor Datta Gupta
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The conceptualization, methodology, software implementation & debug, and validation are contributed by Xuyang Yan, Shabnam Nazmi, Biniam Gebru, and Mrinmoy Sarkar. The first draft was prepared by Xuyang Yan and all authors participated the editing of the manuscript. Xuyang Yan and Dr.Abodllah Homaifar shaped up the original idea of the new contribution for the first revision. The implementation of the new contribution and additional experiments are conducted by Xuyang Yan, Mrinmoy Sarkar and Kishor Datta Gupta. Drs. Mohd Anwar and Adbollah Homaifar suggested many important modifications to improve the overall writing quality and re-organzation of this revised manuscript. This research was supervised by Professors Abdollah Homaifar and Mohd Anwar. The funding of this research was acquired by Dr.Abdollah Homaifar.

Corresponding author

Correspondence to Abdollah Homaifar.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Availability of data and material

The data that support the findings of this study are available from the UCI machine learning repository https://archive.ics.uci.edu/ml/index.php.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yan, X., Nazmi, S., Gebru, B. et al. A clustering-based active learning method to query informative and representative samples. Appl Intell 52, 13250–13267 (2022). https://doi.org/10.1007/s10489-021-03139-y

Download citation

Accepted: 13 December 2021
Published: 25 February 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s10489-021-03139-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A clustering-based active learning method to query informative and representative samples

Abstract

Access this article

Similar content being viewed by others

A Multicriterion Query-Based Batch Mode Active Learning Technique

Active Semi-supervised K-Means Clustering Based on Silhouette Coefficient

An Active Learning Based on Uncertainty and Density Method for Positive and Unlabeled Data

Code Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Availability of data and material

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A clustering-based active learning method to query informative and representative samples

Abstract

Access this article

Similar content being viewed by others

A Multicriterion Query-Based Batch Mode Active Learning Technique

Active Semi-supervised K-Means Clustering Based on Silhouette Coefficient

An Active Learning Based on Uncertainty and Density Method for Positive and Unlabeled Data

Code Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Availability of data and material

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation