Skip to main content
Log in

A clustering-based active learning method to query informative and representative samples

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Active learning (AL) has widely been used to address the shortage of labeled datasets. Yet, most AL techniques require an initial set of labeled data as the knowledge base to perform active querying. The informativeness of the initial labeled set significantly affects the subsequent active query; hence the performance of active learning. In this paper, a new clustering-based active learning framework, namely Active Learning using a Clustering-based Sampling (ALCS), is proposed to simultaneously consider the representativeness and informativeness of samples using no prior label information. A density-based clustering approach is employed to explore the cluster structure from the data without requiring exhaustive parameter tuning. A simple yet effective distance-based querying strategy is adopted to adjust the sampling weight between the center-based and boundary-based selections for active learning. A novel bi-cluster boundary-based sample query procedure is introduced to select the most uncertain samples across the boundary among adjacent clusters. Additionally, we developed an effective diversity exploration strategy to address the redundancy among queried samples. Our extensive experimentation provided a comparison of the ALCS approach with state-of-the-art methods, exhibiting that ALCS produces statistically better or comparable performance than state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Code Availability

The python code is developed by the authors and it is available at https://github.com/XuyangAbert/ALCS.

Notes

  1. https://github.com/XuyangAbert/ALCS

References

  1. Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Amer Stat 46(3):175–185

    MathSciNet  Google Scholar 

  2. Cai D, He X (2011) Manifold adaptive experimental design for text categorization. IEEE Trans Knowl Data Eng 24(4):707–719

    Article  Google Scholar 

  3. Chattopadhyay R, Wang Z, Fan W, Davidson I, Panchanathan S, Ye J (2013) Batch mode active sampling based on marginal probability distribution matching. ACM Trans Knowl Discov Data (TKDD) 7(3):1–25

    Article  Google Scholar 

  4. Cortes C, Mohri M (2014) Domain adaptation and sample bias correction theory and algorithm for regression. Theor Comput Sci 519:103–126

    Article  MathSciNet  MATH  Google Scholar 

  5. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  6. Dagan I, Engelson SP (1995) Committee-based sampling for training probabilistic classifiers. In: Machine Learning Proceedings 1995, Elsevier. pp 150–157

  7. Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: Proceedings of the 25th international conference on Machine learning, pp 208–215

  8. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30

    MathSciNet  MATH  Google Scholar 

  9. Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

  10. Donmez P, Carbonell JG, Bennett PN (2007) Dual strategy active learning. In: European Conference on Machine Learning, Springer. pp 116–127

  11. Freund Y, Seung HS, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm. Mach Learn 28(2-3):133–168

    Article  MATH  Google Scholar 

  12. Gu S, Cai Y, Shan J, Hou C (2019) Active learning with error-correcting output codes. Neurocomputing 364:182–191

    Article  Google Scholar 

  13. Hoi SC, Jin R, Zhu J, Lyu MR (2006) Batch mode active learning and its application to medical image classification. In: Proceedings of the 23rd international conference on Machine learning, pp 417–424

  14. Hoi SC, Jin R, Zhu J, Lyu MR (2009) Semisupervised svm batch mode active learning with applications to image retrieval. ACM Trans Inform Syst (TOIS) 27(3):1–29

    Article  Google Scholar 

  15. Holub A, Perona P, Burl MC (2008) Entropy-based active learning for object recognition. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, IEEE. pp 1–8

  16. Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1-3):489–501

    Article  Google Scholar 

  17. Huang SJ, Jin R, Zhou ZH (2010) Active learning by querying informative and representative examples. In: Advances in neural information processing systems, pp 892–900

  18. Huang SJ, Zong CC, Ning KP, Ye HB (2021) Asynchronous active learning with distributed label querying. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, International Joint Conferences on Artificial Intelligence Organization, pp 2570–2576

  19. Kading C, Freytag A, Rodner E, Bodesheim P, Denzler J (2015) Active learning and discovery of object categories in the presence of unnameable instances. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4343–4352

  20. Kee S, Del Castillo E, Runger G (2018) Query-by-committee improvement with diversity and density in batch active learning. Inf Sci 454:401–418

    Article  MathSciNet  Google Scholar 

  21. Krempl G, Kottke D, Lemaire V (2015) Optimised probabilistic active learning (opal). Mach Learn 100(2):449–476

    Article  MathSciNet  MATH  Google Scholar 

  22. Lewis DD, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: Machine learning proceedings 1994, Elsevier, pp 148–156

  23. Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: SIGIR’94, Springer. pp 3–12

  24. Li H, Wang Y, Li Y, Xiao G, Hu P, Zhao R (2021a) Batch mode active learning via adaptive criteria weights. Appl Intell 51(6):3475–3489

    Article  Google Scholar 

  25. Li H, Wang Y, Li Y, Xiao G, Hu P, Zhao R, Li B (2021b) Learning adaptive criteria weights for active semi-supervised learning. Inf Sci 561:286–303

    Article  MathSciNet  Google Scholar 

  26. Lu J, Zhao P, Hoi SC (2016) Online passive-aggressive active learning. Mach Learn 103 (2):141–183

    Article  MathSciNet  MATH  Google Scholar 

  27. Lughofer E (2012) Hybrid active learning for reducing the annotation effort of operators in classification systems. Pattern Recogn 45(2):884–896

    Article  Google Scholar 

  28. Lughofer E (2017) On-line active learning: a new paradigm to improve practical useability of data stream modeling methods. Inf Sci 415:356–376

    Article  Google Scholar 

  29. Min F, Zhang SM, Ciucci D, Wang M (2020) Three-way active learning through clustering selection. Int J Mach Learn Cybern 11(5):1033–1046

    Article  Google Scholar 

  30. Nguyen HT, Smeulders A (2004) Active learning using pre-clustering. In: Proceedings of the twenty-first international conference on Machine learning, p 79

  31. Nuhu AR, Yan X, Opoku D, Homaifar A (2021) A niching framework based on fitness proportionate sharing for multi-objective genetic algorithm (moga-fps). In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, Association for Computing Machinery, New York, NY, USA, GECCO ’21, p 191–192 . https://doi.org/10.1145/3449726.3459566

  32. Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344 (6191):1492–1496

    Article  Google Scholar 

  33. Roy N, McCallum A (2001) Toward optimal active learning through monte carlo estimation of error reduction. ICML, Williamstown 441–448

  34. Schein AI, Ungar LH (2007) Active learning for logistic regression: an evaluation. Mach Learn 68(3):235–265

    Article  MATH  Google Scholar 

  35. Settles B, Craven M, Ray S (2008) Multiple-instance active learning. In: Advances in neural information processing systems, pp 1289–1296

  36. Seung HS, Opper M, Sompolinsky H (1992) Query by committee. In: Proceedings of the fifth annual workshop on Computational learning theory, pp 287–294

  37. Smith JS, Nebgen B, Lubbers N, Isayev O, Roitberg AE (2018) Less is more: Sampling chemical space with active learning. J Chem Phys 148(24):241733

    Article  Google Scholar 

  38. Tang YP, Huang SJ (2021) Dual active learning for both model and data selection. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, International Joint Conferences on Artificial Intelligence Organization, pp 3052–3058

  39. Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. J Mach Learn Res 2(Nov):45–66

    MATH  Google Scholar 

  40. Tsou YL, Lin HT (2019) Annotation cost-sensitive active learning by tree sampling. Mach Learn 108(5):785–807

    Article  MathSciNet  MATH  Google Scholar 

  41. Viering TJ, Krijthe JH, Loog M (2019) Nuclear discrepancy for single-shot batch active learning. Mach Learn 108(8):1561–1599

    Article  MathSciNet  MATH  Google Scholar 

  42. Wang L, Hu X, Yuan B, Lu J (2015) Active learning via query synthesis and nearest neighbour search. Neurocomputing 147:426–434

    Article  Google Scholar 

  43. Wang M, Hua XS (2011) Active learning in multimedia annotation and retrieval: a survey. ACM Trans Intell Syst Technol (TIST) 2(2):1–21

    Article  Google Scholar 

  44. Wang M, Min F, Zhang ZH, Wu YX (2017a) Active learning through density clustering. Expert Syst Appl 85:305– 317

    Article  Google Scholar 

  45. Wang M, Fu K, Min F (2018a) Active learning through two-stage clustering. In: 2018 IEEE International conference on fuzzy systems (FUZZ-IEEE), IEEE, pp 1–7

  46. Wang M, Zhang YY, Min F (2019) Active learning through multi-standard optimization. IEEE Access 7:56772–56784

    Article  Google Scholar 

  47. Wang M, Fu K, Min F, Jia X (2020) Active learning through label error statistical methods. Knowl-Based Syst 189:105140

    Article  Google Scholar 

  48. Wang R, Wang XZ, Kwong S, Xu C (2017b) Incorporating diversity and informativeness in multiple-instance active learning. IEEE Trans Fuzzy Syst 25(6):1460–1475

    Article  Google Scholar 

  49. Wang Z, Ye J (2015) Querying discriminative and representative samples for batch mode active learning. ACM Trans Knowl Discov Data (TKDD) 9(3):1–23

    Google Scholar 

  50. Wang Z, Du B, Zhang L, Zhang L (2016) A batch-mode active learning framework by querying discriminative and representative samples for hyperspectral image classification. Neurocomputing 179:88–100

    Article  Google Scholar 

  51. Wang Z, Fang X, Tang X, Wu C (2018b) Multi-class active learning by integrating uncertainty and diversity. IEEE Access 6:22794–22803

    Article  Google Scholar 

  52. Workineh A, Homaifar A (2012) Fitness proportionate niching: Maintaining diversity in a rugged fitness landscape. In: Proceedings of the International Conference on Genetic and Evolutionary Methods (GEM), The Steering Committee of The World Congress in Computer Science Computer ..., pp 1–7

  53. Xiao Y, Chang Z, Liu B (2020) An efficient active learning method for multi-task learning. Knowl-Based Syst 190:105137

    Article  Google Scholar 

  54. Yan X, Homaifar A, Nazmi S, Razeghi-Jahromi M (2017) A novel clustering algorithm based on fitness proportionate sharing. In: Systems, man, and cybernetics (SMC), 2017 IEEE International Conference on IEEE, pp 1960–1965

  55. Yan X, Razeghi-Jahromi M, Homaifar A, Erol BA, Girma A, Tunstel E (2019) A novel streaming data clustering algorithm based on fitness proportionate sharing. IEEE Access 7:184985–185000

    Article  Google Scholar 

  56. Yan X, Nazmi S, Erol BA, Homaifar A, Gebru B, Tunstel E (2020) An efficient unsupervised feature selection procedure through feature clustering. Pattern Recognition Letters

  57. Yan X, Homaifar A, Sarkar M, Girma A, Tunstel E (2021) A clustering-based framework for classifying data streams. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence IJCAI-21, International Joint Conferences on Artificial Intelligence Organization, pp 3257–3263

  58. Yang MS, Wu KL (2004) A similarity-based robust clustering method. IEEE Trans Pattern Anal Mach Intell 26(4):434–448

    Article  Google Scholar 

  59. Yang Y, Loog M (2016) Active learning using uncertainty information. In: 2016 23Rd international conference on pattern recognition (ICPR), IEEE, pp 2646–2651

  60. Yang Y, Loog M (2018) A variance maximization criterion for active learning. Pattern Recogn 78:358–370

    Article  Google Scholar 

  61. Yang Y, Ma Z, Nie F, Chang X, Hauptmann AG (2015) Multi-class active learning by uncertainty sampling with diversity maximization. Int J Comput Vis 113(2):113–127

    Article  MathSciNet  Google Scholar 

  62. Yang YY, Lee SC, Chung YA, Wu TE, Chen SA, Lin HT (2017) libact: Pool-based active learning in python. arXiv:171000379

  63. Yu D, Varadarajan B, Deng L, Acero A (2010) Active learning and semi-supervised learning for speech recognition: a unified framework using the global entropy reduction maximization criterion. Comput Speech Lang 24(3):433–444

    Article  Google Scholar 

  64. Yu H, Sun C, Yang W, Yang X, Zuo X (2015) Al-elm: One uncertainty-based active learning algorithm using extreme learning machine. Neurocomputing 166:140–150

    Article  Google Scholar 

Download references

Acknowledgements

This paper is based on research sponsored by the Air Force Research Laboratory and the Office of the Secretary of Defense (OSD) under agreement number FA8750-15-2-0116. Also, this work is partially funded through the National Science Foundation (NSF) under grant number 2000320.

Funding

This study was funded by Air Force Research Laboratory and the Office of the Secretary of Defense (OSD) under agreement number FA8750-15-2-0116. This work is also partially supported by National Science Foundation under grant number 2000320.

Author information

Authors and Affiliations

Authors

Contributions

The conceptualization, methodology, software implementation & debug, and validation are contributed by Xuyang Yan, Shabnam Nazmi, Biniam Gebru, and Mrinmoy Sarkar. The first draft was prepared by Xuyang Yan and all authors participated the editing of the manuscript. Xuyang Yan and Dr.Abodllah Homaifar shaped up the original idea of the new contribution for the first revision. The implementation of the new contribution and additional experiments are conducted by Xuyang Yan, Mrinmoy Sarkar and Kishor Datta Gupta. Drs. Mohd Anwar and Adbollah Homaifar suggested many important modifications to improve the overall writing quality and re-organzation of this revised manuscript. This research was supervised by Professors Abdollah Homaifar and Mohd Anwar. The funding of this research was acquired by Dr.Abdollah Homaifar.

Corresponding author

Correspondence to Abdollah Homaifar.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Availability of data and material

The data that support the findings of this study are available from the UCI machine learning repository https://archive.ics.uci.edu/ml/index.php.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yan, X., Nazmi, S., Gebru, B. et al. A clustering-based active learning method to query informative and representative samples. Appl Intell 52, 13250–13267 (2022). https://doi.org/10.1007/s10489-021-03139-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-03139-y

Keywords

Navigation