Abstract
Classification of cancerous genes from microarray data is an important research area in bioinformatics. Large amount of microarray data are available, but it is very costly to label them. This paper proposes an active learning model, a semi-supervised classification approach, to label the microarray data using which predictions can be made even with lesser amount of labeled data. Initially, a pool of unlabeled instances is given from which some instances are randomly chosen for labeling. Successive selection of instances to be labeled from unlabeled pool is determined by selection algorithms. The proposed method is devised following an ensemble approach to combine the decisions of three classifiers in order to arrive at a consensus which provides a more accurate prediction of the class label to ensure that each individual classifier learns in an uncorrelated manner. Our method combines the heuristic techniques used by an active learning algorithm to choose training samples with the multiple learning paradigm attained by an ensemble to optimize the search space by choosing efficiently from an already sparse learning pool. On evaluating the proposed method on 10 microarray datasets, we achieve performance which is comparable with state-of-the-art methods. The code and datasets are given at https://github.com/anuran-Chakraborty/Active-learning.

Flowchart of the proposed ensemble-based active learning framework





Similar content being viewed by others
References
Dasgupta S, Hsu DJ, Monteleoni C (2008) “A general agnostic active learning algorithm,” in Advances in neural information processing systems 20, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, Eds. Curran Associates, Inc., pp. 353–360
Krishnamurthy V (2002) Algorithms for optimal scheduling and management of hidden Markov model sensors. IEEE Trans Signal Process 50(6):1382–1397. https://doi.org/10.1109/TSP.2002.1003062
McCallum A, Nigam K (1998) “Employing EM and pool-based active learning for text classification,” in Proceedings of the Fifteenth International Conference on Machine Learning, pp. 350–358
Settles B, Craven M (2008) “An analysis of active learning strategies for sequence labeling tasks,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1070–1079
Holub A, Perona P, Burl MC (2008) “Entropy-based active learning for object recognition,” in 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–8, doi: https://doi.org/10.1109/CVPRW.2008.4563068
Mitra P, Murthy CA, Pal SK (2004) A probabilistic active support vector learning algorithm. IEEE Trans Pattern Anal Mach Intell 26(3):413–418. https://doi.org/10.1109/TPAMI.2004.1262340
Freund Y, Seung HS, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm. Mach Learn 28(2–3):133–168. https://doi.org/10.1023/A:1007330508534
Zhang C, Chen T (2002) “An active learning framework for content-based information retrieval,” IEEE Trans Multimed, vol. 4, pp. 260–268
Hoi SCH, Jin R, Lyu MR (2006) “Large-scale text categorization by batch mode active learning,” in Proceedings of the 15th International Conference on World Wide Web, pp. 633–642, doi: https://doi.org/10.1145/1135777.1135870
Warmuth MK, Liao J, Rätsch G, Mathieson M, Putta S, Lemmen C (2003) Active learning with support vector machines in the drug discovery process. J Chem Inf Comput Sci 43(2):667–673. https://doi.org/10.1021/ci025620t
Liu Y (2004) Active learning with support vector machine applied to gene expression data for cancer classification. J Chem Inf Comput Sci 44(6):1936–1941. https://doi.org/10.1021/ci049810a
Hoi SCH, Jin R, Zhu J, Lyu MR (2006) “Batch mode active learning and its application to medical image classification,” in Proceedings of the 23rd International Conference on Machine Learning, pp. 417–424, doi: https://doi.org/10.1145/1143844.1143897
Ruskin HJ (2016) Computational modeling and analysis of microarray data: new horizons. Microarrays (Basel, Switzerland) 5(4):26. https://doi.org/10.3390/microarrays5040026
Epstein CB, Butow RA (2000) Microarray technology - enhanced versatility, persistent challenge. Curr Opin Biotechnol 11(1):36–41. https://doi.org/10.1016/s0958-1669(99)00065-8
Fan J, Ren Y (2006) Statistical analysis of DNA microarray data in cancer research. Clin Cancer Res 12(15):4469–4473. https://doi.org/10.1158/1078-0432.CCR-06-1033
Schalper KA, Velcheti V, Carvajal D, Wimberly H, Brown J, Pusztai L, Rimm DL (2014) In situ tumor PD-L1 mRNA expression is associated with increased TILs and better outcome in breast carcinomas. Clin Cancer Res 20(10):2773–2782. https://doi.org/10.1158/1078-0432.CCR-13-2702
Xu P, Brock GN, Parrish RS (2009) Modified linear discriminant analysis approaches for classification of high-dimensional microarray data. Comput Stat Data Anal 53(5):1674–1687. https://doi.org/10.1016/j.csda.2008.02.005
Kittler J, Hatef M, Duin RPW, Matas J (Mar. 1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239. https://doi.org/10.1109/34.667881
Joshi AJ, Porikli F, Papanikolopoulos N (2009) “Multi-class active learning for image classification,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2372–2379, doi: https://doi.org/10.1109/CVPR.2009.5206627
Ali K (1995) “On the link between error correlation and error reduction in decision tree ensembles,”
Xu L, Krzyzak A, Suen CY (May 1992) Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans Syst Man Cybern 22(3):418–435. https://doi.org/10.1109/21.155943
Ho TK, Hull JJ, Srihari SN (1994) Decision combination in multiple classifier systems. IEEE Trans Pattern Anal Mach Intell 16(1):66–75. https://doi.org/10.1109/34.273716
Wolpert DH (2011) Stacked generalization. Neural Netw 5(2):241–260. https://doi.org/10.1360/zd-2013-43-6-1064
Cao J, Ahmadi M, Shridhar M (1995) Recognition of handwritten numerals with multiple feature and multistage classifier. Pattern Recogn 28(2):153–160. https://doi.org/10.1016/0031-3203(94)00094-3
Kimura F, Shridhar M (1991) Handwritten numerical recognition based on multiple algorithms. Pattern Recogn 24(10):969–983. https://doi.org/10.1016/0031-3203(91)90094-L
Franke J, Mandler E (1992) “A comparison of two approaches for combining the votes of cooperating classifiers,” in Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems, pp. 611–614, doi: https://doi.org/10.1109/ICPR.1992.201786
Bagui SC, Pal NR (1995) A multistage generalization of the rank nearest neighbor classification rule. Pattern Recogn Lett 16(6):601–614. https://doi.org/10.1016/0167-8655(95)80006-F
Hashem S, Schmeiser B (May 1995) Improving model accuracy using optimal linear combinations of trained neural networks. IEEE Trans Neural Netw 6(3):792–794. https://doi.org/10.1109/72.377990
Kittler J, Hater M, Duin RPW (1996) “Combining classifiers,” in Proceedings of 13th International Conference on Pattern Recognition, vol. 2, pp. 897–901 vol.2, doi: https://doi.org/10.1109/ICPR.1996.547205
Kittler TWJ, Hojjatoleslami A (1997) “Weighting factors in multiple expert fusion,” in Proc. British Machine Vision Conf., Colchester, England, pp. 41–50
Rogova G (1994) Combining the results of several neural network classifiers. Neural Netw 7(5):777–781. https://doi.org/10.1016/0893-6080(94)90099-X
Tresp V, Taniguchi M (1995) “Combining estimators using non-constant weighting functions,” in Advances in Neural Information Processing Systems 7, G. Tesauro, D. S. Touretzky, and T. K. Leen, Eds. MIT Press, pp. 419–426
Ghosh M, Begum S, Sarkar R, Chakraborty D, Maulik U (2019) Recursive memetic algorithm for gene selection in microarray data. Expert Syst Appl 116:172–185. https://doi.org/10.1016/j.eswa.2018.06.057
Ghosh M, Adhikary S, Ghosh KK, Sardar A, Begum S, Sarkar R (Jan. 2019) Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods. Med Biol Eng Comput 57(1):159–176. https://doi.org/10.1007/s11517-018-1874-4
Zhu Z, Ong YS, Dash M (2007) Markov blanket-embedded genetic algorithm for gene selection. Pattern Recogn. https://doi.org/10.1016/j.patcog.2007.02.007
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238. https://doi.org/10.1109/TPAMI.2005.159
Singh PK, Sarkar R, Nasipuri M (2016) Significance of non-parametric statistical tests for comparison of classifiers over multiple datasets. Int J Comput Sci Math. https://doi.org/10.1504/IJCSM.2016.080073
Singh PK, Sarkar R, Nasipuri M (2015) Statistical validation of multiple classifiers over multiple datasets in the field of pattern recognition. Int J Appl Pattern Recognit. https://doi.org/10.1504/ijapr.2015.068929
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
De, R., Chakraborty, A., Chatterjee, A. et al. A weighted ensemble-based active learning model to label microarray data. Med Biol Eng Comput 58, 2427–2441 (2020). https://doi.org/10.1007/s11517-020-02238-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11517-020-02238-1