Abstract
Traditional classification algorithms work well on general small-scale microarray datasets, but for large-scale scenarios, general machines are not capable of supporting the operation of these algorithms anymore for the memory and time costs. In this paper, we design a new application framework to perform the computation of at the fastest speed. First, the synthetic minority over-sampling technique is used to sample a few classes of sample for obtaining the balanced data. Then, a large-scale algorithm for \(L_{2}\)-SVM based on the stochastic gradient descent method is proposed and used for microarray classification. Also, We give a simple proof of the convergence of stochastic gradient descent algorithm. Next, various large-scale algorithms for support vector machines are performed on the microarray datasets to identify the most appropriate algorithm. Finally, a comparative analysis of loss functions is done to clearly understand the differences. The experimental results show that the stochastic gradient descent algorithm and the squared hinge loss is an attractive choice, which can achieve high accuracy in seconds.
Similar content being viewed by others
References
Leung YF, Cavalieri D (2003) Fundamentals of cDNA microarray data analysis. Trends Genet 19:649–659
Lee G, Rodriguez C, Madabhushi A (2008) Investigating the efficacy of nonlinear dimensionality reduction schemes in classifying gene and protein expression studies. In: IEEE/ACM transactions on computational biology and bioinformatics. pp 368–384
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422. https://doi.org/10.1023/A:1012487302797
Daoud M, Mayo M (2019) A survey of neural network-based cancer prediction models from microarray data. Artif Intell Med 97:204–214
Garro BA, Rodríguez K, Vázquez RA (2016) Classification of DNA microarrays using artificial neural networks and ABC algorithm. Appl Soft Comput J 38:548–560. https://doi.org/10.1016/j.asoc.2015.10.002
Shah SH, Iqbal MJ, Ahmad I et al (2020) Optimized gene selection and classification of cancer from microarray gene expression data using deep learning. Neural Comput Appl. https://doi.org/10.1007/s00521-020-05367-8
Vafaee Sharbaf F, Mosafer S, Moattar MH (2016) A hybrid gene selection approach for microarray data classification using cellular learning automata and ant colony optimization. Genomics 107:231–238. https://doi.org/10.1016/j.ygeno.2016.05.001
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357. https://doi.org/10.1613/jair.953
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B Cybern 39:539–550. https://doi.org/10.1109/TSMCB.2008.2007853
Platt J (1999) Sequential minimal optimization: A fast algorithm for training support vector machines. Advances in Kernel Methods-Support Vector learning. Cambridge, MA MIT Press, pp. 185–208
Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp 217–226
Fan RE, Chang KW, Hsieh CJ et al (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874. https://doi.org/10.1145/1390681.1442794
Smola AJ, Vishwanathan SVN, Le QV (2007) Bundle methods for machine learning. In: Proceedings of the 20th International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA. pp 1377–1384
Bottou L (2012) Stochastic gradient descent tricks. pp 421–436
Bottou L, Curtis FE, Nocedal J (2018) Optimization methods for large-scale machine learning. SIAM Rev 60:223–311
Nguyen LM, Nguyen PH, Richtárik P et al (2019) New convergence aspects of stochastic gradient algorithms. J Mach Learn Res 20:1–49
Kivinen J, Smola AJ, Williamson RC (2004) Online learning with kernels. IEEE Trans Signal Process 52:2165–2176. https://doi.org/10.1109/TSP.2004.830991
Shalev-Shwartz S, Singer Y, Srebro N, Cotter A (2011) Pegasos: primal estimated sub-gradient solver for SVM. Math Program 127:3–30. https://doi.org/10.1007/s10107-010-0420-4
Bordes A, Bottou L, Gallinari P (2009) SGD-QN: Careful quasi-newton stochastic gradient descent. J Mach Learn Res 10:1737–1754
Takáč M, Bijral A, Richtárik P, Srebro N (2013) Mini-batch primal and dual methods for SVMs. In: 30th International Conference on Machine Learning, ICML 2013. pp 2059–2067
Wang Z, Djuric N, Crammer K, Vucetic S (2011) Trading representability for scalability: adaptive multi-hyperplane machine for nonlinear classification. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp 24–32
Djuric N, Wang Z, Vucetic S (2020) Growing adaptive multi-hyperplane machines. In: III HD, Singh A (eds) Proceedings of the 37th International Conference on Machine Learning. PMLR, Virtual. pp 2567–2576
Wang Z, Crammer K, Vucetic S (2012) Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale SVM training. J Mach Learn Res 13:3103–3131
Cheung IWT and JTK and P-M (2005) Core vector machines: fast SVM training on very large data sets. J Mach Learn Res 6:363–392
Wang S, Wang J, Chung F (2014) Kernel density estimation, kernel methods, and fast learning in large data sets. IEEE Trans Cybern 44:1–20. https://doi.org/10.1109/TSMCB.2012.2236828
Ding S, Nie X, Qiao H, Zhang B (2018) A fast algorithm of convex hull vertices selection for online classification. IEEE Trans Neural Netw Learn Syst 29:792–806. https://doi.org/10.1109/TNNLS.2017.2648038
Gu X, Chung F, Wang S (2018) Fast convex-hull vector machine for training on large-scale ncRNA data classification tasks. Knowl Based Syst 151:149–164. https://doi.org/10.1016/j.knosys.2018.03.029
Graf HP, Cosatto E, Bottou L, et al (2005) Parallel support vector machines: the cascade SVM. In: Advances in neural information processing systems
Haferlach T, Kohlmann A, Wieczorek L et al (2010) Clinical utility of microarray-based gene expression profiling in the diagnosis and subclassification of leukemia: report from the International Microarray Innovations in Leukemia Study Group. J Clin Oncol Off J Am Soc Clin Oncol 28:2529–2537. https://doi.org/10.1200/JCO.2009.23.4732
Urabe F, Matsuzaki J, Yamamoto Y et al (2019) Large-scale Circulating microRNA Profiling for the liquid biopsy of prostate cancer. Clin Cancer Res Off J Am Assoc Cancer Res 25:3016–3025. https://doi.org/10.1158/1078-0432.CCR-18-2849
Noble CL, Abbas AR, Cornelius J et al (2008) Regional variation in gene expression in the healthy colon is dysregulated in ulcerative colitis. Gut 57:1398–1405. https://doi.org/10.1136/gut.2008.148395
Pellagatti A, Cazzola M, Giagounidis A et al (2010) Deregulated gene expression pathways in myelodysplastic syndrome hematopoietic stem cells. Leukemia 24:756–764. https://doi.org/10.1038/leu.2010.31
Kumar M, Kumar Rath S (2015) Classification of microarray using MapReduce based proximal support vector machine classifier. Knowl Based Syst 89:584–602. https://doi.org/10.1016/j.knosys.2015.09.005
Kumar M, Rath NK, Rath SK (2016) Analysis of microarray leukemia data using an efficient MapReduce-based K-nearest-neighbor classifier. J Biomed Inform 60:395–409. https://doi.org/10.1016/j.jbi.2016.03.002
Baliarsingh SK, Vipsita S, Gandomi AH et al (2020) Analysis of high-dimensional genomic data using MapReduce based probabilistic neural network. Comput Methods Programs Biomed. https://doi.org/10.1016/j.cmpb.2020.105625
Liu S, Mocanu DC, Matavalam ARR et al (2021) Sparse evolutionary deep learning with over one million artificial neurons on commodity hardware. Neural Comput Appl 33:2589–2604. https://doi.org/10.1007/s00521-020-05136-7
Acknowledgements
This work was partially supported by the funding of National Natural Science Foundation of China (No. 62066001), National Natural Science Youth Science Foundation of China (No. 61907012), Natural Science Foundation of Ningxia (No. 2021AAC03230), and North Minzu University Major special projects: 201804. Authors are grateful to all the reviewers and Editor-in-Chief for their insightful comments on this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, B., Han, B. & Qin, C. Application of large-scale L2-SVM for microarray classification. J Supercomput 78, 2265–2286 (2022). https://doi.org/10.1007/s11227-021-03962-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-03962-7