Abstract
The relevance vector machine (RVM) is a machine learning algorithm based on sparse Bayesian theory that shows good classification performance for small-scale data sets. However, due to the high runtime complexity \(O\left( n^{3}\right) \) and space complexity \(O\left( n^{2}\right) \) of the RVM, it is difficult to train models for medium-sized or large-scale data sets. Therefore, a distributed ensemble of relevance vector machines on the Spark framework (DE-RVM) is proposed. In this approach, a data set is divided into a number of disjoint subsets of data, and on each subset, a set of RVM classifiers are trained using the AdaBoostRVM based on sample type (STAB-RVM) according to the concept of ensemble learning. A final classifier is generated by the combination method with a diversity measure for the RVM classifiers. The smallest empirical loss of the combinatorial classifier is obtained in the quadratic programming problem. The algorithm was applied to both artificial data sets and real data sets. The experimental results show that the proposed method offers good classification performance and can effectively improve the ability of the RVM to process large-scale data sets.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bacardit J, Llorà X (2013) Large-scale data mining using genetics-based machine learning. Wiley Interdiscip Rev: Data Min Knowl Discov 3(1):37–61
Barddal JP, Barddal JP, Bifet A (2017) A survey on ensemble learning for data stream classification. ACM Comput Surv 50(2):23
Bechini A, Marcelloni F, Segatori A (2016) A MapReduce solution for associative classification of big data. Inf Sci 332:33–55
Bi Y (2012) The impact of diversity on the accuracy of evidential classifier ensembles. Int J Approx Reason 53(4):584–607
Candela JQ, Hansen LK (2004) Learning with uncertainty-Gaussian processes and relevance vector machines (Doctoral dissertation, unknown)
Choi TM, Chan HK, Yue X (2017) Recent development in big data analytics for business operations and risk management. IEEE Trans Cybern 47(1):81–92
Csató L, Opper M (2002) Sparse on-line Gaussian processes. Neural Comput 14(3):641–668
Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77
Dong C, Tian L (2012) Accelerating relevance-vector-machine-based classification of hyperspectral image with parallel computing. Math Problems Eng
Grolinger K, Hayes M, Higashino W A, L’Heureux A, Allison DS, Capretz MA (2014) Challenges for mapreduce in big data. In: IEEE world congress on services (SERVICES). IEEE, pp 182–189
Koh JL, Chen CC, Chan CY, Chen AL (2017) MapReduce skyline query processing with partitioning and distributed dominance tests. Inf Sci 375:114–137
Krogh A, Vedelsby J (1995) Neural network ensembles, cross validation, and active learning. In: Advances in neural information processing systems, pp 231–238
Kumar A, Shankar R, Choudhary A, Thakur LS (2016) A big data MapReduce framework for fault diagnosis in cloud-based manufacturing. Int J Prod Res 54(23):7060–7073
Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51(2):181–207
Lei Y, Ding X, Wang S (2008) Visual tracker using sequential bayesian learning: discriminative, generative, and hybrid. IEEE Trans Syst Man Cybern Part B (Cybern) 38(6):1578–1591
Li X, Wang L, Sung E (2008) AdaBoost with SVM-based component classifiers. Eng Appl Artif Intell 21(5):785–795
Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) Graphlab: a new framework for parallel machine learning. arXiv:1408.2041
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Xin D (2016) Mllib: Machine learning in apache spark. J Mach Learn Res 17(1):1235–1241
Palit I, Reddy CK (2012) Scalable and parallel boosting with mapreduce. IEEE Trans Knowl Data Eng 24(10):1904–1916
Seeger M, Williams C, Lawrence N (2003) Fast forward selection to speed up sparse Gaussian process regression. In: Artificial intelligence and statistics 9 (No. EPFL-CONF-161318)
Silva C, Ribeiro B (2008) Towards expanding relevance vector machines to large scale datasets. Int J Neural Syst 18(01):45–58
Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, pp 1–10
Smola AJ, Bartlett PL (2001) Sparse greedy Gaussian process regression. In: Advances in neural information processing systems, pp 619–625
Tang EK, Suganthan PN, Yao X (2006) An analysis of diversity measures. Mach Learn 65(1):247–271
Tashk ARB, Sayadiyan A, Valiollahzadeh S (2007) Face detection using adaboosted RVM-based component classifier. In: 5th International symposium on image and signal processing and analysis, ISPA 2007. IEEE, pp 351–355
Tipping ME (2001) Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res 1:211–244
Tipping ME, Faul AC (2003) Fast marginal likelihood maximisation for sparse Bayesian models. In: AISTATS
Yang D, Liang G, Jenkins DD, Peterson GD, Li H (2010) High performance relevance vector machine on GPUs. In: Symposium on application accelerators in high performance computing
Yu Y, Li YF, Zhou ZH (2011) July) Diversity regularized machine. IJCAI Proc Int Joint Conf Artif Intell 22(1):1603
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, Mccauley M (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Usenix conference on networked systems design and implementation, vol 70. USENIX Association, p 2
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Ghodsi A (2016) Apache Spark: a unified engine for big data processing. Commun ACM 59(11):56–65
Acknowledgements
This work is supported by the National Natural Science Foundation of China under projects 61402345 and 61735013.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Qin, W., Liu, F., Tong, M. et al. A distributed ensemble of relevance vector machines for large-scale data sets on Spark. Soft Comput 25, 7119–7130 (2021). https://doi.org/10.1007/s00500-021-05671-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-021-05671-y