A distributed ensemble of relevance vector machines for large-scale data sets on Spark

Qin, Wangchen; Liu, Fang; Tong, Mi; Li, Zhengying

doi:10.1007/s00500-021-05671-y

A distributed ensemble of relevance vector machines for large-scale data sets on Spark

Methodologies and Application
Published: 08 March 2021

Volume 25, pages 7119–7130, (2021)
Cite this article

Soft Computing Aims and scope Submit manuscript

Wangchen Qin¹,
Fang Liu ORCID: orcid.org/0000-0002-2356-4592²,
Mi Tong¹ &
…
Zhengying Li³

235 Accesses
Explore all metrics

Abstract

The relevance vector machine (RVM) is a machine learning algorithm based on sparse Bayesian theory that shows good classification performance for small-scale data sets. However, due to the high runtime complexity $O\left( n^{3}\right) $ and space complexity $O\left( n^{2}\right) $ of the RVM, it is difficult to train models for medium-sized or large-scale data sets. Therefore, a distributed ensemble of relevance vector machines on the Spark framework (DE-RVM) is proposed. In this approach, a data set is divided into a number of disjoint subsets of data, and on each subset, a set of RVM classifiers are trained using the AdaBoostRVM based on sample type (STAB-RVM) according to the concept of ensemble learning. A final classifier is generated by the combination method with a diversity measure for the RVM classifiers. The smallest empirical loss of the combinatorial classifier is obtained in the quadratic programming problem. The algorithm was applied to both artificial data sets and real data sets. The experimental results show that the proposed method offers good classification performance and can effectively improve the ability of the RVM to process large-scale data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Parallelized Spark Based Version of mRMR

Distributed ReliefF-based feature selection in Spark

Article 22 January 2018

A Distributed Multi-source Feature Selection Using Spark

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Bacardit J, Llorà X (2013) Large-scale data mining using genetics-based machine learning. Wiley Interdiscip Rev: Data Min Knowl Discov 3(1):37–61
Google Scholar
Barddal JP, Barddal JP, Bifet A (2017) A survey on ensemble learning for data stream classification. ACM Comput Surv 50(2):23
Google Scholar
Bechini A, Marcelloni F, Segatori A (2016) A MapReduce solution for associative classification of big data. Inf Sci 332:33–55
Article Google Scholar
Bi Y (2012) The impact of diversity on the accuracy of evidential classifier ensembles. Int J Approx Reason 53(4):584–607
Article MathSciNet Google Scholar
Candela JQ, Hansen LK (2004) Learning with uncertainty-Gaussian processes and relevance vector machines (Doctoral dissertation, unknown)
Choi TM, Chan HK, Yue X (2017) Recent development in big data analytics for business operations and risk management. IEEE Trans Cybern 47(1):81–92
Article Google Scholar
Csató L, Opper M (2002) Sparse on-line Gaussian processes. Neural Comput 14(3):641–668
Article Google Scholar
Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77
Article Google Scholar
Dong C, Tian L (2012) Accelerating relevance-vector-machine-based classification of hyperspectral image with parallel computing. Math Problems Eng
Grolinger K, Hayes M, Higashino W A, L’Heureux A, Allison DS, Capretz MA (2014) Challenges for mapreduce in big data. In: IEEE world congress on services (SERVICES). IEEE, pp 182–189
Koh JL, Chen CC, Chan CY, Chen AL (2017) MapReduce skyline query processing with partitioning and distributed dominance tests. Inf Sci 375:114–137
Article Google Scholar
Krogh A, Vedelsby J (1995) Neural network ensembles, cross validation, and active learning. In: Advances in neural information processing systems, pp 231–238
Kumar A, Shankar R, Choudhary A, Thakur LS (2016) A big data MapReduce framework for fault diagnosis in cloud-based manufacturing. Int J Prod Res 54(23):7060–7073
Article Google Scholar
Kuncheva LI, Whitaker CJ (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51(2):181–207
Article Google Scholar
Lei Y, Ding X, Wang S (2008) Visual tracker using sequential bayesian learning: discriminative, generative, and hybrid. IEEE Trans Syst Man Cybern Part B (Cybern) 38(6):1578–1591
Article Google Scholar
Li X, Wang L, Sung E (2008) AdaBoost with SVM-based component classifiers. Eng Appl Artif Intell 21(5):785–795
Article Google Scholar
Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) Graphlab: a new framework for parallel machine learning. arXiv:1408.2041
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Xin D (2016) Mllib: Machine learning in apache spark. J Mach Learn Res 17(1):1235–1241
MathSciNet MATH Google Scholar
Palit I, Reddy CK (2012) Scalable and parallel boosting with mapreduce. IEEE Trans Knowl Data Eng 24(10):1904–1916
Article Google Scholar
Seeger M, Williams C, Lawrence N (2003) Fast forward selection to speed up sparse Gaussian process regression. In: Artificial intelligence and statistics 9 (No. EPFL-CONF-161318)
Silva C, Ribeiro B (2008) Towards expanding relevance vector machines to large scale datasets. Int J Neural Syst 18(01):45–58
Article Google Scholar
Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, pp 1–10
Smola AJ, Bartlett PL (2001) Sparse greedy Gaussian process regression. In: Advances in neural information processing systems, pp 619–625
Tang EK, Suganthan PN, Yao X (2006) An analysis of diversity measures. Mach Learn 65(1):247–271
Article Google Scholar
Tashk ARB, Sayadiyan A, Valiollahzadeh S (2007) Face detection using adaboosted RVM-based component classifier. In: 5th International symposium on image and signal processing and analysis, ISPA 2007. IEEE, pp 351–355
Tipping ME (2001) Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res 1:211–244
MathSciNet MATH Google Scholar
Tipping ME, Faul AC (2003) Fast marginal likelihood maximisation for sparse Bayesian models. In: AISTATS
Yang D, Liang G, Jenkins DD, Peterson GD, Li H (2010) High performance relevance vector machine on GPUs. In: Symposium on application accelerators in high performance computing
Yu Y, Li YF, Zhou ZH (2011) July) Diversity regularized machine. IJCAI Proc Int Joint Conf Artif Intell 22(1):1603
Google Scholar
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, Mccauley M (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Usenix conference on networked systems design and implementation, vol 70. USENIX Association, p 2
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Ghodsi A (2016) Apache Spark: a unified engine for big data processing. Commun ACM 59(11):56–65
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under projects 61402345 and 61735013.

Author information

Authors and Affiliations

School of Computer Science and Technology, Wuhan University of Technology, Wuhan, 430070, China
Wangchen Qin & Mi Tong
National Engineering Laboratory for Fiber Optic Sensing Technology, Wuhan University of Technology, Wuhan, 430070, China
Fang Liu
School of Information Engineering, Wuhan University of Technology, Wuhan, 430070, China
Zhengying Li

Authors

Wangchen Qin
View author publications
You can also search for this author in PubMed Google Scholar
Fang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Mi Tong
View author publications
You can also search for this author in PubMed Google Scholar
Zhengying Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fang Liu.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qin, W., Liu, F., Tong, M. et al. A distributed ensemble of relevance vector machines for large-scale data sets on Spark. Soft Comput 25, 7119–7130 (2021). https://doi.org/10.1007/s00500-021-05671-y

Download citation

Accepted: 08 February 2021
Published: 08 March 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s00500-021-05671-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A distributed ensemble of relevance vector machines for large-scale data sets on Spark

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Parallelized Spark Based Version of mRMR

Distributed ReliefF-based feature selection in Spark

A Distributed Multi-source Feature Selection Using Spark

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now