Distributed and Parallel Ensemble Classification for Big Data Based on Kullback-Leibler Random Sample Partition

Wei, Chenghao; Zhang, Jiyong; Valiullin, Timur; Cao, Weipeng; Wang, Qiang; Long, Hao

doi:10.1007/978-3-030-60245-1_31

Distributed and Parallel Ensemble Classification for Big Data Based on Kullback-Leibler Random Sample Partition

Chenghao Wei⁹,
Jiyong Zhang¹⁰,
Timur Valiullin⁹,
Weipeng Cao⁹,
Qiang Wang¹¹ &
…
Hao Long⁹

Conference paper
First Online: 29 September 2020

1570 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12452))

Abstract

In this article, we use a Kullback-Leibler random sample partition data model to generate a set of disjoint data blocks, where each block is a good representation of the entire data set. Every random sample partition (RSP) block has a sample distribution function similar to the entire data set. To obtain the statistical measure between them, Kernel Density Estimation (KDE) with a dual-tree recursion data structure is firstly applied to fast estimate the probability density of each block. Then, based on the Kullback-Leibler (KL) divergence measure, we can obtain the statistical similarity between a randomly selected RSP data block and other RSP data blocks. We rank the RSP data blocks according to their divergence values in descending order and choose the first ten for an ensemble classification learning. The classification models are established in parallel for the selected RSP data blocks and the final ensemble classification model is obtained by the weighted voting ensemble strategy. The experiments were conducted by building XGboost model based on those ten blocks in parallel, and we incrementally ensemble them according to their KL values. The testing classification results show that our method can increase the generalization capability of the ensemble classification model. It could reduce the model building time in parallel computation environment by using less than \(15\%\) of the entire data, which could also solve the memory constraints of big data analysis.

This work was supported by National Natural Science Foundation of China (61836005), and the Opening Project of Shanghai Trusted Industrial Control Platform (TICPSH202003008-ZC).

C. Wei and J. Zhang—Joint first authors.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Available: http://hadoop.apache.org/.
2.
Available: https://spark.apache.org/.
3.
Available: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client.
4.
Available: http://archive.ics.uci.edu/ml/datasets/HIGGS.
5.
Available: https://xgboost.readthedocs.io/en/latest/.

References

Chen, B.W., Wen, J., Seungmin, R.: Divide-and-conquer signal processing, feature extraction, and machine learning for big data. Neurocomputing 174, 383 (2016)
Article Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: IEEE 26th symposium on mass storage systems and technologies, pp. 1–10 (2010)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communi of ACM. 51(1), 107–13 (2008)
Article Google Scholar
Elteir M., Lin H., and Feng W.C.: Enhancing mapreduce via asynchronous data processing. In: IEEE international conference on parallel and distributed systems, pp. 397–405 (2010)
Google Scholar
Zaharia M., Chowdhury M., Franklin M.J., Shenker S., Stoica I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing, pp. 10 (2010)
Google Scholar
Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int J Data Sci Anal 1, 145–164 (2016)
Article Google Scholar
Lei G., Huan L.: Memory or time: performance evaluation for iterative operation on hadoop and spark. In: IEEE 10th international conference on high performance computing and communications. pp. 721–727 (2013)
Google Scholar
Wu X., Zhu X., Wu G.Q., Ding W., Data mining with big data, IEEE Trans. Knowl. Data Eng., 26(1), 97C107(2014)
Google Scholar
Salloum, S., Huang, J.Z., He, Y.L.: Random sample partition: a distributed data model for big data analysis. IEEE Trans on Indus Infor 15(11), 5846–5854 (2019)
Article Google Scholar
Dong, X., Yu, Z., Cao, W., Shi, Y.: Ma, Q: A survey on ensemble learning. Front Compu Sci 14(2), 241–258 (2020)
Article Google Scholar
Galicia, A., Talavera, L.R., Troncoso, A., Koprinska, I., Martnez, A.F.: Multi-step forecasting for big data time series based on ensemble learning. Know Base Sys 163, 830–841 (2018)
Article Google Scholar
Tang Y., Wang Y., Cooper K.M.L., Li L.: Towards big data bayesian network learning - an ensemble learning based approach, In: IEEE international congress on big data, 2014, pp. 355–357
Google Scholar
Shadi K., Patrick M., Rebecca Y.,: Label-aware distributed ensemble learning: a simplified distributed classifier training model for big data, big data res, 15, pp. 1–11, (2019)
Google Scholar
Diego M., Eduard A., Jose R. Herrero, Read J., Bifet A., Low-latency multi-threaded ensemble learning for dynamic big data streams. In: IEEE international conference on big data, pp. 223–232 (2017)
Google Scholar
Salman, S., Joshua, Z.X.H., He, Y.L., Chen, X.J.: An asymptotic ensemble learning framework for big data analysis. IEEE Acc 7, 3675–3693 (2019)
Article Google Scholar
Zhou Z.H., Wu J.X., Tang W.: Ensembling neural networks: many could be better than all, AI, 137(1C2),239–263 (2002)
Google Scholar
Giancinto, G., Roli, F.: An approach to the automatic design of multiple classifier ensembles. Patt Recog Lett 22(1), 25–33 (2001)
Article Google Scholar
Cheng X.Y., Guo H.L.: The technology of selective multiple classifiers ensemble based on kernel clustering. In: International symposium on intelligent information technology application. pp. 146–150 (2008)
Google Scholar
Martinez, M.G., Suarez, A.: Using boosting to prune bagging ensembles. Patt Recog Lett 28(1), 156–165 (2007)
Article Google Scholar
Martinez M.G., Suarez A. Pruning in ordered bagging ensembles. In: Prceedings of the 23rd international conference on machine learning, pp. 609–368 (2006)
Google Scholar
Breiman, L.: Out-of-bag estimation. Statistics deparment in university of california, Technical Report (1996)
Google Scholar
Zhang, L., Zhou, W.D.: Sparse ensembles using weighted combination methods based on linear programming. Patt Recog Lett 44(1), 97–106 (2011)
Article MATH Google Scholar
Fan, C.T., Muller, M.E., Rezucha, I.: Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J Amer Stat Ass 57(298), 387–402 (1962)
Article MathSciNet MATH Google Scholar
Haas, P.J.: Data-stream sampling: basic techniques and results. Data Stream Management. Springer, Berlin Heidelberg (2016). https://doi.org/10.1007/978-3-540-28608-0_2
Oliphant, T.E.: SciPy: open source scientific tools for Python. Comput Sci Eng, 9(3):10C20, (2007)
Google Scholar
Swami, A., Jain, R.: Scikit-learn: machine learning in Python. J Mach Lear Res 12(10), 2825–2830 (2013)
MathSciNet Google Scholar
Podgurski, A., Yang, C.: Partition testing, stratified sampling, and cluster analysis. ACM SIGSOFT Soft Eng Notes 18(5), 169–181 (1993)
Article Google Scholar
Kleiner A., Talwalkar A., Sarkar P., Jordan M.I.: The big data Bootstrap. In: Proceedings of the 29th international conference on machine learning. pp. 1787–1794, (2012)
Google Scholar
Rosenblatt, M.: Remarks on some nonparametric estimates of a density function. Ann Math Stat 27, 832–837 (1956)
Article MathSciNet MATH Google Scholar
Parzen, E.: On the estimation of probability density functions and mode. Ann Math Stat 33, 1065–1076 (1962)
Article MathSciNet MATH Google Scholar
Chen, S., Hong, X., Harris, C.J.: Sparse kernel density construction using orthogonal forward regression with leave-one-out test score and local regularization. IEEE transactions on systems man and cybernetics part b cybernetics 34(4), 1708–1717 (2004)
Article Google Scholar
Friedman, J.H., Bentley, J.L., Finkel, R.A.: An algorithm for finding best matches in logarithmic expected time. ACM Trans Math Soft 3(3), 209–226 (1977)
Article MATH Google Scholar
Gray, A.G., Moore, A.W.: ‘N-Body’ problems in statistical learning. Adva in Neur Infor Proc Sys 4(1), 521–527 (2001)
Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann Math Stats 22(1), 79–86 (1951)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Big Data Institute, College of Computer Science and Software Engineering, Shenzhen Univresity, Shenzhen, 518000, China
Chenghao Wei, Timur Valiullin, Weipeng Cao & Hao Long
School of Automation, Hangzhou Dianzi University, Hangzhou, 311305, China
Jiyong Zhang
SUSTech Academy for Advanced Interdisciplinary Studies, Southern University of Science and Technology, Shenzhen, 518055, China
Qiang Wang

Authors

Chenghao Wei
View author publications
You can also search for this author in PubMed Google Scholar
Jiyong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Timur Valiullin
View author publications
You can also search for this author in PubMed Google Scholar
Weipeng Cao
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hao Long
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Weipeng Cao or Qiang Wang .

Editor information

Editors and Affiliations

Columbia University, New York, NY, USA
Meikang Qiu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wei, C., Zhang, J., Valiullin, T., Cao, W., Wang, Q., Long, H. (2020). Distributed and Parallel Ensemble Classification for Big Data Based on Kullback-Leibler Random Sample Partition. In: Qiu, M. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science(), vol 12452. Springer, Cham. https://doi.org/10.1007/978-3-030-60245-1_31

Download citation

DOI: https://doi.org/10.1007/978-3-030-60245-1_31
Published: 29 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60244-4
Online ISBN: 978-3-030-60245-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics