Abstract
In this article, we use a Kullback-Leibler random sample partition data model to generate a set of disjoint data blocks, where each block is a good representation of the entire data set. Every random sample partition (RSP) block has a sample distribution function similar to the entire data set. To obtain the statistical measure between them, Kernel Density Estimation (KDE) with a dual-tree recursion data structure is firstly applied to fast estimate the probability density of each block. Then, based on the Kullback-Leibler (KL) divergence measure, we can obtain the statistical similarity between a randomly selected RSP data block and other RSP data blocks. We rank the RSP data blocks according to their divergence values in descending order and choose the first ten for an ensemble classification learning. The classification models are established in parallel for the selected RSP data blocks and the final ensemble classification model is obtained by the weighted voting ensemble strategy. The experiments were conducted by building XGboost model based on those ten blocks in parallel, and we incrementally ensemble them according to their KL values. The testing classification results show that our method can increase the generalization capability of the ensemble classification model. It could reduce the model building time in parallel computation environment by using less than \(15\%\) of the entire data, which could also solve the memory constraints of big data analysis.
This work was supported by National Natural Science Foundation of China (61836005), and the Opening Project of Shanghai Trusted Industrial Control Platform (TICPSH202003008-ZC).
C. Wei and J. Zhang—Joint first authors.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Available: http://hadoop.apache.org/.
- 2.
Available: https://spark.apache.org/.
- 3.
- 4.
Available: http://archive.ics.uci.edu/ml/datasets/HIGGS.
- 5.
Available: https://xgboost.readthedocs.io/en/latest/.
References
Chen, B.W., Wen, J., Seungmin, R.: Divide-and-conquer signal processing, feature extraction, and machine learning for big data. Neurocomputing 174, 383 (2016)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: IEEE 26th symposium on mass storage systems and technologies, pp. 1–10 (2010)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communi of ACM. 51(1), 107–13 (2008)
Elteir M., Lin H., and Feng W.C.: Enhancing mapreduce via asynchronous data processing. In: IEEE international conference on parallel and distributed systems, pp. 397–405 (2010)
Zaharia M., Chowdhury M., Franklin M.J., Shenker S., Stoica I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing, pp. 10 (2010)
Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int J Data Sci Anal 1, 145–164 (2016)
Lei G., Huan L.: Memory or time: performance evaluation for iterative operation on hadoop and spark. In: IEEE 10th international conference on high performance computing and communications. pp. 721–727 (2013)
Wu X., Zhu X., Wu G.Q., Ding W., Data mining with big data, IEEE Trans. Knowl. Data Eng., 26(1), 97C107(2014)
Salloum, S., Huang, J.Z., He, Y.L.: Random sample partition: a distributed data model for big data analysis. IEEE Trans on Indus Infor 15(11), 5846–5854 (2019)
Dong, X., Yu, Z., Cao, W., Shi, Y.: Ma, Q: A survey on ensemble learning. Front Compu Sci 14(2), 241–258 (2020)
Galicia, A., Talavera, L.R., Troncoso, A., Koprinska, I., Martnez, A.F.: Multi-step forecasting for big data time series based on ensemble learning. Know Base Sys 163, 830–841 (2018)
Tang Y., Wang Y., Cooper K.M.L., Li L.: Towards big data bayesian network learning - an ensemble learning based approach, In: IEEE international congress on big data, 2014, pp. 355–357
Shadi K., Patrick M., Rebecca Y.,: Label-aware distributed ensemble learning: a simplified distributed classifier training model for big data, big data res, 15, pp. 1–11, (2019)
Diego M., Eduard A., Jose R. Herrero, Read J., Bifet A., Low-latency multi-threaded ensemble learning for dynamic big data streams. In: IEEE international conference on big data, pp. 223–232 (2017)
Salman, S., Joshua, Z.X.H., He, Y.L., Chen, X.J.: An asymptotic ensemble learning framework for big data analysis. IEEE Acc 7, 3675–3693 (2019)
Zhou Z.H., Wu J.X., Tang W.: Ensembling neural networks: many could be better than all, AI, 137(1C2),239–263 (2002)
Giancinto, G., Roli, F.: An approach to the automatic design of multiple classifier ensembles. Patt Recog Lett 22(1), 25–33 (2001)
Cheng X.Y., Guo H.L.: The technology of selective multiple classifiers ensemble based on kernel clustering. In: International symposium on intelligent information technology application. pp. 146–150 (2008)
Martinez, M.G., Suarez, A.: Using boosting to prune bagging ensembles. Patt Recog Lett 28(1), 156–165 (2007)
Martinez M.G., Suarez A. Pruning in ordered bagging ensembles. In: Prceedings of the 23rd international conference on machine learning, pp. 609–368 (2006)
Breiman, L.: Out-of-bag estimation. Statistics deparment in university of california, Technical Report (1996)
Zhang, L., Zhou, W.D.: Sparse ensembles using weighted combination methods based on linear programming. Patt Recog Lett 44(1), 97–106 (2011)
Fan, C.T., Muller, M.E., Rezucha, I.: Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J Amer Stat Ass 57(298), 387–402 (1962)
Haas, P.J.: Data-stream sampling: basic techniques and results. Data Stream Management. Springer, Berlin Heidelberg (2016). https://doi.org/10.1007/978-3-540-28608-0_2
Oliphant, T.E.: SciPy: open source scientific tools for Python. Comput Sci Eng, 9(3):10C20, (2007)
Swami, A., Jain, R.: Scikit-learn: machine learning in Python. J Mach Lear Res 12(10), 2825–2830 (2013)
Podgurski, A., Yang, C.: Partition testing, stratified sampling, and cluster analysis. ACM SIGSOFT Soft Eng Notes 18(5), 169–181 (1993)
Kleiner A., Talwalkar A., Sarkar P., Jordan M.I.: The big data Bootstrap. In: Proceedings of the 29th international conference on machine learning. pp. 1787–1794, (2012)
Rosenblatt, M.: Remarks on some nonparametric estimates of a density function. Ann Math Stat 27, 832–837 (1956)
Parzen, E.: On the estimation of probability density functions and mode. Ann Math Stat 33, 1065–1076 (1962)
Chen, S., Hong, X., Harris, C.J.: Sparse kernel density construction using orthogonal forward regression with leave-one-out test score and local regularization. IEEE transactions on systems man and cybernetics part b cybernetics 34(4), 1708–1717 (2004)
Friedman, J.H., Bentley, J.L., Finkel, R.A.: An algorithm for finding best matches in logarithmic expected time. ACM Trans Math Soft 3(3), 209–226 (1977)
Gray, A.G., Moore, A.W.: ‘N-Body’ problems in statistical learning. Adva in Neur Infor Proc Sys 4(1), 521–527 (2001)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann Math Stats 22(1), 79–86 (1951)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Wei, C., Zhang, J., Valiullin, T., Cao, W., Wang, Q., Long, H. (2020). Distributed and Parallel Ensemble Classification for Big Data Based on Kullback-Leibler Random Sample Partition. In: Qiu, M. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science(), vol 12452. Springer, Cham. https://doi.org/10.1007/978-3-030-60245-1_31
Download citation
DOI: https://doi.org/10.1007/978-3-030-60245-1_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60244-4
Online ISBN: 978-3-030-60245-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)