Skip to main content

Distributed and Parallel Ensemble Classification for Big Data Based on Kullback-Leibler Random Sample Partition

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12452))

Abstract

In this article, we use a Kullback-Leibler random sample partition data model to generate a set of disjoint data blocks, where each block is a good representation of the entire data set. Every random sample partition (RSP) block has a sample distribution function similar to the entire data set. To obtain the statistical measure between them, Kernel Density Estimation (KDE) with a dual-tree recursion data structure is firstly applied to fast estimate the probability density of each block. Then, based on the Kullback-Leibler (KL) divergence measure, we can obtain the statistical similarity between a randomly selected RSP data block and other RSP data blocks. We rank the RSP data blocks according to their divergence values in descending order and choose the first ten for an ensemble classification learning. The classification models are established in parallel for the selected RSP data blocks and the final ensemble classification model is obtained by the weighted voting ensemble strategy. The experiments were conducted by building XGboost model based on those ten blocks in parallel, and we incrementally ensemble them according to their KL values. The testing classification results show that our method can increase the generalization capability of the ensemble classification model. It could reduce the model building time in parallel computation environment by using less than \(15\%\) of the entire data, which could also solve the memory constraints of big data analysis.

This work was supported by National Natural Science Foundation of China (61836005), and the Opening Project of Shanghai Trusted Industrial Control Platform (TICPSH202003008-ZC).

C. Wei and J. Zhang—Joint first authors.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Available: http://hadoop.apache.org/.

  2. 2.

    Available: https://spark.apache.org/.

  3. 3.

    Available: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client.

  4. 4.

    Available: http://archive.ics.uci.edu/ml/datasets/HIGGS.

  5. 5.

    Available: https://xgboost.readthedocs.io/en/latest/.

References

  1. Chen, B.W., Wen, J., Seungmin, R.: Divide-and-conquer signal processing, feature extraction, and machine learning for big data. Neurocomputing 174, 383 (2016)

    Article  Google Scholar 

  2. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: IEEE 26th symposium on mass storage systems and technologies, pp. 1–10 (2010)

    Google Scholar 

  3. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communi of ACM. 51(1), 107–13 (2008)

    Article  Google Scholar 

  4. Elteir M., Lin H., and Feng W.C.: Enhancing mapreduce via asynchronous data processing. In: IEEE international conference on parallel and distributed systems, pp. 397–405 (2010)

    Google Scholar 

  5. Zaharia M., Chowdhury M., Franklin M.J., Shenker S., Stoica I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing, pp. 10 (2010)

    Google Scholar 

  6. Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int J Data Sci Anal 1, 145–164 (2016)

    Article  Google Scholar 

  7. Lei G., Huan L.: Memory or time: performance evaluation for iterative operation on hadoop and spark. In: IEEE 10th international conference on high performance computing and communications. pp. 721–727 (2013)

    Google Scholar 

  8. Wu X., Zhu X., Wu G.Q., Ding W., Data mining with big data, IEEE Trans. Knowl. Data Eng., 26(1), 97C107(2014)

    Google Scholar 

  9. Salloum, S., Huang, J.Z., He, Y.L.: Random sample partition: a distributed data model for big data analysis. IEEE Trans on Indus Infor 15(11), 5846–5854 (2019)

    Article  Google Scholar 

  10. Dong, X., Yu, Z., Cao, W., Shi, Y.: Ma, Q: A survey on ensemble learning. Front Compu Sci 14(2), 241–258 (2020)

    Article  Google Scholar 

  11. Galicia, A., Talavera, L.R., Troncoso, A., Koprinska, I., Martnez, A.F.: Multi-step forecasting for big data time series based on ensemble learning. Know Base Sys 163, 830–841 (2018)

    Article  Google Scholar 

  12. Tang Y., Wang Y., Cooper K.M.L., Li L.: Towards big data bayesian network learning - an ensemble learning based approach, In: IEEE international congress on big data, 2014, pp. 355–357

    Google Scholar 

  13. Shadi K., Patrick M., Rebecca Y.,: Label-aware distributed ensemble learning: a simplified distributed classifier training model for big data, big data res, 15, pp. 1–11, (2019)

    Google Scholar 

  14. Diego M., Eduard A., Jose R. Herrero, Read J., Bifet A., Low-latency multi-threaded ensemble learning for dynamic big data streams. In: IEEE international conference on big data, pp. 223–232 (2017)

    Google Scholar 

  15. Salman, S., Joshua, Z.X.H., He, Y.L., Chen, X.J.: An asymptotic ensemble learning framework for big data analysis. IEEE Acc 7, 3675–3693 (2019)

    Article  Google Scholar 

  16. Zhou Z.H., Wu J.X., Tang W.: Ensembling neural networks: many could be better than all, AI, 137(1C2),239–263 (2002)

    Google Scholar 

  17. Giancinto, G., Roli, F.: An approach to the automatic design of multiple classifier ensembles. Patt Recog Lett 22(1), 25–33 (2001)

    Article  Google Scholar 

  18. Cheng X.Y., Guo H.L.: The technology of selective multiple classifiers ensemble based on kernel clustering. In: International symposium on intelligent information technology application. pp. 146–150 (2008)

    Google Scholar 

  19. Martinez, M.G., Suarez, A.: Using boosting to prune bagging ensembles. Patt Recog Lett 28(1), 156–165 (2007)

    Article  Google Scholar 

  20. Martinez M.G., Suarez A. Pruning in ordered bagging ensembles. In: Prceedings of the 23rd international conference on machine learning, pp. 609–368 (2006)

    Google Scholar 

  21. Breiman, L.: Out-of-bag estimation. Statistics deparment in university of california, Technical Report (1996)

    Google Scholar 

  22. Zhang, L., Zhou, W.D.: Sparse ensembles using weighted combination methods based on linear programming. Patt Recog Lett 44(1), 97–106 (2011)

    Article  MATH  Google Scholar 

  23. Fan, C.T., Muller, M.E., Rezucha, I.: Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J Amer Stat Ass 57(298), 387–402 (1962)

    Article  MathSciNet  MATH  Google Scholar 

  24. Haas, P.J.: Data-stream sampling: basic techniques and results. Data Stream Management. Springer, Berlin Heidelberg (2016). https://doi.org/10.1007/978-3-540-28608-0_2

  25. Oliphant, T.E.: SciPy: open source scientific tools for Python. Comput Sci Eng, 9(3):10C20, (2007)

    Google Scholar 

  26. Swami, A., Jain, R.: Scikit-learn: machine learning in Python. J Mach Lear Res 12(10), 2825–2830 (2013)

    MathSciNet  Google Scholar 

  27. Podgurski, A., Yang, C.: Partition testing, stratified sampling, and cluster analysis. ACM SIGSOFT Soft Eng Notes 18(5), 169–181 (1993)

    Article  Google Scholar 

  28. Kleiner A., Talwalkar A., Sarkar P., Jordan M.I.: The big data Bootstrap. In: Proceedings of the 29th international conference on machine learning. pp. 1787–1794, (2012)

    Google Scholar 

  29. Rosenblatt, M.: Remarks on some nonparametric estimates of a density function. Ann Math Stat 27, 832–837 (1956)

    Article  MathSciNet  MATH  Google Scholar 

  30. Parzen, E.: On the estimation of probability density functions and mode. Ann Math Stat 33, 1065–1076 (1962)

    Article  MathSciNet  MATH  Google Scholar 

  31. Chen, S., Hong, X., Harris, C.J.: Sparse kernel density construction using orthogonal forward regression with leave-one-out test score and local regularization. IEEE transactions on systems man and cybernetics part b cybernetics 34(4), 1708–1717 (2004)

    Article  Google Scholar 

  32. Friedman, J.H., Bentley, J.L., Finkel, R.A.: An algorithm for finding best matches in logarithmic expected time. ACM Trans Math Soft 3(3), 209–226 (1977)

    Article  MATH  Google Scholar 

  33. Gray, A.G., Moore, A.W.: ‘N-Body’ problems in statistical learning. Adva in Neur Infor Proc Sys 4(1), 521–527 (2001)

    Google Scholar 

  34. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann Math Stats 22(1), 79–86 (1951)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Weipeng Cao or Qiang Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wei, C., Zhang, J., Valiullin, T., Cao, W., Wang, Q., Long, H. (2020). Distributed and Parallel Ensemble Classification for Big Data Based on Kullback-Leibler Random Sample Partition. In: Qiu, M. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science(), vol 12452. Springer, Cham. https://doi.org/10.1007/978-3-030-60245-1_31

Download citation

Publish with us

Policies and ethics