A Distributed Big Data Discretization Algorithm Under Spark

Chan, Yeung; Zhang, Xia Jie; Zhu, Jing Hua

doi:10.1007/978-981-15-1899-7_8

Yeung Chan¹²,
Xia Jie Zhang¹² &
Jing Hua Zhu¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1120))

Included in the following conference series:

CCF Conference on Big Data

1161 Accesses

Abstract

Data discretization is one of the important steps of data preprocessing in data mining, which can improve the data quality and thus improve the accuracy and time performance of the subsequent learning process. In the era of big data, the traditional discretization method is no longer applicable and distributed discretization algorithms need to be designed. Hellinger-entropy as an important distance measurement method in information theory is context-sensitive and feature-sensitive and thus are abundant of useful information. Therefore, in this paper we implement a Hellinger-entropy based distributed discretization algorithm under Apache Spark. We first measure the divergence of discrete intervals using Hellinger-entropy. Then we select top-k boundary points according to the information provided by the divergence value of discrete intervals. Finally, we divide the continuous variable range into k discrete intervals. We verficate the distributed discretization performance in the preprocessing of random forest, Bayes and multilayer perceptron classification on real sensor big data sets. Experimental results show that the time performance and classification accuracy of the distributed discretization algorithm based on Hellinger-entropy proposed in this paper are better than the existing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

García, S., Luengo, J., Herrera, F.: Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl.-Based Syst. 98, 1–29 (2016)
Article Google Scholar
Ramírez-Gallego, S., García, S., Mouriño-Talín, H., et al.: Data discretization: taxonomy and big data challenge. Wiley Interdisc. Rev.: Data Min. Knowl. Discovery 6(1), 5–21 (2016)
Google Scholar
Beran, R.: Minimum hellinger distance estimates for parametric models. Ann. Stat. 5(3), 445–463 (1977)
Article MathSciNet Google Scholar
Ramírez-Gallego, S., et al.: Data discretization: taxonomy and big data challenge. Wiley Interdiscip. Rev.: Data Min. Knowl. Discov. 6(1), 5–21 (2016)
Google Scholar
Salzberg, S.L.: C4.5: programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc. 1993. Mach. Learn. 16(3), 235–240 (1994)
MathSciNet Google Scholar
Au, W.H., Chan, K.C., Wong, A.K.C.: A fuzzy approach to partitioning continuous attributes for classification. IEEE Educational Activities Department (2006)
Google Scholar
Liu, Y.: Parallel discrete data preparation optimization in data mining. J. Sichuan Univ. (Nat. Sci. Ed.) 55(05), 103–109 (2018)
Google Scholar
Lee, C.H.: A Hellinger-based discretization method for numeric attributes in classification learning. Knowl.-Based Syst. 20(4), 419–425 (2007)
Article Google Scholar
Wu, C., Guo, S., Li, C.: Research on discretization algorithm based on gaussian mixture model. Small Microcomput. Syst. (4), 21 (2018)
Google Scholar
Ramírez-Gallego, S., García, S., Mouriño-Talín, H., et al.: Distributed entropy minimization discretizer for big data analysis under apache spark. In: 2015 IEEE Trustcom/BigDataSE/ISPA, vol. 2, pp. 33–40. IEEE (2015)
Google Scholar
Wang, L.: Power big data attribute discretization method based on cloud computing technology. Digit. Technol. Appl. (1), 56–58 (2015)
Google Scholar
Zaharia, M., Xin, R.S., Wendell, P., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Article Google Scholar
Alcalá-Fdez, J., Fernández, A., Luengo, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17, 2–3 (2011)
Google Scholar
UCI Machine Learning Repository: Heterogeneity Activity Recognition data. http://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition

Download references

Author information

Authors and Affiliations

Heilongjiang University, Harbin, 150000, China
Yeung Chan, Xia Jie Zhang & Jing Hua Zhu

Authors

Yeung Chan
View author publications
You can also search for this author in PubMed Google Scholar
Xia Jie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jing Hua Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Hua Zhu .

Editor information

Editors and Affiliations

Huazhong University of Science and Technology, Wuhan, China
Hai Jin
East China Normal University, Shanghai, China
Xuemin Lin
Chinese Academy of Sciences, Beijing, China
Xueqi Cheng
Huazhong University of Science and Technology, Wuhan, China
Xuanhua Shi
National University of Defense Technology, Changsha, China
Nong Xiao
Nanjing University, Nanjing, China
Yihua Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chan, Y., Zhang, X.J., Zhu, J.H. (2019). A Distributed Big Data Discretization Algorithm Under Spark. In: Jin, H., Lin, X., Cheng, X., Shi, X., Xiao, N., Huang, Y. (eds) Big Data. BigData 2019. Communications in Computer and Information Science, vol 1120. Springer, Singapore. https://doi.org/10.1007/978-981-15-1899-7_8

Download citation

DOI: https://doi.org/10.1007/978-981-15-1899-7_8
Published: 28 November 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-1898-0
Online ISBN: 978-981-15-1899-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)