Abstract
Data discretization is one of the important steps of data preprocessing in data mining, which can improve the data quality and thus improve the accuracy and time performance of the subsequent learning process. In the era of big data, the traditional discretization method is no longer applicable and distributed discretization algorithms need to be designed. Hellinger-entropy as an important distance measurement method in information theory is context-sensitive and feature-sensitive and thus are abundant of useful information. Therefore, in this paper we implement a Hellinger-entropy based distributed discretization algorithm under Apache Spark. We first measure the divergence of discrete intervals using Hellinger-entropy. Then we select top-k boundary points according to the information provided by the divergence value of discrete intervals. Finally, we divide the continuous variable range into k discrete intervals. We verficate the distributed discretization performance in the preprocessing of random forest, Bayes and multilayer perceptron classification on real sensor big data sets. Experimental results show that the time performance and classification accuracy of the distributed discretization algorithm based on Hellinger-entropy proposed in this paper are better than the existing algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
GarcÃa, S., Luengo, J., Herrera, F.: Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl.-Based Syst. 98, 1–29 (2016)
RamÃrez-Gallego, S., GarcÃa, S., Mouriño-TalÃn, H., et al.: Data discretization: taxonomy and big data challenge. Wiley Interdisc. Rev.: Data Min. Knowl. Discovery 6(1), 5–21 (2016)
Beran, R.: Minimum hellinger distance estimates for parametric models. Ann. Stat. 5(3), 445–463 (1977)
RamÃrez-Gallego, S., et al.: Data discretization: taxonomy and big data challenge. Wiley Interdiscip. Rev.: Data Min. Knowl. Discov. 6(1), 5–21 (2016)
Salzberg, S.L.: C4.5: programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc. 1993. Mach. Learn. 16(3), 235–240 (1994)
Au, W.H., Chan, K.C., Wong, A.K.C.: A fuzzy approach to partitioning continuous attributes for classification. IEEE Educational Activities Department (2006)
Liu, Y.: Parallel discrete data preparation optimization in data mining. J. Sichuan Univ. (Nat. Sci. Ed.) 55(05), 103–109 (2018)
Lee, C.H.: A Hellinger-based discretization method for numeric attributes in classification learning. Knowl.-Based Syst. 20(4), 419–425 (2007)
Wu, C., Guo, S., Li, C.: Research on discretization algorithm based on gaussian mixture model. Small Microcomput. Syst. (4), 21 (2018)
RamÃrez-Gallego, S., GarcÃa, S., Mouriño-TalÃn, H., et al.: Distributed entropy minimization discretizer for big data analysis under apache spark. In: 2015 IEEE Trustcom/BigDataSE/ISPA, vol. 2, pp. 33–40. IEEE (2015)
Wang, L.: Power big data attribute discretization method based on cloud computing technology. Digit. Technol. Appl. (1), 56–58 (2015)
Zaharia, M., Xin, R.S., Wendell, P., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Alcalá-Fdez, J., Fernández, A., Luengo, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17, 2–3 (2011)
UCI Machine Learning Repository: Heterogeneity Activity Recognition data. http://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chan, Y., Zhang, X.J., Zhu, J.H. (2019). A Distributed Big Data Discretization Algorithm Under Spark. In: Jin, H., Lin, X., Cheng, X., Shi, X., Xiao, N., Huang, Y. (eds) Big Data. BigData 2019. Communications in Computer and Information Science, vol 1120. Springer, Singapore. https://doi.org/10.1007/978-981-15-1899-7_8
Download citation
DOI: https://doi.org/10.1007/978-981-15-1899-7_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-1898-0
Online ISBN: 978-981-15-1899-7
eBook Packages: Computer ScienceComputer Science (R0)