Skip to main content

A Distributed Big Data Discretization Algorithm Under Spark

  • Conference paper
  • First Online:
Big Data (BigData 2019)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1120))

Included in the following conference series:

  • 1161 Accesses

Abstract

Data discretization is one of the important steps of data preprocessing in data mining, which can improve the data quality and thus improve the accuracy and time performance of the subsequent learning process. In the era of big data, the traditional discretization method is no longer applicable and distributed discretization algorithms need to be designed. Hellinger-entropy as an important distance measurement method in information theory is context-sensitive and feature-sensitive and thus are abundant of useful information. Therefore, in this paper we implement a Hellinger-entropy based distributed discretization algorithm under Apache Spark. We first measure the divergence of discrete intervals using Hellinger-entropy. Then we select top-k boundary points according to the information provided by the divergence value of discrete intervals. Finally, we divide the continuous variable range into k discrete intervals. We verficate the distributed discretization performance in the preprocessing of random forest, Bayes and multilayer perceptron classification on real sensor big data sets. Experimental results show that the time performance and classification accuracy of the distributed discretization algorithm based on Hellinger-entropy proposed in this paper are better than the existing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. García, S., Luengo, J., Herrera, F.: Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl.-Based Syst. 98, 1–29 (2016)

    Article  Google Scholar 

  2. Ramírez-Gallego, S., García, S., Mouriño-Talín, H., et al.: Data discretization: taxonomy and big data challenge. Wiley Interdisc. Rev.: Data Min. Knowl. Discovery 6(1), 5–21 (2016)

    Google Scholar 

  3. Beran, R.: Minimum hellinger distance estimates for parametric models. Ann. Stat. 5(3), 445–463 (1977)

    Article  MathSciNet  Google Scholar 

  4. Ramírez-Gallego, S., et al.: Data discretization: taxonomy and big data challenge. Wiley Interdiscip. Rev.: Data Min. Knowl. Discov. 6(1), 5–21 (2016)

    Google Scholar 

  5. Salzberg, S.L.: C4.5: programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc. 1993. Mach. Learn. 16(3), 235–240 (1994)

    MathSciNet  Google Scholar 

  6. Au, W.H., Chan, K.C., Wong, A.K.C.: A fuzzy approach to partitioning continuous attributes for classification. IEEE Educational Activities Department (2006)

    Google Scholar 

  7. Liu, Y.: Parallel discrete data preparation optimization in data mining. J. Sichuan Univ. (Nat. Sci. Ed.) 55(05), 103–109 (2018)

    Google Scholar 

  8. Lee, C.H.: A Hellinger-based discretization method for numeric attributes in classification learning. Knowl.-Based Syst. 20(4), 419–425 (2007)

    Article  Google Scholar 

  9. Wu, C., Guo, S., Li, C.: Research on discretization algorithm based on gaussian mixture model. Small Microcomput. Syst. (4), 21 (2018)

    Google Scholar 

  10. Ramírez-Gallego, S., García, S., Mouriño-Talín, H., et al.: Distributed entropy minimization discretizer for big data analysis under apache spark. In: 2015 IEEE Trustcom/BigDataSE/ISPA, vol. 2, pp. 33–40. IEEE (2015)

    Google Scholar 

  11. Wang, L.: Power big data attribute discretization method based on cloud computing technology. Digit. Technol. Appl. (1), 56–58 (2015)

    Google Scholar 

  12. Zaharia, M., Xin, R.S., Wendell, P., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)

    Article  Google Scholar 

  13. Alcalá-Fdez, J., Fernández, A., Luengo, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17, 2–3 (2011)

    Google Scholar 

  14. UCI Machine Learning Repository: Heterogeneity Activity Recognition data. http://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jing Hua Zhu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chan, Y., Zhang, X.J., Zhu, J.H. (2019). A Distributed Big Data Discretization Algorithm Under Spark. In: Jin, H., Lin, X., Cheng, X., Shi, X., Xiao, N., Huang, Y. (eds) Big Data. BigData 2019. Communications in Computer and Information Science, vol 1120. Springer, Singapore. https://doi.org/10.1007/978-981-15-1899-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-1899-7_8

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-1898-0

  • Online ISBN: 978-981-15-1899-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics