Skip to main content

A Bayes Classifier-Based OVFDT Algorithm for Massive Stream Data Mining on Big Data Platform

  • Conference paper
  • First Online:
Complex, Intelligent, and Software Intensive Systems (CISIS 2017)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 611))

Included in the following conference series:

Abstract

Recently, online incremental data mining has become an immensely growing area of research for stream data mining. VFDT algorithm, as an excellent incremental decision tree classification algorithm, is widely used in online data mining. To optimize VFDT algorithm, a dynamic tie-breaking threshold strategy and a pre-pruning mechanism strategy are utilized to achieve the reduction of the scale of decision tree. Furthermore, Bayes classifier is applied to leaf nodes of Hoeffding decision tree, which promotes the improvement of classification accuracy. In this paper, this improved algorithm is called OVFDT (Optimized VFDT) algorithm. To improve the performance of OVFDT for massive streaming data processing, an implementation scheme of OVFDT Algorithm on MapReduce Platform is proposed in our paper. Considering the need for real-time computing, the implementation scheme on Storm Platform is designed. Three comparison experiments are designed to compare the scale, the classification accuracy and the execution time of decision tree of three algorithm generate. The simulation results reveal that compared with C4.5 and VFDT algorithm, OVFDT algorithm can effectively reduce the scale of the decision tree, achieves the improvement of classification accuracy as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wu, X., Zhu, X., Wu, G.Q., et al.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26, 97–107 (2014)

    Google Scholar 

  2. Tan, P.N.: Introduction to Data Mining. Pearson Education India, Upper Saddle River (2006)

    Google Scholar 

  3. Wu, K., Kang, J., Chi, K.: Research on fault diagnosis method using improved multi-class classification algorithm and relevance vector machine. Int. J. Inform. Technol. Web Eng. (IJITWE) 10, 1–16 (2015)

    Article  Google Scholar 

  4. Pradhan, R., Sharma, D.K.: TemporalClassifier: Classification of implicit query on temporal profiles. Int. J. Inform. Technol. Web Eng. (IJITWE) 10, 44–66 (2015)

    Google Scholar 

  5. Wu, Z., Lin, T., Tang, N.: Explore the use of handwriting information and machine learning techniques in evaluating mental workload. Int. J. Technol. Hum. Interac. (IJTHI) 12, 18–32 (2016)

    Article  Google Scholar 

  6. Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000). doi:10.1007/3-540-45014-9_1

    Chapter  Google Scholar 

  7. Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 71–80. ACM (2000)

    Google Scholar 

  8. De Mántaras, R.L.: A distance-based attribute selection measure for decision tree induction. Mach. Learn. 6, 81–92 (1991)

    Article  Google Scholar 

  9. Friedman, J.H., Kohavi, R., Yun, Y.: Lazy decision trees, pp. 717–724 (1996)

    Google Scholar 

  10. Shah, S., Chauhan, N.C., Bhanderi, S.D.: Incremental mining of association rules: a survey. Int. J. Comput. Sci. Inform. Technol. 3, 4071–4074 (2012)

    Google Scholar 

  11. Mitra, S., Pal, S.K., Mitra, P.: Data mining in soft computing framework: a survey. IEEE Trans. Neural Networks 13, 3–14 (2002)

    Article  Google Scholar 

  12. Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 97–106. ACM (2001)

    Google Scholar 

  13. Last, M.: Online classification of nonstationary data streams. Intell. Data Anal. 6, 129–147 (2002)

    MATH  Google Scholar 

  14. Song, X., He, H., Niu, S., et al.: A data streams analysis strategy based on hoeffding tree with concept drift on Hadoop system. In: 2016 International Conference on Advanced Cloud and Big Data (CBD), pp. 45–48. IEEE (2016)

    Google Scholar 

  15. Su, Z., Sun, C., Li, H., et al.: A method for efficient parallel computation of Tate pairing. Int. J. Grid Util. Comput. 3, 43–52 (2012)

    Article  Google Scholar 

  16. Petrlic, R., Sekula, S., Sorge, C.: A privacy-friendly architecture for future cloud computing. Int. J. Grid Util. Comput. 26(4), 265–277 (2013)

    Article  Google Scholar 

  17. Yuriyama, M., Kushida, T.: Integrated cloud computing environment with IT resources and sensor devices. Int. J. Space-Based Situated Comput. 1, 163–173 (2011)

    Article  Google Scholar 

  18. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)

    Article  Google Scholar 

  19. Mori, T., Nakashima, M., Ito, T.: SpACCE: a sophisticated ad hoc cloud computing environment built by server migration to facilitate distributed collaboration. Int. J. Space-Based Situated Comput. 2, 230–239 (2012)

    Article  Google Scholar 

  20. Mezghani, K., Ayadi, F.: Int. J. Technol. Hum. Interact. (IJTHI) 12, 1–20 (2016)

    Article  Google Scholar 

  21. Urbani J, Margara A, Jacobs C, et al. AJIRA: a lightweight distributed middleware for MapReduce and stream processing. In: 2014 IEEE 34th International Conference on Distributed Computing Systems (ICDCS), pp. 545–554. IEEE (2014)

    Google Scholar 

  22. Desai, S., Roy, S., Patel, B., et al.: Very Fast Decision Tree (VFDT) algorithm on Hadoop. In: 2016 International Conference on Computing Communication Control and automation (ICCUBEA), pp. 1–7. IEEE (2016)

    Google Scholar 

  23. Joshi, SB.: Apache hadoop performance-tuning methodologies and best practices. In: Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering, pp. 241–242. ACM (2012)

    Google Scholar 

  24. Toshniwal, A., Taneja, S., Shukla, A., et al.: Storm@ twitter. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 147–156.ACM (2014)

    Google Scholar 

  25. Xin, R.S., Gonzalez, J.E., Franklin, M.J., et al.: Graphx: a resilient distributed graph system on Storm. In: First International Workshop on Graph Data Management Experiences and Systems. ACM (2013)

    Google Scholar 

  26. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963)

    Article  MathSciNet  MATH  Google Scholar 

  27. Li, F., Liu, Q.: An improved algorithm of decision trees for streaming data based on VFDT. In: IEEE International Symposium on Information Science and Engineering, ISISE 2008, vol. 1, pp. 597–600 (2008)

    Google Scholar 

Download references

Acknowledgments

The subject is sponsored by the National Natural Science Foundation of P. R. China (No. 61373017, No. 61572260, No. 61572261, No. 61672296, No. 61602261), the Natural Science Foundation of Jiangsu Province (No. BK20140886, No. BK20140888, No. BK20160089), Scientific & Technological Support Project of Jiangsu Province (No. BE2015702, No. BE2016777, BE2016185), China Postdoctoral Science Foundation (No. 2014M551636, No. 2014M561696), Jiangsu Planned Projects for Postdoctoral Research Funds (No. 1302090B, No. 1401005B), Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks Foundation (No. WSNLBZY201508).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peng Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Cite this paper

Li, L., Li, P., Xu, H., Chen, F. (2018). A Bayes Classifier-Based OVFDT Algorithm for Massive Stream Data Mining on Big Data Platform. In: Barolli, L., Terzo, O. (eds) Complex, Intelligent, and Software Intensive Systems. CISIS 2017. Advances in Intelligent Systems and Computing, vol 611. Springer, Cham. https://doi.org/10.1007/978-3-319-61566-0_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-61566-0_49

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-61565-3

  • Online ISBN: 978-3-319-61566-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics