Abstract
Recently, online incremental data mining has become an immensely growing area of research for stream data mining. VFDT algorithm, as an excellent incremental decision tree classification algorithm, is widely used in online data mining. To optimize VFDT algorithm, a dynamic tie-breaking threshold strategy and a pre-pruning mechanism strategy are utilized to achieve the reduction of the scale of decision tree. Furthermore, Bayes classifier is applied to leaf nodes of Hoeffding decision tree, which promotes the improvement of classification accuracy. In this paper, this improved algorithm is called OVFDT (Optimized VFDT) algorithm. To improve the performance of OVFDT for massive streaming data processing, an implementation scheme of OVFDT Algorithm on MapReduce Platform is proposed in our paper. Considering the need for real-time computing, the implementation scheme on Storm Platform is designed. Three comparison experiments are designed to compare the scale, the classification accuracy and the execution time of decision tree of three algorithm generate. The simulation results reveal that compared with C4.5 and VFDT algorithm, OVFDT algorithm can effectively reduce the scale of the decision tree, achieves the improvement of classification accuracy as well.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Wu, X., Zhu, X., Wu, G.Q., et al.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26, 97–107 (2014)
Tan, P.N.: Introduction to Data Mining. Pearson Education India, Upper Saddle River (2006)
Wu, K., Kang, J., Chi, K.: Research on fault diagnosis method using improved multi-class classification algorithm and relevance vector machine. Int. J. Inform. Technol. Web Eng. (IJITWE) 10, 1–16 (2015)
Pradhan, R., Sharma, D.K.: TemporalClassifier: Classification of implicit query on temporal profiles. Int. J. Inform. Technol. Web Eng. (IJITWE) 10, 44–66 (2015)
Wu, Z., Lin, T., Tang, N.: Explore the use of handwriting information and machine learning techniques in evaluating mental workload. Int. J. Technol. Hum. Interac. (IJTHI) 12, 18–32 (2016)
Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000). doi:10.1007/3-540-45014-9_1
Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 71–80. ACM (2000)
De Mántaras, R.L.: A distance-based attribute selection measure for decision tree induction. Mach. Learn. 6, 81–92 (1991)
Friedman, J.H., Kohavi, R., Yun, Y.: Lazy decision trees, pp. 717–724 (1996)
Shah, S., Chauhan, N.C., Bhanderi, S.D.: Incremental mining of association rules: a survey. Int. J. Comput. Sci. Inform. Technol. 3, 4071–4074 (2012)
Mitra, S., Pal, S.K., Mitra, P.: Data mining in soft computing framework: a survey. IEEE Trans. Neural Networks 13, 3–14 (2002)
Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 97–106. ACM (2001)
Last, M.: Online classification of nonstationary data streams. Intell. Data Anal. 6, 129–147 (2002)
Song, X., He, H., Niu, S., et al.: A data streams analysis strategy based on hoeffding tree with concept drift on Hadoop system. In: 2016 International Conference on Advanced Cloud and Big Data (CBD), pp. 45–48. IEEE (2016)
Su, Z., Sun, C., Li, H., et al.: A method for efficient parallel computation of Tate pairing. Int. J. Grid Util. Comput. 3, 43–52 (2012)
Petrlic, R., Sekula, S., Sorge, C.: A privacy-friendly architecture for future cloud computing. Int. J. Grid Util. Comput. 26(4), 265–277 (2013)
Yuriyama, M., Kushida, T.: Integrated cloud computing environment with IT resources and sensor devices. Int. J. Space-Based Situated Comput. 1, 163–173 (2011)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Mori, T., Nakashima, M., Ito, T.: SpACCE: a sophisticated ad hoc cloud computing environment built by server migration to facilitate distributed collaboration. Int. J. Space-Based Situated Comput. 2, 230–239 (2012)
Mezghani, K., Ayadi, F.: Int. J. Technol. Hum. Interact. (IJTHI) 12, 1–20 (2016)
Urbani J, Margara A, Jacobs C, et al. AJIRA: a lightweight distributed middleware for MapReduce and stream processing. In: 2014 IEEE 34th International Conference on Distributed Computing Systems (ICDCS), pp. 545–554. IEEE (2014)
Desai, S., Roy, S., Patel, B., et al.: Very Fast Decision Tree (VFDT) algorithm on Hadoop. In: 2016 International Conference on Computing Communication Control and automation (ICCUBEA), pp. 1–7. IEEE (2016)
Joshi, SB.: Apache hadoop performance-tuning methodologies and best practices. In: Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering, pp. 241–242. ACM (2012)
Toshniwal, A., Taneja, S., Shukla, A., et al.: Storm@ twitter. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 147–156.ACM (2014)
Xin, R.S., Gonzalez, J.E., Franklin, M.J., et al.: Graphx: a resilient distributed graph system on Storm. In: First International Workshop on Graph Data Management Experiences and Systems. ACM (2013)
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963)
Li, F., Liu, Q.: An improved algorithm of decision trees for streaming data based on VFDT. In: IEEE International Symposium on Information Science and Engineering, ISISE 2008, vol. 1, pp. 597–600 (2008)
Acknowledgments
The subject is sponsored by the National Natural Science Foundation of P. R. China (No. 61373017, No. 61572260, No. 61572261, No. 61672296, No. 61602261), the Natural Science Foundation of Jiangsu Province (No. BK20140886, No. BK20140888, No. BK20160089), Scientific & Technological Support Project of Jiangsu Province (No. BE2015702, No. BE2016777, BE2016185), China Postdoctoral Science Foundation (No. 2014M551636, No. 2014M561696), Jiangsu Planned Projects for Postdoctoral Research Funds (No. 1302090B, No. 1401005B), Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks Foundation (No. WSNLBZY201508).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Li, L., Li, P., Xu, H., Chen, F. (2018). A Bayes Classifier-Based OVFDT Algorithm for Massive Stream Data Mining on Big Data Platform. In: Barolli, L., Terzo, O. (eds) Complex, Intelligent, and Software Intensive Systems. CISIS 2017. Advances in Intelligent Systems and Computing, vol 611. Springer, Cham. https://doi.org/10.1007/978-3-319-61566-0_49
Download citation
DOI: https://doi.org/10.1007/978-3-319-61566-0_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61565-3
Online ISBN: 978-3-319-61566-0
eBook Packages: EngineeringEngineering (R0)