Skip to main content
Log in

Critical parameter analysis of Vertical Hoeffding Tree for optimized performance using SAMOA

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Streaming classification of big data is a method under stream data mining that learns from continuous, ordered sequences of data streams coming from diversified sources using limited computing and storage capabilities. SAMOA stands for scalable advanced massive online analysis, is a machine learning framework used to perform distributed data mining over streaming data. Vertical Hoeffding Tree (VHT) under SAMOA is a variant of very fast decision tree used for distributed classification of data streams. The performance of VHT depends on various critical parameters such as tie-threshold, grace value, confidence, split criterion, etc. Although, VHT is widely accepted as an efficient streaming classifier but one of the challenges in streaming classification is varying distribution of incoming data instances with respect to underlying classes in different datasets; therefore performance of VHT varies in different datasets. Therefore, achieving optimal performance from the stream classifier like VHT on different datasets is a challenging task and fixed set of values of critical parameters cannot be preconfigured for various types of datasets. This research work explores the capabilities of VHT streaming classifier of SAMOA in the light of various benchmarking performance statistics such as classification accuracy, kappa and kappa temporal. The work presented here, experimentally identifies suitable values of critical parameters of VHT that yield optimized performance on different datasets. Thus, this analytical study is extremely significant in developing streaming classifiers which achieve optimum performance via parameter tuning at run time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Murdopo A, Severien A, Morales GDF, Bifet A (2013) SAMOA: developer’s guide. Yahoo Labs, Barcelona

    Google Scholar 

  2. Storm. http://storm-project.net. Accessed 10 Apr 2015

  3. Neumeyer L, Robbins B, Nair A, Kesari A (2010) S4: distributed stream computing platform. In: IEEE International conference on data mining workshops (ICDMW). IEEE Press, pp 170–177

  4. Apache Software Foundation. Samza. http://samza.incubator.apache.org. Accessed 11 Apr 2015

  5. Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine. http://archive.ics.uci.edu/ml. Accessed 10 Mar 2015

  6. Prasad BR, Agarwal S (2014) Handling big data stream analytics using SAMOA framework—a practical experience. Int J Database Theory Appl 7(4):197–208

    Article  Google Scholar 

  7. Domingos P, Hulten G (2000) Mining high-speed data streams. In: 6th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 71–80

  8. Gama J, Fernandes R, Rocha R (2006) Decision trees for mining data streams. Intell Data Anal 10:23–46

    Google Scholar 

  9. Yang H, Fong S (2011) Moderated VFDT in stream mining using adaptive tie threshold and incremental pruning. In: Data warehousing and knowledge discovery. Springer, Berlin, Heidelberg, pp 471–483

  10. White T (2012) Hadoop: the definitive guide. O’Reilly Media Publishers, Yahoo Press

  11. Apache Pig. http://www.pig.apache.org. Accessed 15 Apr 2015

  12. Apache Mahout. http://mahout.apache.org. Accessed 12 Mar 2015

  13. Scott DM (2011) Real-time marketing and PR, revised: how to instantly engage your market, connect with customers, and create products that grow your business now. Wiley Desktop Editions Series. Wiley

  14. Taormina R et al (2015) ANN-based interval forecasting of stream flow discharges using the LUBE method and MOFIPS. Eng Appl Artif Intell 45:429–440

    Article  Google Scholar 

  15. Zhang J et al (2009) Multilayer ensemble pruning via novel multi-sub-swarm particle swarm optimization. J Univ Comput Sci 15(4):840–858

    Google Scholar 

  16. Wang WC et al (2015) Improving forecasting accuracy of annual runoff time series using ARIMA based on EEMD decomposition. Water Resour Manage 29(8):2655–2675

    Article  Google Scholar 

  17. Zhang SW et al (2009) Dimension reduction using semi-supervised locally linear embedding for plant leaf classification. Lect Notes Comput Sci 5754:948–955

    Article  Google Scholar 

  18. Wu CL et al (2009) Methods to improve neural network performance in daily flows prediction. J Hydrol 372(1–4):80–93

    Article  Google Scholar 

  19. Chau KW et al (2010) A hybrid model coupled with singular spectrum analysis for daily rainfall prediction. J Hydroinform 12(4):458–473

    Article  Google Scholar 

  20. Amatriain X (2012) Mining large streams of user data for personalized recommendations. ACM SIGKDD Explor Newsl 14:37–48

    Article  Google Scholar 

  21. Facebook Scribe. https://github.com/facebook/scribe. Accessed 13 Mar 2015

  22. Bifet A et al (2010) MOA: massive online analysis. J Mach Learn. 11:1601–1604

    MathSciNet  Google Scholar 

  23. VowpalWabbit (Fast Learning). http://hunch.net/~vw. Accessed 15 Mar 2015

  24. Marz N, Warren J (2013) Big data: principles and best practices of scalable realtime data systems. Manning Publications, O’Reilly Media

  25. Alberg D, Last M, Kandel A (2012) knowledge discovery in data streams with regression tree methods. Wiley Interdiscip Rev Data Min Knowl Discov 2:69–78

    Article  Google Scholar 

  26. Gehrke J, Ramakrishnan R, Ganti V (1998) Rainforest—a framework for fast decision tree construction of large datasets. In: 24th international conference on very large data bases. VLDB, pp 416–427

  27. Bifet A, Holmes G, Pfahringer B (2010) Leveraging bagging for evolving data streams. In: European conference on machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, pp 135–150

  28. Hand DJ (2006) Classifier technology and the illusion of progress. Stat Sci 21:1–14

    Article  MathSciNet  MATH  Google Scholar 

  29. Gomes JB, Ruiz EM, Sousa PAC (2011) Learning recurring concepts from data streams with a context-aware ensemble. In: ACM symposium on applied computing, pp 994–999

  30. Giraud-Carrier C (2000) A note on the utility of incremental learning. AI Commun 13:215–223

    MATH  Google Scholar 

  31. Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavalda R (2009) New ensemble methods for evolving data streams. In: 15th ACMSIGKDD international conference on knowledge discovery and data mining. ACM, pp 139–148

  32. Ikonomovska E, Gama J, Dzeroski S (2011) Learning model trees from evolving data streams. Data Min Knowl Discov 23:128–168

    Article  MathSciNet  MATH  Google Scholar 

  33. Kadlec P, Grbic R, Gabrys B (2011) Review of adaptation mechanisms for data-driven soft sensors. Comput Chem Eng 35:1–24

    Article  Google Scholar 

  34. Moreno-Torres JG, Raeder T, Alaiz-Rodriguez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recogn 45:521–530

    Article  Google Scholar 

  35. Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: 7th Brazilian symposium on artificial intelligence, pp 286–295

  36. Kolter J, Maloof M (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8:2755–2790

    MATH  Google Scholar 

  37. Ross G, Adams N, Tasoulis D, Hand D (2012) Exponentially weighted moving average charts for detecting concept drift. Pattern Recogn Lett 33:191–198

    Article  Google Scholar 

  38. Gama J, Sebastiao R, Rodrigues P (2013) On evaluating stream learning algorithms. Mach Learn 90:317–346

    Article  MathSciNet  MATH  Google Scholar 

  39. Liu XY, Zhou ZH (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18:63–77

    Article  Google Scholar 

  40. Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: International conference on data mining, pp 592–602

  41. Abe N, Zadrozny B, Langford J (2004) An iterative method for multi-class cost-sensitive learning. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 3–11

  42. Mitsa T (2010) Importance of temporal data mining today. In: Temporal data mining. Chapman and Hall/CRC, Taylor and Francis Group, CRC Press, pp 1–17

  43. Bifet A et al (2013) Pitfalls in benchmarking data stream classification and how to avoid them. In: Machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, pp 465–479

  44. Wikipedia. http://en.wikipedia.org/wiki/Cohen%27s_kappa. Accessed 18 Mar 2015

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bakshi Rohit Prasad.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Prasad, B.R., Agarwal, S. Critical parameter analysis of Vertical Hoeffding Tree for optimized performance using SAMOA. Int. J. Mach. Learn. & Cyber. 8, 1389–1402 (2017). https://doi.org/10.1007/s13042-016-0513-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-016-0513-3

Keywords

Navigation