Critical parameter analysis of Vertical Hoeffding Tree for optimized performance using SAMOA

Prasad, Bakshi Rohit; Agarwal, Sonali

doi:10.1007/s13042-016-0513-3

Critical parameter analysis of Vertical Hoeffding Tree for optimized performance using SAMOA

Original Article
Published: 27 February 2016

Volume 8, pages 1389–1402, (2017)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Bakshi Rohit Prasad¹ &
Sonali Agarwal¹

507 Accesses
16 Citations
Explore all metrics

Abstract

Streaming classification of big data is a method under stream data mining that learns from continuous, ordered sequences of data streams coming from diversified sources using limited computing and storage capabilities. SAMOA stands for scalable advanced massive online analysis, is a machine learning framework used to perform distributed data mining over streaming data. Vertical Hoeffding Tree (VHT) under SAMOA is a variant of very fast decision tree used for distributed classification of data streams. The performance of VHT depends on various critical parameters such as tie-threshold, grace value, confidence, split criterion, etc. Although, VHT is widely accepted as an efficient streaming classifier but one of the challenges in streaming classification is varying distribution of incoming data instances with respect to underlying classes in different datasets; therefore performance of VHT varies in different datasets. Therefore, achieving optimal performance from the stream classifier like VHT on different datasets is a challenging task and fixed set of values of critical parameters cannot be preconfigured for various types of datasets. This research work explores the capabilities of VHT streaming classifier of SAMOA in the light of various benchmarking performance statistics such as classification accuracy, kappa and kappa temporal. The work presented here, experimentally identifies suitable values of critical parameters of VHT that yield optimized performance on different datasets. Thus, this analytical study is extremely significant in developing streaming classifiers which achieve optimum performance via parameter tuning at run time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Analysis of Classification Methods for Big Data Stream

Naive Bayes and Decision Tree Classifier for Streaming Data Using HBase

A novel approach using incremental oversampling for data stream mining

Article 27 July 2018

N. Anupama & Sudarson Jena

References

Murdopo A, Severien A, Morales GDF, Bifet A (2013) SAMOA: developer’s guide. Yahoo Labs, Barcelona
Google Scholar
Storm. http://storm-project.net. Accessed 10 Apr 2015
Neumeyer L, Robbins B, Nair A, Kesari A (2010) S4: distributed stream computing platform. In: IEEE International conference on data mining workshops (ICDMW). IEEE Press, pp 170–177
Apache Software Foundation. Samza. http://samza.incubator.apache.org. Accessed 11 Apr 2015
Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine. http://archive.ics.uci.edu/ml. Accessed 10 Mar 2015
Prasad BR, Agarwal S (2014) Handling big data stream analytics using SAMOA framework—a practical experience. Int J Database Theory Appl 7(4):197–208
Article Google Scholar
Domingos P, Hulten G (2000) Mining high-speed data streams. In: 6th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 71–80
Gama J, Fernandes R, Rocha R (2006) Decision trees for mining data streams. Intell Data Anal 10:23–46
Google Scholar
Yang H, Fong S (2011) Moderated VFDT in stream mining using adaptive tie threshold and incremental pruning. In: Data warehousing and knowledge discovery. Springer, Berlin, Heidelberg, pp 471–483
White T (2012) Hadoop: the definitive guide. O’Reilly Media Publishers, Yahoo Press
Apache Pig. http://www.pig.apache.org. Accessed 15 Apr 2015
Apache Mahout. http://mahout.apache.org. Accessed 12 Mar 2015
Scott DM (2011) Real-time marketing and PR, revised: how to instantly engage your market, connect with customers, and create products that grow your business now. Wiley Desktop Editions Series. Wiley
Taormina R et al (2015) ANN-based interval forecasting of stream flow discharges using the LUBE method and MOFIPS. Eng Appl Artif Intell 45:429–440
Article Google Scholar
Zhang J et al (2009) Multilayer ensemble pruning via novel multi-sub-swarm particle swarm optimization. J Univ Comput Sci 15(4):840–858
Google Scholar
Wang WC et al (2015) Improving forecasting accuracy of annual runoff time series using ARIMA based on EEMD decomposition. Water Resour Manage 29(8):2655–2675
Article Google Scholar
Zhang SW et al (2009) Dimension reduction using semi-supervised locally linear embedding for plant leaf classification. Lect Notes Comput Sci 5754:948–955
Article Google Scholar
Wu CL et al (2009) Methods to improve neural network performance in daily flows prediction. J Hydrol 372(1–4):80–93
Article Google Scholar
Chau KW et al (2010) A hybrid model coupled with singular spectrum analysis for daily rainfall prediction. J Hydroinform 12(4):458–473
Article Google Scholar
Amatriain X (2012) Mining large streams of user data for personalized recommendations. ACM SIGKDD Explor Newsl 14:37–48
Article Google Scholar
Facebook Scribe. https://github.com/facebook/scribe. Accessed 13 Mar 2015
Bifet A et al (2010) MOA: massive online analysis. J Mach Learn. 11:1601–1604
MathSciNet Google Scholar
VowpalWabbit (Fast Learning). http://hunch.net/~vw. Accessed 15 Mar 2015
Marz N, Warren J (2013) Big data: principles and best practices of scalable realtime data systems. Manning Publications, O’Reilly Media
Alberg D, Last M, Kandel A (2012) knowledge discovery in data streams with regression tree methods. Wiley Interdiscip Rev Data Min Knowl Discov 2:69–78
Article Google Scholar
Gehrke J, Ramakrishnan R, Ganti V (1998) Rainforest—a framework for fast decision tree construction of large datasets. In: 24th international conference on very large data bases. VLDB, pp 416–427
Bifet A, Holmes G, Pfahringer B (2010) Leveraging bagging for evolving data streams. In: European conference on machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, pp 135–150
Hand DJ (2006) Classifier technology and the illusion of progress. Stat Sci 21:1–14
Article MathSciNet MATH Google Scholar
Gomes JB, Ruiz EM, Sousa PAC (2011) Learning recurring concepts from data streams with a context-aware ensemble. In: ACM symposium on applied computing, pp 994–999
Giraud-Carrier C (2000) A note on the utility of incremental learning. AI Commun 13:215–223
MATH Google Scholar
Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavalda R (2009) New ensemble methods for evolving data streams. In: 15th ACMSIGKDD international conference on knowledge discovery and data mining. ACM, pp 139–148
Ikonomovska E, Gama J, Dzeroski S (2011) Learning model trees from evolving data streams. Data Min Knowl Discov 23:128–168
Article MathSciNet MATH Google Scholar
Kadlec P, Grbic R, Gabrys B (2011) Review of adaptation mechanisms for data-driven soft sensors. Comput Chem Eng 35:1–24
Article Google Scholar
Moreno-Torres JG, Raeder T, Alaiz-Rodriguez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recogn 45:521–530
Article Google Scholar
Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: 7th Brazilian symposium on artificial intelligence, pp 286–295
Kolter J, Maloof M (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8:2755–2790
MATH Google Scholar
Ross G, Adams N, Tasoulis D, Hand D (2012) Exponentially weighted moving average charts for detecting concept drift. Pattern Recogn Lett 33:191–198
Article Google Scholar
Gama J, Sebastiao R, Rodrigues P (2013) On evaluating stream learning algorithms. Mach Learn 90:317–346
Article MathSciNet MATH Google Scholar
Liu XY, Zhou ZH (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18:63–77
Article Google Scholar
Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: International conference on data mining, pp 592–602
Abe N, Zadrozny B, Langford J (2004) An iterative method for multi-class cost-sensitive learning. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 3–11
Mitsa T (2010) Importance of temporal data mining today. In: Temporal data mining. Chapman and Hall/CRC, Taylor and Francis Group, CRC Press, pp 1–17
Bifet A et al (2013) Pitfalls in benchmarking data stream classification and how to avoid them. In: Machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, pp 465–479
Wikipedia. http://en.wikipedia.org/wiki/Cohen%27s_kappa. Accessed 18 Mar 2015

Download references

Author information

Authors and Affiliations

Indian Institute of Information Technology Allahabad, Jhalwa, Allahabad, India
Bakshi Rohit Prasad & Sonali Agarwal

Authors

Bakshi Rohit Prasad
View author publications
You can also search for this author in PubMed Google Scholar
Sonali Agarwal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bakshi Rohit Prasad.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Prasad, B.R., Agarwal, S. Critical parameter analysis of Vertical Hoeffding Tree for optimized performance using SAMOA. Int. J. Mach. Learn. & Cyber. 8, 1389–1402 (2017). https://doi.org/10.1007/s13042-016-0513-3

Download citation

Received: 18 June 2015
Accepted: 16 February 2016
Published: 27 February 2016
Issue Date: August 2017
DOI: https://doi.org/10.1007/s13042-016-0513-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Critical parameter analysis of Vertical Hoeffding Tree for optimized performance using SAMOA

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Analysis of Classification Methods for Big Data Stream

Naive Bayes and Decision Tree Classifier for Streaming Data Using HBase

A novel approach using incremental oversampling for data stream mining

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Critical parameter analysis of Vertical Hoeffding Tree for optimized performance using SAMOA

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Analysis of Classification Methods for Big Data Stream

Naive Bayes and Decision Tree Classifier for Streaming Data Using HBase

A novel approach using incremental oversampling for data stream mining

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation