Abstract
Streaming classification of big data is a method under stream data mining that learns from continuous, ordered sequences of data streams coming from diversified sources using limited computing and storage capabilities. SAMOA stands for scalable advanced massive online analysis, is a machine learning framework used to perform distributed data mining over streaming data. Vertical Hoeffding Tree (VHT) under SAMOA is a variant of very fast decision tree used for distributed classification of data streams. The performance of VHT depends on various critical parameters such as tie-threshold, grace value, confidence, split criterion, etc. Although, VHT is widely accepted as an efficient streaming classifier but one of the challenges in streaming classification is varying distribution of incoming data instances with respect to underlying classes in different datasets; therefore performance of VHT varies in different datasets. Therefore, achieving optimal performance from the stream classifier like VHT on different datasets is a challenging task and fixed set of values of critical parameters cannot be preconfigured for various types of datasets. This research work explores the capabilities of VHT streaming classifier of SAMOA in the light of various benchmarking performance statistics such as classification accuracy, kappa and kappa temporal. The work presented here, experimentally identifies suitable values of critical parameters of VHT that yield optimized performance on different datasets. Thus, this analytical study is extremely significant in developing streaming classifiers which achieve optimum performance via parameter tuning at run time.
Similar content being viewed by others
References
Murdopo A, Severien A, Morales GDF, Bifet A (2013) SAMOA: developer’s guide. Yahoo Labs, Barcelona
Storm. http://storm-project.net. Accessed 10 Apr 2015
Neumeyer L, Robbins B, Nair A, Kesari A (2010) S4: distributed stream computing platform. In: IEEE International conference on data mining workshops (ICDMW). IEEE Press, pp 170–177
Apache Software Foundation. Samza. http://samza.incubator.apache.org. Accessed 11 Apr 2015
Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine. http://archive.ics.uci.edu/ml. Accessed 10 Mar 2015
Prasad BR, Agarwal S (2014) Handling big data stream analytics using SAMOA framework—a practical experience. Int J Database Theory Appl 7(4):197–208
Domingos P, Hulten G (2000) Mining high-speed data streams. In: 6th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 71–80
Gama J, Fernandes R, Rocha R (2006) Decision trees for mining data streams. Intell Data Anal 10:23–46
Yang H, Fong S (2011) Moderated VFDT in stream mining using adaptive tie threshold and incremental pruning. In: Data warehousing and knowledge discovery. Springer, Berlin, Heidelberg, pp 471–483
White T (2012) Hadoop: the definitive guide. O’Reilly Media Publishers, Yahoo Press
Apache Pig. http://www.pig.apache.org. Accessed 15 Apr 2015
Apache Mahout. http://mahout.apache.org. Accessed 12 Mar 2015
Scott DM (2011) Real-time marketing and PR, revised: how to instantly engage your market, connect with customers, and create products that grow your business now. Wiley Desktop Editions Series. Wiley
Taormina R et al (2015) ANN-based interval forecasting of stream flow discharges using the LUBE method and MOFIPS. Eng Appl Artif Intell 45:429–440
Zhang J et al (2009) Multilayer ensemble pruning via novel multi-sub-swarm particle swarm optimization. J Univ Comput Sci 15(4):840–858
Wang WC et al (2015) Improving forecasting accuracy of annual runoff time series using ARIMA based on EEMD decomposition. Water Resour Manage 29(8):2655–2675
Zhang SW et al (2009) Dimension reduction using semi-supervised locally linear embedding for plant leaf classification. Lect Notes Comput Sci 5754:948–955
Wu CL et al (2009) Methods to improve neural network performance in daily flows prediction. J Hydrol 372(1–4):80–93
Chau KW et al (2010) A hybrid model coupled with singular spectrum analysis for daily rainfall prediction. J Hydroinform 12(4):458–473
Amatriain X (2012) Mining large streams of user data for personalized recommendations. ACM SIGKDD Explor Newsl 14:37–48
Facebook Scribe. https://github.com/facebook/scribe. Accessed 13 Mar 2015
Bifet A et al (2010) MOA: massive online analysis. J Mach Learn. 11:1601–1604
VowpalWabbit (Fast Learning). http://hunch.net/~vw. Accessed 15 Mar 2015
Marz N, Warren J (2013) Big data: principles and best practices of scalable realtime data systems. Manning Publications, O’Reilly Media
Alberg D, Last M, Kandel A (2012) knowledge discovery in data streams with regression tree methods. Wiley Interdiscip Rev Data Min Knowl Discov 2:69–78
Gehrke J, Ramakrishnan R, Ganti V (1998) Rainforest—a framework for fast decision tree construction of large datasets. In: 24th international conference on very large data bases. VLDB, pp 416–427
Bifet A, Holmes G, Pfahringer B (2010) Leveraging bagging for evolving data streams. In: European conference on machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, pp 135–150
Hand DJ (2006) Classifier technology and the illusion of progress. Stat Sci 21:1–14
Gomes JB, Ruiz EM, Sousa PAC (2011) Learning recurring concepts from data streams with a context-aware ensemble. In: ACM symposium on applied computing, pp 994–999
Giraud-Carrier C (2000) A note on the utility of incremental learning. AI Commun 13:215–223
Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavalda R (2009) New ensemble methods for evolving data streams. In: 15th ACMSIGKDD international conference on knowledge discovery and data mining. ACM, pp 139–148
Ikonomovska E, Gama J, Dzeroski S (2011) Learning model trees from evolving data streams. Data Min Knowl Discov 23:128–168
Kadlec P, Grbic R, Gabrys B (2011) Review of adaptation mechanisms for data-driven soft sensors. Comput Chem Eng 35:1–24
Moreno-Torres JG, Raeder T, Alaiz-Rodriguez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recogn 45:521–530
Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: 7th Brazilian symposium on artificial intelligence, pp 286–295
Kolter J, Maloof M (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8:2755–2790
Ross G, Adams N, Tasoulis D, Hand D (2012) Exponentially weighted moving average charts for detecting concept drift. Pattern Recogn Lett 33:191–198
Gama J, Sebastiao R, Rodrigues P (2013) On evaluating stream learning algorithms. Mach Learn 90:317–346
Liu XY, Zhou ZH (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18:63–77
Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: International conference on data mining, pp 592–602
Abe N, Zadrozny B, Langford J (2004) An iterative method for multi-class cost-sensitive learning. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 3–11
Mitsa T (2010) Importance of temporal data mining today. In: Temporal data mining. Chapman and Hall/CRC, Taylor and Francis Group, CRC Press, pp 1–17
Bifet A et al (2013) Pitfalls in benchmarking data stream classification and how to avoid them. In: Machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, pp 465–479
Wikipedia. http://en.wikipedia.org/wiki/Cohen%27s_kappa. Accessed 18 Mar 2015
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Prasad, B.R., Agarwal, S. Critical parameter analysis of Vertical Hoeffding Tree for optimized performance using SAMOA. Int. J. Mach. Learn. & Cyber. 8, 1389–1402 (2017). https://doi.org/10.1007/s13042-016-0513-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-016-0513-3