Classification of high-dimensional evolving data streams via a resource-efficient online ensemble

Zhai, Tingting; Gao, Yang; Wang, Hao; Cao, Longbing

doi:10.1007/s10618-017-0500-7

Classification of high-dimensional evolving data streams via a resource-efficient online ensemble

Published: 23 March 2017

Volume 31, pages 1242–1265, (2017)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Tingting Zhai¹,
Yang Gao ORCID: orcid.org/0000-0002-2488-1813¹,
Hao Wang¹ &
…
Longbing Cao²

1052 Accesses
13 Citations
4 Altmetric
Explore all metrics

Abstract

A novel online ensemble strategy, ensemble BPegasos (EBPegasos), is proposed to solve the problems simultaneously caused by concept drifting and the curse of dimensionality in classifying high-dimensional evolving data streams, which has not been addressed in the literature. First, EBPegasos uses BPegasos, an online kernelized SVM-based algorithm, as the component classifier to address the scalability and sparsity of high-dimensional data. Second, EBPegasos takes full advantage of the characteristics of BPegasos to cope with various types of concept drifts. Specifically, EBPegasos constructs diverse component classifiers by controlling the budget size of BPegasos; it also equips each component with a drift detector to monitor and evaluate its performance, and modifies the ensemble structure only when large performance degradation occurs. Such conditional structural modification strategy makes EBPegasos strike a good balance between exploiting and forgetting old knowledge. Lastly, we first prove experimentally that EBPegasos is more effective and resource-efficient than the tree ensembles on high-dimensional data. Then comprehensive experiments on synthetic and real-life datasets also show that EBPegasos can cope with various types of concept drifts significantly better than the state-of-the-art ensemble frameworks when all ensembles use BPegasos as the base learner.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive ensemble classification techniques detecting and managing concept drift in dynamic imbalanced data streams

Article 23 April 2024

An online ensembles approach for handling concept drift in data streams: diversified online ensembles detection

Article 30 April 2015

ROSE: robust online self-adjusting ensemble for continual learning on imbalanced drifting data streams

Article 20 April 2022

Notes

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#gisette.
http://qwone.com/~jason/20Newsgroups/.
ASHTBag is excluded here since its ensemble strategy is dedicated to Hoeffding trees.
The scripts for generating these datasets are available at http://cs.nju.edu.cn/rl/people/zhaitt/datasetsGenerateScript.txt.
It can be downloaded from http://moa.cms.waikato.ac.nz/datasets/.
It can be downloaded from http://www.cse.fau.edu/~xqzhu/stream.html.

References

Abdulsalam H, Skillicorn DB, Martin P (2007) Streaming random forests. In: 11th international database engineering and applications symposium, pp 225–232
Abdulsalam H, Skillicorn DB, Martin P (2011) Classification using streaming random forests. IEEE Trans Knowl Data Eng 23(1):22–36
Article Google Scholar
Abe S (2005) Support vector machines for pattern classification. Springer, London
MATH Google Scholar
Aggarwal CC, Yu PS (2008) Locust: an online analytical processing framework for high dimensional classification of data streams. In: Proceedings of the 24th IEEE international conference on data engineering, pp 426–435
Bifet A, Frank E (2010) Sentiment knowledge discovery in twitter streaming data. In: International conference on discovery science, pp 1–15
Bifet A, Gavalda R (2007) Learning from time-changing data with adaptive windowing. In: Proceedings of the 7th SIAM international conference on data mining, pp 443–448
Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavaldà R (2009) New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 139–148
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010a) Moa: massive online analysis. J Mach Learn Res 11:1601–1604
Google Scholar
Bifet A, Holmes G, Pfahringer B (2010b) Leveraging bagging for evolving data streams. In: Joint European conference on machine learning and knowledge discovery in databases, pp 135–150
Bifet A, Holmes G, Pfahringer B, Frank E (2010c) Fast perceptron decision tree learning from evolving data streams. In: Pacific-Asia conference on knowledge discovery and data mining, pp 299–310
Bifet A, Pfahringer B, Read J, Holmes G (2013) Efficient data stream classification via probabilistic adaptive windows. In: Proceedings of the 28th annual ACM symposium on applied computing, pp 801–806
Brzeziński D, Stefanowski J (2011) Accuracy updated ensemble for data streams with concept drift. In: International conference on hybrid artificial intelligence systems, pp 155–163
Brzezinski D, Stefanowski J (2014a) Combining block-based and online methods in learning ensembles from concept drifting data streams. Inf Sci 265:50–67
Article MathSciNet MATH Google Scholar
Brzezinski D, Stefanowski J (2014b) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learn Syst 25(1):81–94
Article Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Denil M, Matheson D, De Freitas N (2013) Consistency of online random forests. In: Proceedings of the 30th international conference on machine learning, pp 1256–1264
Do TN, Lenca P, Lallich S, Pham NK (2010) Classifying very-high-dimensional data with random forests of oblique decision trees. In: Guillet F, Ritschard G, Zighed DA, Briand H (eds) Advances in knowledge discovery and management. Springer, Berlin, Heidelberg, pp 39–55
Chapter Google Scholar
Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, pp 71–80
Elwell R, Polikar R (2011) Incremental learning of concept drift in nonstationary environments. IEEE Trans Neural Netw 22(10):1517–1531
Article Google Scholar
Gama J, Fernandes R, Rocha R (2006) Decision trees for mining data streams. Intell Data Anal 10(1):23–45
Google Scholar
Gama J, Sebastiao R, Rodrigues PP (2013) On evaluating stream learning algorithms. Mach Learn 90(3):317–346
Article MathSciNet MATH Google Scholar
Gama J, Zliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):44
Article MATH Google Scholar
Holmes G, Kirkby R, Pfahringer B (2005) Stress-testing hoeffding trees. In: European conference on principles of data mining and knowledge discovery, pp 495–502
Hosseini MJ, Gholipour A, Beigy H (2015) An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams. Knowl Inf Syst 46:1–31
Google Scholar
Hsu CW, Chang CC, Lin CJ, et al (2003) A practical guide to support vector classification. https://www.cs.sfu.ca/people/Faculty/teaching/726/spring11/svmguide.pdf
Katakis I, Tsoumakas G, Banos E, Bassiliades N, Vlahavas I (2009) An adaptive personalized news dissemination system. J Intell Inf Syst 32(2):191–212
Article Google Scholar
Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22(3):371–391
Article Google Scholar
Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8:2755–2790
MATH Google Scholar
Krempl G, Žliobaite I, Brzeziński D, Hüllermeier E, Last M, Lemaire V, Noack T, Shaker A, Sievi S, Spiliopoulou M, Stefanowski J (2014) Open challenges for data stream mining research. SIGKDD Explor 16(1):1–10
Article Google Scholar
Lakshminarayanan B, Roy DM, Teh YW (2014) Mondrian forests: efficient online random forests. In: Advances in neural information processing systems 27: annual conference on neural information processing systems 2014, Montreal, Quebec, Canada, pp 3140–3148
Liu Y, Zhou Y (2014) Online detection of concept drift in visual tracking. In: International conference on neural information processing, pp 159–166
McCallum A, Nigam K et al (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization, vol 752, pp 41–48
Minku LL, Yao X (2012) Ddd: A new ensemble approach for dealing with concept drift. IEEE Trans Knowl Data Eng 24(4):619–633
Article Google Scholar
Minku LL, White AP, Yao X (2010) The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Trans Knowl Data Eng 22(5):730–742
Article Google Scholar
Oza NC (2005) Online bagging and boosting. In: 2005 IEEE international conference on systems, man and cybernetics, vol 3, pp 2340–2345
Pappu V, Pardalos PM (2014) High-dimensional data classification. In: Aleskerov F, Goldengorin B, Pardalos PM (eds) Clusters, orders, and trees: methods and applications. Springer, New York, pp 119–150
Google Scholar
Rutkowski L, Pietruczuk L, Duda P, Jaworski M (2013) Decision trees for mining data streams based on the McDiarmid’s bound. IEEE Trans Knowl Data Eng 25(6):1272–1279
Article Google Scholar
Saffari A, Leistner C, Santner J, Godec M, Bischof H (2009) On-line random forests. In: 2009 IEEE 12th international conference on computer vision workshops, pp 1393–1400
Shalev-Shwartz S, Singer Y, Srebro N, Cotter A (2011) Pegasos: primal estimated sub-gradient solver for SVM. Math Program 127(1):3–30
Article MathSciNet MATH Google Scholar
Tomasev N, Radovanovic M, Mladenic D, Ivanovic M (2014) The role of hubness in clustering high-dimensional data. IEEE Trans Knowl Data Eng 26(3):739–751
Article Google Scholar
Wang Z, Crammer K, Vucetic S (2012) Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale SVM training. J Mach Learn Res 13(1):3103–3131
MathSciNet MATH Google Scholar
Wang D, Wu P, Zhao P, Wu Y, Miao C, Hoi SC (2014) High-dimensional data stream classification via sparse online learning. In: 2014 IEEE international conference on data mining, pp 1007–1012
Ye Y, Wu Q, Huang JZ, Ng MK, Li X (2013) Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognit 46(3):769–787
Article Google Scholar
Zhang X, Furtlehner C, Germain-Renaud C, Sebag M (2014) Data stream clustering with affinity propagation. IEEE Trans Knowl Data Eng 26(7):1644–1656
Article Google Scholar
Zliobaite I, Gabrys B (2014) Adaptive preprocessing for streaming data. IEEE Trans Knowl Data Eng 26(2):309–321
Article Google Scholar
Zliobaite I, Bifet A, Read J, Pfahringer B, Holmes G (2015) Evaluation methods and decision theory for classification of streaming data with temporal dependence. Mach Learn 98(3):455–482
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work is supported by the National NSF of China (Nos. 61432008, 61503178), NSF and Primary R&D Plan of Jiangsu Province, China (Nos. BE2015213, BK20150587), and the Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Tingting Zhai, Yang Gao & Hao Wang
Advanced Analytics Institute, University of Technology Sydney, Sydney, Australia
Longbing Cao

Authors

Tingting Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Yang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Longbing Cao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Gao.

Additional information

Responsible editor: Thomas Gärtner, Mirco Nanni, Andrea Passerini and Céline Robardet.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhai, T., Gao, Y., Wang, H. et al. Classification of high-dimensional evolving data streams via a resource-efficient online ensemble. Data Min Knowl Disc 31, 1242–1265 (2017). https://doi.org/10.1007/s10618-017-0500-7

Download citation

Received: 30 March 2016
Accepted: 27 February 2017
Published: 23 March 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s10618-017-0500-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Classification of high-dimensional evolving data streams via a resource-efficient online ensemble

Abstract

Access this article

Similar content being viewed by others

A comprehensive ensemble classification techniques detecting and managing concept drift in dynamic imbalanced data streams

An online ensembles approach for handling concept drift in data streams: diversified online ensembles detection

ROSE: robust online self-adjusting ensemble for continual learning on imbalanced drifting data streams

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Classification of high-dimensional evolving data streams via a resource-efficient online ensemble

Abstract

Access this article

Similar content being viewed by others

A comprehensive ensemble classification techniques detecting and managing concept drift in dynamic imbalanced data streams

An online ensembles approach for handling concept drift in data streams: diversified online ensembles detection

ROSE: robust online self-adjusting ensemble for continual learning on imbalanced drifting data streams

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation