Heterogeneous Ensemble for Feature Drifts in Data Streams

Nguyen, Hai-Long; Woon, Yew-Kwong; Ng, Wee-Keong; Wan, Li

doi:10.1007/978-3-642-30220-6_1

Heterogeneous Ensemble for Feature Drifts in Data Streams

Hai-Long Nguyen²³,
Yew-Kwong Woon²⁴,
Wee-Keong Ng²³ &
…
Li Wan²⁵

Conference paper

2518 Accesses
25 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7302))

Abstract

The nature of data streams requires classification algorithms to be real-time, efficient, and able to cope with high-dimensional data that are continuously arriving. It is a known fact that in high-dimensional datasets, not all features are critical for training a classifier. To improve the performance of data stream classification, we propose an algorithm called HEFT-Stream (Heterogeneous Ensemble with Feature drifT for Data Streams) that incorporates feature selection into a heterogeneous ensemble to adapt to different types of concept drifts. As an example of the proposed framework, we first modify the FCBF [13] algorithm so that it dynamically update the relevant feature subsets for data streams. Next, a heterogeneous ensemble is constructed based on different online classifiers, including Online Naive Bayes and CVFDT [5]. Empirical results show that our ensemble classifier outperforms state-of-the-art ensemble classifiers (AWE [15] and OnlineBagging [21]) in terms of accuracy, speed, and scalability. The success of HEFT-Stream opens new research directions in understanding the relationship between feature selection techniques and ensemble learning to achieve better classification performance.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bifet, A., Holmes, G., Kirkby, R.: Moa: Massive online analysis. The Journal of Machine Learning Research 11, 1601–1604 (2010)
Google Scholar
Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavald, R.: New ensemble methods for evolving data streams. In: 15th ACM SIGKDD, pp. 139–148. ACM (2009)
Google Scholar
Breiman, L.: Bagging predictors. The Journal of Machine Learning Research 24(2), 123–140 (1996)
MathSciNet MATH Google Scholar
Breiman, L.: Random forests. The Journal of Machine Learning Research 45(1), 5–32 (2001)
Article MATH Google Scholar
Domingos, P., Hulten, G.: Mining high-speed data streams. In: The Sixth ACM SIGKDD, pp. 71–80. ACM (2000)
Google Scholar
Domingos, P., Pazzani, M.: On the optimality of the simple bayesian classifier under zero-one loss. The Journal of Machine Learning Research 29(2-3), 103–130 (1997)
MATH Google Scholar
Eibl, G., Pfeiffer, K.-P.: Multiclass boosting for weak classifiers. The Journal of Machine Learning Research 6, 189–210 (2005)
MathSciNet MATH Google Scholar
Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: The 13th ICML, pp. 148–156 (1996)
Google Scholar
Friedman, J.H.: Stochastic gradient boosting. Computational Statistics & Data Analysis 38(4), 367–378 (2002)
Article MathSciNet MATH Google Scholar
Fumera, G., Roli, F.: A theoretical and experimental analysis of linear combiners for multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(6), 942–956 (2005)
Article Google Scholar
Hsu, K.-W., Srivastava, J.: Diversity in Combinations of Heterogeneous Classifiers. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 923–932. Springer, Heidelberg (2009)
Chapter Google Scholar
Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: ACM SIGKDD, pp. 97–106. ACM (2001)
Google Scholar
Lei, Y., Huan, L.: Feature selection for high-dimensional data: A fast correlation-based filter solution. In: The 20th ICML, pp. 856–863 (2003)
Google Scholar
Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering 17(4), 491–502 (2005)
Article Google Scholar
Oza, N.C.: Online bagging and boosting. In: 2005 IEEE International Conference on Systems, Man and Cybernetics, vol. 3, pp. 2340–2345. IEEE (2005)
Google Scholar
Sattar, H., Ying, Y., Zahra, M., Mohammadreza, K.: Adapted one-vs-all decision trees for data stream classification. IEEE Transactions on Knowledge and Data Engineering 21, 624–637 (2009)
Article Google Scholar
Shen, C., Li, H.: On the dual formulation of boosting algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(12), 2216–2231 (2010)
Article MathSciNet Google Scholar
Street, W.N., Kim, Y.: A streaming ensemble algorithm (sea) for large-scale classification. In: The 7th ACM SIGKDD, pp. 377–382. ACM (2001)
Google Scholar
Tin Kam, H.: The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)
Article Google Scholar
Tumer, K., Ghosh, J.: Linear and order statistics combiners for pattern classification. Springer (1999)
Google Scholar
Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using ensemble classifiers. In: ACM SIGKDD, pp. 226–235. ACM (2003)
Google Scholar
Woods, K., Philip Kegelmeyer, J.W., Bowyer, K.: Combination of multiple classifiers using local accuracy estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(4), 405–410 (1997)
Article Google Scholar
Zhenyu, L., Xindong, W., Bongard, J.: Active learning with adaptive heterogeneous ensembles. In: The 9th IEEE ICDM, pp. 327–336 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Nanyang Technological University, Singapore
Hai-Long Nguyen & Wee-Keong Ng
EADS Innovation Works, Singapore
Yew-Kwong Woon
New York University, USA
Li Wan

Authors

Hai-Long Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Yew-Kwong Woon
View author publications
You can also search for this author in PubMed Google Scholar
Wee-Keong Ng
View author publications
You can also search for this author in PubMed Google Scholar
Li Wan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Michigan State University, 428 S. Shaw Lane, 48824-1226, East Lansing, MI, USA
Pang-Ning Tan
School of Information Technologies, University of Sydney, 1 Cleveland St., 2006, Sydney, NSW, Australia
Sanjay Chawla
Faculty of Computing and Informatics, Jalan Multimedia, Multimedia University, 63100, Cyberjaya, Selangor, Malaysia
Chin Kuan Ho
Department of Computing and Information Systems, The University of Melbourne, 111 Barry Street, 3053, Melbourne, VIC, Australia
James Bailey

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, HL., Woon, YK., Ng, WK., Wan, L. (2012). Heterogeneous Ensemble for Feature Drifts in Data Streams. In: Tan, PN., Chawla, S., Ho, C.K., Bailey, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2012. Lecture Notes in Computer Science(), vol 7302. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30220-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-30220-6_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-30219-0
Online ISBN: 978-3-642-30220-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics