Article

Mining concept-drifting data streams using ensemble classifiers

Authors:
Haixun Wang

IBM T. J. Watson Research, Hawthorne, NY

IBM T. J. Watson Research, Hawthorne, NY
View Profile

,
Wei Fan

IBM T. J. Watson Research, Hawthorne, NY

IBM T. J. Watson Research, Hawthorne, NY
View Profile

,
Philip S. Yu

IBM T. J. Watson Research, Hawthorne, NY

IBM T. J. Watson Research, Hawthorne, NY
View Profile

,
Jiawei Han

Univ. of Illinois, Urbana, IL

Univ. of Illinois, Urbana, IL
View Profile

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2003Pages 226–235https://doi.org/10.1145/956750.956778

Published:24 August 2003Publication History

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 226–235

ABSTRACT

Recently, mining data streams with concept drifts for actionable insights has become an important and challenging task for a wide range of applications including credit card fraud protection, target marketing, network intrusion detection, etc. Conventional knowledge discovery tools are facing two challenges, the overwhelming volume of the streaming data, and the concept drifts. In this paper, we propose a general framework for mining concept-drifting data streams using weighted ensemble classifiers. We train an ensemble of classification models, such as C4.5, RIPPER, naive Beyesian, etc., from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. Our empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.

References

B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom. Models and issues in data stream systems. In ACM Symposium on Principles of Database Systems (PODS), 2002. Google ScholarDigital Library
S. Babu and J. Widom Continuous queries over data streams. SIGMOD Record, 30:109--120, 2001. Google ScholarDigital Library
Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1--2):105--139, 1999. Google ScholarDigital Library
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-dimensional regression analysis of time-series data streams. In Proc. of Very Large Database (VLDB), Hongkong, China, 2002. Google ScholarDigital Library
William Cohen. Fast effective rule induction. In Int'l Conf. on Machine Learning (ICML), pages 115--123, 1995.Google Scholar
P. Domingos. A unified bias-variance decomposition and its applications. In Int'l Conf. on Machine Learning (ICML), pages 231--238, 2000. Google ScholarDigital Library
P. Domingos and G. Hulten. Mining high-speed data streams. In Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 71--80, Boston, MA, 2000. ACM Press. Google ScholarDigital Library
W. Fan, H. Wang, P. Yu, and S. Lo. Progressive modeling. In Int'l Conf. Data Mining (ICDM), 2002. Google ScholarDigital Library
W. Fan, H. Wang, P. Yu, and S. Lo. Inductive learning in less than one sequential scan. In Int'l Joint Conf. on Artificial Intelligence, 2003. Google ScholarDigital Library
W. Fan, H. Wang, P. Yu, and S. Stolfo. A framework for scalable cost-sensitive learning based on combining probabilities and benefits. In SIAM Int'l Conf. on Data Mining (SDM), 2002.Google Scholar
Wei Fan, Fang Chu, Haixun Wang, and Philip S. Yu. Pruning and dynamic scheduling of cost-sensitive ensembles. In Proceedings of the 18th National Conference on Artificial Intelligence (AAAI), 2002. Google ScholarDigital Library
Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Int'l Conf. on Machine Learning (ICML), pages 148--156, 1996.Google Scholar
L. Gao and X. Wang. Continually evaluating similarity-based pattern queries on a streaming time series. In Int'l Conf. Management of Data (SIGMOD), Madison, Wisconsin, June 2002. Google ScholarDigital Library
J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh. BOAT-optimistic decision tree construction. In Int'l Conf. Management of Data (SIGMOD), 1999. Google ScholarDigital Library
S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4(1):1--58, 1992. Google ScholarDigital Library
M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Int'l Conf. Management of Data (SIGMOD), pages 58--66, Santa Barbara, CA, May 2001. Google ScholarDigital Library
S. Guha, N. Milshra, R. Motwani, and L. O'Callaghan. Clustering data streams. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 359--366, 2000. Google ScholarDigital Library
L. Hall, K. Bowyer, W. Kegelmeyer, T. Moore, and C. Chao. Distributed learning on very large data sets. In Workshop on Distributed and Parallel Knowledge Discover, 2000.Google Scholar
G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 97--106, San Francisco, CA, 2001. ACM Press. Google ScholarDigital Library
J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Google ScholarDigital Library
C. Shafer, R. Agrawal, and M. Mehta. Spring: A scalable parallel classifier for data mining. In Proc. of Very Large Database (VLDB), 1996. Google ScholarDigital Library
S. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. Chan. Credit card fraud detection using meta-learning: Issues and initial results. In AAAI-97 Workshop on Fraud Detection and Risk Management, 1997.Google Scholar
W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD), 2001. Google ScholarDigital Library
Kagan Tumer and Joydeep Ghosh. Error correlation and error reduction in ensemble classifiers. Connection Science, 8(3--4):385--403, 1996.Google Scholar
P. E. Utgoff. Incremental induction of decision trees. Machine Learning, 4:161--186, 1989. Google ScholarDigital Library

Index Terms

Mining concept-drifting data streams using ensemble classifiers
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

New ensemble methods for evolving data streams
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Advanced analysis of data streams is quickly becoming a key area of data mining research as the number of applications demanding such processing increases. Online mining when such data streams evolve over time, that is when concepts drift or change ...
Read More
Mining Concept-Drifting and Noisy Data Streams Using Ensemble Classifiers
AICI '09: Proceedings of the 2009 International Conference on Artificial Intelligence and Computational Intelligence - Volume 04

Mining concept drifting data stream is a challenging area for data mining research. Recent years have witnessed an averaging ensemble classifier which is based on the learnable assumption, although this ensemble classifier is an efficient algorithm for ...
Read More
An adaptive ensemble classifier for mining concept drifting data streams

It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
August 2003
736 pages
ISBN:1581137370
DOI:10.1145/956750
Conference Chair:
Lise Getoor
University of Maryland, College Park
,
General Chair:
Ted Senator
DARPA
,
Program Chairs:
Pedro Domingos
University of Washington
,
Christos Faloutsos
Carnegie Mellon University
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
classifier
classifier ensemble
concept drift
data streams
Qualifiers
- Article
Conference

Acceptance Rates
KDD '03 Paper Acceptance Rate46of298submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 596
  Total Citations
  View Citations
- 5,876
  Total Downloads
- Downloads (Last 12 months)164
- Downloads (Last 6 weeks)37
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mining concept-drifting data streams using ensemble classifiers

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

New ensemble methods for evolving data streams

Mining Concept-Drifting and Noisy Data Streams Using Ensemble Classifiers

An adaptive ensemble classifier for mining concept drifting data streams