ABSTRACT
Recently, mining data streams with concept drifts for actionable insights has become an important and challenging task for a wide range of applications including credit card fraud protection, target marketing, network intrusion detection, etc. Conventional knowledge discovery tools are facing two challenges, the overwhelming volume of the streaming data, and the concept drifts. In this paper, we propose a general framework for mining concept-drifting data streams using weighted ensemble classifiers. We train an ensemble of classification models, such as C4.5, RIPPER, naive Beyesian, etc., from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. Our empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.
- B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom. Models and issues in data stream systems. In ACM Symposium on Principles of Database Systems (PODS), 2002. Google ScholarDigital Library
- S. Babu and J. Widom Continuous queries over data streams. SIGMOD Record, 30:109--120, 2001. Google ScholarDigital Library
- Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1--2):105--139, 1999. Google ScholarDigital Library
- Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-dimensional regression analysis of time-series data streams. In Proc. of Very Large Database (VLDB), Hongkong, China, 2002. Google ScholarDigital Library
- William Cohen. Fast effective rule induction. In Int'l Conf. on Machine Learning (ICML), pages 115--123, 1995.Google Scholar
- P. Domingos. A unified bias-variance decomposition and its applications. In Int'l Conf. on Machine Learning (ICML), pages 231--238, 2000. Google ScholarDigital Library
- P. Domingos and G. Hulten. Mining high-speed data streams. In Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 71--80, Boston, MA, 2000. ACM Press. Google ScholarDigital Library
- W. Fan, H. Wang, P. Yu, and S. Lo. Progressive modeling. In Int'l Conf. Data Mining (ICDM), 2002. Google ScholarDigital Library
- W. Fan, H. Wang, P. Yu, and S. Lo. Inductive learning in less than one sequential scan. In Int'l Joint Conf. on Artificial Intelligence, 2003. Google ScholarDigital Library
- W. Fan, H. Wang, P. Yu, and S. Stolfo. A framework for scalable cost-sensitive learning based on combining probabilities and benefits. In SIAM Int'l Conf. on Data Mining (SDM), 2002.Google Scholar
- Wei Fan, Fang Chu, Haixun Wang, and Philip S. Yu. Pruning and dynamic scheduling of cost-sensitive ensembles. In Proceedings of the 18th National Conference on Artificial Intelligence (AAAI), 2002. Google ScholarDigital Library
- Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Int'l Conf. on Machine Learning (ICML), pages 148--156, 1996.Google Scholar
- L. Gao and X. Wang. Continually evaluating similarity-based pattern queries on a streaming time series. In Int'l Conf. Management of Data (SIGMOD), Madison, Wisconsin, June 2002. Google ScholarDigital Library
- J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh. BOAT-optimistic decision tree construction. In Int'l Conf. Management of Data (SIGMOD), 1999. Google ScholarDigital Library
- S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4(1):1--58, 1992. Google ScholarDigital Library
- M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Int'l Conf. Management of Data (SIGMOD), pages 58--66, Santa Barbara, CA, May 2001. Google ScholarDigital Library
- S. Guha, N. Milshra, R. Motwani, and L. O'Callaghan. Clustering data streams. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 359--366, 2000. Google ScholarDigital Library
- L. Hall, K. Bowyer, W. Kegelmeyer, T. Moore, and C. Chao. Distributed learning on very large data sets. In Workshop on Distributed and Parallel Knowledge Discover, 2000.Google Scholar
- G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 97--106, San Francisco, CA, 2001. ACM Press. Google ScholarDigital Library
- J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Google ScholarDigital Library
- C. Shafer, R. Agrawal, and M. Mehta. Spring: A scalable parallel classifier for data mining. In Proc. of Very Large Database (VLDB), 1996. Google ScholarDigital Library
- S. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. Chan. Credit card fraud detection using meta-learning: Issues and initial results. In AAAI-97 Workshop on Fraud Detection and Risk Management, 1997.Google Scholar
- W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD), 2001. Google ScholarDigital Library
- Kagan Tumer and Joydeep Ghosh. Error correlation and error reduction in ensemble classifiers. Connection Science, 8(3--4):385--403, 1996.Google Scholar
- P. E. Utgoff. Incremental induction of decision trees. Machine Learning, 4:161--186, 1989. Google ScholarDigital Library
Index Terms
- Mining concept-drifting data streams using ensemble classifiers
Recommendations
New ensemble methods for evolving data streams
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data miningAdvanced analysis of data streams is quickly becoming a key area of data mining research as the number of applications demanding such processing increases. Online mining when such data streams evolve over time, that is when concepts drift or change ...
Mining Concept-Drifting and Noisy Data Streams Using Ensemble Classifiers
AICI '09: Proceedings of the 2009 International Conference on Artificial Intelligence and Computational Intelligence - Volume 04Mining concept drifting data stream is a challenging area for data mining research. Recent years have witnessed an averaging ensemble classifier which is based on the learnable assumption, although this ensemble classifier is an efficient algorithm for ...
An adaptive ensemble classifier for mining concept drifting data streams
It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we ...
Comments