skip to main content
10.1145/956750.956778acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Mining concept-drifting data streams using ensemble classifiers

Published:24 August 2003Publication History

ABSTRACT

Recently, mining data streams with concept drifts for actionable insights has become an important and challenging task for a wide range of applications including credit card fraud protection, target marketing, network intrusion detection, etc. Conventional knowledge discovery tools are facing two challenges, the overwhelming volume of the streaming data, and the concept drifts. In this paper, we propose a general framework for mining concept-drifting data streams using weighted ensemble classifiers. We train an ensemble of classification models, such as C4.5, RIPPER, naive Beyesian, etc., from sequential chunks of the data stream. The classifiers in the ensemble are judiciously weighted based on their expected classification accuracy on the test data under the time-evolving environment. Thus, the ensemble approach improves both the efficiency in learning the model and the accuracy in performing classification. Our empirical study shows that the proposed methods have substantial advantage over single-classifier approaches in prediction accuracy, and the ensemble framework is effective for a variety of classification models.

References

  1. B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom. Models and issues in data stream systems. In ACM Symposium on Principles of Database Systems (PODS), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Babu and J. Widom Continuous queries over data streams. SIGMOD Record, 30:109--120, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1--2):105--139, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-dimensional regression analysis of time-series data streams. In Proc. of Very Large Database (VLDB), Hongkong, China, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. William Cohen. Fast effective rule induction. In Int'l Conf. on Machine Learning (ICML), pages 115--123, 1995.Google ScholarGoogle Scholar
  6. P. Domingos. A unified bias-variance decomposition and its applications. In Int'l Conf. on Machine Learning (ICML), pages 231--238, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Domingos and G. Hulten. Mining high-speed data streams. In Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 71--80, Boston, MA, 2000. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. Fan, H. Wang, P. Yu, and S. Lo. Progressive modeling. In Int'l Conf. Data Mining (ICDM), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. Fan, H. Wang, P. Yu, and S. Lo. Inductive learning in less than one sequential scan. In Int'l Joint Conf. on Artificial Intelligence, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. Fan, H. Wang, P. Yu, and S. Stolfo. A framework for scalable cost-sensitive learning based on combining probabilities and benefits. In SIAM Int'l Conf. on Data Mining (SDM), 2002.Google ScholarGoogle Scholar
  11. Wei Fan, Fang Chu, Haixun Wang, and Philip S. Yu. Pruning and dynamic scheduling of cost-sensitive ensembles. In Proceedings of the 18th National Conference on Artificial Intelligence (AAAI), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Int'l Conf. on Machine Learning (ICML), pages 148--156, 1996.Google ScholarGoogle Scholar
  13. L. Gao and X. Wang. Continually evaluating similarity-based pattern queries on a streaming time series. In Int'l Conf. Management of Data (SIGMOD), Madison, Wisconsin, June 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh. BOAT-optimistic decision tree construction. In Int'l Conf. Management of Data (SIGMOD), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4(1):1--58, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Int'l Conf. Management of Data (SIGMOD), pages 58--66, Santa Barbara, CA, May 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Guha, N. Milshra, R. Motwani, and L. O'Callaghan. Clustering data streams. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 359--366, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Hall, K. Bowyer, W. Kegelmeyer, T. Moore, and C. Chao. Distributed learning on very large data sets. In Workshop on Distributed and Parallel Knowledge Discover, 2000.Google ScholarGoogle Scholar
  19. G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 97--106, San Francisco, CA, 2001. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Shafer, R. Agrawal, and M. Mehta. Spring: A scalable parallel classifier for data mining. In Proc. of Very Large Database (VLDB), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. Chan. Credit card fraud detection using meta-learning: Issues and initial results. In AAAI-97 Workshop on Fraud Detection and Risk Management, 1997.Google ScholarGoogle Scholar
  23. W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kagan Tumer and Joydeep Ghosh. Error correlation and error reduction in ensemble classifiers. Connection Science, 8(3--4):385--403, 1996.Google ScholarGoogle Scholar
  25. P. E. Utgoff. Incremental induction of decision trees. Machine Learning, 4:161--186, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining concept-drifting data streams using ensemble classifiers

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
          August 2003
          736 pages
          ISBN:1581137370
          DOI:10.1145/956750

          Copyright © 2003 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 August 2003

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          KDD '03 Paper Acceptance Rate46of298submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%

          Upcoming Conference

          KDD '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader