skip to main content
10.1145/1150402.1150496acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Suppressing model overfitting in mining concept-drifting data streams

Published: 20 August 2006 Publication History

Abstract

Mining data streams of changing class distributions is important for real-time business decision support. The stream classifier must evolve to reflect the current class distribution. This poses a serious challenge. On the one hand, relying on historical data may increase the chances of learning obsolete models. On the other hand, learning only from the latest data may lead to biased classifiers, as the latest data is often an unrepresentative sample of the current class distribution. The problem is particularly acute in classifying rare events, when, for example, instances of the rare class do not even show up in the most recent training data. In this paper, we use a stochastic model to describe the concept shifting patterns and formulate this problem as an optimization one: from the historical and the current training data that we have observed, find the most-likely current distribution, and learn a classifier based on the most-likely distribution. We derive an analytic solution and approximate this solution with an efficient algorithm, which calibrates the influence of historical data carefully to create an accurate classifier. We evaluate our algorithm with both synthetic and real-world datasets. Our results show that our algorithm produces accurate and efficient classification.

References

[1]
B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom. Models and issues in data stream systems. In PODS, 2002.
[2]
Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1-2):105--139, 1999.
[3]
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-dimensional regression analysis of time-series data streams. In VLDB, Hongkong, China, 2002.
[4]
Yun Chi, Haixun Wang, Philip S. Yu, and Richard R. Muntz. Moment: Maintaining closed frequent itemsets over a stream sliding window data streams. In ICDM, 2004.
[5]
Yun Chi, Philip S. Yu, Haixun Wang, and Richard Muntz. Loadstar: A load shedding scheme for classifying data streams. In SIAM Data Mining, 2005.
[6]
Graham Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SDM, 2005.
[7]
P. Domingos and G. Hulten. Mining high-speed data streams. In SIGKDD, pages 71--80, Boston, MA, 2000. ACM Press.
[8]
Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In ICML, pages 148--156, 1996.
[9]
L. Gao and X. Wang. Continually evaluating similarity-based pattern queries on a streaming time series. In SIGMOD, Madison, Wisconsin, June 2002.
[10]
J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh. BOAT - optimistic decision tree construction. In SIGMOD, 1999.
[11]
M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In SIGMOD, pages 58--66, Santa Barbara, CA, May 2001.
[12]
S. Guha, N. Milshra, R. Motwani, and L. O'Callaghan. Clustering data streams. In FOCS, pages 359--366, 2000.
[13]
Sudipto Guha and Boulos Harb. Wavelet synopsis for data streams: minimizing non-euclidean error. In KDD, pages 88--97, 2005.
[14]
G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In SIGKDD, pages 97--106, San Francisco, CA, 2001. ACM Press.
[15]
Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Readings in speech recognition, pages 267--296. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990.
[16]
W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In SIGKDD, 2001.
[17]
Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han. Mining concept-drifting data streams using ensemble classifiers. In SIGKDD, 2003.
[18]
Peng Wang, Haixun Wang, Xiaochen Wu, Wei Wang, and Baile Shi .On reducing classifier granularity in mining concept-drifting data streams. In ICDM, 2005.
[19]
Ying Yang, Xindong Wu, and Xingquan Zhu. Combining proactive and reactive predictions for data streams. In SIGKDD, pages 710--715, 2005.

Cited By

View all
  • (2022)Solving Periodic Investment Portfolio Selection Problems by a Data-Assisted Multiobjective Evolutionary ApproachIEEE Transactions on Cybernetics10.1109/TCYB.2021.310897752:11(11418-11430)Online publication date: Nov-2022
  • (2022)Advanced IDSS Topics and ApplicationsIntelligent Decision Support Systems10.1007/978-3-030-87790-3_9(583-766)Online publication date: 29-Mar-2022
  • (2021)A novel handwritten Turkish letter recognition model based on convolutional neural networkConcurrency and Computation: Practice and Experience10.1002/cpe.642933:21Online publication date: 28-May-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2006
986 pages
ISBN:1595933395
DOI:10.1145/1150402
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. classifier
  2. classifier ensemble
  3. concept drift
  4. data streams

Qualifiers

  • Article

Conference

KDD06

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Solving Periodic Investment Portfolio Selection Problems by a Data-Assisted Multiobjective Evolutionary ApproachIEEE Transactions on Cybernetics10.1109/TCYB.2021.310897752:11(11418-11430)Online publication date: Nov-2022
  • (2022)Advanced IDSS Topics and ApplicationsIntelligent Decision Support Systems10.1007/978-3-030-87790-3_9(583-766)Online publication date: 29-Mar-2022
  • (2021)A novel handwritten Turkish letter recognition model based on convolutional neural networkConcurrency and Computation: Practice and Experience10.1002/cpe.642933:21Online publication date: 28-May-2021
  • (2019)Hierarchical and Robust Convolutional Neural Network for Very High-Resolution Remote Sensing Object DetectionIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2019.2900302(1-14)Online publication date: 2019
  • (2017)Nonlinear Dynamics of Information Diffusion in Social NetworksACM Transactions on the Web10.1145/305774111:2(1-40)Online publication date: 24-Apr-2017
  • (2017)Ecosystem on the WebWorld Wide Web10.1007/s11280-016-0389-x20:3(439-465)Online publication date: 1-May-2017
  • (2016)Regularizing Deep Convolutional Neural Networks with a Structured Decorrelation Constraint2016 IEEE 16th International Conference on Data Mining (ICDM)10.1109/ICDM.2016.0063(519-528)Online publication date: Dec-2016
  • (2015)The Web as a JungleProceedings of the 24th International Conference on World Wide Web10.1145/2736277.2741092(721-731)Online publication date: 18-May-2015
  • (2014)Event Characterization and Prediction Based on Temporal Patterns in Dynamic Data SystemIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2013.6026:1(144-156)Online publication date: 1-Jan-2014
  • (2014)Fast and accurate detection of changes in data streamsStatistical Analysis and Data Mining10.1002/sam.112167:2(125-139)Online publication date: 1-Apr-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media