Abstract
Concept drift constitutes a challenging problem for the machine learning and data mining community that frequently appears in real world stream classification problems. It is usually defined as the unforeseeable concept change of the target variable in a prediction task. In this paper, we focus on the problem of recurring contexts, a special sub-type of concept drift, that has not yet met the proper attention from the research community. In the case of recurring contexts, concepts may re-appear in future and thus older classification models might be beneficial for future classifications. We propose a general framework for classifying data streams by exploiting stream clustering in order to dynamically build and update an ensemble of incremental classifiers. To achieve this, a transformation function that maps batches of examples into a new conceptual representation model is proposed. The clustering algorithm is then applied in order to group batches of examples into concepts and identify recurring contexts. The ensemble is produced by creating and maintaining an incremental classifier for every concept discovered in the data stream. An experimental study is performed using (a) two new real-world concept drifting datasets from the email domain, (b) an instantiation of the proposed framework and (c) five methods for dealing with drifting concepts. Results indicate the effectiveness of the proposed representation and the suitability of the concept-specific classifiers for problems with recurring contexts.
Similar content being viewed by others
References
Aggarwal, C (eds) (2007) Data streams: models and algorithms. Springer, Heidelberg
Tsymbal A (2004) The problem of concept drift: definitions and related work. Technical report, Department of Computer Science Trinity College
Widmer G, Kubat M (1996) Learning in the presense of concept drift and hidden contexts. Mach Learn 23(1): 69–101
Harries MB, Sammut C, Horn K (1998) Extracting hidden context. Mach Learn 32(2): 101–126
Forman G (2006) Tackling concept drift by temporal inductive transfer. In: SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 252–259
Gaber M, Zaslavsky A, Krishnaswamy S (2007) A survey of classification methods in data streams. In: Aggarwal C (eds) Data streams, models and algorithms. Springer, Heidelberg, pp 39–59
Barbará D (2002) Requirements for clustering data streams. SIGKDD Explor 3(2): 23–27
Cheng J, Ke Y, Ng W (2008) A survey on algorithms for mining frequent itemsets over data streams. Knowl Inform Syst 16(1): 1–27
Kolter J, Maloof M (2003) Dynamic weighted majority: a new ensemble method for tracking concept drift. In: Proceedings of the Third IEEE international conference on data mining. IEEE Press, Los Alamitos, pp 123–130
Kolter JZ, Maloof MA (2005) Using additive expert ensembles to cope with concept drift. In: ICML ’05: Proceedings of the 22nd international conference on machine learning. ACM Press, New York, pp 449–456
Wenerstrom B, Giraud-Carrier C (2006) Temporal data mining in dynamic feature spaces. IEEE Computer Society, Los Alamitos, pp 1141–1145
Gama J, Medas P, Castillo G, Rodrigues PP (2004) Learning with drift detection. In: Bazzan ALC, Labidi S (eds) Advances in artificial intelligence. Proceedings of the 17th Brazilian symposium on artificial intelligence (SBIA 2004). Lecture notes in artificial intelligence, vol 3171. Springer, Brazil, pp 286–295
Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8: 2755–2790
Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensembles classifiers. In: 9th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, Washington, DC, pp 226–235
Martin Scholz RK (2007) Boosting classifiers for drifting concepts. Intell Data Anal, Spec Issue Knowl Discovery from Data Streams 11(1): 3–28
Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inform Syst 15(2): 181–214
O’Callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) High-performance clustering of streams and large data sets. In: ICDE 2002
Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: VLDB ’04: Proceedings of the 30th international conference on very large data bases, VLDB Endowment, pp 852–863
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Rec 25(2): 103–114
Klinkenberg R, Joachims T (2000) Detecting concept drift with support vector machines. In: ICML ’00: Proceedings of the 17th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 487–494
Klinkenberg R (2004) Learning drifting concepts: example selection vs. example weighting. Intell Data Anal 8(3): 200–281
Fan W (2004) Systematic data selection to mine concept-drifting data streams. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp 128–137
Delany SJ, Padraig Cunningham ATLC (2005) A case-based technique for tracking concept drift in spam filtering. Knowl Based Syst 18(4–5): 187–195
Street WN, Kim Y (2001) A streaming ensemble algorithm (SEA) for large-scale classification. In: 7th ACM SIGKDD international conference on knowledge discovery in data mining. ACM Press, pp 277–382
Zhu X, Wu X, Yang Y (2006) Effective classification of noisy data streams with attribute-oriented dynamic classifier selection. Knowl Inform Syst 9(3): 339–363
Spinosa EJ, Carvahlo Ad, Gama J (2007) OLINDDA: a cluster-based approach for detecting novelty and concept drift in data streams. In: 22nd annual acm symposium on applied computing. ACM Press, pp 448–452
Hulten G, Spence L, Domingos P (2001) Mining time-changing data streams. In: KDD ’01: 7th ACM SIGKDD International conference on knowledge discovery and data mining. ACM Press, pp 97–106
Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley, New York
Asuncion A, Newman D (2007) UCI machine learning repository
Katakis I, Tsoumakas G, Vlahavas I (2006) Dynamic feature space and incremental feature selection for the classification of textual data streams. In: ECML/PKDD-2006 international workshop on knowledge discovery from data stream, pp 107–116
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological) 39(1): 1–38
Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques. 2nd edn, San Francisco
John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: UAI ’95: Proceedings of the 11th annual conference on uncertainty in artificial intelligence. Morgan Kaufman, Montreal, pp 338–345
Domingos P, Pazzani MJ (1997) On the optimality of the simple bayesian classifier under zero-one loss. Mach Learn 29(2–3): 103–130
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A bayesian approach to filtering junk e-mail. In: Learning for text categorization: papers from the 1998 Workshop, Madison, Wisconsin, AAAI Technical Report WS-98-05
Rennie J (2000) ifile: an application of machine learning to e-mail filtering. In: KDD-2000 workshop on text mining
Vapnik V (1995) The nature of statistical learning theory. Springer, Heidelberg
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Proceedings of ECML-98, 10th European conference on machine learning. Number 1398. Springer, Heidelberg, pp 137–142
Peng T, Zuo W, He F (2008) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inform Syst 16(3): 281–301
Klimt B, Yang Y (2004) The enron corpus: a new dataset for email classification research. In: ECML 2004, 15th European conference on machine learning. Springer, Pisa, pp 217–226
Rennie JD, Rifkn R (2001) Improving multiclass text classification with the support vector machine. Technical Report AIM-2001-026, Massachusetts Institute of Technology
Yang Y, Liu X (1999) A re-examination of text categorization methods. In: SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM Press, New York, pp 42–49
Tsoumakas G, Angelis L, Vlahavas I (2004) Clustering classifiers for knowledge discovery from physically distributed databases. Data Knowl Eng 49(3): 223–242
Katakis I, Tsoumakas G, Banos E, Bassiliades N, Vlahavas I An adaptive personalized news dissemination system. J Intell Inform Syst 32:191–212
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version of this paper appears in the proceedings of the 18th European Conference on Artificial Intelligence, Patras, Greece, 2008.
Rights and permissions
About this article
Cite this article
Katakis, I., Tsoumakas, G. & Vlahavas, I. Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22, 371–391 (2010). https://doi.org/10.1007/s10115-009-0206-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0206-2