Tracking recurring contexts using ensemble classifiers: an application to email filtering

Katakis, Ioannis; Tsoumakas, Grigorios; Vlahavas, Ioannis

doi:10.1007/s10115-009-0206-2

Tracking recurring contexts using ensemble classifiers: an application to email filtering

Regular Paper
Published: 24 April 2009

Volume 22, pages 371–391, (2010)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Ioannis Katakis¹,
Grigorios Tsoumakas¹ &
Ioannis Vlahavas¹

998 Accesses
146 Citations
3 Altmetric
Explore all metrics

Abstract

Concept drift constitutes a challenging problem for the machine learning and data mining community that frequently appears in real world stream classification problems. It is usually defined as the unforeseeable concept change of the target variable in a prediction task. In this paper, we focus on the problem of recurring contexts, a special sub-type of concept drift, that has not yet met the proper attention from the research community. In the case of recurring contexts, concepts may re-appear in future and thus older classification models might be beneficial for future classifications. We propose a general framework for classifying data streams by exploiting stream clustering in order to dynamically build and update an ensemble of incremental classifiers. To achieve this, a transformation function that maps batches of examples into a new conceptual representation model is proposed. The clustering algorithm is then applied in order to group batches of examples into concepts and identify recurring contexts. The ensemble is produced by creating and maintaining an incremental classifier for every concept discovered in the data stream. An experimental study is performed using (a) two new real-world concept drifting datasets from the email domain, (b) an instantiation of the proposed framework and (c) five methods for dealing with drifting concepts. Results indicate the effectiveness of the proposed representation and the suitability of the concept-specific classifiers for problems with recurring contexts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Streaming Data Analytics for Feature Importance Measures in Concept Drift Detection and Adaptation

A Context-Sensitive Framework for Mining Concept Drifting Data Streams

Adaptive Ensembles for Evolving Data Streams – Combining Block-Based and Online Solutions

References

Aggarwal, C (eds) (2007) Data streams: models and algorithms. Springer, Heidelberg
MATH Google Scholar
Tsymbal A (2004) The problem of concept drift: definitions and related work. Technical report, Department of Computer Science Trinity College
Widmer G, Kubat M (1996) Learning in the presense of concept drift and hidden contexts. Mach Learn 23(1): 69–101
Google Scholar
Harries MB, Sammut C, Horn K (1998) Extracting hidden context. Mach Learn 32(2): 101–126
Article MATH Google Scholar
Forman G (2006) Tackling concept drift by temporal inductive transfer. In: SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 252–259
Gaber M, Zaslavsky A, Krishnaswamy S (2007) A survey of classification methods in data streams. In: Aggarwal C (eds) Data streams, models and algorithms. Springer, Heidelberg, pp 39–59
Google Scholar
Barbará D (2002) Requirements for clustering data streams. SIGKDD Explor 3(2): 23–27
Article Google Scholar
Cheng J, Ke Y, Ng W (2008) A survey on algorithms for mining frequent itemsets over data streams. Knowl Inform Syst 16(1): 1–27
Article MathSciNet Google Scholar
Kolter J, Maloof M (2003) Dynamic weighted majority: a new ensemble method for tracking concept drift. In: Proceedings of the Third IEEE international conference on data mining. IEEE Press, Los Alamitos, pp 123–130
Kolter JZ, Maloof MA (2005) Using additive expert ensembles to cope with concept drift. In: ICML ’05: Proceedings of the 22nd international conference on machine learning. ACM Press, New York, pp 449–456
Wenerstrom B, Giraud-Carrier C (2006) Temporal data mining in dynamic feature spaces. IEEE Computer Society, Los Alamitos, pp 1141–1145
Gama J, Medas P, Castillo G, Rodrigues PP (2004) Learning with drift detection. In: Bazzan ALC, Labidi S (eds) Advances in artificial intelligence. Proceedings of the 17th Brazilian symposium on artificial intelligence (SBIA 2004). Lecture notes in artificial intelligence, vol 3171. Springer, Brazil, pp 286–295
Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8: 2755–2790
Google Scholar
Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensembles classifiers. In: 9th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, Washington, DC, pp 226–235
Martin Scholz RK (2007) Boosting classifiers for drifting concepts. Intell Data Anal, Spec Issue Knowl Discovery from Data Streams 11(1): 3–28
Google Scholar
Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inform Syst 15(2): 181–214
Article Google Scholar
O’Callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) High-performance clustering of streams and large data sets. In: ICDE 2002
Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: VLDB ’04: Proceedings of the 30th international conference on very large data bases, VLDB Endowment, pp 852–863
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Rec 25(2): 103–114
Article Google Scholar
Klinkenberg R, Joachims T (2000) Detecting concept drift with support vector machines. In: ICML ’00: Proceedings of the 17th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 487–494
Klinkenberg R (2004) Learning drifting concepts: example selection vs. example weighting. Intell Data Anal 8(3): 200–281
Google Scholar
Fan W (2004) Systematic data selection to mine concept-drifting data streams. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp 128–137
Delany SJ, Padraig Cunningham ATLC (2005) A case-based technique for tracking concept drift in spam filtering. Knowl Based Syst 18(4–5): 187–195
Article Google Scholar
Street WN, Kim Y (2001) A streaming ensemble algorithm (SEA) for large-scale classification. In: 7th ACM SIGKDD international conference on knowledge discovery in data mining. ACM Press, pp 277–382
Zhu X, Wu X, Yang Y (2006) Effective classification of noisy data streams with attribute-oriented dynamic classifier selection. Knowl Inform Syst 9(3): 339–363
Article MathSciNet Google Scholar
Spinosa EJ, Carvahlo Ad, Gama J (2007) OLINDDA: a cluster-based approach for detecting novelty and concept drift in data streams. In: 22nd annual acm symposium on applied computing. ACM Press, pp 448–452
Hulten G, Spence L, Domingos P (2001) Mining time-changing data streams. In: KDD ’01: 7th ACM SIGKDD International conference on knowledge discovery and data mining. ACM Press, pp 97–106
Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley, New York
Google Scholar
Asuncion A, Newman D (2007) UCI machine learning repository
Katakis I, Tsoumakas G, Vlahavas I (2006) Dynamic feature space and incremental feature selection for the classification of textual data streams. In: ECML/PKDD-2006 international workshop on knowledge discovery from data stream, pp 107–116
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological) 39(1): 1–38
MATH MathSciNet Google Scholar
Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques. 2nd edn, San Francisco
John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: UAI ’95: Proceedings of the 11th annual conference on uncertainty in artificial intelligence. Morgan Kaufman, Montreal, pp 338–345
Domingos P, Pazzani MJ (1997) On the optimality of the simple bayesian classifier under zero-one loss. Mach Learn 29(2–3): 103–130
Article MATH Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47
Article Google Scholar
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A bayesian approach to filtering junk e-mail. In: Learning for text categorization: papers from the 1998 Workshop, Madison, Wisconsin, AAAI Technical Report WS-98-05
Rennie J (2000) ifile: an application of machine learning to e-mail filtering. In: KDD-2000 workshop on text mining
Vapnik V (1995) The nature of statistical learning theory. Springer, Heidelberg
MATH Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Proceedings of ECML-98, 10th European conference on machine learning. Number 1398. Springer, Heidelberg, pp 137–142
Chapter Google Scholar
Peng T, Zuo W, He F (2008) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inform Syst 16(3): 281–301
Article Google Scholar
Klimt B, Yang Y (2004) The enron corpus: a new dataset for email classification research. In: ECML 2004, 15th European conference on machine learning. Springer, Pisa, pp 217–226
Rennie JD, Rifkn R (2001) Improving multiclass text classification with the support vector machine. Technical Report AIM-2001-026, Massachusetts Institute of Technology
Yang Y, Liu X (1999) A re-examination of text categorization methods. In: SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM Press, New York, pp 42–49
Tsoumakas G, Angelis L, Vlahavas I (2004) Clustering classifiers for knowledge discovery from physically distributed databases. Data Knowl Eng 49(3): 223–242
Article Google Scholar
Katakis I, Tsoumakas G, Banos E, Bassiliades N, Vlahavas I An adaptive personalized news dissemination system. J Intell Inform Syst 32:191–212

Download references

Author information

Authors and Affiliations

Department of Informatics, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece
Ioannis Katakis, Grigorios Tsoumakas & Ioannis Vlahavas

Authors

Ioannis Katakis
View author publications
You can also search for this author inPubMed Google Scholar
Grigorios Tsoumakas
View author publications
You can also search for this author inPubMed Google Scholar
Ioannis Vlahavas
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Ioannis Katakis.

Additional information

A preliminary version of this paper appears in the proceedings of the 18th European Conference on Artificial Intelligence, Patras, Greece, 2008.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Katakis, I., Tsoumakas, G. & Vlahavas, I. Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22, 371–391 (2010). https://doi.org/10.1007/s10115-009-0206-2

Download citation

Received: 16 July 2008
Revised: 31 January 2009
Accepted: 15 March 2009
Published: 24 April 2009
Issue Date: March 2010
DOI: https://doi.org/10.1007/s10115-009-0206-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tracking recurring contexts using ensemble classifiers: an application to email filtering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Streaming Data Analytics for Feature Importance Measures in Concept Drift Detection and Adaptation

A Context-Sensitive Framework for Mining Concept Drifting Data Streams

Adaptive Ensembles for Evolving Data Streams – Combining Block-Based and Online Solutions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now