Abstract
Real-time email classification is a challenging task because of its online nature, subject to concept-drift. Identifying spam, where only two labels exist, has received great attention in the literature. We are nevertheless interested in classification involving multiple folders, which is an additional source of complexity. Moreover, neither cross-validation nor other sampling procedures are suitable for data streams evaluation. Therefore, other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using mechanisms such as fading factors. In this paper we present GNUsmail, an open-source extensible framework for email classification, and focus on its ability to perform online evaluation. GNUsmail’s architecture supports incremental and online learning, and it can be used to compare different online mining methods, using state-of-art evaluation metrics. We show how GNUsmail can be used to compare different algorithms, including a tool for launching replicable experiments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aberdeen, D., Pacovsky, O., Slater, A.: AIM: The learning behind gmail priority inbox. Tech. rep., Google Inc. (2010)
Barrett, R., Selker, T.: AIM: A new approach for meeting information needs. Tech. rep., IBM Almaden Research Center, Almaden, CA (1995)
Bekkerman, R., Mccallum, A., Huang, G.: Automatic categorization of email into folders: Benchmark experiments on Enron and SRI Corpora. Tech. rep., Center for Intelligent Information Retrieval (2004)
Bermejo, P., Gámez, J.A., Puerta, J.M., Uribe-Paredes, R.: Improving KNN-based e-mail classification into folders generating class-balanced datasets. In: Proceedings of the 12th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Sytems (IPMU 2008), pp. 529–536 (2008)
Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavaldà, R.: New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2009), pp. 139–148 (2009)
Bifet, A., Holmes, G., Pfahringer, B., Kranen, P., Kremer, H., Jansen, T., Seidl, T.: MOA: Massive Online Analysis, a Framework for Stream Classification and Clustering. Journal of Machine Learning Research - Proceedings Track 11, 44–50 (2010)
Carmona-Cejudo, J.M., Baena-García, M., del Campo-Ávila, J., Bueno, R.M., Bifet, A.: Gnusmail: Open framework for on-line email classification. In: ECAI, pp. 1141–1142 (2010)
Chaudhry, N., Shaw, K., Abdelguerfi, M. (eds.): Stream Data Management. Advances in Database Systems. Springer, Heidelberg (2005)
Cohen, W.: Learning rules that classify e-mail. In: Papers from the AAAI Spring Symposium on Machine Learning in Information Access, pp. 18–25 (1996), citeseer.ist.psu.edu/406441.html
Crawford, E., Kay, J., McCreath, E.: IEMS - the intelligent email sorter. In: Proceedings of the 19th International Conference on Machine Learning (ICML 2002), pp. 83–90 (2002)
Domingos, P., Hulten, G.: Mining high-speed data streams. In: Knowledge Discovery and Data Mining, pp. 71–80 (2000), citeseer.ist.psu.edu/article/domingos00mining.html
Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
Gama, J.: Knowledge Discovery from Data Streams. CRC Press, Boca Raton (2010)
Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detection. In: Bazzan, A.L.C., Labidi, S. (eds.) SBIA 2004. LNCS (LNAI), vol. 3171, pp. 286–295. Springer, Heidelberg (2004)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: An update. SIGKDD Explorations 11(1), 10–18 (2009)
Klimt, B., Yang, Y.: The enron corpus: A new dataset for email classification research. In: Proceedings of the 15th European Conference on Machine Learning, ECML 2004 (2004)
Maes, P.: Agents that reduce work and information overload. Communications of the ACM 37(7), 30–40 (1994)
Manco, G., Masciari, E., Tagarelli, A.: A framework for adaptive mail classification. In: Proceedings of the 14th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2002), pp. 387–392 (2002)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (2003)
Martin, B.: Instance-Based Learning: Nearest Neighbour with Generalization. Master’s thesis, University of Waikato (1995)
Pantel, P., Lin, D.: SpamCop: A spam classification & organization program. In: Proceedings of the AAAI 1998 Workshop on Learning for Text Categorization, pp. 95–98 (1998)
Rennie, J.D.M.: ifile: An application of machine learning to e-mail filtering. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2000) Text Mining Workshop (2000)
Sabellico, E., Repici, D.: http://mailclassifier.mozdev.org/ , http://mailclassifier.mozdev.org/
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Segal, R.B., Kephart, J.O.: Incremental learning in SwiftFile. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), pp. 863–870 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Carmona-Cejudo, J.M., Baena-García, M., del Campo-Ávila, J., Bifet, A., Gama, J., Morales-Bueno, R. (2011). Online Evaluation of Email Streaming Classifiers Using GNUsmail. In: Gama, J., Bradley, E., Hollmén, J. (eds) Advances in Intelligent Data Analysis X. IDA 2011. Lecture Notes in Computer Science, vol 7014. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24800-9_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-24800-9_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24799-6
Online ISBN: 978-3-642-24800-9
eBook Packages: Computer ScienceComputer Science (R0)