Skip to main content

Online Evaluation of Email Streaming Classifiers Using GNUsmail

  • Conference paper
Advances in Intelligent Data Analysis X (IDA 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7014))

Included in the following conference series:

  • 1362 Accesses

Abstract

Real-time email classification is a challenging task because of its online nature, subject to concept-drift. Identifying spam, where only two labels exist, has received great attention in the literature. We are nevertheless interested in classification involving multiple folders, which is an additional source of complexity. Moreover, neither cross-validation nor other sampling procedures are suitable for data streams evaluation. Therefore, other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using mechanisms such as fading factors. In this paper we present GNUsmail, an open-source extensible framework for email classification, and focus on its ability to perform online evaluation. GNUsmail’s architecture supports incremental and online learning, and it can be used to compare different online mining methods, using state-of-art evaluation metrics. We show how GNUsmail can be used to compare different algorithms, including a tool for launching replicable experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aberdeen, D., Pacovsky, O., Slater, A.: AIM: The learning behind gmail priority inbox. Tech. rep., Google Inc. (2010)

    Google Scholar 

  2. Barrett, R., Selker, T.: AIM: A new approach for meeting information needs. Tech. rep., IBM Almaden Research Center, Almaden, CA (1995)

    Google Scholar 

  3. Bekkerman, R., Mccallum, A., Huang, G.: Automatic categorization of email into folders: Benchmark experiments on Enron and SRI Corpora. Tech. rep., Center for Intelligent Information Retrieval (2004)

    Google Scholar 

  4. Bermejo, P., Gámez, J.A., Puerta, J.M., Uribe-Paredes, R.: Improving KNN-based e-mail classification into folders generating class-balanced datasets. In: Proceedings of the 12th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Sytems (IPMU 2008), pp. 529–536 (2008)

    Google Scholar 

  5. Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavaldà, R.: New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2009), pp. 139–148 (2009)

    Google Scholar 

  6. Bifet, A., Holmes, G., Pfahringer, B., Kranen, P., Kremer, H., Jansen, T., Seidl, T.: MOA: Massive Online Analysis, a Framework for Stream Classification and Clustering. Journal of Machine Learning Research - Proceedings Track 11, 44–50 (2010)

    Google Scholar 

  7. Carmona-Cejudo, J.M., Baena-García, M., del Campo-Ávila, J., Bueno, R.M., Bifet, A.: Gnusmail: Open framework for on-line email classification. In: ECAI, pp. 1141–1142 (2010)

    Google Scholar 

  8. Chaudhry, N., Shaw, K., Abdelguerfi, M. (eds.): Stream Data Management. Advances in Database Systems. Springer, Heidelberg (2005)

    MATH  Google Scholar 

  9. Cohen, W.: Learning rules that classify e-mail. In: Papers from the AAAI Spring Symposium on Machine Learning in Information Access, pp. 18–25 (1996), citeseer.ist.psu.edu/406441.html

  10. Crawford, E., Kay, J., McCreath, E.: IEMS - the intelligent email sorter. In: Proceedings of the 19th International Conference on Machine Learning (ICML 2002), pp. 83–90 (2002)

    Google Scholar 

  11. Domingos, P., Hulten, G.: Mining high-speed data streams. In: Knowledge Discovery and Data Mining, pp. 71–80 (2000), citeseer.ist.psu.edu/article/domingos00mining.html

  12. Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)

    MATH  Google Scholar 

  13. Gama, J.: Knowledge Discovery from Data Streams. CRC Press, Boca Raton (2010)

    Book  MATH  Google Scholar 

  14. Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detection. In: Bazzan, A.L.C., Labidi, S. (eds.) SBIA 2004. LNCS (LNAI), vol. 3171, pp. 286–295. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  15. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: An update. SIGKDD Explorations 11(1), 10–18 (2009)

    Article  Google Scholar 

  16. Klimt, B., Yang, Y.: The enron corpus: A new dataset for email classification research. In: Proceedings of the 15th European Conference on Machine Learning, ECML 2004 (2004)

    Google Scholar 

  17. Maes, P.: Agents that reduce work and information overload. Communications of the ACM 37(7), 30–40 (1994)

    Article  Google Scholar 

  18. Manco, G., Masciari, E., Tagarelli, A.: A framework for adaptive mail classification. In: Proceedings of the 14th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2002), pp. 387–392 (2002)

    Google Scholar 

  19. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (2003)

    MATH  Google Scholar 

  20. Martin, B.: Instance-Based Learning: Nearest Neighbour with Generalization. Master’s thesis, University of Waikato (1995)

    Google Scholar 

  21. Pantel, P., Lin, D.: SpamCop: A spam classification & organization program. In: Proceedings of the AAAI 1998 Workshop on Learning for Text Categorization, pp. 95–98 (1998)

    Google Scholar 

  22. Rennie, J.D.M.: ifile: An application of machine learning to e-mail filtering. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2000) Text Mining Workshop (2000)

    Google Scholar 

  23. Sabellico, E., Repici, D.: http://mailclassifier.mozdev.org/ , http://mailclassifier.mozdev.org/

  24. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)

    Article  Google Scholar 

  25. Segal, R.B., Kephart, J.O.: Incremental learning in SwiftFile. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), pp. 863–870 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Carmona-Cejudo, J.M., Baena-García, M., del Campo-Ávila, J., Bifet, A., Gama, J., Morales-Bueno, R. (2011). Online Evaluation of Email Streaming Classifiers Using GNUsmail. In: Gama, J., Bradley, E., Hollmén, J. (eds) Advances in Intelligent Data Analysis X. IDA 2011. Lecture Notes in Computer Science, vol 7014. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24800-9_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24800-9_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24799-6

  • Online ISBN: 978-3-642-24800-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics