Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5077))

Included in the following conference series:

Abstract

Unsolicited commercial e-mail (UCE), more commonly known as spam is a growing problem on the Internet. Every day people receive lots of unwanted advertising e-mails that flood their mailboxes. Fortunately, there are several approaches for spam filtering able to detect and automatically delete this kind of messages. However, spammers have adopted some techniques to reduce the effectiveness of these filters by introducing noise in their messages. This work presents a new pre-processing technique for noise identification and reduction, showing preliminary results when it is applied with a Flexible Bayes classifier. The experimental analysis confirms the advantages of using the proposed technique in order to improve spam filters accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cunningham, P., Nowlan, N., Delany, S.J., Haahr, M.: A Case-Based Approach to Spam Filtering than Can Track Concept Drift. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689, pp. 115–123. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  2. The Spamhaus Project: Working to Protect Internet Networks Worldwide (2007), http://www.spamhaus.org/

  3. Spam overview (2007), http://en.wikipedia.org/wiki/E-mail_spam

  4. Spam statistics (2007), http://www.spamunit.com/spam-statistics/

  5. Wittel, G.L., Wu, S.F.: On attacking statistical spam filters. CEAS: First Conference on E-mail and Anti-Spam (2004)

    Google Scholar 

  6. Leslie, C., Kuang, R.: Fast string kernels using inexact matching for protein sequences. Journal of Machine Learning Research, 1435–1455 (2004)

    Google Scholar 

  7. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. Journal of Machine Learning Research 2, 419–444 (2002)

    Article  MATH  Google Scholar 

  8. Androutsopoulos, I., Koustias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.: An Evaluation of Naïve Bayesian Anti-Spam Filtering. In: Proceedings of the 11th European Conference on Machine Learning, Workshop on Machine Learning in the New Information Age, pp. 9–17 (2000)

    Google Scholar 

  9. Cid, I., Méndez, J.R., Peña-Glez, D., Fdez-Riverola, F.: A comparative impact study of attribute selection techniques on Naïve Bayes spam filters. In: The 8th Industrial Conference on Data Mining, ICDM 2008 (submitted for publication 2007)

    Google Scholar 

  10. Random Act of Spamness (2007), http://www.wired.com/techbiz/it/news/2004/01/61886

  11. Hash Buster definition (2007), http://en.wikipedia.org/wiki/Hash_buster

  12. Méndez, J.R., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Sistemas Inteligentes para la Detección y Filtrado de Correo Spam: una Revisión. Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial 34, 63–81 (2007)

    Google Scholar 

  13. Lee, H., Ng, A.Y.: Spam deobfuscation using a Hidden Markov Model. In: Second Conference on E-mail and Anti-Spam (2005)

    Google Scholar 

  14. Shabbir, A., Farzana, M.: Word stemming to enhance spam filtering. In: CEAS: First Conference on E-mail and Anti-Spam (2004)

    Google Scholar 

  15. The Dspam project (2007), http://dspam.nuclearelephant.com/

  16. SpamAssassin BNR (Bayes Noise Reduction) (2007), http://docs.google.com/View?docid=dfsk849w_13d4zm72

  17. Graham, P.: Better bayesian filtering (2003), http://www.paulgraham.com/better.html

  18. Klimt, B., Yang, Y.: Introducing the Enron corpus. In: CEAS: First Conference on E-mail and Anti-Spam (2004)

    Google Scholar 

  19. The Apache SpamAssassin Public Corpus (2007), http://spamassassin.apache.org/publiccorpus/

  20. Crocker, D.: Standard for the Format of ARPA Internet Text Messages. STD 11, RFC 822 (2007), http://www.faqs.org/rfcs/rfc822.html

  21. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 1137–1143 (1995)

    Google Scholar 

  22. Graham-Cumming, J.: Understanding Spam Filter Accuracy. In: jgc spam and anti-spam newsletter (2004) (2007), http://www.jgc.org/antispam/11162004-baafcd719ec31936296c1fb3d74d2cbd.pdf

  23. Rijsbergen, C.J.: Information Retrieval (ed.). Butterworth, London (1979)

    Google Scholar 

  24. Shaw, W.M., Burgin, R., Howell, P.: Performance standards and evaluations in IR test collections: Cluster-based retrieval models. Information Processing and Management 33(1), 1–14 (1997)

    Article  Google Scholar 

  25. Egan, J.P.: Signal Detection Theory and Roc Analysis (ed.). Academic Press, New York (1975)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cid, I., Janeiro, L.R., Méndez, J.R., Glez-Peña, D., Fdez-Riverola, F. (2008). The Impact of Noise in Spam Filtering: A Case Study. In: Perner, P. (eds) Advances in Data Mining. Medical Applications, E-Commerce, Marketing, and Theoretical Aspects. ICDM 2008. Lecture Notes in Computer Science(), vol 5077. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70720-2_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-70720-2_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-70717-2

  • Online ISBN: 978-3-540-70720-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics