Skip to main content

Spam Detection Using Character N-Grams

  • Conference paper
Advances in Artificial Intelligence (SETN 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3955))

Included in the following conference series:

Abstract

This paper presents a content-based approach to spam detection based on low-level information. Instead of the traditional ’bag of words’ representation, we use a ’bag of character n-grams’ representation which avoids the sparse data problem that arises in n-grams on the word-level. Moreover, it is language-independent and does not require any lemmatizer or ’deep’ text preprocessing. Based on experiments on Ling-Spam corpus we evaluate the proposed representation in combination with support vector machines. Both binary and term-frequency representations achieve high precision rates while maintaining recall on equally high level, which is a crucial factor for anti-spam filters, a cost sensitive application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  2. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian Approach to Filtering Junk E-mail. In: Proc. of AAAI Workshop on Learning for Text Categorization (1998)

    Google Scholar 

  3. Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.D.: An Evaluation of Naive Bayesian Anti-Spam Filtering. In: Potamias, G., Moustakis, V., van Someren, M. (eds.) Proc. of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, pp. 9–17 (2000)

    Google Scholar 

  4. Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval 6(1), 49–73 (2003)

    Article  Google Scholar 

  5. Drucker, H., Wu, D., Vapnik, V.: Support Vector Machines for Spam Categorization. IEEE Trans. Neural Network 10, 1048–1054 (1999)

    Article  Google Scholar 

  6. Androutsopoulos, I., Paliouras, G., Michelakis, E.: Learning to Filter Unsolicited Commercial E-Mail. Technical report 2004/2, NCSR Demokritos (2004)

    Google Scholar 

  7. Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proc. 3rd Int’l Symposium on Document Analysis and Information Retrieval, pp. 161–169 (1994)

    Google Scholar 

  8. Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based Author Profiles for Authorship Attribution. In: Proc. of the Conference Pacific Assoc. Comp. Linguistics (2003)

    Google Scholar 

  9. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text Classification Using String Kernels. The Journal of Machine Learning Research 2, 419–444 (2002)

    MATH  Google Scholar 

  10. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)

    Book  MATH  Google Scholar 

  11. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proc. of the European Conference on Machine Learning (1998)

    Google Scholar 

  12. Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: Stacking Classifiers for Anti-Spam Filtering of E-Mail. In: Proc. of 6th Conf. Empirical Methods in Natural Language Processing, pp. 44–50 (2001)

    Google Scholar 

  13. Hovold, J.: Naive Bayes Spam Filtering Using Word-Position-Based Attributes. In: Proc. of the Second Conference on Email and Anti-Spam (2005)

    Google Scholar 

  14. Yang, Y., Petersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the 14th Int. Conference on Machine Learning, pp. 412–420 (1997)

    Google Scholar 

  15. Pampapathi, R., Mirkin, B., Levene, M.: A Suffix Tree Approach to Text Categorisation Applied to Spam Filtering, http://arxiv.org/abs/cs.AI/0503030

  16. Berger, H., Koehle, M., Merkl, D.: On the Impact of Document Representation on Classifier Performance in e-Mail Categorization. In: Proc. of the 4th International Conference on Information Systems Technology and its Applications, pp. 19–30 (2005)

    Google Scholar 

  17. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools with Java Implementations. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kanaris, I., Kanaris, K., Stamatatos, E. (2006). Spam Detection Using Character N-Grams. In: Antoniou, G., Potamias, G., Spyropoulos, C., Plexousakis, D. (eds) Advances in Artificial Intelligence. SETN 2006. Lecture Notes in Computer Science(), vol 3955. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11752912_12

Download citation

  • DOI: https://doi.org/10.1007/11752912_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-34117-8

  • Online ISBN: 978-3-540-34118-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics