Spam Detection Using Character N-Grams

Kanaris, Ioannis; Kanaris, Konstantinos; Stamatatos, Efstathios

doi:10.1007/11752912_12

Ioannis Kanaris²²,
Konstantinos Kanaris²³ &
Efstathios Stamatatos²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3955))

Included in the following conference series:

Hellenic Conference on Artificial Intelligence

1819 Accesses
10 Citations

Abstract

This paper presents a content-based approach to spam detection based on low-level information. Instead of the traditional ’bag of words’ representation, we use a ’bag of character n-grams’ representation which avoids the sparse data problem that arises in n-grams on the word-level. Moreover, it is language-independent and does not require any lemmatizer or ’deep’ text preprocessing. Based on experiments on Ling-Spam corpus we evaluate the proposed representation in combination with support vector machines. Both binary and term-frequency representations achieve high precision rates while maintaining recall on equally high level, which is a crucial factor for anti-spam filters, a cost sensitive application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian Approach to Filtering Junk E-mail. In: Proc. of AAAI Workshop on Learning for Text Categorization (1998)
Google Scholar
Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.D.: An Evaluation of Naive Bayesian Anti-Spam Filtering. In: Potamias, G., Moustakis, V., van Someren, M. (eds.) Proc. of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, pp. 9–17 (2000)
Google Scholar
Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval 6(1), 49–73 (2003)
Article Google Scholar
Drucker, H., Wu, D., Vapnik, V.: Support Vector Machines for Spam Categorization. IEEE Trans. Neural Network 10, 1048–1054 (1999)
Article Google Scholar
Androutsopoulos, I., Paliouras, G., Michelakis, E.: Learning to Filter Unsolicited Commercial E-Mail. Technical report 2004/2, NCSR Demokritos (2004)
Google Scholar
Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: Proc. 3rd Int’l Symposium on Document Analysis and Information Retrieval, pp. 161–169 (1994)
Google Scholar
Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based Author Profiles for Authorship Attribution. In: Proc. of the Conference Pacific Assoc. Comp. Linguistics (2003)
Google Scholar
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text Classification Using String Kernels. The Journal of Machine Learning Research 2, 419–444 (2002)
MATH Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Book MATH Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proc. of the European Conference on Machine Learning (1998)
Google Scholar
Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: Stacking Classifiers for Anti-Spam Filtering of E-Mail. In: Proc. of 6th Conf. Empirical Methods in Natural Language Processing, pp. 44–50 (2001)
Google Scholar
Hovold, J.: Naive Bayes Spam Filtering Using Word-Position-Based Attributes. In: Proc. of the Second Conference on Email and Anti-Spam (2005)
Google Scholar
Yang, Y., Petersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the 14th Int. Conference on Machine Learning, pp. 412–420 (1997)
Google Scholar
Pampapathi, R., Mirkin, B., Levene, M.: A Suffix Tree Approach to Text Categorisation Applied to Spam Filtering, http://arxiv.org/abs/cs.AI/0503030
Berger, H., Koehle, M., Merkl, D.: On the Impact of Document Representation on Classifier Performance in e-Mail Categorization. In: Proc. of the 4th International Conference on Information Systems Technology and its Applications, pp. 19–30 (2005)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools with Java Implementations. Morgan Kaufmann, San Francisco (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Information and Communication Systems Eng., University of the Aegean, 83200, Karlovassi, Greece
Ioannis Kanaris & Efstathios Stamatatos
Dept. of Mathematics, University of the Aegean, 83200, Karlovassi, Greece
Konstantinos Kanaris

Authors

Ioannis Kanaris
View author publications
You can also search for this author in PubMed Google Scholar
Konstantinos Kanaris
View author publications
You can also search for this author in PubMed Google Scholar
Efstathios Stamatatos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department of University of Crete, Greece
Grigoris Antoniou
Institute of Computer Science, Foundation for Research & Technology – Hellas (FORTH), Vassilika Vouton, P.O. Box 1385, 71110, Heraklion, Greece
George Potamias
Institute of Informatics and Telecommunications, NCSR "Demokritos", 15310 A., Paraskevi Attikis, Greece
Costas Spyropoulos
Institute of Computer Science, FO.R.T.H., Vassilika Vouton, P.O. Box 1385, GR 71110, Heraklion, Greece
Dimitris Plexousakis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kanaris, I., Kanaris, K., Stamatatos, E. (2006). Spam Detection Using Character N-Grams. In: Antoniou, G., Potamias, G., Spyropoulos, C., Plexousakis, D. (eds) Advances in Artificial Intelligence. SETN 2006. Lecture Notes in Computer Science(), vol 3955. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11752912_12

Download citation

DOI: https://doi.org/10.1007/11752912_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34117-8
Online ISBN: 978-3-540-34118-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics