Abstract
In this paper, we propose methods and heuristics having high accuracies and low time complexities for filtering spam e-mails. The methods are based on the n-gram approach and a heuristics which is referred to as the first n-words heuristics is devised. Though the main concern of the research is studying the applicability of these methods on Turkish e-mails, they were also applied to English e-mails. A data set for both languages was compiled. Extensive tests were performed with different parameters. Success rates of about 97% for Turkish e-mails and above 98% for English e-mails were obtained. In addition, it has been shown that the time complexities can be reduced significantly without sacrificing from success.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Burns, E.: New Image-Based Spam: No Two Alike, http://www.clickz.com/
Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos, C.: An Evaluation of Naive Bayesian Anti-Spam Filtering. In: Machine Learning in the New Information Age. Barcelona, pp. 9–17 (2000)
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian Approach to Filtering Junk E-Mail. In: AAAI Workshop on Learning for Text Categorization. Madison, pp. 55–62 (1998)
Schneider, K.M.: A Comparison of Event Models for Naïve Bayes Anti-Spam E-Mail Filtering. In: Conference of the European Chapter of ACL. Budapest, pp. 307–314 (2003)
Cohen, W.: Learning Rules That Classify E-mail. In: AAAI Spring Symposium on Machine Learning in Information Access. Stanford, California, pp. 18–25 (1996)
Drucker, H., Wu, D., Vapnik, V.N.: Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999)
Kolcz, A., Alspector, J.: SVM-Based Filtering of E-Mail Spam with Content-Specific Misclassification Costs. In: TextDM Workshop on Text Mining (2001)
Delany, S.J., Cunningham, P., Tsymbal, A., Coyle, L.: A Case-Based Technique for Tracking Concept Drift in Spam Filtering. Knowledge-Based Systems 18, 187–195 (2005)
Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to Filter Spam E-Mail: A Comparison of a Naïve Bayesian and a Memory-Based Approach. In: Workshop on Machine Learning and Textual Information Access, Lyon, pp. 1–13 (2000)
Zhang, L., Yao, T.: Filtering Junk Mail with a Maximum Entropy Model. In: International Conference on Computer Processing of Oriental Languages, pp. 446–453 (2003)
Özgür, L., Güngör, T., Gürgen, F.: Adaptive Anti-Spam Filtering for Agglutinative Languages:A Special Case for Turkish. Pattern Recognition Letters 25(16), 1819–1831 (2004)
Oflazer, K.: Two-Level Description of Turkish Morphology. Literary and Linguistic Computing 9(2), 137–148 (1994)
Charniak, E.: Statistical Language Learning. MIT, Cambridge, MA (1997)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT, Cambridge, MA (2000)
Zdziarski, J.: Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification. No Starch Press (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Güngör, T., Çıltık, A. (2007). Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages. In: Kedad, Z., Lammari, N., Métais, E., Meziane, F., Rezgui, Y. (eds) Natural Language Processing and Information Systems. NLDB 2007. Lecture Notes in Computer Science, vol 4592. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73351-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-540-73351-5_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73350-8
Online ISBN: 978-3-540-73351-5
eBook Packages: Computer ScienceComputer Science (R0)