Text Mining for Spam Filtering

KoŁcz, Aleksander

doi:10.1007/978-1-4899-7687-1_828

Text Mining for Spam Filtering

Aleksander KoŁcz³

Reference work entry
First Online: 01 January 2017

255 Accesses

Synonyms

Commercial Email Filtering; Junk email filtering; Spam detection; Unsolicited commercial email filtering

Definition

Spam filtering is the process of detecting unsolicited commercial email (UCE) messages on behalf of an individual recipient or a group of recipients. Machine learning applied to this problem is used to create discriminating models based on labeled and unlabeled examples of spam and nonspam. Such models can serve populations of users (e.g., departments, corporations, ISP customers) or they can be personalized to reflect the judgments of an individual. An important aspect of spam detection is the way in which textual information contained in email is extracted and used for the purpose of discrimination.

Motivation and Background

Spam has become the bane of existence for both Internet users and entities providing email services. Time is lost when sifting through unwanted messages and important emails may be lost through omission or accidental deletion. According to...

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 699.99; Price excludes VAT (USA)

Hardcover Book: USD 949.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Recommended Reading

Bratko A, Cormack GV, Filipic B, Lynam TR, Zupan B (2006) Spam filtering using statistical data compression models. J Mach Learn Res 7:2673–2698
MathSciNet MATH Google Scholar
Carreras X, Màrquez L (2001) Boosting trees for anti-spam email filtering. In: Proceedings of RANLP-01, the 4th international conference on recent advances in natural language processing. ACM, New York
Google Scholar
Cormack GV, Lynam TR (2006) On-line supervised spam filter evaluation. ACM Trans Inf Syst 25(3):11
Article Google Scholar
Dalvi N, Domingos P, Sanghai MS, Verma D (2004) Adversarial classification. In: Proceedings of the tenth international conference on knowledge discovery and data mining, vol 1. ACM, New York, pp 99–108
Google Scholar
Drucker H, Wu D, Vapnik VN (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 5(10):1048–1054
Article Google Scholar
Fawcett T (2003) In vivo’ spam filtering: a challenge problem for data mining. KDD Explor 5(2):140–148
Google Scholar
Goodman J, Yih W (2006) Online discriminative spam filter training. In: Proceedings of the third conference on email and anti-spam (CEAS-2006), Mountain View
Google Scholar
Kołcz A (2005) Local sparsity control for naive bayes with extreme misclassification costs. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York
Google Scholar
Kołcz A, Alspector J (2001) SVM-based filtering of e-mail spam with content-specific misclassification costs. In: TextDM’2001 (IEEE ICDM-2001 workshop on text mining), San Jose
Google Scholar
Kołcz A, Bond M, Sargent J (2006) The challenges of service-side personalized spam filtering: scalability and beyond. In: Proceedings of the first international conference on scalable information systems (INFOSCALE). ACM, New York
Google Scholar
Kołcz AM, Chowdhury A (2007) Hardening fingerprinting by context. In: Proceedings of the fourth international conference on email and anti-spam, Mountain View
Google Scholar
Lowd D, Meek C (2005) Good word attacks on statistical spam filters. In: Proceedings of the second conference on email and anti-spam (CEAS-2005), Mountain View
Google Scholar
Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with naive bayes – which naive bayes? In: Proceedings of the third conference on email and anti-spam (CEAS-2006), Mountain View
Google Scholar
Rigoutsos I, Huynh T (2004) Chung-Kwei: a pattern-discovery-based system for the automatic identification of unsolicited e-mail messages (SPAM). In: Proceedings of the first conference on email and anti-spam (CEAS-2004), Mountain View
Google Scholar
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk email. In: AAAI workshop on learning for text categorization, Madison. AAAI technical report WS-98-05
Google Scholar
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of e-mail. In: Lee L, Harman D (eds) Proceedings of empirical methods in natural language processing (EMNLP 2001), pp 44–50. http://www.cs.cornell.edu/home/llee/emnlp/proceeding.html
Sculley D, Wachman G (2007) Relaxed online support vector machines for spam filtering. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York
Google Scholar
Segal R, Crawford J, Kephart J, Leiba B (2004) SpamGuru: an enterprise anti-spam filtering system. In: Proceedings of the first conference on email and anti-spam (CEAS-2004), Mountain View
Google Scholar
Siefkes C, Assis F, Chhabra S, Yerazunis W (2004) Combining winnow and orthogonal sparse bigrams for incremental spam filtering. In: Proceedings of the European conference on principle and practice of knowledge discovery in databases. Springer, New York
Book Google Scholar
Yoshida K, Adachi F, Washio T, Motoda H, Homma T, Nakashima A et al (2004) Densitiy-based spam detection. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 486–493
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft One Microsoft Way, Redmond, WA, USA
Aleksander KoŁcz

Authors

Aleksander KoŁcz
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The University of New South Wales, Sydney, NSW, Australia
Claude Sammut
Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
Geoffrey I. Webb

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

KoŁcz, A. (2017). Text Mining for Spam Filtering. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7687-1_828

Download citation

DOI: https://doi.org/10.1007/978-1-4899-7687-1_828
Published: 14 April 2017
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4899-7685-7
Online ISBN: 978-1-4899-7687-1
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics