Skip to main content

Text Mining for Spam Filtering

  • Reference work entry
  • First Online:
  • 255 Accesses

Synonyms

Commercial Email Filtering; Junk email filtering; Spam detection; Unsolicited commercial email filtering

Definition

Spam filtering is the process of detecting unsolicited commercial email (UCE) messages on behalf of an individual recipient or a group of recipients. Machine learning applied to this problem is used to create discriminating models based on labeled and unlabeled examples of spam and nonspam. Such models can serve populations of users (e.g., departments, corporations, ISP customers) or they can be personalized to reflect the judgments of an individual. An important aspect of spam detection is the way in which textual information contained in email is extracted and used for the purpose of discrimination.

Motivation and Background

Spam has become the bane of existence for both Internet users and entities providing email services. Time is lost when sifting through unwanted messages and important emails may be lost through omission or accidental deletion. According to...

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   699.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   949.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Recommended Reading

  • Bratko A, Cormack GV, Filipic B, Lynam TR, Zupan B (2006) Spam filtering using statistical data compression models. J Mach Learn Res 7:2673–2698

    MathSciNet  MATH  Google Scholar 

  • Carreras X, Màrquez L (2001) Boosting trees for anti-spam email filtering. In: Proceedings of RANLP-01, the 4th international conference on recent advances in natural language processing. ACM, New York

    Google Scholar 

  • Cormack GV, Lynam TR (2006) On-line supervised spam filter evaluation. ACM Trans Inf Syst 25(3):11

    Article  Google Scholar 

  • Dalvi N, Domingos P, Sanghai MS, Verma D (2004) Adversarial classification. In: Proceedings of the tenth international conference on knowledge discovery and data mining, vol 1. ACM, New York, pp 99–108

    Google Scholar 

  • Drucker H, Wu D, Vapnik VN (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 5(10):1048–1054

    Article  Google Scholar 

  • Fawcett T (2003) In vivo’ spam filtering: a challenge problem for data mining. KDD Explor 5(2):140–148

    Google Scholar 

  • Goodman J, Yih W (2006) Online discriminative spam filter training. In: Proceedings of the third conference on email and anti-spam (CEAS-2006), Mountain View

    Google Scholar 

  • Kołcz A (2005) Local sparsity control for naive bayes with extreme misclassification costs. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York

    Google Scholar 

  • Kołcz A, Alspector J (2001) SVM-based filtering of e-mail spam with content-specific misclassification costs. In: TextDM’2001 (IEEE ICDM-2001 workshop on text mining), San Jose

    Google Scholar 

  • Kołcz A, Bond M, Sargent J (2006) The challenges of service-side personalized spam filtering: scalability and beyond. In: Proceedings of the first international conference on scalable information systems (INFOSCALE). ACM, New York

    Google Scholar 

  • Kołcz AM, Chowdhury A (2007) Hardening fingerprinting by context. In: Proceedings of the fourth international conference on email and anti-spam, Mountain View

    Google Scholar 

  • Lowd D, Meek C (2005) Good word attacks on statistical spam filters. In: Proceedings of the second conference on email and anti-spam (CEAS-2005), Mountain View

    Google Scholar 

  • Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with naive bayes – which naive bayes? In: Proceedings of the third conference on email and anti-spam (CEAS-2006), Mountain View

    Google Scholar 

  • Rigoutsos I, Huynh T (2004) Chung-Kwei: a pattern-discovery-based system for the automatic identification of unsolicited e-mail messages (SPAM). In: Proceedings of the first conference on email and anti-spam (CEAS-2004), Mountain View

    Google Scholar 

  • Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk email. In: AAAI workshop on learning for text categorization, Madison. AAAI technical report WS-98-05

    Google Scholar 

  • Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of e-mail. In: Lee L, Harman D (eds) Proceedings of empirical methods in natural language processing (EMNLP 2001), pp 44–50. http://www.cs.cornell.edu/home/llee/emnlp/proceeding.html

  • Sculley D, Wachman G (2007) Relaxed online support vector machines for spam filtering. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York

    Google Scholar 

  • Segal R, Crawford J, Kephart J, Leiba B (2004) SpamGuru: an enterprise anti-spam filtering system. In: Proceedings of the first conference on email and anti-spam (CEAS-2004), Mountain View

    Google Scholar 

  • Siefkes C, Assis F, Chhabra S, Yerazunis W (2004) Combining winnow and orthogonal sparse bigrams for incremental spam filtering. In: Proceedings of the European conference on principle and practice of knowledge discovery in databases. Springer, New York

    Book  Google Scholar 

  • Yoshida K, Adachi F, Washio T, Motoda H, Homma T, Nakashima A et al (2004) Densitiy-based spam detection. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 486–493

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media New York

About this entry

Cite this entry

KoŁcz, A. (2017). Text Mining for Spam Filtering. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7687-1_828

Download citation

Publish with us

Policies and ethics