Skip to main content

Combining Supervised and Semi-supervised Classifier for Personalized Spam Filtering

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4426))

Included in the following conference series:

Abstract

This paper addresses the problem of spam filtering for individual email user under the condition that only public domain labeled emails given as the training data and all emails from the user’s email inbox are unlabeled. Owing to the difference of wordings and distribution of emails, conventional supervised classifier such as SVM cannot produce accurate result because it assumes the training and the testing data come from the same source and have the same distribution. We model these discrepancies as variation of decision hyperplane and come up with a criterion for selecting reliable emails with classified labels which are likely to be agreed by the user. A semi-supervised classifier then uses these emails as the training set and propagates the label information to other unlabeled emails by exploiting the distribution of them in feature space. Experimental result shows that this combined classifier strategy can classify emails for individual user with high accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Sahami, M., et al.: A Bayesian Approach to Filtering Junk E-mail. In: AAAI Workshop on Learning for Text Categorization, Madison, Wisconsin, July 1998, AAAI, Menlo Park (1998)

    Google Scholar 

  2. The Apache SpamAssassin Project (accessed July 2006), http://spamassassin.apache.org/

  3. Tretyakov, K.: Machine Learning Techniques in Spam Filtering. In: Data Mining Problem-oriented Seminar, MTAT.03.177, pp. 60–79 (May 2004)

    Google Scholar 

  4. Schölkopf, B.: Statistical Learning and Kernel Method. MSR-TR 2000-23, Microsoft Research (2000)

    Google Scholar 

  5. Drucker, H., Wu, D., Vapnik, V.N.: Support Vector Machines for Spam Categorization. IEEE Trans. On Neural Networks 10(5) (1999)

    Google Scholar 

  6. Omidvar, O., Dayhoff, J.: Neural Networks and Pattern Recognition. Academic Press, London (1998)

    Google Scholar 

  7. Discovery Challenge, ECMLPKDD2006 (accessed July 2006), http://www.ecmlpkdd2006.org/challenge.html

  8. Zhu, X.: Semi-Supervised Learning with Graphs. Doctoral thesis, CMU-LTI-05-192 (May 2005)

    Google Scholar 

  9. Bradley, A.P.: The Use of the Area Under the ROC curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition 30, 1145–1159 (1997)

    Article  Google Scholar 

  10. Wilcoxon, F.: Individual Comparisons by Ranking Methods. Biometrics 1, 80–83 (1945)

    Article  Google Scholar 

  11. Mann, H.B., Whitney, D.R.: On a Test Whether One of Two Random Variables is Stochastically Larger than the Other. Annals of Mathematical Statistics 18, 50–60 (1947)

    Article  MathSciNet  Google Scholar 

  12. Mictchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Zhi-Hua Zhou Hang Li Qiang Yang

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Cheng, V., Li, Ch. (2007). Combining Supervised and Semi-supervised Classifier for Personalized Spam Filtering. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_45

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-71701-0_45

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-71700-3

  • Online ISBN: 978-3-540-71701-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics