Combining Supervised and Semi-supervised Classifier for Personalized Spam Filtering

Cheng, Victor; Li, Chun-hung

doi:10.1007/978-3-540-71701-0_45

Victor Cheng¹ &
Chun-hung Li¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4426))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1801 Accesses
6 Citations

Abstract

This paper addresses the problem of spam filtering for individual email user under the condition that only public domain labeled emails given as the training data and all emails from the user’s email inbox are unlabeled. Owing to the difference of wordings and distribution of emails, conventional supervised classifier such as SVM cannot produce accurate result because it assumes the training and the testing data come from the same source and have the same distribution. We model these discrepancies as variation of decision hyperplane and come up with a criterion for selecting reliable emails with classified labels which are likely to be agreed by the user. A semi-supervised classifier then uses these emails as the training set and propagates the label information to other unlabeled emails by exploiting the distribution of them in feature space. Experimental result shows that this combined classifier strategy can classify emails for individual user with high accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Sahami, M., et al.: A Bayesian Approach to Filtering Junk E-mail. In: AAAI Workshop on Learning for Text Categorization, Madison, Wisconsin, July 1998, AAAI, Menlo Park (1998)
Google Scholar
The Apache SpamAssassin Project (accessed July 2006), http://spamassassin.apache.org/
Tretyakov, K.: Machine Learning Techniques in Spam Filtering. In: Data Mining Problem-oriented Seminar, MTAT.03.177, pp. 60–79 (May 2004)
Google Scholar
Schölkopf, B.: Statistical Learning and Kernel Method. MSR-TR 2000-23, Microsoft Research (2000)
Google Scholar
Drucker, H., Wu, D., Vapnik, V.N.: Support Vector Machines for Spam Categorization. IEEE Trans. On Neural Networks 10(5) (1999)
Google Scholar
Omidvar, O., Dayhoff, J.: Neural Networks and Pattern Recognition. Academic Press, London (1998)
Google Scholar
Discovery Challenge, ECMLPKDD2006 (accessed July 2006), http://www.ecmlpkdd2006.org/challenge.html
Zhu, X.: Semi-Supervised Learning with Graphs. Doctoral thesis, CMU-LTI-05-192 (May 2005)
Google Scholar
Bradley, A.P.: The Use of the Area Under the ROC curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition 30, 1145–1159 (1997)
Article Google Scholar
Wilcoxon, F.: Individual Comparisons by Ranking Methods. Biometrics 1, 80–83 (1945)
Article Google Scholar
Mann, H.B., Whitney, D.R.: On a Test Whether One of Two Random Variables is Stochastically Larger than the Other. Annals of Mathematical Statistics 18, 50–60 (1947)
Article MathSciNet Google Scholar
Mictchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Hong Kong Baptist University, Hong Kong
Victor Cheng & Chun-hung Li

Authors

Victor Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Chun-hung Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Zhi-Hua Zhou Hang Li Qiang Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheng, V., Li, Ch. (2007). Combining Supervised and Semi-supervised Classifier for Personalized Spam Filtering. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_45

Download citation

DOI: https://doi.org/10.1007/978-3-540-71701-0_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71700-3
Online ISBN: 978-3-540-71701-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics