Abstract
This paper addresses the problem of spam filtering for individual email user under the condition that only public domain labeled emails given as the training data and all emails from the user’s email inbox are unlabeled. Owing to the difference of wordings and distribution of emails, conventional supervised classifier such as SVM cannot produce accurate result because it assumes the training and the testing data come from the same source and have the same distribution. We model these discrepancies as variation of decision hyperplane and come up with a criterion for selecting reliable emails with classified labels which are likely to be agreed by the user. A semi-supervised classifier then uses these emails as the training set and propagates the label information to other unlabeled emails by exploiting the distribution of them in feature space. Experimental result shows that this combined classifier strategy can classify emails for individual user with high accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Sahami, M., et al.: A Bayesian Approach to Filtering Junk E-mail. In: AAAI Workshop on Learning for Text Categorization, Madison, Wisconsin, July 1998, AAAI, Menlo Park (1998)
The Apache SpamAssassin Project (accessed July 2006), http://spamassassin.apache.org/
Tretyakov, K.: Machine Learning Techniques in Spam Filtering. In: Data Mining Problem-oriented Seminar, MTAT.03.177, pp. 60–79 (May 2004)
Schölkopf, B.: Statistical Learning and Kernel Method. MSR-TR 2000-23, Microsoft Research (2000)
Drucker, H., Wu, D., Vapnik, V.N.: Support Vector Machines for Spam Categorization. IEEE Trans. On Neural Networks 10(5) (1999)
Omidvar, O., Dayhoff, J.: Neural Networks and Pattern Recognition. Academic Press, London (1998)
Discovery Challenge, ECMLPKDD2006 (accessed July 2006), http://www.ecmlpkdd2006.org/challenge.html
Zhu, X.: Semi-Supervised Learning with Graphs. Doctoral thesis, CMU-LTI-05-192 (May 2005)
Bradley, A.P.: The Use of the Area Under the ROC curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition 30, 1145–1159 (1997)
Wilcoxon, F.: Individual Comparisons by Ranking Methods. Biometrics 1, 80–83 (1945)
Mann, H.B., Whitney, D.R.: On a Test Whether One of Two Random Variables is Stochastically Larger than the Other. Annals of Mathematical Statistics 18, 50–60 (1947)
Mictchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Cheng, V., Li, Ch. (2007). Combining Supervised and Semi-supervised Classifier for Personalized Spam Filtering. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_45
Download citation
DOI: https://doi.org/10.1007/978-3-540-71701-0_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71700-3
Online ISBN: 978-3-540-71701-0
eBook Packages: Computer ScienceComputer Science (R0)