Abstract
Because of the volume of spam email and its evolving nature, any deployed Machine Learning- based spam filtering system will need to have procedures for case-base maintenance. Key to this will be procedures to edit the case-base to remove noise and eliminate redundancy. In this paper we present a two stage process to do this. We present a new noise reduction algorithm called Blame-Based Noise Reduction that removes cases that are observed to cause misclassification. We also present an algorithm called Conservative Redundancy Reduction that is much less aggressive than the state-of-the-art alternatives and has significantly better generalisation performance in this domain. These new techniques are evaluated against the alternatives in the literature on four datasets of 1000 emails each (50% spam and 50% non spam).
This research was supported by funding from Enterprise Ireland under grant no. CFTD/03/ 219 and funding from Science Foundation Ireland under grant no. SFI-02IN.1I111.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Cunningham, P., Nowlan, N., Delany, S.J., Haahr, M.: A Case-Based Approach to Spam Filtering that Can Track Concept Drift. In: The ICCBR 2003 Workshop on Long-Lived CBR Systems, Trondheim, Norway (2003)
Androutsopoulos, I., Koutsias, J., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a naive Bayesian and a memory-based approach. In: Workshop on Machine Learning and Textual Information Access, at 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD (2000)
Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists Information Retrieval, vol. 6(1), pp. 49–73. Kluwer, Dordrecht (2003)
Smyth, B., McKenna, E.: Modelling the competence of case-bases. In: Smyth, B., Cunningham, P. (eds.) EWCBR 1998. LNCS (LNAI), vol. 1488, pp. 208–220. Springer, Heidelberg (1998)
Smyth, B., Keane, M.: Remembering to Forget: A Competence Preserving Case Deletion Policy for CBR Systems. In: Mellish, C. (ed.) Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 337–382. Morgan Kaufmann, San Francisco (1995)
McKenna, E., Smyth, B.: Competence-guided Editing Methods for Lazy Learning. In: Proceedings of the 14th European Conference on Artificial Intelligence, Berlin (2000)
Wilson, D.R., Martinez, T.R.: Instance Pruning Techniques. In: Fisher, D. (ed.) Proceedings of the Fourteenth International Conference on Machine Learning, pp. 404–411. Morgan Kaufmann, San Francisco (1997)
Brighton, H., Mellish, C.: Advances in Instance Selection for Instance-Based Learning Algorithms. In: Data Mining and Knowledge Discovery, vol. 6, pp. 153–172. Kluwer Academic Publishers, The Netherlands (2002)
Hart, P.E.: The Condensed Nearest Neighbor Rule. IEEE Transactions on Information Theory 14(3), 515–516 (1968)
Ritter, G.L., Woodruff, H.B., Lowry, S.R., Isenhour, T.L.: An Algorithm for a Selective Nearest Neighbor Decision Rule. IEEE Transactions on Information Theory 21(6), 665–669 (1975)
Gates, G.W.: The Reduced Nearest Neighbor Rule. IEEE Transactions on Information Theory 18(3), 431–433 (1972)
Wilson, D.L.: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems. Man, and Cybernetics 2(3), 408–421 (1972)
Tomek, I.: An Experiment with the Nearest Neighbor Rule. IEEE Transactions on Systems, Man, and Cybernetics 6(6), 448–452 (1976)
Aha, D.W., Kibler, D., Albert, M.K.: Instance-Based Learning Algorithms. Machine Learning 6, 37–66 (1991)
Zhang, J.: Selecting Typical Instances in Instance-Based Learning. In: Proceedings of the Ninth International Conference on Machine Learning, pp. 470–479 (1992)
Cameron-Jones, R.M.: Minimum Description Length Instance-Based Learning. In: Proceedings of the Fifth Australian Joint Conference on Artificial Intelligence, pp. 368–373 (1992)
Brodley, C.: Addressing the Selective Superiority Problem: Automatic Algorithm/Mode Class Selection. In: Proceedings of the Tenth International Machine Learning Conference, pp. 17–24 (1993)
Zhu, J., Yang, Q.: Remembering to Add: Competence Preserving Case-Addition Policies for Case-Base Maintenance. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pp. 234–239. Morgan Kaufmann, San Francisco (1997)
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian Approach to Filtering Junk Email. In: AAAI 1998 Workshop on Learning for Text Categorization, Madison,Wisconsin, pp. 55–62, AAAI Technical Report WS-98-05 (1998)
Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of ICML1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Delany, S.J., Cunningham, P. (2004). An Analysis of Case-Base Editing in a Spam Filtering System. In: Funk, P., González Calero, P.A. (eds) Advances in Case-Based Reasoning. ECCBR 2004. Lecture Notes in Computer Science(), vol 3155. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28631-8_11
Download citation
DOI: https://doi.org/10.1007/978-3-540-28631-8_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22882-0
Online ISBN: 978-3-540-28631-8
eBook Packages: Springer Book Archive