ABSTRACT
We show that a set of independently developed spam filters may be combined in simple ways to provide substantially better filtering than any of the individual filters. The results of fifty-three spam filters evaluated at the TREC 2005 Spam Track were combined post-hoc so as to simulate the parallel on-line operation of the filters. The combined results were evaluated using the TREC methodology, yielding more than a factor of two improvement over the best filter. The simplest method -- averaging the binary classifications returned by the individual filters -- yields a remarkably good result. A new method -- averaging log-odds estimates based on the scores returned by the individual filters -- yields a somewhat better result, and provides input to SVM- and logistic-regression-based stacking methods. The stacking methods appear to provide further improvement, but only for very large corpora. Of the stacking methods, logistic regression yields the better result. Finally, we show that it is possible to select a priori small subsets of the filters that, when combined, still outperform the best individual filter by a substantial margin.
- Attia, J. Moving beyond sensistivity and specificity: using likelihood ratios to help interpret diagnostic tests. Australian Prescriber 26, 5 (2003), 111--113.]]Google ScholarCross Ref
- Bartell, B. T., Cottrell, G. W., and Belew, R. K. Automatic combination of multiple ranked retrieval systems. In SIGIR Conference on Research and Development in Information Retrieval (1994), pp. 173--181.]] Google ScholarDigital Library
- Belkin, N. J., Kantor, P., Fox, E. A., and Shaw, J. A. Combining the evidence of multiple query representations for information retrieval. In TREC-2: Proceedings of the second conference on Text retrieval (Gaithersburg, 1995), NIST, pp. 431--448.]] Google ScholarDigital Library
- Bennett, P. N., Dumais, S. T., and Horvitz, E. The combination of text classifiers using reliability indicators. Inf. Retr. 8, 1 (2005), 67--100.]] Google ScholarDigital Library
- Bentley, J. L., and Friedman, J. H. Data structures for range searching. ACM Comput. Surv. 11, 4 (1979), 397--409.]] Google ScholarDigital Library
- Cormack, G. V., and Bratko, A. Batch and on-line spam filter evaluation. In CEAS 2006 -- The 3rd Conference on Email and Anti-Spam (Mountain View, 2006).]]Google Scholar
- Cormack, G. V., and Lynam, T. R. Overview of the TREC 2005 Spam Evaluation Track. In Fourteenth Text REtrieval Conference (TREC-2005) (Gaithersburg, MD, 2005), NIST.]]Google Scholar
- Cormack, G. V., and Lynam, T. R. Statistical precision of information retrieval evaluation. In 29th ACM SIGIR Conference on Research and Development on Information Retrieval (Seattle, 2006).]] Google ScholarDigital Library
- Dietterich, T. G. Ensemble methods in machine learning. Lecture Notes in Computer Science 1857 (2000), 1--15.]] Google ScholarDigital Library
- Dzeroski, S., and Zenko, B. Is combining classifiers with stacking better than selecting the best one? Mach. Learn. 54, 3 (2004), 255--273.]] Google ScholarDigital Library
- Fawcett, T. ROC graphs: Notes and practical considerations for researchers. Tech. Rep. HPL-2003-4, HP Laboratories, 2004.]]Google Scholar
- Gosh, J. Multiclassifier systems: Back to the future. In Multiple Classifier Systems (MCS2002) (2002), J. Kittler and F. Roli, Eds., vol. LNCS 2364, pp. 1--15.]] Google ScholarDigital Library
- Hull, D. A., Pedersen, J. O., and Schutze, H. Method combination for document filtering. In SIGIR '96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (1996), ACM Press, pp. 279--287.]] Google ScholarDigital Library
- Joachims, T. Making large-scale support vector machine learning practical. In Advances in Kernel Methods: Support Vector Machines, A. S. B. Scholkopf, C. Burges, Ed. MIT Press, Cambridge, MA, 1998.]] Google ScholarDigital Library
- Kittler, J., Hatef, M., Duin, R. P. W., and Matas, J. On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20, 3 (1998), 226--239.]] Google ScholarDigital Library
- Komarek, P., and Moore, A. Fast robust logistic regression for large sparse datasets with binary outputs. In Artificial Intelligence and Statistics (2003).]]Google Scholar
- Lam, W., and Lai, K.-Y. A meta-learning approach for text categorization. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (2001), ACM Press, pp. 303--309.]] Google ScholarDigital Library
- Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka, R. Training algorithms for linear text classifiers. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval (Zürich, CH, 1996), H.-P. Frei, D. Harman, P. Schäuble, and R. Wilkinson, Eds., ACM Press, New York, US, pp. 298--306.]] Google ScholarDigital Library
- Lynam, T., and Cormack, G. TREC Spam Filter Evaluation Took Kit. http://plg.uwaterloo.ca/~trlynam/spamjig.]]Google Scholar
- Lynam, T. R., Buckley, C., Clarke, C. L. A., and Cormack, G. V. A multi-system analysis of document and term selection for blind feedback. In CIKM '04: Thirteenth ACM conference on Information and knowledge management (2004), pp. 261--269.]] Google ScholarDigital Library
- Montague, M., and Aslam, J. A. Condorcet fusion for improved retrieval. In CIKM '02: Eleventh international conference on Information and knowledge management (2002), pp. 538--548.]] Google ScholarDigital Library
- Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C. D., and Stamatopoulos, P. Stacking classifiers for anti-spam filtering of e-mail, 2001.]]Google Scholar
- Sebastiani, F. Machine learning in automated text categorization. ACM Computing Surveys 34, 1 (2002), 1--47.]] Google ScholarDigital Library
- Segal, R., Crawford, J., Kephart, J., and Leiba, B. SpamGuru: An enterprise anti-spam filtering system. In First Conference on Email and Anti-Spam (CEAS) (2004).]]Google Scholar
- Shaw, J. A., and Fox, E. A. Combination of multiple searches. In Text REtrieval Conference (1994).]]Google Scholar
- Voorhees, E. Fourteenth Text REtrieval Conference (TREC-2005). NIST, Gaithersburg, MD, 2005.]]Google ScholarCross Ref
- Wolpert, D. H. Stacked generalization. Neural Networks 5 (1992), 241--259.]] Google ScholarDigital Library
- Zhang, Y. Using Bayesian priors to combine classifiers for adaptive filtering. In SIGIR '04: The 27th Conference on Research and Development in Information Retrieval (2004), pp. 345--352.]] Google ScholarDigital Library
Index Terms
- On-line spam filter fusion
Recommendations
Oversampled filter banks from extended perfect reconstruction filter banks
Oversampled filter banks are currently being proposed for robust transmission applications. In this paper, we completely characterize multidimensional doubly finite-impulse-response (FIR) filter banks, that is, oversampled filter banks whose dual is ...
On the relative age of spam and ham training samples for email filtering
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalEmail spam filters are commonly trained on a sample of spam and ham (non-spam) messages. We investigate the effect on filter performance of using samples of spam and ham messages sent months before those to be filtered. Our results show that filter ...
Spam filtering for short messages
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge managementWe consider the problem of content-based spam filtering for short text messages that arise in three contexts: mobile (SMS) communication, blog comments, and email summary information such as might be displayed by a low-bandwidth client. Short messages ...
Comments