skip to main content
10.1145/1148170.1148195acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

On-line spam filter fusion

Published:06 August 2006Publication History

ABSTRACT

We show that a set of independently developed spam filters may be combined in simple ways to provide substantially better filtering than any of the individual filters. The results of fifty-three spam filters evaluated at the TREC 2005 Spam Track were combined post-hoc so as to simulate the parallel on-line operation of the filters. The combined results were evaluated using the TREC methodology, yielding more than a factor of two improvement over the best filter. The simplest method -- averaging the binary classifications returned by the individual filters -- yields a remarkably good result. A new method -- averaging log-odds estimates based on the scores returned by the individual filters -- yields a somewhat better result, and provides input to SVM- and logistic-regression-based stacking methods. The stacking methods appear to provide further improvement, but only for very large corpora. Of the stacking methods, logistic regression yields the better result. Finally, we show that it is possible to select a priori small subsets of the filters that, when combined, still outperform the best individual filter by a substantial margin.

References

  1. Attia, J. Moving beyond sensistivity and specificity: using likelihood ratios to help interpret diagnostic tests. Australian Prescriber 26, 5 (2003), 111--113.]]Google ScholarGoogle ScholarCross RefCross Ref
  2. Bartell, B. T., Cottrell, G. W., and Belew, R. K. Automatic combination of multiple ranked retrieval systems. In SIGIR Conference on Research and Development in Information Retrieval (1994), pp. 173--181.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Belkin, N. J., Kantor, P., Fox, E. A., and Shaw, J. A. Combining the evidence of multiple query representations for information retrieval. In TREC-2: Proceedings of the second conference on Text retrieval (Gaithersburg, 1995), NIST, pp. 431--448.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bennett, P. N., Dumais, S. T., and Horvitz, E. The combination of text classifiers using reliability indicators. Inf. Retr. 8, 1 (2005), 67--100.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bentley, J. L., and Friedman, J. H. Data structures for range searching. ACM Comput. Surv. 11, 4 (1979), 397--409.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Cormack, G. V., and Bratko, A. Batch and on-line spam filter evaluation. In CEAS 2006 -- The 3rd Conference on Email and Anti-Spam (Mountain View, 2006).]]Google ScholarGoogle Scholar
  7. Cormack, G. V., and Lynam, T. R. Overview of the TREC 2005 Spam Evaluation Track. In Fourteenth Text REtrieval Conference (TREC-2005) (Gaithersburg, MD, 2005), NIST.]]Google ScholarGoogle Scholar
  8. Cormack, G. V., and Lynam, T. R. Statistical precision of information retrieval evaluation. In 29th ACM SIGIR Conference on Research and Development on Information Retrieval (Seattle, 2006).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Dietterich, T. G. Ensemble methods in machine learning. Lecture Notes in Computer Science 1857 (2000), 1--15.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dzeroski, S., and Zenko, B. Is combining classifiers with stacking better than selecting the best one? Mach. Learn. 54, 3 (2004), 255--273.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Fawcett, T. ROC graphs: Notes and practical considerations for researchers. Tech. Rep. HPL-2003-4, HP Laboratories, 2004.]]Google ScholarGoogle Scholar
  12. Gosh, J. Multiclassifier systems: Back to the future. In Multiple Classifier Systems (MCS2002) (2002), J. Kittler and F. Roli, Eds., vol. LNCS 2364, pp. 1--15.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Hull, D. A., Pedersen, J. O., and Schutze, H. Method combination for document filtering. In SIGIR '96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (1996), ACM Press, pp. 279--287.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Joachims, T. Making large-scale support vector machine learning practical. In Advances in Kernel Methods: Support Vector Machines, A. S. B. Scholkopf, C. Burges, Ed. MIT Press, Cambridge, MA, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kittler, J., Hatef, M., Duin, R. P. W., and Matas, J. On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20, 3 (1998), 226--239.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Komarek, P., and Moore, A. Fast robust logistic regression for large sparse datasets with binary outputs. In Artificial Intelligence and Statistics (2003).]]Google ScholarGoogle Scholar
  17. Lam, W., and Lai, K.-Y. A meta-learning approach for text categorization. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (2001), ACM Press, pp. 303--309.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka, R. Training algorithms for linear text classifiers. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval (Zürich, CH, 1996), H.-P. Frei, D. Harman, P. Schäuble, and R. Wilkinson, Eds., ACM Press, New York, US, pp. 298--306.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lynam, T., and Cormack, G. TREC Spam Filter Evaluation Took Kit. http://plg.uwaterloo.ca/~trlynam/spamjig.]]Google ScholarGoogle Scholar
  20. Lynam, T. R., Buckley, C., Clarke, C. L. A., and Cormack, G. V. A multi-system analysis of document and term selection for blind feedback. In CIKM '04: Thirteenth ACM conference on Information and knowledge management (2004), pp. 261--269.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Montague, M., and Aslam, J. A. Condorcet fusion for improved retrieval. In CIKM '02: Eleventh international conference on Information and knowledge management (2002), pp. 538--548.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C. D., and Stamatopoulos, P. Stacking classifiers for anti-spam filtering of e-mail, 2001.]]Google ScholarGoogle Scholar
  23. Sebastiani, F. Machine learning in automated text categorization. ACM Computing Surveys 34, 1 (2002), 1--47.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Segal, R., Crawford, J., Kephart, J., and Leiba, B. SpamGuru: An enterprise anti-spam filtering system. In First Conference on Email and Anti-Spam (CEAS) (2004).]]Google ScholarGoogle Scholar
  25. Shaw, J. A., and Fox, E. A. Combination of multiple searches. In Text REtrieval Conference (1994).]]Google ScholarGoogle Scholar
  26. Voorhees, E. Fourteenth Text REtrieval Conference (TREC-2005). NIST, Gaithersburg, MD, 2005.]]Google ScholarGoogle ScholarCross RefCross Ref
  27. Wolpert, D. H. Stacked generalization. Neural Networks 5 (1992), 241--259.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Zhang, Y. Using Bayesian priors to combine classifiers for adaptive filtering. In SIGIR '04: The 27th Conference on Research and Development in Information Retrieval (2004), pp. 345--352.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. On-line spam filter fusion

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
        August 2006
        768 pages
        ISBN:1595933697
        DOI:10.1145/1148170

        Copyright © 2006 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 August 2006

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate792of3,983submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader