Article

On-line spam filter fusion

Authors:

Thomas R. Lynam,

Gordon V. Cormack,

David R. CheritonAuthors Info & Claims

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 123 - 130

https://doi.org/10.1145/1148170.1148195

Published: 06 August 2006 Publication History

Abstract

We show that a set of independently developed spam filters may be combined in simple ways to provide substantially better filtering than any of the individual filters. The results of fifty-three spam filters evaluated at the TREC 2005 Spam Track were combined post-hoc so as to simulate the parallel on-line operation of the filters. The combined results were evaluated using the TREC methodology, yielding more than a factor of two improvement over the best filter. The simplest method -- averaging the binary classifications returned by the individual filters -- yields a remarkably good result. A new method -- averaging log-odds estimates based on the scores returned by the individual filters -- yields a somewhat better result, and provides input to SVM- and logistic-regression-based stacking methods. The stacking methods appear to provide further improvement, but only for very large corpora. Of the stacking methods, logistic regression yields the better result. Finally, we show that it is possible to select a priori small subsets of the filters that, when combined, still outperform the best individual filter by a substantial margin.

References

[1]

Attia, J. Moving beyond sensistivity and specificity: using likelihood ratios to help interpret diagnostic tests. Australian Prescriber 26, 5 (2003), 111--113.]]

[2]

Bartell, B. T., Cottrell, G. W., and Belew, R. K. Automatic combination of multiple ranked retrieval systems. In SIGIR Conference on Research and Development in Information Retrieval (1994), pp. 173--181.]]

Digital Library

[3]

Belkin, N. J., Kantor, P., Fox, E. A., and Shaw, J. A. Combining the evidence of multiple query representations for information retrieval. In TREC-2: Proceedings of the second conference on Text retrieval (Gaithersburg, 1995), NIST, pp. 431--448.]]

Digital Library

[4]

Bennett, P. N., Dumais, S. T., and Horvitz, E. The combination of text classifiers using reliability indicators. Inf. Retr. 8, 1 (2005), 67--100.]]

Digital Library

[5]

Bentley, J. L., and Friedman, J. H. Data structures for range searching. ACM Comput. Surv. 11, 4 (1979), 397--409.]]

Digital Library

[6]

Cormack, G. V., and Bratko, A. Batch and on-line spam filter evaluation. In CEAS 2006 -- The 3rd Conference on Email and Anti-Spam (Mountain View, 2006).]]

[7]

Cormack, G. V., and Lynam, T. R. Overview of the TREC 2005 Spam Evaluation Track. In Fourteenth Text REtrieval Conference (TREC-2005) (Gaithersburg, MD, 2005), NIST.]]

[8]

Cormack, G. V., and Lynam, T. R. Statistical precision of information retrieval evaluation. In 29th ACM SIGIR Conference on Research and Development on Information Retrieval (Seattle, 2006).]]

Digital Library

[9]

Dietterich, T. G. Ensemble methods in machine learning. Lecture Notes in Computer Science 1857 (2000), 1--15.]]

Digital Library

[10]

Dzeroski, S., and Zenko, B. Is combining classifiers with stacking better than selecting the best one? Mach. Learn. 54, 3 (2004), 255--273.]]

Digital Library

[11]

Fawcett, T. ROC graphs: Notes and practical considerations for researchers. Tech. Rep. HPL-2003-4, HP Laboratories, 2004.]]

[12]

Gosh, J. Multiclassifier systems: Back to the future. In Multiple Classifier Systems (MCS2002) (2002), J. Kittler and F. Roli, Eds., vol. LNCS 2364, pp. 1--15.]]

Digital Library

[13]

Hull, D. A., Pedersen, J. O., and Schutze, H. Method combination for document filtering. In SIGIR '96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (1996), ACM Press, pp. 279--287.]]

Digital Library

[14]

Joachims, T. Making large-scale support vector machine learning practical. In Advances in Kernel Methods: Support Vector Machines, A. S. B. Scholkopf, C. Burges, Ed. MIT Press, Cambridge, MA, 1998.]]

Digital Library

[15]

Kittler, J., Hatef, M., Duin, R. P. W., and Matas, J. On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20, 3 (1998), 226--239.]]

Digital Library

[16]

Komarek, P., and Moore, A. Fast robust logistic regression for large sparse datasets with binary outputs. In Artificial Intelligence and Statistics (2003).]]

[17]

Lam, W., and Lai, K.-Y. A meta-learning approach for text categorization. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (2001), ACM Press, pp. 303--309.]]

Digital Library

[18]

Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka, R. Training algorithms for linear text classifiers. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval (Zürich, CH, 1996), H.-P. Frei, D. Harman, P. Schäuble, and R. Wilkinson, Eds., ACM Press, New York, US, pp. 298--306.]]

Digital Library

[19]

Lynam, T., and Cormack, G. TREC Spam Filter Evaluation Took Kit. http://plg.uwaterloo.ca/~trlynam/spamjig.]]

[20]

Lynam, T. R., Buckley, C., Clarke, C. L. A., and Cormack, G. V. A multi-system analysis of document and term selection for blind feedback. In CIKM '04: Thirteenth ACM conference on Information and knowledge management (2004), pp. 261--269.]]

Digital Library

[21]

Montague, M., and Aslam, J. A. Condorcet fusion for improved retrieval. In CIKM '02: Eleventh international conference on Information and knowledge management (2002), pp. 538--548.]]

Digital Library

[22]

Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C. D., and Stamatopoulos, P. Stacking classifiers for anti-spam filtering of e-mail, 2001.]]

[23]

Sebastiani, F. Machine learning in automated text categorization. ACM Computing Surveys 34, 1 (2002), 1--47.]]

Digital Library

[24]

Segal, R., Crawford, J., Kephart, J., and Leiba, B. SpamGuru: An enterprise anti-spam filtering system. In First Conference on Email and Anti-Spam (CEAS) (2004).]]

[25]

Shaw, J. A., and Fox, E. A. Combination of multiple searches. In Text REtrieval Conference (1994).]]

[26]

Voorhees, E. Fourteenth Text REtrieval Conference (TREC-2005). NIST, Gaithersburg, MD, 2005.]]

[27]

Wolpert, D. H. Stacked generalization. Neural Networks 5 (1992), 241--259.]]

Digital Library

[28]

Zhang, Y. Using Bayesian priors to combine classifiers for adaptive filtering. In SIGIR '04: The 27th Conference on Research and Development in Information Retrieval (2004), pp. 345--352.]]

Digital Library

Cited By

Kim HPaek JCuller DBahk S(2020)PC-RPLACM Transactions on Sensor Networks10.1145/337202616:2(1-32)Online publication date: 16-Mar-2020
https://dl.acm.org/doi/10.1145/3372026
Balazia MSojka P(2018)Gait Recognition from Motion Capture DataACM Transactions on Multimedia Computing, Communications, and Applications10.1145/315212414:1s(1-18)Online publication date: 21-Feb-2018
https://dl.acm.org/doi/10.1145/3152124
Moraes DWainer JRocha A(2016)Low false positive learning with support vector machinesJournal of Visual Communication and Image Representation10.1016/j.jvcir.2016.03.00738:C(340-350)Online publication date: 1-Jul-2016
https://dl.acm.org/doi/10.1016/j.jvcir.2016.03.007
Show More Cited By

Index Terms

On-line spam filter fusion
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Oversampled filter banks from extended perfect reconstruction filter banks

Oversampled filter banks are currently being proposed for robust transmission applications. In this paper, we completely characterize multidimensional doubly finite-impulse-response (FIR) filter banks, that is, oversampled filter banks whose dual is ...
On the relative age of spam and ham training samples for email filtering
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Email spam filters are commonly trained on a sample of spam and ham (non-spam) messages. We investigate the effect on filter performance of using samples of spam and ham messages sent months before those to be filtered. Our results show that filter ...
Spam filtering for short messages
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

We consider the problem of content-based spam filtering for short text messages that arise in three contexts: mobile (SMS) communication, blog comments, and email summary information such as might be displayed by a low-bandwidth client. Short messages ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

August 2006

768 pages

ISBN:1595933697

DOI:10.1145/1148170

General Chair:
Efthimis N. Efthimiadis
University of Washington
,
Program Chairs:
Susan Dumais
Microsoft Research, Redmond
,
David Hawking
CSIRO ICT Centre, Canberra, Australia
,
Kalervo Järvelin,
University of Tampere, Finland

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 August 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGIR06

Sponsor:

SIGIR06: The 29th Annual International SIGIR Conference

August 6 - 11, 2006

Washington, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
977
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kim HPaek JCuller DBahk S(2020)PC-RPLACM Transactions on Sensor Networks10.1145/337202616:2(1-32)Online publication date: 16-Mar-2020
https://dl.acm.org/doi/10.1145/3372026
Balazia MSojka P(2018)Gait Recognition from Motion Capture DataACM Transactions on Multimedia Computing, Communications, and Applications10.1145/315212414:1s(1-18)Online publication date: 21-Feb-2018
https://dl.acm.org/doi/10.1145/3152124
Moraes DWainer JRocha A(2016)Low false positive learning with support vector machinesJournal of Visual Communication and Image Representation10.1016/j.jvcir.2016.03.00738:C(340-350)Online publication date: 1-Jul-2016
https://dl.acm.org/doi/10.1016/j.jvcir.2016.03.007
Padmanabhuni BSubramanian KSundaram S(2015)Extended Metacognitive Neuro-Fuzzy Inference System for Biometric IdentificationRecent Advances in Computational Intelligence in Defense and Security10.1007/978-3-319-26450-9_12(309-338)Online publication date: 20-Dec-2015
https://doi.org/10.1007/978-3-319-26450-9_12
Campbell NKautz J(2014)Learning a manifold of fontsACM Transactions on Graphics10.1145/2601097.260121233:4(1-11)Online publication date: 27-Jul-2014
https://dl.acm.org/doi/10.1145/2601097.2601212
O'Donovan PLībeks JAgarwala AHertzmann A(2014)Exploratory font selection using crowdsourced attributesACM Transactions on Graphics10.1145/2601097.260111033:4(1-9)Online publication date: 27-Jul-2014
https://dl.acm.org/doi/10.1145/2601097.2601110
Xu CSu BCheng YPan WChen L(2014)An Adaptive Fusion Algorithm for Spam DetectionIEEE Intelligent Systems10.1109/MIS.2013.5429:4(2-8)Online publication date: Jul-2014
https://doi.org/10.1109/MIS.2013.54
Erdélyi MBenczúr ADaróczy BGarzó AKiss TSiklósi D(2014)The Classification Power of Web FeaturesInternet Mathematics10.1080/15427951.2013.85045610:3-4(421-457)Online publication date: 15-Sep-2014
https://doi.org/10.1080/15427951.2013.850456
Ouyang TRay SAllman MRabinovich M(2014)A large-scale empirical analysis of email spam detection through network characteristics in a stand-alone enterpriseComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2013.08.03159(101-121)Online publication date: 1-Feb-2014
https://dl.acm.org/doi/10.1016/j.comnet.2013.08.031
Ke WMostafa J(2013)Studying the clustering paradox and scalability of search in highly distributed environmentsACM Transactions on Information Systems10.1145/2457465.245746831:2(1-36)Online publication date: 17-May-2013
https://dl.acm.org/doi/10.1145/2457465.2457468
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten