Effective and efficient classification on a search-engine model

Anagnostopoulos, Aris; Broder, Andrei; Punera, Kunal

doi:10.1007/s10115-007-0102-6

Effective and efficient classification on a search-engine model

Regular Paper
Published: 13 September 2007

Volume 16, pages 129–154, (2008)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Aris Anagnostopoulos¹,
Andrei Broder¹ &
Kunal Punera²

127 Accesses
13 Citations
Explore all metrics

Abstract

Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our access to documents is solely through the inverted index of a large scale search engine. Our main goal is to build the “best” short query that characterizes a document class using operators normally available within search engines. We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms. As part of our study, we enhance some of the feature-selection techniques that are found in the literature by forcing the inclusion of terms that are negatively correlated with the target class and by making use of term correlations; we show that both of those techniques can offer significant advantages. Moreover, we show that optimizing the efficiency of query execution by careful selection of terms can further reduce the query costs. More precisely, we show that on our set-up the best 10-term query can achieve 93% of the accuracy of the best SVM classifier (14,000 terms), and if we are willing to tolerate a reduction to 89% of the best SVM, we can build a 10-term query that can be executed more than twice as fast as the best 10-term query.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Baeza-Yates R and Ribeiro-Neto B (1999). Modern information retrieval. Addison, Wesley, Reading
Google Scholar
Broder AZ, Carmel D, Herscovici M, Soffer A, Zien J (2003) Efficient query evaluation using a two-level retrieval process. In: Proceedings of the 12th International conference on information and knowledge management, pp 426–434
Carmel D, Amitay E, Herscovici M, Maarek YS, Petruschka Y, Soffer A (2001) Juru at TREC 10—Experiments with Index Pruning. In: Proceedings of the 10th Text REtrieval Conference (TREC), NIST
Chakrabarti S, Dom B, Agrawal R and Raghavan P (1998). Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. VLDB J 7(3): 163–178
Article Google Scholar
Chang C-H and Hsu C-C (1999). Enabling concept-based relevance feedback for information retrieval on the WWW. IEEE Trans Knowl Data Eng 11(4): 595–609
Article Google Scholar
Domingos P and Pazzani MJ (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2–3): 103–130
Article MATH Google Scholar
Fawcett T (2003) ROC graphs: notes and practical considerations for data mining researchers. Technical Report HPL-2003-4, HP Laboratories
Flake GW, Glover EJ, Lawrence S, Giles CL (2002) Extracting query modifications from nonlinear SVMs. In: Proceedigs of the 11th International conference on World Wide Web, pp 317–324
Friedman JH (1997). On bias, variance, 01 loss and the curse-of-dimensionality. Data Mining Knowl Discov 1(1): 55–77
Article Google Scholar
Glover EJ, Flake GW, Lawrence S, Birmingham WP, Kruger A, Giles CL, Pennock D (2001) Improving category specific web search by learning query modifications. In: Symposium on Applications and the Internet, SAINT, pp 23–31
Haines D, Croft WB (1993) Relevance feedback and inference networks. In: Proceedings of the 16th ACM SIGIR conference on research and development in information retrieval, pp 2–11
Hancock-Beaulieu M, Gatford M, Huang X, Robertson SE, Walker S, Williams PW (1997) Okapi at trec. In: 5th Text REtrieval Conference (TREC)
Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of 10th European conference on machine learning, pp 137–142
Lewis DD, Yang Y, Rose TG and Li F (2004). RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5: 361–397
Google Scholar
McCallum A, Nigam K (1998) A comparison of event models for Naive Bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization
Mitchell T (1997). Machine learning. McGraw Hill, New York
MATH Google Scholar
Porter MF (1997) An algorithm for suffix stripping, pp 313–316
Quinlan JR (1993). C4.5: programs for machine learning. Morgan Kaufmann, San Francisco
Google Scholar
Rennie J, Rifkin R (2001) Improving multiclass text classification with the support vector machine. In: Massachusetts Institute of Technology. AI Memo AIM-2001-026
Rifkin R and Klautau A (2004). In defense of one-vs-all classification. J Mach Learn Res 5: 101–141
MathSciNet Google Scholar
Robertson SE and Jones KS (1976). Relevance weighting of search terms. J Am Soc Inform Sci 27: 129–146
Article Google Scholar
Rocchio JJ (1971). Relevance feedback in information retrieval. In: Salton, G (eds) The SMART retrieval system: experiments in automatic document processing., pp 313–323. Prentice-Hall, Englewood cliffs
Google Scholar
Salton G, Buckley C, Yu CT (1982) An evaluation of term dependence models in information retrieval. In: SIGIR, pp 151–173
Vapnik V (1995). The nature of statistical learning theory. Springer, Berlin
MATH Google Scholar
Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd ACM SIGIR conference on research and development in information retrieval, pp 42–49
Yang Y, Pedersen J (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, pp 412–420

Download references

Author information

Authors and Affiliations

Yahoo! Research, 701 First Ave, Sunnyvale, CA, 94089, USA
Aris Anagnostopoulos & Andrei Broder
Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX, 78712, USA
Kunal Punera

Authors

Aris Anagnostopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Andrei Broder
View author publications
You can also search for this author in PubMed Google Scholar
Kunal Punera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kunal Punera.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Anagnostopoulos, A., Broder, A. & Punera, K. Effective and efficient classification on a search-engine model. Knowl Inf Syst 16, 129–154 (2008). https://doi.org/10.1007/s10115-007-0102-6

Download citation

Received: 11 November 2006
Revised: 19 April 2007
Accepted: 10 June 2007
Published: 13 September 2007
Issue Date: August 2008
DOI: https://doi.org/10.1007/s10115-007-0102-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effective and efficient classification on a search-engine model

Abstract

Access this article

Similar content being viewed by others

Evaluation of the Document Classification Approaches

Approximating Multi-class Text Classification Via Automatic Generation of Training Examples

One-Class Text Document Classification with OCSVM and LSI

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Effective and efficient classification on a search-engine model

Abstract

Access this article

Similar content being viewed by others

Evaluation of the Document Classification Approaches

Approximating Multi-class Text Classification Via Automatic Generation of Training Examples

One-Class Text Document Classification with OCSVM and LSI

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation