poster

Online stratified sampling: evaluating classifiers at web-scale

Authors:
Paul N. Bennett

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Vitor R. Carvalho

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementOctober 2010Pages 1581–1584https://doi.org/10.1145/1871437.1871677

Published:26 October 2010Publication History

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Pages 1581–1584

ABSTRACT

Deploying a classifier to large-scale systems such as the web requires careful feature design and performance evaluation. Evaluation is particularly challenging because these large collections frequently change. In this paper we adapt stratified sampling techniques to evaluate the precision of classifiers deployed in large-scale systems. We investigate different types of stratification strategies, and then we derive a new online sampling algorithm that incrementally approximates the theoretical optimal disproportionate sampling strategy. In experiments, the proposed algorithm significantly outperforms both simple random sampling as well as other types of stratified sampling, with an average reduction of about 20% in labeling effort to reach the same confidence and interval-bounds on precision

References

J. Allan, B. Carterette, J. A. Aslam, V. Pavlu, B. Dachev, and E. Kanoulas. Million query track 2007 overview. In E. M. Voorhees and L. P. Buckland, editors, The Sixteenth Text REtrieval Conference Proceedings (TREC 2007). National Institute of Standards and Technology, December 2008. NIST Special Publication SP 500-274.Google Scholar
P. N. Bennett. Using asymmetric distributions to improve text classifier probability estimates. In SIGIR `03, 2003. Google ScholarDigital Library
J. Carletta. Assessing agreement in classification tasks: the kappa statistic. Computational Linguistics, 22(2):249--254, 1996. Google ScholarDigital Library
S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. ACM TODS, 32(2), 2007. Google ScholarDigital Library
G. Cormack and T. Lynam. Online supervised spam filter evaluation. ACM TOIS, 25(3), 2007. Google ScholarDigital Library
P. Dixon, A. Ellison, and N. Gotelli. Improving the precision of estimates of the frequency of rare events. Ecology, 86(5), 2005.Google Scholar
S. Fernandes, C. Kamienski, J. Kelner, D. Mariz, and D. Sadok. A stratified traffic sampling methodology for seeing the big picture. Computer Networks, 52:2677--2689, 2008. Google ScholarDigital Library
X. He, L. Duan, Y. Zhou, and B. Dom. Threshold selection for web-page classification with highly skewed class distribution. In WWW `09, 2009. Google ScholarDigital Library
L. Kish. Survey Sampling. John Wiley & Sons, Inc., 1965.Google Scholar
D. D. Lewis. Evaluating and optimizing autonomous text classification systems. In SIGIR `95, 1995. Google ScholarDigital Library
Netscape Communication Corporation. Open directory project. http://www.dmoz.org.Google Scholar
J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. J. Smola, P. Bartlett, B. Scholkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers. MIT Press, 1999.Google Scholar
S. K. Thompson. Sampling. Wiley-Interscience, 2002.Google Scholar
E. Yilmaz, E. Kanoulas, and J. A. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In SIGIR `08, 2008. Google ScholarDigital Library
B. Zadrozny. Learning and evaluating classifiers under sample selection bias. In ICML `04, 2004. Google ScholarDigital Library
B. Zadrozny and C. Elkan. Reducing multiclass to binary by coupling probability estimates. In KDD '02, 2002.Google Scholar
T. Zseby. Stratification strategies for sampling-based non-intrusive measurements of one-way delay. In Passive and Active Measurement Workshop (PAM 2003), 2003.Google Scholar

Index Terms

Online stratified sampling: evaluating classifiers at web-scale
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Fast balanced sampling for highly stratified population

Balanced sampling is a very efficient sampling design when the variable of interest is correlated to the auxiliary variables on which the sample is balanced. A procedure to select balanced samples in a stratified population has previously been proposed. ...
Read More
The Concept of Stratified Sampling of Execution Traces
ICPC '11: Proceedings of the 2011 IEEE 19th International Conference on Program Comprehension

Execution traces can be overwhelmingly large. To reduce their size, sampling techniques, especially the ones based on random sampling, have been extensively used. Random sampling, however, may result in samples that are not representative of the ...
Read More
Stratified sampling of execution traces: Execution phases serving as strata

The understanding of the behavioral aspects of a software system is an important enabler for many reverse engineering activities. The behavior of software is typically represented in the form of execution traces. Traces, however, can be overwhelmingly ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
October 2010
2036 pages
ISBN:9781450300995
DOI:10.1145/1871437
General Chair:
Jimmy Huang
York University, Canada
,
Program Chairs:
Nick Koudas
University of Toronto, Canada
,
Gareth Jones
Dublin City University, Ireland
,
Xindong Wu
University of Vermont, USA
,
Kevyn Collins-Thompson
Microsoft Research, USA
,
Aijun An
York University, Canada
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
classification
stratified sampling
web scale
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 301
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Online stratified sampling: evaluating classifiers at web-scale

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fast balanced sampling for highly stratified population

The Concept of Stratified Sampling of Execution Traces

Stratified sampling of execution traces: Execution phases serving as strata