poster

Generalising multiple capture-recapture to non-uniform sample sizes

Author:
Paul Thomas

CSIRO ICT Centre, Canberra, Australia

CSIRO ICT Centre, Canberra, Australia
View Profile

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrievalJuly 2008Pages 839–840https://doi.org/10.1145/1390334.1390531

Published:20 July 2008Publication History

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Pages 839–840

ABSTRACT

Algorithms in distributed information retrieval often rely on accurate knowledge of the size of a collection. The "multiple capture-recapture" method of Shokouhi et al. is one of the more reliable algorithms for determining collection size, but it relies on samples with a uniform number of documents. Such uniform samples are often hard to obtain in a working system.

A simple generalisation of multiple capture-recapture does not rely on uniform sample sizes. Simulations show it is as accurate as the original method even when sample sizes vary considerably, making it a useful technique in real tools.

References

K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. In Proc. WWW, 1998. Google ScholarDigital Library
J. Callan and M. Connell. Query-based sampling of text databases. ACM Trans. Info. Systems, 19(2), 2001. Google ScholarDigital Library
K.-L. Liu, A. Santoso, C. Yu, W. Meng, and C. Zhang. Discovering the representative of a search engine. In Proc. CIKM, 2001. Poster. Google ScholarDigital Library
M. Shokouhi, J. Zobel, F. Scholer, and S. M. M. Tahaghoghi. Capturing collection size for distributed non-cooperative retrieval. In Proc. ACM SIGIR, 2006. Google ScholarDigital Library
P. Thomas and D. Hawking. Evaluating sampling methods for uncooperative collections. In Proc. ACM SIGIR, 2007. Google ScholarDigital Library

Index Terms

Generalising multiple capture-recapture to non-uniform sample sizes
1. Information systems
  1. Information retrieval
    1. Search engine architectures and scalability
      1. Distributed retrieval
      2. Peer-to-peer retrieval
  2. Information storage systems
    1. Storage architectures
      1. Distributed storage

Recommendations

Assessing Software Designs Using Capture-Recapture Methods
Special issue on software reliability

The number of faults not discovered by the design review can be estimated by using capture-recapture methods. Since these methods were developed for wildlife population estimation, the assumptions used to derive them do not match design review ...
Read More
The effects of misclassification on estimates from capture - recapture studies (dual system estimation, em algorithm)
Read More
Capture-Recapture Sampling for Estimating Software Error Content

Mills capture-recapture sampling method allows the estimation of the number of errors in a program by randomly inserting known errors and then testing the program for both inserted and indigenous errors. This correspondence shows how correct confidence ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
July 2008
934 pages
ISBN:9781605581644
DOI:10.1145/1390334
General Chairs:
Tat-Seng Chua
National University of Singapore
,
Mun-Kew Leong
National Library Board, Singapore
,
Program Chairs:
Syung Hyon Myaeng
Information and Communications University, Korea
,
Douglas W. Oard
University of Maryland, College Park, USA
,
Fabrizio Sebastiani
Consiglio Nazionale delle Ricerche, Italy
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 July 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
size estimation
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 252
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Generalising multiple capture-recapture to non-uniform sample sizes

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Assessing Software Designs Using Capture-Recapture Methods

The effects of misclassification on estimates from capture - recapture studies (dual system estimation, em algorithm)

Capture-Recapture Sampling for Estimating Software Error Content