ABSTRACT
Algorithms in distributed information retrieval often rely on accurate knowledge of the size of a collection. The "multiple capture-recapture" method of Shokouhi et al. is one of the more reliable algorithms for determining collection size, but it relies on samples with a uniform number of documents. Such uniform samples are often hard to obtain in a working system.
A simple generalisation of multiple capture-recapture does not rely on uniform sample sizes. Simulations show it is as accurate as the original method even when sample sizes vary considerably, making it a useful technique in real tools.
- K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. In Proc. WWW, 1998. Google ScholarDigital Library
- J. Callan and M. Connell. Query-based sampling of text databases. ACM Trans. Info. Systems, 19(2), 2001. Google ScholarDigital Library
- K.-L. Liu, A. Santoso, C. Yu, W. Meng, and C. Zhang. Discovering the representative of a search engine. In Proc. CIKM, 2001. Poster. Google ScholarDigital Library
- M. Shokouhi, J. Zobel, F. Scholer, and S. M. M. Tahaghoghi. Capturing collection size for distributed non-cooperative retrieval. In Proc. ACM SIGIR, 2006. Google ScholarDigital Library
- P. Thomas and D. Hawking. Evaluating sampling methods for uncooperative collections. In Proc. ACM SIGIR, 2007. Google ScholarDigital Library
Index Terms
- Generalising multiple capture-recapture to non-uniform sample sizes
Recommendations
Assessing Software Designs Using Capture-Recapture Methods
Special issue on software reliabilityThe number of faults not discovered by the design review can be estimated by using capture-recapture methods. Since these methods were developed for wildlife population estimation, the assumptions used to derive them do not match design review ...
Capture-Recapture Sampling for Estimating Software Error Content
Mills capture-recapture sampling method allows the estimation of the number of errors in a program by randomly inserting known errors and then testing the program for both inserted and indigenous errors. This correspondence shows how correct confidence ...
Comments