Producing efficient retrievability ranks of documents using normalized retrievability scoring function

Bashir, Shariq; Khattak, Akmal Saeed

doi:10.1007/s10844-013-0274-3

Producing efficient retrievability ranks of documents using normalized retrievability scoring function

Published: 06 September 2013

Volume 42, pages 457–484, (2014)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Shariq Bashir¹ &
Akmal Saeed Khattak²

267 Accesses
2 Citations
Explore all metrics

Abstract

In this paper, we perform a number of experiments with large scale queries to analyze the retrieval bias of standard retrieval models. These experiments analyze how far different retrieval models differ in terms of retrieval bias that they imposed on the collection. Along with the retrieval bias analysis, we also exploit a limitation of standard retrievability scoring function and propose a normalized retrievability scoring function. Results of retrieval bias experiments show us that when a collection contains highly skewed distribution, then the standard retrievability calculation function does not take into account the differences in vocabulary richness across documents of collection. In such case, documents having large vocabulary produce many more queries and such documents thus have theoretically large probability of retrievability via a much large number of queries. We thus propose a normalized retrievability scoring function that tries to mitigate this effect by normalizing the retrievability scores of documents relative to their total number of queries. This provides an unbiased representation of the retrieval bias that could occurred due to vocabulary differences between the documents of collection without automatically inflicting a penalty on the retrieval models that favor or disfavor long documents. Finally, in order to examine, which retrievability scoring function has better effectiveness than other for correctly producing the retrievability ranks of documents, we perform a comparison between the both functions on the basis of known-items search method. Experiments on known-items search show that normalized retrievability scoring function has better effectiveness than the standard retrievability scoring function.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RefinerHash: a new hashing-based re-ranking technique for image retrieval

Article 08 April 2024

Farzad Sabahi, M. Omair Ahmad & M.N.S. Swamy

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

Article Open access 15 January 2021

Hai Lan, Zhifeng Bao & Yuwei Peng

On some efficient logarithmic type estimators under stratified ranked set sampling

Article 05 April 2024

Shashi Bhushan & Anoop Kumar

Notes

Available at http://www.ir-facility.org/research/evaluation/trec-chem-09.
Available at http://www.uspto.gov/.
http://www.ifs.tuwien.ac.at/~bashir/Analyzing_Retrievability.htm
http://www.ifs.tuwien.ac.at/~andi/tmp/STANDARD.tgz
The complete query set for all collections are available at http://www.ifs.tuwien.ac.at/~bashir/Analyzing_Retrievability.htm.

References

Arampatzis, A., Kamps, J., Kooken, M., Nussbaum, N. (2007). Access to legal documents: exact match, best match, and combinations. In Proceedings of the 16th text retrieval conference (TREC’07).
Azzopardi, L., & Bache, R. (2010). On the relationship between effectiveness and accessibility. In SIGIR ’10: Proceeding of the 33rd annual international ACM SIGIR conference on research and development in information retrieval, Geneva, Switzerland (pp. 889–890).
Azzopardi, L., de Rijke, M., Balog, K. (2007). Building simulated queries for known-item topics: an analysis using six European languages. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, Amsterdam, The Netherlands (pp. 455–462).
Azzopardi, L., & Vinay, V. (2008). Retrievability: an evaluation measure for higher order information access tasks. In CIKM ’08: Proceeding of the 17th ACM conference on information and knowledge management, Napa Valley, CA, USA (pp. 561–570).
Bache, R., & Azzopardi, L. (2010). Improving access to large patent corpora. In Transactions on large-scale data- and knowledge-centered systems II (Vol. 2, pp. 103–121). Springer.
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. ACM Press.
Bashir, S., & Rauber, A. (2009a). Analyzing document retrievability in patent retrieval settings. In DEXA’09: Proceedings of the 20th international conference on database and expert systems applications (pp. 753–760).
Bashir, S., & Rauber, A. (2009b). Improving retrievability of patents with cluster-based pseudo-relevance feedback documents selection. In Proceedings of the 18th ACM conference on information and knowledge management, CIKM 2009 (pp. 1863–1866).
Bashir, S., & Rauber, A. (2010a). Improving retrievability and recall by automatic corpus partitioning. In Transactions on large-scale data- and knowledge-centered systems II (Vol. 2, pp. 122–140). Springer.
Bashir, S., & Rauber, A. (2010b). Improving retrievability of patents in prior-art search. In Advances in information retrieval, 32nd European Conference on IR Research, ECIR 2010 (pp. 457–470).
Callan, J., & Connell, M. (2001). Query-based sampling of text databases. ACM Transactions on Information Systems (TOIS) Journal, 19(2), 97–130.
Article Google Scholar
Chowdhury, G.G. (2004). Introduction to modern information retrieval (2nd ed.). London: Facet Publishing.
Google Scholar
Gastwirth, J.L. (1972). The estimation of the Lorenz curve and Gini index. The Review of Economics and Statistics, 54(3), 306–316.
Article MathSciNet Google Scholar
Harter, P.S. & Hert, A.C. (1997). Evaluation of information retrieval systems: approaches, issues, and methods. Annual Review of Information Science and Technology (ARIST), 32, 3–94.
Google Scholar
Lauw, W.H., Lim, E.-P., Wang, K. (2006). Bias and controversy: beyond the statistical deviation. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, USA (pp. 625–630).
Lawrence, S., & Giles, C.L. (1999). Accessibility of information on the web. Nature, 400, 107–109.
Google Scholar
Lupu, M., Huang, J., Zhu, J., Tait, J. (2009). TREC-CHEM: large scale chemical information retrieval evaluation at trec. SIGIR Forum, 43(2), 63–70.
Article Google Scholar
Magdy, W., & Jones, J.F.G. (2010). Pres: a score metric for evaluating recall-oriented information retrieval applications. In SIGIR’10: ACM SIGIR conference on research and development in information retrieval (pp. 611–618). ACM.
Manning, D., Raghavan, C.P., Schutze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
Book MATH Google Scholar
Mowshowitz, A., & Kawaguchi, A. (2002). Bias on the web. Communications of the ACM, 45(9), 56–60.
Article Google Scholar
Ounis, I., De Rijke, M., Macdonald, C., Mishne, G., Soboroff, I. (2006). Overview of the trec 2006 blog track. In Proc. of the text retrieval conference, TREC’06.
Owens, C. (2009). A study of the relative bias of web search engines toward news media providers. Master Thesis, University of Glasgow.
Robertson, S.E., & Walker, S. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR ’94: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, Dublin, Ireland (pp. 232–241).
Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: effort, sensitivity, and reliability. In SIGIR’05: ACM SIGIR conference on research and development in information retrieval (pp. 162–169). ACM.
Singhal, A. (1997). At&t at trec-6. In The 6th text retrieval conference (TREC6) (pp. 227–232).
Singhal, A. (2001). Modern information retrieval: a brief overview. IEEE Data Engineering Bulletin, 24, 34–43.
Google Scholar
Vaughan, L., & Thelwall, M. (2004). Search engine coverage bias: evidence and possible causes. Information Processing and Management Journal, 40(4), 693–707.
Article Google Scholar
Voorhees, M.E. (2001). Overview of the trec 2001 question answering track. In Proc. of the text retrieval conference, TREC’01 (pp. 42–51).
Voorhees, M.E. (2002). The philosophy of information retrieval evaluation. In CLEF’01 (pp. 355–370). Springer.
Voorhees, M.E., & Harman, K.D. (2005). Trec experiment and evaluation in information retrieval. Cambridge, MA: MIT Press.
Google Scholar
Zhai, C. (2002). Risk minimization and language modeling in text retrieval. PhD thesis, Carnegie Mellon University.

Download references

Author information

Authors and Affiliations

Center for Science and Engineering, New York University Abu Dhabi, Musaffah, Abu Dhabi, United Arab Emirates
Shariq Bashir
Natural Language Processing Research Group, Department of Computer Science, University of Leipzig, Leipzig, Germany
Akmal Saeed Khattak

Authors

Shariq Bashir
View author publications
You can also search for this author in PubMed Google Scholar
Akmal Saeed Khattak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shariq Bashir.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bashir, S., Khattak, A.S. Producing efficient retrievability ranks of documents using normalized retrievability scoring function. J Intell Inf Syst 42, 457–484 (2014). https://doi.org/10.1007/s10844-013-0274-3

Download citation

Received: 09 April 2013
Revised: 06 August 2013
Accepted: 12 August 2013
Published: 06 September 2013
Issue Date: June 2014
DOI: https://doi.org/10.1007/s10844-013-0274-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Producing efficient retrievability ranks of documents using normalized retrievability scoring function

Abstract

Access this article

Similar content being viewed by others

RefinerHash: a new hashing-based re-ranking technique for image retrieval

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

On some efficient logarithmic type estimators under stratified ranked set sampling

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Producing efficient retrievability ranks of documents using normalized retrievability scoring function

Abstract

Access this article

Similar content being viewed by others

RefinerHash: a new hashing-based re-ranking technique for image retrieval

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

On some efficient logarithmic type estimators under stratified ranked set sampling

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation