Randomised vs. Prioritised Pools for Relevance Assessments: Sample Size Considerations

Sakai, Tetsuya; Xiao, Peng

doi:10.1007/978-3-030-42835-8_9

Tetsuya Sakai¹⁷ &
Peng Xiao¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12004))

Included in the following conference series:

Asia Information Retrieval Symposium

395 Accesses
1 Citations

Abstract

The present study concerns depth-k pooling for building IR test collections. At TREC, pooled documents are traditionally presented in random order to the assessors to avoid judgement bias. In contrast, an approach that has been used widely at NTCIR is to prioritise the pooled documents based on “pseudorelevance,” in the hope of letting assessors quickly form an idea as to what constitutes a relevant document and thereby judge more efficiently and reliably. While the recent TREC 2017 Common Core Track went beyond depth-k pooling and adopted a method for selecting documents to judge dynamically, even this task let the assessors process the usual depth-10 pools first: the idea was to give the assessors a “burn-in” period, which actually appears to echo the view of the NTCIR approach. Our research questions are: (1) Which depth-k ordering strategy enables more efficient assessments? Randomisation, or prioritisation by pseudorelevance? (2) Similarly, which of the two strategies enables higher inter-assessor agreements? Our experiments based on two English web search test collections with multiple sets of graded relevance assessments suggest that randomisation outperforms prioritisation in both respects on average, although the results are statistically inconclusive. We then discuss a plan for a much larger experiment with sufficient statistical power to obtain the final verdict.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://trec.nist.gov/.
2.
http://research.nii.ac.jp/ntcir/index-en.html.
3.
The “sort by document number” advice from TREC should not be taken literally: if the publication date is embedded in the document identifier, then sorting by document ID would mean sorting by time, which is not what we want. Similarly, if the target document collection consists of multiple subcollections and the document IDs contain different prefixes accordingly, such a sort would actually cluster documents by source (See [5]), which again is not what we want. Throughout this study, we interpret the advice from TREC as “randomise”.
4.
http://research.nii.ac.jp/ntcir/tools/ntcirpool-en.html.
5.
http://lemurproject.org/clueweb12/.
6.
Although it is debatable whether making fewer judgement corrections is better, it does imply higher efficiency.
7.
We refrain from treating the official assessments as the gold data: we argue that they are also just one version of qrels.
8.
Microsoft version of normalised discounted cumulative gain, cutoff-version of Q-measure, and normalised expected reciprocal rank, respectively [13].
9.
“It is astonishing how many papers report work in which a slight effect is investigated with a small number of trials. Given that such investigations would generally fail even if the hypothesis was correct, it seems likely that many interesting research questions are unnecessarily discarded.” [22, p. 225].

References

Allan, J., Carterette, B., Aslam, J.A., Pavlu, V., Dachev, B., Kanoulas, E.: Million query track 2007 overview (2008)
Google Scholar
Allan, J., Harman, D., Kanoulas, E., Li, D., Van Gysel, C., Voorhees, E.: TREC common core track overview. In: Proceedings of TREC 2017 (2018)
Google Scholar
Carterette, B., Pavlu, V., Fang, H., Kanoulas, E.: Million query track 2009 overview. In: Proceedings of TREC 2009 (2010)
Google Scholar
Cormack, G.V., Palmer, C.R., Clarke, C.L.: Efficient construction of large test collections. In: Proceedings of ACM SIGIR 1998, pp. 282–289 (1998)
Google Scholar
Damessie, T.T., Culpepper, J.S., Kim, J., Scholer, F.: Presentation ordering effects on assessor agreement. In: Proceedings of ACM CIKM 2018, pp. 723–732 (2018)
Google Scholar
Eisenberg, M., Barry, C.: Order effects: a study of the possible influence of presentation order on user judgments of document relevance. J. Am. Soc. Inf. Sci. 39(5), 293–300 (1988)
Article Google Scholar
Harlow, L.L., Mulaik, S.A., Steiger, J.H.: What If There Were No Significance Tests? (Classic Edition). Routledge, London (2016)
Book Google Scholar
Harman, D.K.: The TREC test collections. In: Voorhees, E.M., Harman, D.K. (eds.) TREC: Experiment and Evaluation in Information Retrieval (Chapter 2). The MIT Press, Cambridge (2005)
Google Scholar
Huang, M.H., Wang, H.Y.: The influence of document presentation order and number of documents judged on users’ judgments of relevance. J. Am. Soc. Inf. Sci. 55(11), 970–979 (2004)
Article Google Scholar
Kando, N.: Evaluation of information access technologies at the NTCIR workshop. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 29–43. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30222-3_4
Chapter Google Scholar
Losada, D.E., Parapar, J., Barreiro, Á.: Multi-armed bandits for ordering judgements in pooling-based evaluation. Inf. Process. Manag. 53(3), 1005–1025 (2017)
Article Google Scholar
Losada, D.E., Parapar, J., Barreiro, Á.: When to stop making relevance judgments? A study of stopping methods for building information retrieval test collections. J. Assoc. Inf. Sci. Technol. 70(1), 49–60 (2018)
Article Google Scholar
Luo, C., Sakai, T., Liu, Y., Dou, Z., Xiong, C., Xu, J.: Overview of the NTCIR-13 we want web task. In: Proceedings of NTCIR-13, pp. 394–401 (2017)
Google Scholar
Mao, J., Sakai, T., Luo, C., Xiao, P., Liu, Y., Dou, Z.: Overview of the NTCIR-14 we want web task. In: Proceedings of NTCIR-14 (2019)
Google Scholar
Rosenthal, R.: The “file drawer problem” and tolerance for null results. Psychol. Bull. 86(3), 638–641 (1979)
Article Google Scholar
Sakai, T.: Statistical significance, power, and sample sizes: a systematic review of SIGIR and TOIS, 2006–2015. In: Proceedings of ACM SIGIR 2016, pp. 5–14 (2016)
Google Scholar
Sakai, T.: Laboratory Experiments in Information Retrieval: Sample Sizes, Effect Sizes, and Statistical Power. TIRS, vol. 40. Springer, Singapore (2018). https://doi.org/10.1007/978-981-13-1199-4
Book MATH Google Scholar
Sakai, T.: How to run an evaluation task. Information Retrieval Evaluation in a Changing World. TIRS, vol. 41, pp. 71–102. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22948-1_3
Chapter Google Scholar
Sakai, T., et al.: Overview of the NTCIR-7 ACLIA IR4QA task. In: Proceedings of NTCIR-7, pp. 77–114 (2008)
Google Scholar
Voorhees, E.M.: The philosophy of information retrieval evaluation. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, pp. 355–370. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45691-0_34
Chapter MATH Google Scholar
Zobel, J.: How reliable are the results of large-scale information retrieval experiments? In: Proceedings of ACM SIGIR 1998, pp. 307–314 (1998)
Google Scholar
Zobel, J.: Writing for Computer Science, 3rd edn. Springer, London (2014). https://doi.org/10.1007/978-1-4471-6639-9
Book MATH Google Scholar

Download references

Acknowledgements

This work was partially supported by JSPS KAKENHI Grant Number 16H01756.

Author information

Authors and Affiliations

Waseda University, Tokyo, Japan
Tetsuya Sakai & Peng Xiao

Authors

Tetsuya Sakai
View author publications
You can also search for this author in PubMed Google Scholar
Peng Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tetsuya Sakai .

Editor information

Editors and Affiliations

Open University of Hong Kong, Hong Kong, China
Fu Lee Wang
The Education University of Hong Kong, Hong Kong, China
Haoran Xie
Chinese University of Hong Kong, Hong Kong, China
Wai Lam
Nanyang Technological University, Singapore, Singapore
Aixin Sun
Institute of Information Science, Academia Sinica, Taipei, Taiwan
Lun-Wei Ku
South China Normal University, Guangzhou, China
Tianyong Hao
Chinese Academy of Agricultural Sciences, Beijing, China
Wei Chen
Douglas College, New Westminster, BC, Canada
Tak-Lam Wong
University of Southern Queensland, Toowoomba, QLD, Australia
Xiaohui Tao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sakai, T., Xiao, P. (2020). Randomised vs. Prioritised Pools for Relevance Assessments: Sample Size Considerations. In: Wang, F., et al. Information Retrieval Technology. AIRS 2019. Lecture Notes in Computer Science(), vol 12004. Springer, Cham. https://doi.org/10.1007/978-3-030-42835-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-42835-8_9
Published: 27 February 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-42834-1
Online ISBN: 978-3-030-42835-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics