research-article

On Annotation Methodologies for Image Search Evaluation

Authors:

Shaoping MaAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 37, Issue 3

Article No.: 29, Pages 1 - 32

https://doi.org/10.1145/3309994

Published: 27 March 2019 Publication History

Abstract

Image search engines differ significantly from general web search engines in the way of presenting search results. The difference leads to different interaction and examination behavior patterns, and therefore requires changes in evaluation methodologies. However, evaluation of image search still utilizes the methods for general web search. In particular, offline metrics are calculated based on coarse-fine topical relevance judgments with the assumption that users examine results in a sequential manner.

In this article, we investigate annotation methods via crowdsourcing for image search evaluation based on a lab-based user study. Using user satisfaction as the golden standard, we make several interesting findings. First, instead of item-based annotation, annotating relevance in a row-based way is more efficient without hurting performance. Second, besides topical relevance, image quality plays a crucial role when evaluating the image search results, and the importance of image quality changes with search intent. Third, compared to traditional four-level scales, the fine-grain annotation method outperforms significantly. To our best knowledge, our work is the first to systematically study how diverse factors in data annotation impact image search evaluation. Our results suggest different strategies for exploiting the crowdsourcing to get data annotated under different conditions.

References

[1]

Azzah Al-Maskari, Mark Sanderson, and Paul Clough. 2007. The relationship between IR effectiveness measures and user satisfaction. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 773--774.

Digital Library

[2]

Omar Alonso and Stefano Mizzaro. 2009. Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment. In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, Vol. 15. 16.

[3]

Omar Alonso and Stefano Mizzaro. 2012. Using crowdsourcing for TREC relevance assessment. Information Processing and Management 48, 6 (2012), 1053--1066.

Digital Library

[4]

Paul André, Edward Cutrell, Desney S. Tan, and Greg Smith. 2009. Designing novel image search interfaces by understanding unique characteristics and usage. In Proceedings of the IFIP Conference on Human-Computer Interaction. 340--353.

Digital Library

[5]

Leif Azzopardi, Paul Thomas, and Nick Craswell. 2018. Measuring the utility of search engine result pages: An information foraging based measure. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval.

Digital Library

[6]

Ben Carterette. 2011. System effectiveness, user models, and user utility: A conceptual framework for investigation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 903--912.

Digital Library

[7]

Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge (CIKM’09). 621--630.

Digital Library

[8]

Ye Chen, Ke Zhou, Yiqun Liu, Min Zhang, and Shaoping Ma. 2017. Meta-evaluation of online and offline web search evaluation metrics. In Proceedings of the International ACM SIGIR Conference. 15--24.

Digital Library

[9]

Karen Church, Mauro Cherubini, and Nuria Oliver. 2014. A large-scale study of daily information needs captured in situ. ACM Transactions on Computer-Human Interaction 21, 2 (2014), 10.

Digital Library

[10]

Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. 2004. Overview of the TREC 2004 terabyte track. In Proceedings of the Text Retrieval Conference (TREC’04), Vol. 4. 74.

[11]

C. W. Cleverdon and E. M. Keen. 1966. Aslib--Cranfield research project. Factors Determining the Performance of Indexing Systems, Vol. 1. College of Aeronautics.

[12]

Jacob Cohen and Patricia Cohen. 1983. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.

[13]

Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M. Voorhees. 2015. TREC 2014 Web Track Overview. Technical Report. Michigan University, Ann Arbor, MI.

[14]

Eli P. Cox III. 1980. The optimal number of response alternatives for a scale: A review. Journal of Marketing Research (1980), 407--422.

[15]

Carsten Eickhoff. 2018. Cognitive biases in crowdsourcing. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. ACM, New York, NY, 162--170.

Digital Library

[16]

Henry A. Feild, James Allan, and Rosie Jones. 2010. Predicting searcher frustration. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 34--41.

Digital Library

[17]

Bo Geng, Linjun Yang, Chao Xu, Xian-Sheng Hua, and Shipeng Li. 2011. The role of attractiveness in web image search. In Proceedings of the 19th ACM International Conference on Multimedia (MM’11). 63--72.

Digital Library

[18]

Abby Goodrum and Amanda Spink. 1999. Visual information seeking: A study of image queries on the World Wide Web. In Proceedings of the ASIST Annual Meeting, Vol. 36. 665--74.

[19]

Qi Guo and Yang Song. 2016. Large-scale analysis of viewing behavior: Towards measuring satisfaction with mobile proactive systems. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM’16). 579--588.

Digital Library

[20]

Andrew F. Hayes and Klaus Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1, 1 (2007), 77--89.

[21]

Jiyin He and Emine Yilmaz. 2017. User behaviour and task characteristics: A field study of daily information behaviour. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval. ACM, New York, NY, 67--76.

Digital Library

[22]

Mehdi Hosseini, Ingemar J. Cox, Nataša Milić-Frayling, Gabriella Kazai, and Vishwa Vinay. 2012. On aggregating labels from multiple crowd workers to infer relevance of documents. In Proceedings of the European Conference on Information Retrieval. 182--194.

Digital Library

[23]

Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. Vol. 20. ACM, New York, NY.

Digital Library

[24]

Diane Kelly and Nicholas J. Belkin. 2004. Display time as implicit feedback: Understanding task effects. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 377--384.

Digital Library

[25]

Mucahid Kutlu, Tyler McDonnell, Yassmine Barkallah, Tamer Elsayed, and Matthew Lease. 2018. Crowd vs. expert: What can relevance judgment rationales teach us about assessor disagreement? In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’18). 805--814.

Digital Library

[26]

Dmitry Lagun, Chih Hung Hsieh, Dale Webster, and Vidhya Navalpakkam. 2014. Towards better measurement of attention and satisfaction in mobile search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 113--122.

Digital Library

[27]

J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33, 1 (1977), 159--174.

[28]

Yiqun Liu, Ye Chen, Jinhui Tang, Jiashen Sun, Min Zhang, Shaoping Ma, and Xuan Zhu. 2015. Different users, different opinions: Predicting search satisfaction with mouse movement information. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 493--502.

Digital Library

[29]

Cheng Luo, Yiqun Liu, Tetsuya Sakai, Fan Zhang, Min Zhang, and Shaoping Ma. 2017. Evaluating mobile search with height-biased gain. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 435--444.

Digital Library

[30]

Eddy Maddalena, Marco Basaldella, Dario De Nart, Dante Degl’Innocenti, Stefano Mizzaro, and Gianluca Demartini. 2016. Crowdsourcing relevance assessments: The unexpected benefits of limiting the time to judge. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing.

[31]

Eddy Maddalena, Stefano Mizzaro, Falk Scholer, and Andrew Turpin. 2017. On crowdsourcing relevance magnitudes for information retrieval evaluation. ACM Transactions on Information Systems 35, 3 (2017), 19.

Digital Library

[32]

Jiaxin Mao, Yiqun Liu, Ke Zhou, Jian-Yun Nie, Jingtao Song, Min Zhang, Shaoping Ma, et al. 2016. When does relevance mean usefulness and user satisfaction in web search? In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). 463--472.

Digital Library

[33]

Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2011. Crowdsourcing blog track top news judgments at TREC. In Proceedings of the Workshop on Crowdsourcing for Search and Data Mining (CSDM) at the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). 23--26.

[34]

Alistair Moffat, Paul Thomas, and Falk Scholer. 2013. Users versus models: What observation tells us about effectiveness metrics. In Proceedings of the ACM International Conference on Information and Knowledge Management. 659--668.

Digital Library

[35]

Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27, 1 (2008), 2.

Digital Library

[36]

Howard R. Moskowitz. 1977. Magnitude estimation: Notes on what, how, when, and why to use it. Journal of Food Quality 1, 3 (1977), 195--227.

[37]

Neil O’Hare, Paloma De Juan, Rossano Schifanella, Yunlong He, Dawei Yin, and Yi Chang. 2016. Leveraging user interaction signals for web image search. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). 559--568.

Digital Library

[38]

Jaimie Y. Park, Neil O’Hare, Rossano Schifanella, Alejandro Jaimes, and Chin-Wan Chung. 2015. A large-scale study of user image search behavior on the web. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, New York, NY, 985--994.

Digital Library

[39]

Hsiao-Tieh Pu. 2005. A comparative analysis of web image and textual queries. Online Information Review 29, 5 (2005), 457--467.

[40]

Kevin Roitero, Eddy Maddalena, Gianluca Demartini, and Stefano Mizzaro. 2018. On fine-grained relevance scales. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 675--684.

Digital Library

[41]

Tetsuya Sakai. 2018. Conducting laboratory experiments properly with statistical tools: An easy hands-on tutorial. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 1369--1370.

Digital Library

[42]

Tetsuya Sakai and Zhicheng Dou. 2013. Summaries, ranked retrieval and sessions: A unified framework for information access evaluation. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 473--482.

Digital Library

[43]

Mark Sanderson, Monica Lestari Paramita, Paul Clough, and Evangelos Kanoulas. 2010. Do user preferences and evaluation measures line up? In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 555--562.

Digital Library

[44]

Tefko Saracevic. 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance. Journal of the American Society for Information Science and Technology 58, 13 (2007), 2126--2144.

Digital Library

[45]

David J. Sheskin. 2003. Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, Boca Raton, FL.

[46]

Mark D. Smucker and Charles L. A. Clarke. 2012. Time-based calibration of effectiveness measures. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 95--104.

Digital Library

[47]

Mark D. Smucker, Gabriella Kazai, and Matthew Lease. 2012. Overview of the TREC 2012 Crowdsourcing Track. Technical Report. Texas University at Austin School of Information.

[48]

Yang Song, Hao Ma, Hongning Wang, and Kuansan Wang. 2013. Exploring and exploiting user search behavior on mobile and tablet devices to improve search relevance. In Proceedings of the 22nd International Conference on World Wide Web. ACM, New York, NY, 1201--1212.

Digital Library

[49]

Eero Sormunen. 2002. Liberal relevance criteria of TREC: Counting on negligible documents? In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 324--330.

Digital Library

[50]

Eleftherios Spyromitros-Xioufis, Symeon Papadopoulos, Alexandru Lucian Ginsca, Adrian Popescu, Yiannis Kompatsiaris, and Ioannis Vlahavas. 2015. Improving diversity in image search via supervised relevance scoring. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, New York, NY, 323--330.

Digital Library

[51]

Xu Sun and Andrew May. 2013. A comparison of field-based and lab-based experiments to evaluate user experience of personalised mobile devices. Advances in Human-Computer Interaction 2013 (2013), 2.

Digital Library

[52]

Rong Tang, William M. Shaw, and Jack L. Vevea. 2010. Towards the identification of the optimal number of relevance categories. Journal of the American Society for Information Science and Technology 50, 3 (2010), 254--264.

Digital Library

[53]

Andrew Turpin, Falk Scholer, Stefano Mizzaro, and Eddy Maddalena. 2015. The benefits of magnitude estimation relevance assessments for information retrieval evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 565--574.

Digital Library

[54]

Reinier H. van Leuken, Lluis Garcia, Ximena Olivares, and Roelof van Zwol. 2009. Visual diversification of image search results. In Proceedings of the 18th International Conference on World Wide Web. ACM, New York, NY, 341--350.

Digital Library

[55]

Ellen M. Voorhees and Donna K. Harman. 2005. TREC: Experiment and Evaluation in Information Retrieval. Vol. 1. MIT Press, Cambridge, MA.

Digital Library

[56]

Xiaohui Xie, Yiqun Liu, Maarten de Rijke, Jiyin He, Min Zhang, and Shaoping Ma. 2018. Why people search for images using web search engines. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM’18). 655--663.

Digital Library

[57]

Xiaohui Xie, Yiqun Liu, Xiaochuan Wang, Meng Wang, Zhijing Wu, Yingying Wu, Min Zhang, et al. 2017. Investigating examination behavior of image search users. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 275--284.

Digital Library

[58]

Fan Zhang, Yiqun Liu, Xin Li, Min Zhang, Yinghui Xu, and Shaoping Ma. 2017. Evaluating web search with a Bejeweled player model. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 425--434.

Digital Library

[59]

Fan Zhang, Ke Zhou, Yunqiu Shao, Cheng Luo, Min Zhang, and Shaoping Ma. 2018. How well do offline and online evaluation metrics measure user satisfaction in web image search? In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 615--624.

Digital Library

[60]

Guido Zuccon, Teerapong Leelanupab, Stewart Whiting, Emine Yilmaz, Joemon M. Jose, and Leif Azzopardi. 2013. Crowdsourcing interactions: Using crowdsourcing for evaluating interactive information retrieval systems. Information Retrieval 16, 2 (2013), 267--305.

Digital Library

Cited By

Zhu SXie XYe ZAi QLiu Y(2024)Comparing point‐wise and pair‐wise relevance judgment with brain signalsJournal of the Association for Information Science and Technology10.1002/asi.24936Online publication date: 18-Jun-2024
https://doi.org/10.1002/asi.24936
Han W(2023)Reference: An algorithm for recognizing the main melody of orchestral music based on artificial intelligence of music melody contourApplied Mathematics and Nonlinear Sciences10.2478/amns.2023.1.000899:1Online publication date: 28-Apr-2023
https://doi.org/10.2478/amns.2023.1.00089
Zou JAliannejadi MKanoulas EPera MLiu Y(2023)Users Meet Clarifying Questions: Toward a Better Understanding of User Interactions for Search ClarificationACM Transactions on Information Systems10.1145/352411041:1(1-25)Online publication date: 9-Jan-2023
https://dl.acm.org/doi/10.1145/3524110
Show More Cited By

Index Terms

On Annotation Methodologies for Image Search Evaluation
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment

Recommendations

Does Diversity Affect User Satisfaction in Image Search

Diversity has been taken into consideration by existing Web image search engines in ranking search results. However, there is no thorough investigation of how diversity affects user satisfaction in image search. In this article, we address the following ...
Towards Context-Aware Evaluation for Image Search
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Compared to general web search, image search engines present results in a significantly different way, which leads to changes in user behavior patterns, and thus creates challenges for the existing evaluation mechanisms. In this paper, we pay attention ...
Why People Search for Images using Web Search Engines
WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining

What are the intents or goals behind human interactions with image search engines? Knowing why people search for images is of major concern to Web image search engines because user satisfaction may vary as intent varies. Previous analyses of image ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 37, Issue 3

July 2019

335 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/3320115

Editor:
Maarten de Rijke
University of Amsterdam, The Netherlands

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 March 2019

Accepted: 01 January 2019

Revised: 01 November 2018

Received: 01 August 2018

Published in TOIS Volume 37, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Natural Science Foundation of China
National Key Research and Development Program of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
325
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhu SXie XYe ZAi QLiu Y(2024)Comparing point‐wise and pair‐wise relevance judgment with brain signalsJournal of the Association for Information Science and Technology10.1002/asi.24936Online publication date: 18-Jun-2024
https://doi.org/10.1002/asi.24936
Han W(2023)Reference: An algorithm for recognizing the main melody of orchestral music based on artificial intelligence of music melody contourApplied Mathematics and Nonlinear Sciences10.2478/amns.2023.1.000899:1Online publication date: 28-Apr-2023
https://doi.org/10.2478/amns.2023.1.00089
Zou JAliannejadi MKanoulas EPera MLiu Y(2023)Users Meet Clarifying Questions: Toward a Better Understanding of User Interactions for Search ClarificationACM Transactions on Information Systems10.1145/352411041:1(1-25)Online publication date: 9-Jan-2023
https://dl.acm.org/doi/10.1145/3524110
Roitero KMaddalena EMizzaro SScholer F(2022)On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluationInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10268858:6Online publication date: 22-Apr-2022
https://dl.acm.org/doi/10.1016/j.ipm.2021.102688
Shao YMao JLiu YZhang MMa S(2022)From linear to non-linear: investigating the effects of right-rail results on complex SERPsAdvances in Computational Intelligence10.1007/s43674-021-00028-22:1Online publication date: 10-Jan-2022
https://doi.org/10.1007/s43674-021-00028-2
Arabzadeh NVtyurina AYan XClarke C(2022)Shallow pooling for sparse labelsInformation Retrieval Journal10.1007/s10791-022-09411-025:4(365-385)Online publication date: 20-Jul-2022
https://doi.org/10.1007/s10791-022-09411-0
Cambazoglu BTavakoli LScholer FSanderson MCroft BScholer FThomas PElsweiler DJoho HKando NSmith C(2021)An Intent Taxonomy for Questions Asked in Web SearchProceedings of the 2021 Conference on Human Information Interaction and Retrieval10.1145/3406522.3446027(85-94)Online publication date: 14-Mar-2021
https://dl.acm.org/doi/10.1145/3406522.3446027
Xie XMao JLiu Yde Rijke MChen HZhang MMa SHuang JChang YCheng XKamps JMurdock VWen JLiu Y(2020)Preference-based Evaluation Metrics for Web Image SearchProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401146(369-378)Online publication date: 25-Jul-2020
https://dl.acm.org/doi/10.1145/3397271.3401146
Shao YMao JLiu YZhang MMa SPiwowarski BChevalier MGaussier EMaarek YNie JScholer F(2019)Towards Context-Aware Evaluation for Image SearchProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331343(1209-1212)Online publication date: 18-Jul-2019
https://dl.acm.org/doi/10.1145/3331184.3331343

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents