skip to main content
research-article

On Annotation Methodologies for Image Search Evaluation

Published: 27 March 2019 Publication History

Abstract

Image search engines differ significantly from general web search engines in the way of presenting search results. The difference leads to different interaction and examination behavior patterns, and therefore requires changes in evaluation methodologies. However, evaluation of image search still utilizes the methods for general web search. In particular, offline metrics are calculated based on coarse-fine topical relevance judgments with the assumption that users examine results in a sequential manner.
In this article, we investigate annotation methods via crowdsourcing for image search evaluation based on a lab-based user study. Using user satisfaction as the golden standard, we make several interesting findings. First, instead of item-based annotation, annotating relevance in a row-based way is more efficient without hurting performance. Second, besides topical relevance, image quality plays a crucial role when evaluating the image search results, and the importance of image quality changes with search intent. Third, compared to traditional four-level scales, the fine-grain annotation method outperforms significantly. To our best knowledge, our work is the first to systematically study how diverse factors in data annotation impact image search evaluation. Our results suggest different strategies for exploiting the crowdsourcing to get data annotated under different conditions.

References

[1]
Azzah Al-Maskari, Mark Sanderson, and Paul Clough. 2007. The relationship between IR effectiveness measures and user satisfaction. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 773--774.
[2]
Omar Alonso and Stefano Mizzaro. 2009. Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment. In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, Vol. 15. 16.
[3]
Omar Alonso and Stefano Mizzaro. 2012. Using crowdsourcing for TREC relevance assessment. Information Processing and Management 48, 6 (2012), 1053--1066.
[4]
Paul André, Edward Cutrell, Desney S. Tan, and Greg Smith. 2009. Designing novel image search interfaces by understanding unique characteristics and usage. In Proceedings of the IFIP Conference on Human-Computer Interaction. 340--353.
[5]
Leif Azzopardi, Paul Thomas, and Nick Craswell. 2018. Measuring the utility of search engine result pages: An information foraging based measure. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval.
[6]
Ben Carterette. 2011. System effectiveness, user models, and user utility: A conceptual framework for investigation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 903--912.
[7]
Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge (CIKM’09). 621--630.
[8]
Ye Chen, Ke Zhou, Yiqun Liu, Min Zhang, and Shaoping Ma. 2017. Meta-evaluation of online and offline web search evaluation metrics. In Proceedings of the International ACM SIGIR Conference. 15--24.
[9]
Karen Church, Mauro Cherubini, and Nuria Oliver. 2014. A large-scale study of daily information needs captured in situ. ACM Transactions on Computer-Human Interaction 21, 2 (2014), 10.
[10]
Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. 2004. Overview of the TREC 2004 terabyte track. In Proceedings of the Text Retrieval Conference (TREC’04), Vol. 4. 74.
[11]
C. W. Cleverdon and E. M. Keen. 1966. Aslib--Cranfield research project. Factors Determining the Performance of Indexing Systems, Vol. 1. College of Aeronautics.
[12]
Jacob Cohen and Patricia Cohen. 1983. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
[13]
Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M. Voorhees. 2015. TREC 2014 Web Track Overview. Technical Report. Michigan University, Ann Arbor, MI.
[14]
Eli P. Cox III. 1980. The optimal number of response alternatives for a scale: A review. Journal of Marketing Research (1980), 407--422.
[15]
Carsten Eickhoff. 2018. Cognitive biases in crowdsourcing. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. ACM, New York, NY, 162--170.
[16]
Henry A. Feild, James Allan, and Rosie Jones. 2010. Predicting searcher frustration. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 34--41.
[17]
Bo Geng, Linjun Yang, Chao Xu, Xian-Sheng Hua, and Shipeng Li. 2011. The role of attractiveness in web image search. In Proceedings of the 19th ACM International Conference on Multimedia (MM’11). 63--72.
[18]
Abby Goodrum and Amanda Spink. 1999. Visual information seeking: A study of image queries on the World Wide Web. In Proceedings of the ASIST Annual Meeting, Vol. 36. 665--74.
[19]
Qi Guo and Yang Song. 2016. Large-scale analysis of viewing behavior: Towards measuring satisfaction with mobile proactive systems. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM’16). 579--588.
[20]
Andrew F. Hayes and Klaus Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1, 1 (2007), 77--89.
[21]
Jiyin He and Emine Yilmaz. 2017. User behaviour and task characteristics: A field study of daily information behaviour. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval. ACM, New York, NY, 67--76.
[22]
Mehdi Hosseini, Ingemar J. Cox, Nataša Milić-Frayling, Gabriella Kazai, and Vishwa Vinay. 2012. On aggregating labels from multiple crowd workers to infer relevance of documents. In Proceedings of the European Conference on Information Retrieval. 182--194.
[23]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. Vol. 20. ACM, New York, NY.
[24]
Diane Kelly and Nicholas J. Belkin. 2004. Display time as implicit feedback: Understanding task effects. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 377--384.
[25]
Mucahid Kutlu, Tyler McDonnell, Yassmine Barkallah, Tamer Elsayed, and Matthew Lease. 2018. Crowd vs. expert: What can relevance judgment rationales teach us about assessor disagreement? In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’18). 805--814.
[26]
Dmitry Lagun, Chih Hung Hsieh, Dale Webster, and Vidhya Navalpakkam. 2014. Towards better measurement of attention and satisfaction in mobile search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 113--122.
[27]
J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33, 1 (1977), 159--174.
[28]
Yiqun Liu, Ye Chen, Jinhui Tang, Jiashen Sun, Min Zhang, Shaoping Ma, and Xuan Zhu. 2015. Different users, different opinions: Predicting search satisfaction with mouse movement information. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 493--502.
[29]
Cheng Luo, Yiqun Liu, Tetsuya Sakai, Fan Zhang, Min Zhang, and Shaoping Ma. 2017. Evaluating mobile search with height-biased gain. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 435--444.
[30]
Eddy Maddalena, Marco Basaldella, Dario De Nart, Dante Degl’Innocenti, Stefano Mizzaro, and Gianluca Demartini. 2016. Crowdsourcing relevance assessments: The unexpected benefits of limiting the time to judge. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing.
[31]
Eddy Maddalena, Stefano Mizzaro, Falk Scholer, and Andrew Turpin. 2017. On crowdsourcing relevance magnitudes for information retrieval evaluation. ACM Transactions on Information Systems 35, 3 (2017), 19.
[32]
Jiaxin Mao, Yiqun Liu, Ke Zhou, Jian-Yun Nie, Jingtao Song, Min Zhang, Shaoping Ma, et al. 2016. When does relevance mean usefulness and user satisfaction in web search? In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). 463--472.
[33]
Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2011. Crowdsourcing blog track top news judgments at TREC. In Proceedings of the Workshop on Crowdsourcing for Search and Data Mining (CSDM) at the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). 23--26.
[34]
Alistair Moffat, Paul Thomas, and Falk Scholer. 2013. Users versus models: What observation tells us about effectiveness metrics. In Proceedings of the ACM International Conference on Information and Knowledge Management. 659--668.
[35]
Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27, 1 (2008), 2.
[36]
Howard R. Moskowitz. 1977. Magnitude estimation: Notes on what, how, when, and why to use it. Journal of Food Quality 1, 3 (1977), 195--227.
[37]
Neil O’Hare, Paloma De Juan, Rossano Schifanella, Yunlong He, Dawei Yin, and Yi Chang. 2016. Leveraging user interaction signals for web image search. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). 559--568.
[38]
Jaimie Y. Park, Neil O’Hare, Rossano Schifanella, Alejandro Jaimes, and Chin-Wan Chung. 2015. A large-scale study of user image search behavior on the web. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, New York, NY, 985--994.
[39]
Hsiao-Tieh Pu. 2005. A comparative analysis of web image and textual queries. Online Information Review 29, 5 (2005), 457--467.
[40]
Kevin Roitero, Eddy Maddalena, Gianluca Demartini, and Stefano Mizzaro. 2018. On fine-grained relevance scales. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 675--684.
[41]
Tetsuya Sakai. 2018. Conducting laboratory experiments properly with statistical tools: An easy hands-on tutorial. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 1369--1370.
[42]
Tetsuya Sakai and Zhicheng Dou. 2013. Summaries, ranked retrieval and sessions: A unified framework for information access evaluation. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 473--482.
[43]
Mark Sanderson, Monica Lestari Paramita, Paul Clough, and Evangelos Kanoulas. 2010. Do user preferences and evaluation measures line up? In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 555--562.
[44]
Tefko Saracevic. 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance. Journal of the American Society for Information Science and Technology 58, 13 (2007), 2126--2144.
[45]
David J. Sheskin. 2003. Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, Boca Raton, FL.
[46]
Mark D. Smucker and Charles L. A. Clarke. 2012. Time-based calibration of effectiveness measures. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 95--104.
[47]
Mark D. Smucker, Gabriella Kazai, and Matthew Lease. 2012. Overview of the TREC 2012 Crowdsourcing Track. Technical Report. Texas University at Austin School of Information.
[48]
Yang Song, Hao Ma, Hongning Wang, and Kuansan Wang. 2013. Exploring and exploiting user search behavior on mobile and tablet devices to improve search relevance. In Proceedings of the 22nd International Conference on World Wide Web. ACM, New York, NY, 1201--1212.
[49]
Eero Sormunen. 2002. Liberal relevance criteria of TREC: Counting on negligible documents? In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 324--330.
[50]
Eleftherios Spyromitros-Xioufis, Symeon Papadopoulos, Alexandru Lucian Ginsca, Adrian Popescu, Yiannis Kompatsiaris, and Ioannis Vlahavas. 2015. Improving diversity in image search via supervised relevance scoring. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, New York, NY, 323--330.
[51]
Xu Sun and Andrew May. 2013. A comparison of field-based and lab-based experiments to evaluate user experience of personalised mobile devices. Advances in Human-Computer Interaction 2013 (2013), 2.
[52]
Rong Tang, William M. Shaw, and Jack L. Vevea. 2010. Towards the identification of the optimal number of relevance categories. Journal of the American Society for Information Science and Technology 50, 3 (2010), 254--264.
[53]
Andrew Turpin, Falk Scholer, Stefano Mizzaro, and Eddy Maddalena. 2015. The benefits of magnitude estimation relevance assessments for information retrieval evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 565--574.
[54]
Reinier H. van Leuken, Lluis Garcia, Ximena Olivares, and Roelof van Zwol. 2009. Visual diversification of image search results. In Proceedings of the 18th International Conference on World Wide Web. ACM, New York, NY, 341--350.
[55]
Ellen M. Voorhees and Donna K. Harman. 2005. TREC: Experiment and Evaluation in Information Retrieval. Vol. 1. MIT Press, Cambridge, MA.
[56]
Xiaohui Xie, Yiqun Liu, Maarten de Rijke, Jiyin He, Min Zhang, and Shaoping Ma. 2018. Why people search for images using web search engines. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM’18). 655--663.
[57]
Xiaohui Xie, Yiqun Liu, Xiaochuan Wang, Meng Wang, Zhijing Wu, Yingying Wu, Min Zhang, et al. 2017. Investigating examination behavior of image search users. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 275--284.
[58]
Fan Zhang, Yiqun Liu, Xin Li, Min Zhang, Yinghui Xu, and Shaoping Ma. 2017. Evaluating web search with a Bejeweled player model. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 425--434.
[59]
Fan Zhang, Ke Zhou, Yunqiu Shao, Cheng Luo, Min Zhang, and Shaoping Ma. 2018. How well do offline and online evaluation metrics measure user satisfaction in web image search? In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 615--624.
[60]
Guido Zuccon, Teerapong Leelanupab, Stewart Whiting, Emine Yilmaz, Joemon M. Jose, and Leif Azzopardi. 2013. Crowdsourcing interactions: Using crowdsourcing for evaluating interactive information retrieval systems. Information Retrieval 16, 2 (2013), 267--305.

Cited By

View all
  • (2024)Comparing point‐wise and pair‐wise relevance judgment with brain signalsJournal of the Association for Information Science and Technology10.1002/asi.24936Online publication date: 18-Jun-2024
  • (2023)Reference: An algorithm for recognizing the main melody of orchestral music based on artificial intelligence of music melody contourApplied Mathematics and Nonlinear Sciences10.2478/amns.2023.1.000899:1Online publication date: 28-Apr-2023
  • (2023)Users Meet Clarifying Questions: Toward a Better Understanding of User Interactions for Search ClarificationACM Transactions on Information Systems10.1145/352411041:1(1-25)Online publication date: 9-Jan-2023
  • Show More Cited By

Index Terms

  1. On Annotation Methodologies for Image Search Evaluation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 37, Issue 3
    July 2019
    335 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/3320115
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 March 2019
    Accepted: 01 January 2019
    Revised: 01 November 2018
    Received: 01 August 2018
    Published in TOIS Volume 37, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Image search
    2. crowdsourcing annotation
    3. offline evaluation
    4. user satisfaction

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 27 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Comparing point‐wise and pair‐wise relevance judgment with brain signalsJournal of the Association for Information Science and Technology10.1002/asi.24936Online publication date: 18-Jun-2024
    • (2023)Reference: An algorithm for recognizing the main melody of orchestral music based on artificial intelligence of music melody contourApplied Mathematics and Nonlinear Sciences10.2478/amns.2023.1.000899:1Online publication date: 28-Apr-2023
    • (2023)Users Meet Clarifying Questions: Toward a Better Understanding of User Interactions for Search ClarificationACM Transactions on Information Systems10.1145/352411041:1(1-25)Online publication date: 9-Jan-2023
    • (2022)On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluationInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10268858:6Online publication date: 22-Apr-2022
    • (2022)From linear to non-linear: investigating the effects of right-rail results on complex SERPsAdvances in Computational Intelligence10.1007/s43674-021-00028-22:1Online publication date: 10-Jan-2022
    • (2022)Shallow pooling for sparse labelsInformation Retrieval Journal10.1007/s10791-022-09411-025:4(365-385)Online publication date: 20-Jul-2022
    • (2021)An Intent Taxonomy for Questions Asked in Web SearchProceedings of the 2021 Conference on Human Information Interaction and Retrieval10.1145/3406522.3446027(85-94)Online publication date: 14-Mar-2021
    • (2020)Preference-based Evaluation Metrics for Web Image SearchProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401146(369-378)Online publication date: 25-Jul-2020
    • (2019)Towards Context-Aware Evaluation for Image SearchProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331343(1209-1212)Online publication date: 18-Jul-2019

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media