skip to main content
10.1145/3539618.3591841acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Practice and Challenges in Building a Business-oriented Search Engine Quality Metric

Published: 18 July 2023 Publication History

Abstract

One of the most challenging aspects of operating a large-scale web search engine is to accurately evaluate and monitor the search engine's result quality regardless of search types. From a business perspective, in the face of such challenges, it is important to establish a universal search quality metric that can be easily understood by the entire organisation. In this paper, we introduce a model-based quality metric using Explainable Boosting Machine as the classifier and online user behaviour signals as features to predict search quality. The proposed metric takes into account a variety of search types and has good interpretability. To examine the performance of the metric, we constructed a large dataset of user behaviour on search engine results pages (SERPs) with SERP quality ratings from professional annotators. We compared the performance of the model in our metric to those of other black-box machine learning models on the dataset. We also share a few experiences within our company for the org-wide adoption of this metric relevant to metric design.

Supplemental Material

MP4 File
This paper explores the practice and challenges in developing a universal search quality metric for large-scale web search engines. The proposed Universal Search Quality Metric (USQM) leverages online user behavior signals and utilizes Explainable Boosting Machine as the classifier. The metric addresses the diverse criteria of search quality and offers interpretability. It has been evaluated against other models, demonstrating its effectiveness in accurately evaluating and monitoring search result quality. This paper provides insights into the importance of a universal metric and its potential impact on decision-making within organizations.

References

[1]
Olga Arkhipova and Lidia Grauer. 2014. Evaluating Mobile Web Search Performance by Taking Good Abandonment into Account. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval (Gold Coast, Queensland, Australia) (SIGIR '14). Association for Computing Machinery, New York, NY, USA, 1043--1046. https://doi.org/10.1145/2600428. 2609505
[2]
Leif Azzopardi, Paul Thomas, and Nick Craswell. 2018. Measuring the Utility of Search Engine Result Pages: An Information Foraging Based Measure. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (Ann Arbor, MI, USA) (SIGIR '18). Association for Computing Machinery, New York, NY, USA, 605--614. https://doi.org/10.1145/3209978.3210027
[3]
Leif Azzopardi, Ryen W. White, Paul Thomas, and Nick Craswell. 2020. Data- Driven Evaluation Metrics for Heterogeneous Search Engine Result Pages (CHIIR '20). Association for Computing Machinery, New York, NY, USA, 213--222. https: //doi.org/10.1145/3343413.3377959
[4]
Peter Bailey, Nick Craswell, Ryen W. White, Liwei Chen, Ashwin Satyanarayana, and S. M.M. Tahaghoghi. 2010. Evaluating Search Systems Using Result Page Context. In Proceedings of the Third Symposium on Information Interaction in Context (New Brunswick, New Jersey, USA) (IIiX '10). Association for Computing Machinery, New York, NY, USA, 105--114. https://doi.org/10.1145/1840784.1840801
[5]
Andrei Broder. 2002. A Taxonomy of Web Search. SIGIR Forum 36, 2 (sep 2002), 3--10. https://doi.org/10.1145/792550.792552
[6]
Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected Reciprocal Rank for Graded Relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (Hong Kong, China) (CIKM '09). Association for Computing Machinery, New York, NY, USA, 621--630. https: //doi.org/10.1145/1645953.1646033
[7]
Jia Chen, Yiqun Liu, Jiaxin Mao, Fan Zhang, Tetsuya Sakai,Weizhi Ma, Min Zhang, and Shaoping Ma. 2021. Incorporating Query Reformulating Behavior into Web Search Evaluation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (Virtual Event, Queensland, Australia) (CIKM '21). Association for Computing Machinery, New York, NY, USA, 171--180. https://doi.org/10.1145/3459637.3482438
[8]
Ye Chen, Yiqun Liu, Ke Zhou, Meng Wang, Min Zhang, and Shaoping Ma. 2015. Does Vertical Bring More Satisfaction? Predicting Search Satisfaction in a Heterogeneous Environment. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (Melbourne, Australia) (CIKM '15). Association for Computing Machinery, New York, NY, USA, 1581--1590. https://doi.org/10.1145/2806416.2806473
[9]
Aleksandr Chuklin and Maarten de Rijke. 2016. Incorporating Clicks, Attention and Satisfaction into a Search Engine Result Page Evaluation Model. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (Indianapolis, Indiana, USA) (CIKM '16). Association for Computing Machinery, New York, NY, USA, 175--184. https://doi.org/10.1145/2983323. 2983829
[10]
Aleksandr Chuklin, Pavel Serdyukov, and Maarten de Rijke. 2013. Click Model- Based Information Retrieval Metrics. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland) (SIGIR '13). Association for Computing Machinery, New York, NY, USA, 493--502. https://doi.org/10.1145/2484028.2484071
[11]
Fernando Diaz, Ryen White, Georg Buscher, and Dan Liebling. 2013. Robust Models of Mouse Movement on Dynamic Web Search Results Pages. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management (San Francisco, California, USA) (CIKM '13). Association for Computing Machinery, New York, NY, USA, 1451--1460. https://doi.org/10.1145/2505515.2505717
[12]
Steve Fox, Kuldeep Karnawat, Mark Mydland, Susan Dumais, and Thomas White. 2005. Evaluating Implicit Measures to Improve Web Search. 23, 2 (apr 2005), 147--168. https://doi.org/10.1145/1059981.1059982
[13]
Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of Opportunity in Supervised Learning. CoRR abs/1610.02413 (2016). arXiv:1610.02413 http: //arxiv.org/abs/1610.02413
[14]
Ahmed Hassan. 2012. A Semi-Supervised Approach to Modeling Web Search Satisfaction (SIGIR '12). Association for Computing Machinery, New York, NY, USA, 275--284. https://doi.org/10.1145/2348283.2348323
[15]
Ahmed Hassan, Rosie Jones, and Kristina Lisa Klinkner. 2010. Beyond DCG: User Behavior as a Predictor of a Successful Search. In Proceedings of the Third ACM International Conference on Web Search and Data Mining (New York, New York, USA) (WSDM '10). Association for Computing Machinery, New York, NY, USA, 221--230. https://doi.org/10.1145/1718487.1718515
[16]
Ahmed Hassan, Yang Song, and Li-wei He. 2011. A Task Level Metric for Measuring Web Search Satisfaction and Its Application on Improving Relevance Estimation. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (Glasgow, Scotland, UK) (CIKM '11). Association for Computing Machinery, New York, NY, USA, 125--134. https: //doi.org/10.1145/2063576.2063599
[17]
Ahmed Hassan, RyenW. White, and Yi-MinWang. 2013. Toward Self-Correcting Search Engines: Using Underperforming Queries to Improve Search (SIGIR '13). Association for Computing Machinery, New York, NY, USA, 263--272. https: //doi.org/10.1145/2484028.2484043
[18]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. ACM Trans. Inf. Syst. 20, 4 (oct 2002), 422--446. https://doi.org/ 10.1145/582415.582418
[19]
Kalervo Järvelin, Susan L. Price, Lois M. L. Delcambre, and Marianne Lykke Nielsen. 2008. Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions. In Proceedings of the IR Research, 30th European Conference on Advances in Information Retrieval (Glasgow, UK) (ECIR'08). Springer-Verlag, Berlin, Heidelberg, 4--15.
[20]
Jiepu Jiang, Ahmed Hassan Awadallah, Xiaolin Shi, and Ryen W. White. 2015. Understanding and Predicting Graded Search Satisfaction. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (Shanghai, China) (WSDM '15). Association for Computing Machinery, New York, NY, USA, 57--66. https://doi.org/10.1145/2684822.2685319
[21]
Aldo Lipani, Ben Carterette, and Emine Yilmaz. 2019. From a User Model for Query Sessions to Session Rank Biased Precision (SRBP). In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval (Santa Clara, CA, USA) (ICTIR '19). Association for Computing Machinery, New York, NY, USA, 109--116. https://doi.org/10.1145/3341981.3344216
[22]
Cheng Luo, Yiqun Liu, Tetsuya Sakai, Fan Zhang, Min Zhang, and Shaoping Ma. 2017. Evaluating Mobile Search with Height-Biased Gain. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (Shinjuku, Tokyo, Japan) (SIGIR '17). Association for Computing Machinery, New York, NY, USA, 435--444. https://doi.org/10.1145/ 3077136.3080795
[23]
Rishabh Mehrotra, Ahmed Hassan Awadallah, Milad Shokouhi, Emine Yilmaz, Imed Zitouni, Ahmed El Kholy, and Madian Khabsa. 2017. Deep Sequential Models for Task Satisfaction Prediction. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (Singapore, Singapore) (CIKM '17). Association for Computing Machinery, New York, NY, USA, 737--746. https: //doi.org/10.1145/3132847.3133001
[24]
Rishabh Mehrotra, Imed Zitouni, Ahmed Hassan Awadallah, Ahmed El Kholy, and Madian Khabsa. 2017. User Interaction Sequences for Search Satisfaction Prediction (SIGIR '17). Association for Computing Machinery, New York, NY, USA, 165--174. https://doi.org/10.1145/3077136.3080833
[25]
Alistair Moffat, Peter Bailey, Falk Scholer, and Paul Thomas. 2017. Incorporating User Expectations and Behavior into the Measurement of Search Effectiveness. ACM Trans. Inf. Syst. 35, 3, Article 24 (jun 2017), 38 pages. https://doi.org/10. 1145/3052768
[26]
Alistair Moffat, Paul Thomas, and Falk Scholer. 2013. Users versus Models: What Observation Tells Us about Effectiveness Metrics. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management (San Francisco, California, USA) (CIKM '13). Association for Computing Machinery, New York, NY, USA, 659--668. https://doi.org/10.1145/2505515.2507665
[27]
Alistair Moffat and Justin Zobel. 2008. Rank-Biased Precision for Measurement of Retrieval Effectiveness. ACM Trans. Inf. Syst. 27, 1, Article 2 (dec 2008), 27 pages. https://doi.org/10.1145/1416950.1416952
[28]
Harsha Nori, Samuel Jenkins, Paul Koch, and Rich Caruana. 2019. InterpretML: A Unified Framework for Machine Learning Interpretability. https://doi.org/10. 48550/ARXIV.1909.09223
[29]
Tetsuya Sakai and Zhicheng Dou. 2013. Summaries, Ranked Retrieval and Sessions: A Unified Framework for Information Access Evaluation. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland) (SIGIR '13). Association for Computing Machinery, New York, NY, USA, 473--482. https://doi.org/10.1145/2484028.2484031
[30]
Mark Sanderson. 2010. Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval 4 (01 2010), 247--375. https://doi.org/10.1561/1500000009
[31]
Mark D. Smucker and Charles L.A. Clarke. 2012. Time-Based Calibration of Effectiveness Measures. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (Portland, Oregon, USA) (SIGIR '12). Association for Computing Machinery, New York, NY, USA, 95--104. https://doi.org/10.1145/2348283.2348300
[32]
Emine Yilmaz, Milad Shokouhi, Nick Craswell, and Stephen Robertson. 2010. Expected Browsing Utility for Web Search Evaluation (CIKM '10). Association for Computing Machinery, New York, NY, USA, 1561--1564. https://doi.org/10. 1145/1871437.1871672
[33]
Fan Zhang, Yiqun Liu, Xin Li, Min Zhang, Yinghui Xu, and Shaoping Ma. 2017. Evaluating Web Search with a Bejeweled Player Model. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (Shinjuku, Tokyo, Japan) (SIGIR '17). Association for Computing Machinery, New York, NY, USA, 425--434. https://doi.org/10.1145/3077136.3080841

Cited By

View all
  • (2025)Decoy Effect in Search Interaction: Understanding User Behavior and Measuring System VulnerabilityACM Transactions on Information Systems10.1145/370888443:2(1-58)Online publication date: 29-Jan-2025

Index Terms

  1. Practice and Challenges in Building a Business-oriented Search Engine Quality Metric

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2023
    3567 pages
    ISBN:9781450394086
    DOI:10.1145/3539618
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 July 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. evaluation metric
    2. explainable model
    3. search quality
    4. user behaviour

    Qualifiers

    • Short-paper

    Conference

    SIGIR '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)37
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Decoy Effect in Search Interaction: Understanding User Behavior and Measuring System VulnerabilityACM Transactions on Information Systems10.1145/370888443:2(1-58)Online publication date: 29-Jan-2025

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media