short-paper

Sogou-QCL: A New Dataset with Click Relevance Label

Authors:

Shaoping MaAuthors Info & Claims

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Pages 1117 - 1120

https://doi.org/10.1145/3209978.3210092

Published: 27 June 2018 Publication History

Abstract

Data is of vital importance in the development of machine learning technologies. Recently, within the information retrieval field, a number of neural ranking frameworks have been proposed to address the ad-hoc search. These models usually need a large amount of query-document relevance judgments for training. However, obtaining this kind of relevance judgments needs a lot of money and manual effort. To shed light on this problem, researchers seek to use implicit feedback from users of search engines to improve the ranking performance. In this paper, we present a new dataset, Sogou-QCL, which contains 537,366 queries and five kinds of weak relevance labels for over 12 million query-document pairs. We apply Sogou-QCL dataset to train recent neural ranking models and show its potential to serve as weak supervision for ranking. We believe that Sogou-QCL will have a broad impact on corresponding areas.

References

[1]

Olivier Chapelle and Ya Zhang. 2009. A dynamic bayesian network click model for web search ranking. In WWW '09.

Digital Library

[2]

Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M Voorhees. 2015. TREC 2014 Web track overview. Technical Report.

[3]

Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce Croft. 2017. Neural ranking models with weak supervision. In SIGIR '17.

Digital Library

[4]

Georges E Dupret and Benjamin Piwowarski. 2008. A user browsing model to predict search engine click data from past observations. In SIGIR '08.

Digital Library

[5]

Yixing Fan, Liang Pang, JianPeng Hou, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2017. MatchZoo: A Toolkit for Deep Text Matching. arXiv preprint arXiv:1707.07270 (2017).

[6]

Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016. A deep relevance matching model for ad-hoc retrieval. In CIKM '16.

Digital Library

[7]

Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neural network architectures for matching natural language sentences. In NIPS '14.

Digital Library

[8]

Tie-Yan Liu, Jun Xu, Tao Qin, Wenying Xiong, and Hang Li. 2007. LETOR: Benchmark dataset for research on learning to rank for information retrieval. In SIGIR '07 workshop on learning to rank for information retrieval.

[9]

Yiqun Liu, Ruihua Song, Min Zhang, Zhicheng Dou, Takehiro Yamamoto, Makoto P Kato, Hiroaki Ohshima, and Ke Zhou. 2014. Overview of the NTCIR-11 IMine Task. In NTCIR '14.

[10]

Yiqun Liu, Xiaohui Xie, Chao Wang, Jian-Yun Nie, Min Zhang, and Shaoping Ma. 2017. Time-aware click model. TOIS 35, 3 (2017), 16.

Digital Library

[11]

Cheng Luo, Tetsuya Sakai, Yiqun Liu, Zhicheng Dou, Chenyan Xiong, and Jingfang Xu. 2017. Overview of the NTCIR-13 We Want Web task. Proc. NTCIR-13 (2017).

[12]

Sean MacAvaney, Kai Hui, and Andrew Yates. 2017. An Approach for WeaklySupervised Deep Information Retrieval. arXiv preprint arXiv:1707.00189 (2017).

[13]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS '13.

Digital Library

[14]

Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. In InfoScale '06.

Digital Library

[15]

Pavel Serdyukov, Georges Dupret, and Nick Craswell. 2014. Log-based personalization: The 4th web search click data (WSCD) workshop. In WSDM '14.

Digital Library

[16]

Chao Wang, Yiqun Liu, Meng Wang, Ke Zhou, Jian-yun Nie, and Shaoping Ma. 2015. Incorporating non-sequential behavior into click models. In SIGIR '15.

Digital Library

[17]

Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-end neural ad-hoc ranking with kernel pooling. In SIGIR '17.

Digital Library

[18]

Wanhong Xu, Eren Manavoglu, and Erick Cantu-Paz. 2010. Temporal click model for sponsored search. In SIGIR '10.

Digital Library

[19]

Yuchen Zhang, Weizhu Chen, Dong Wang, and Qiang Yang. 2011. User-click modeling for understanding and predicting search-behavior. In SIGKDD '11.

Digital Library

[20]

Yuye Zhang and Alistair Moffat. 2006. Some Observations on User Search Behaviour. Austr. J. Intelligent Information Processing Systems 9, 2 (2006), 1--8.

Cited By

Vonásek JStraka MKrč RLasonová LEgorova EStraková JNáplava JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to Web Relevance RankingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657851(1221-1231)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657851
Breuer TFuhr NSchaer P(2023)Validating Synthetic Usage Data in Living Lab EnvironmentsJournal of Data and Information Quality10.1145/3623640Online publication date: 24-Sep-2023
https://dl.acm.org/doi/10.1145/3623640
Su ZDou ZZhou YZhao ZWen JSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)PSLOG: Pretraining with Search Logs for Document RankingProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599477(2072-2082)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599477
Show More Cited By

Index Terms

Sogou-QCL: A New Dataset with Click Relevance Label
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment
      2. Test collections
  2. World Wide Web
    1. Web mining
      1. Web log analysis
    2. Web searching and information discovery
      1. Web search engines
        Web crawling

Recommendations

T2Ranking: A Large-scale Chinese Benchmark for Passage Ranking
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Passage ranking involves two stages: passage retrieval and passage re-ranking, which are important and challenging topics for both academics and industries in the area of Information Retrieval (IR). However, the commonly-used datasets for passage ranking ...
On the information difference between standard retrieval models
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

Recent work introduced a probabilistic framework that measures search engine performance information-theoretically. This allows for novel meta-evaluation measures such as Information Difference, which measures the magnitude of the difference between ...
Federated search in the wild: the combined power of over a hundred search engines
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Federated search has the potential of improving web search: the user becomes less dependent on a single search provider and parts of the deep web become available through a unified interface, leading to a wider variety in the retrieved search results. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

June 2018

1509 pages

ISBN:9781450356572

DOI:10.1145/3209978

General Chairs:
Kevyn Collins-Thompson
University of Michigan, United States
,
Qiaozhu Mei
University of Michigan, United States
,
Program Chairs:
Brian Davison
Lehigh University, United States
,
Yiqun Liu
Tsinghua University, China
,
Emine Yilmaz
University College London, United Kingdom

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Natural Science Foundation of China
National Key Basic Research Program

Conference

SIGIR '18

Sponsor:

SIGIR

SIGIR '18: The 41st International ACM SIGIR conference on research and development in Information Retrieval

July 8 - 12, 2018

MI, Ann Arbor, USA

Acceptance Rates

SIGIR '18 Paper Acceptance Rate 86 of 409 submissions, 21%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
366
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)5

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Vonásek JStraka MKrč RLasonová LEgorova EStraková JNáplava JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to Web Relevance RankingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657851(1221-1231)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657851
Breuer TFuhr NSchaer P(2023)Validating Synthetic Usage Data in Living Lab EnvironmentsJournal of Data and Information Quality10.1145/3623640Online publication date: 24-Sep-2023
https://dl.acm.org/doi/10.1145/3623640
Su ZDou ZZhou YZhao ZWen JSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)PSLOG: Pretraining with Search Logs for Document RankingProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599477(2072-2082)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599477
Galuscáková PDeveaud RGonzález Sáez GMulhem PGoeuriot LPiroi FPopel MChen HDuh WHuang HKato MMothe JPoblete B(2023)LongEval-Retrieval: French-English Dynamic Test Collection for Continuous Web Search EvaluationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591921(3086-3094)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591921
Xie XDong QWang BLv FYao TGan WWu ZLi XLi HLiu YMa JChen HDuh WHuang HKato MMothe JPoblete B(2023)T2Ranking: A Large-scale Chinese Benchmark for Passage RankingProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591874(2681-2690)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591874
Li XMao JMa WWu ZLiu YZhang MMa SWang ZHe XSelcuk Candan KLiu HAkoglu LLuna Dong XTang J(2022)A Cooperative Neural Information Retrieval Pipeline with Knowledge Enhanced Automatic Query ReformulationProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498516(553-561)Online publication date: 11-Feb-2022
https://dl.acm.org/doi/10.1145/3488560.3498516
Dato DMacAvaney SNardini FPerego RTonellotto NAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)The Istella22 DatasetProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531740(3099-3107)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531740
Long DGao QZou KXu GXie PGuo RXu JJiang GXing LYang PAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)Multi-CPRProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531736(3046-3056)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531736
Liu KWu JYang HSun QWan R(2022)SRTEF: Test Function Recommendation With Scenarios and Latent Semantic for Implementing Stepwise Test CaseIEEE Transactions on Reliability10.1109/TR.2022.316464571:2(1127-1140)Online publication date: Jun-2022
https://doi.org/10.1109/TR.2022.3164645
Li XLiu YMao J(2022)Understanding the role of human-inspired heuristics for retrieval modelsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-020-0016-y16:1Online publication date: 1-Feb-2022
https://dl.acm.org/doi/10.1007/s11704-020-0016-y
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten