research-article

Copulas for information retrieval

Authors:

Carsten Eickhoff,

Arjen P. de Vries,

Kevyn Collins-ThompsonAuthors Info & Claims

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pages 663 - 672

https://doi.org/10.1145/2484028.2484066

Published: 28 July 2013 Publication History

Abstract

In many domains of information retrieval, system estimates of document relevance are based on multidimensional quality criteria that have to be accommodated in a unidimensional result ranking. Current solutions to this challenge are often inconsistent with the formal probabilistic framework in which constituent scores were estimated, or use sophisticated learning methods that make it difficult for humans to understand the origin of the final ranking. To address these issues, we introduce the use of copulas, a powerful statistical framework for modeling complex multi-dimensional dependencies, to information retrieval tasks. We provide a formal background to copulas and demonstrate their effectiveness on standard IR tasks such as combining multidimensional relevance estimates and fusion of results from multiple search engines. We introduce copula-based versions of standard relevance estimators and fusion methods and show that these lead to significant performance improvements on several tasks, as evaluated on large-scale standard corpora, compared to their non-copula counterparts. We also investigate criteria for understanding the likely effect of using copula models in a given retrieval scenario.

References

[1]

Alias-i. LingPipe 3.9.2. http://alias-i.com/lingpipe, 2013.

[2]

M. Ames and M. Naaman. Why we tag: motivations for annotation in mobile and online media. In SIGCHI 2007. ACM.

Digital Library

[3]

TW Anderson and D.A. Darling. A test of goodness of fit. Journal of the American Statistical Association, 49, 1954.

[4]

Avi Arampatzis and Stephen Robertson. Modeling score distributions in information retrieval. Information Retrieval, 2011.

Digital Library

[5]

J.A. Aslam and M. Montague. Bayes optimal metasearch: a probabilistic model for combining the results of multiple retrieval systems (poster session). In Proceedings of SIGIR 2000, pages 379--381. ACM.

Digital Library

[6]

K. Balog, L. Azzopardi, and M. de Rijke. Formal models for expert finding in enterprise corpora. In Proceedings of SIGIR 2006, pages 43--50. ACM.

Digital Library

[7]

G. Bordogna and G. Pasi. A model for a SOft Fusion of Information Accesses on the web. Fuzzy Sets and Systems, 148(1):105--118, 2004.

[8]

P. Borlund. The concept of relevance in IR. JASIST, 2003.

Digital Library

[9]

J.P. Bouchaud and M. Potters. Theory of financial risk and derivative pricing: from statistical physics to risk management. Cambridge University Press, 2003.

[10]

C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML, pages 89--96. ACM, 2005.

Digital Library

[11]

A. Charpentier, J.D. Fermanian, and O. Scaillet. The estimation of copulas: Theory and practice. Copulas: From theory to Application in Finance. Risk Publications, 2007.

[12]

K. Collins-Thompson, P.N. Bennett, R.W. White, S. de la Chica, and D. Sontag. Personalizing web search results by reading level. In CIKM 2011. ACM.

Digital Library

[13]

N. Craswell, S. Robertson, H. Zaragoza, and M. Taylor. Relevance weighting for query independent evidence. In Proceedings of SIGIR 2005, pages 416--423. ACM.

Digital Library

[14]

Ronan Cummins. Measuring the ability of score distributions to model relevance. In Information Retrieval Technology. Springer, 2011.

Digital Library

[15]

C. da Costa Pereira, M. Dragoni, and G. Pasi. Multidimensional relevance: A new aggregation criterion. ECIR 2009.

Digital Library

[16]

A. Druin, E. Foss, L. Hatley, E. Golub, M.L. Guha, J. Fails, and H. Hutchinson. How children search the internet with keyword interfaces. In Proceedings of the 8th International Conference on Interaction Design and Children, pages 89--96. ACM, 2009.

Digital Library

[17]

C. Eickhoff, P. Serdyukov, and A.P. de Vries. A combined topical/non-topical approach to identifying web sites for children. In WSDM 2011. ACM.

Digital Library

[18]

P. Embrechts, F. Lindskog, and A. McNeil. Modelling dependence with copulas and applications to risk management. Handbook of heavy tailed distributions in finance, 8(329--384):1, 2003.

[19]

E. Fox and J. Shaw. Combination of multiple searches. NIST Special Pub., 1994.

[20]

E.W. Frees and E.A. Valdez. Understanding relationships using copulas. North American actuarial journal, 2(1), 1998.

[21]

S. Gerani, C.X. Zhai, and F. Crestani. Score transformation in linear combination for multi-criteria relevance ranking. ECIR 2012.

Digital Library

[22]

S.P. Harter. Psychological relevance and information science. JASIS, 43(9):602--615, 1992.

[23]

W. Höffding. Scale-invariant correlation theory. Schriften des Mathematischen Instituts und des Instituts fur Angewandte Mathematik der Universitäat Berlin, 5(3):181--233, 1940.

[24]

X. Huang and W.B. Croft. A unified relevance model for opinion retrieval. In Proceeding of CIKM 2009, pages 947--956. ACM.

Digital Library

[25]

Evangelos Kanoulas, Keshi Dai, Virgil Pavlu, and Javed A Aslam. Score distribution models: assumptions, intuition, and robustness to score manipulation. In SIGIR 2010. ACM.

Digital Library

[26]

J.M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604--632, 1999.

Digital Library

[27]

W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry page search. In SIGIR. ACM, 2002.

Digital Library

[28]

V. Lavrenko and W.B. Croft. Relevance based language models. In Proceedings of SIGIR 2001, pages 120--127. ACM.

Digital Library

[29]

V. Lavrenko and W.B. Croft. Relevance models in information retrieval. Language modeling for information retrieval, pages 11--56, 2003.

[30]

T.Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 2009.

[31]

W. Lu, S. Robertson, and A. MacFarlane. Field-weighted xml retrieval based on bm25. Advances in XML Information Retrieval and Evaluation, pages 161--171, 2006.

Digital Library

[32]

C. Macdonald, R.L.T. Santos, I. Ounis, and I. Soboroff. Blog track research at trec. In SIGIR Forum 2010. ACM.

Digital Library

[33]

R. Manmatha, Toni M. Rath, and Fangfang Feng. Modeling score distributions for combining the outputs of search engines. In SIGIR 2001.

Digital Library

[34]

S. Mizzaro. Relevance: The whole history. JASIS, 1997.

Digital Library

[35]

M. Montague and J.A. Aslam. Condorcet fusion for improved retrieval. In Proceedings of CIKM 2002, pages 538--548. ACM.

Digital Library

[36]

M. Montague and J.A. Aslam. Relevance score normalization for metasearch. In CIKM 2001. ACM.

Digital Library

[37]

A. Onken, S. Grünewälder, M.H.J. Munk, and K. Obermayer. Analyzing short-term noise dependencies of spike-counts in macaque prefrontal cortex using copulas and the flashlight transformation. PLoS computational biology, 5(11):e1000577, 2009.

[38]

M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. JASIS, 47(10):749--764, 1996.

[39]

J.M. Ponte and W.B. Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR 1998, pages 275--281. ACM.

Digital Library

[40]

F. Radlinski and T. Joachims. Query chains: learning to rank from implicit feedback. In SIGKDD, pages 239--248. ACM, 2005.

Digital Library

[41]

B. Renard and M. Lang. Use of a gaussian copula for multivariate extreme value analysis: Some case studies in hydrology. Advances in Water Resources, 30(4):897--912, 2007.

[42]

S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. In CIKM 2004.

Digital Library

[43]

S.E. Robertson, S. Walker, M.M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. Gaithersburgh, MD, 1994.

[44]

T. Saracevic. Relevance reconsidered. In Conference on Conceptions of Library and Information Science, 1996.

[45]

L. Schamber, M.B. Eisenberg, and M.S. Nilan. A re-examination of relevance: toward a dynamic, situational definition. IPM, 26(6):755--776, 1990.

Digital Library

[46]

T. Schmidt. Coping with copulas. Risk Books: Copulas from Theory to Applications in Finance, 2007.

[47]

C. Schoelzel, P. Friederichs, et al. Multivariate non-normally distributed random variables in climate research--introduction to the copula approach. Nonlin. Processes Geophys., 15(5):761--772, 2008.

[48]

A. Sklar. Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8(1):11, 1959.

[49]

T. Tsikrika and M. Lalmas. Combining evidence for relevance criteria: a framework and experiments in web retrieval. ECIR 2007.

Digital Library

[50]

D. Vallet and P. Castells. Personalized diversification of search results. In SIGIR 2012. ACM.

Digital Library

[51]

C.C. Vogt and G.W. Cottrell. Fusion via a linear combination of scores. Information Retrieval, 1(3):151--173, 1999.

Digital Library

[52]

S. Wu and F. Crestani. Data fusion with estimated weights. In CIKM 2002. ACM.

Digital Library

Cited By

Peikos GPasi G(2024)A systematic review of multidimensional relevance estimation in information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.154114:5Online publication date: 7-May-2024
https://doi.org/10.1002/widm.1541
Hermosillo-Valadez JMorales-González EFernández-Reyes FMontes-y-Gómez MFuentes-Pacheco JRendón-Mancha J(2022)Exploiting hierarchical dependence structures for unsupervised rank fusion in information retrievalJournal of Intelligent Information Systems10.1007/s10844-022-00751-360:3(853-876)Online publication date: 18-Oct-2022
https://doi.org/10.1007/s10844-022-00751-3
Allen GMilton AWright KFails JKennington CPera M(2022)Supercalifragilisticexpialidocious: Why Using the “Right” Readability Formula in Children’s Web Search MattersAdvances in Information Retrieval10.1007/978-3-030-99736-6_1(3-18)Online publication date: 5-Apr-2022
https://doi.org/10.1007/978-3-030-99736-6_1
Show More Cited By

Index Terms

Copulas for information retrieval
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Modelling Complex Relevance Spaces with Copulas
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

Modern relevance models consider a wide range of criteria in order to identify those documents that are expected to satisfy the user's information need. With growing dimensionality of the underlying relevance spaces the need for sophisticated score ...
Modelling Term Dependence with Copulas
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Many generative language and relevance models assume conditional independence between the likelihood of observing individual terms. This assumption is obviously naive, but also hard to replace or relax. There are only very few term pairs that actually ...
Enhancing relevance models with adaptive passage retrieval
ECIR'08: Proceedings of the IR research, 30th European conference on Advances in information retrieval

Passage retrieval and pseudo relevance feedback/query expansion have been reported as two effective means for improving document retrieval in literature. Relevance models, while improving retrieval in most cases, hurts performance on some heterogeneous ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

July 2013

1188 pages

ISBN:9781450320344

DOI:10.1145/2484028

General Chairs:
Gareth J.F. Jones
Dublin City University, Ireland
,
Páraic Sheridan
Dublin City University, Ireland
,
Program Chairs:
Diane Kelly
University of North Carolina, Chapel Hill, USA
,
Maarten de Rijke
University of Amsterdam, The Netherlands
,
Tetsuya Sakai
Microsoft Research Asia, China

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 July 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '13

Sponsor:

SIGIR

SIGIR '13: The 36th International ACM SIGIR conference on research and development in Information Retrieval

July 28 - August 1, 2013

Dublin, Ireland

Acceptance Rates

SIGIR '13 Paper Acceptance Rate 73 of 366 submissions, 20%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
489
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Peikos GPasi G(2024)A systematic review of multidimensional relevance estimation in information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.154114:5Online publication date: 7-May-2024
https://doi.org/10.1002/widm.1541
Hermosillo-Valadez JMorales-González EFernández-Reyes FMontes-y-Gómez MFuentes-Pacheco JRendón-Mancha J(2022)Exploiting hierarchical dependence structures for unsupervised rank fusion in information retrievalJournal of Intelligent Information Systems10.1007/s10844-022-00751-360:3(853-876)Online publication date: 18-Oct-2022
https://doi.org/10.1007/s10844-022-00751-3
Allen GMilton AWright KFails JKennington CPera M(2022)Supercalifragilisticexpialidocious: Why Using the “Right” Readability Formula in Children’s Web Search MattersAdvances in Information Retrieval10.1007/978-3-030-99736-6_1(3-18)Online publication date: 5-Apr-2022
https://doi.org/10.1007/978-3-030-99736-6_1
Li MGopalakrishnan KBalakrishnan H(2020)Approximate Projection-Based Control of Networks2020 59th IEEE Conference on Decision and Control (CDC)10.1109/CDC42340.2020.9304414(5573-5579)Online publication date: 14-Dec-2020
https://doi.org/10.1109/CDC42340.2020.9304414
Li PZhang CBarrett RCummings RAgichtein EGabrilovich E(2017)Theory of the GMM KernelProceedings of the 26th International Conference on World Wide Web10.1145/3038912.3052679(1053-1062)Online publication date: 3-Apr-2017
https://dl.acm.org/doi/10.1145/3038912.3052679
He YWang CJiang Cde Rijke MShokouhi MTomkins AZhang M(2017)Modeling Document Networks with Tree-Averaged Copula RegularizationProceedings of the Tenth ACM International Conference on Web Search and Data Mining10.1145/3018661.3018666(691-699)Online publication date: 2-Feb-2017
https://dl.acm.org/doi/10.1145/3018661.3018666
Marrara SPasi GViviani M(2017)Aggregation operators in Information RetrievalFuzzy Sets and Systems10.1016/j.fss.2016.12.018324:C(3-19)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.1016/j.fss.2016.12.018
Duan JDing XLiu T(2017)A Gaussian copula regression model for movie box-office revenues prediction基于高斯连接回归模型的电影票房预测Science China Information Sciences10.1007/s11432-015-0905-660:9Online publication date: 25-Apr-2017
https://doi.org/10.1007/s11432-015-0905-6
Moulahi BTamine LYahia S(2016)When time meets information retrievalJournal of Information Science10.1177/016555151560727742:6(725-747)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1177/0165551515607277
Sasaki YKomatsuda TKeyaki AMiyazaki JAnderst-Kotsis G(2016)A new readability measure for web documents and its evaluation on an effective web search engineProceedings of the 18th International Conference on Information Integration and Web-based Applications and Services10.1145/3011141.3011172(355-362)Online publication date: 28-Nov-2016
https://dl.acm.org/doi/10.1145/3011141.3011172
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten