skip to main content
10.1145/2484028.2484066acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Copulas for information retrieval

Published: 28 July 2013 Publication History

Abstract

In many domains of information retrieval, system estimates of document relevance are based on multidimensional quality criteria that have to be accommodated in a unidimensional result ranking. Current solutions to this challenge are often inconsistent with the formal probabilistic framework in which constituent scores were estimated, or use sophisticated learning methods that make it difficult for humans to understand the origin of the final ranking. To address these issues, we introduce the use of copulas, a powerful statistical framework for modeling complex multi-dimensional dependencies, to information retrieval tasks. We provide a formal background to copulas and demonstrate their effectiveness on standard IR tasks such as combining multidimensional relevance estimates and fusion of results from multiple search engines. We introduce copula-based versions of standard relevance estimators and fusion methods and show that these lead to significant performance improvements on several tasks, as evaluated on large-scale standard corpora, compared to their non-copula counterparts. We also investigate criteria for understanding the likely effect of using copula models in a given retrieval scenario.

References

[1]
Alias-i. LingPipe 3.9.2. http://alias-i.com/lingpipe, 2013.
[2]
M. Ames and M. Naaman. Why we tag: motivations for annotation in mobile and online media. In SIGCHI 2007. ACM.
[3]
TW Anderson and D.A. Darling. A test of goodness of fit. Journal of the American Statistical Association, 49, 1954.
[4]
Avi Arampatzis and Stephen Robertson. Modeling score distributions in information retrieval. Information Retrieval, 2011.
[5]
J.A. Aslam and M. Montague. Bayes optimal metasearch: a probabilistic model for combining the results of multiple retrieval systems (poster session). In Proceedings of SIGIR 2000, pages 379--381. ACM.
[6]
K. Balog, L. Azzopardi, and M. de Rijke. Formal models for expert finding in enterprise corpora. In Proceedings of SIGIR 2006, pages 43--50. ACM.
[7]
G. Bordogna and G. Pasi. A model for a SOft Fusion of Information Accesses on the web. Fuzzy Sets and Systems, 148(1):105--118, 2004.
[8]
P. Borlund. The concept of relevance in IR. JASIST, 2003.
[9]
J.P. Bouchaud and M. Potters. Theory of financial risk and derivative pricing: from statistical physics to risk management. Cambridge University Press, 2003.
[10]
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML, pages 89--96. ACM, 2005.
[11]
A. Charpentier, J.D. Fermanian, and O. Scaillet. The estimation of copulas: Theory and practice. Copulas: From theory to Application in Finance. Risk Publications, 2007.
[12]
K. Collins-Thompson, P.N. Bennett, R.W. White, S. de la Chica, and D. Sontag. Personalizing web search results by reading level. In CIKM 2011. ACM.
[13]
N. Craswell, S. Robertson, H. Zaragoza, and M. Taylor. Relevance weighting for query independent evidence. In Proceedings of SIGIR 2005, pages 416--423. ACM.
[14]
Ronan Cummins. Measuring the ability of score distributions to model relevance. In Information Retrieval Technology. Springer, 2011.
[15]
C. da Costa Pereira, M. Dragoni, and G. Pasi. Multidimensional relevance: A new aggregation criterion. ECIR 2009.
[16]
A. Druin, E. Foss, L. Hatley, E. Golub, M.L. Guha, J. Fails, and H. Hutchinson. How children search the internet with keyword interfaces. In Proceedings of the 8th International Conference on Interaction Design and Children, pages 89--96. ACM, 2009.
[17]
C. Eickhoff, P. Serdyukov, and A.P. de Vries. A combined topical/non-topical approach to identifying web sites for children. In WSDM 2011. ACM.
[18]
P. Embrechts, F. Lindskog, and A. McNeil. Modelling dependence with copulas and applications to risk management. Handbook of heavy tailed distributions in finance, 8(329--384):1, 2003.
[19]
E. Fox and J. Shaw. Combination of multiple searches. NIST Special Pub., 1994.
[20]
E.W. Frees and E.A. Valdez. Understanding relationships using copulas. North American actuarial journal, 2(1), 1998.
[21]
S. Gerani, C.X. Zhai, and F. Crestani. Score transformation in linear combination for multi-criteria relevance ranking. ECIR 2012.
[22]
S.P. Harter. Psychological relevance and information science. JASIS, 43(9):602--615, 1992.
[23]
W. Höffding. Scale-invariant correlation theory. Schriften des Mathematischen Instituts und des Instituts fur Angewandte Mathematik der Universitäat Berlin, 5(3):181--233, 1940.
[24]
X. Huang and W.B. Croft. A unified relevance model for opinion retrieval. In Proceeding of CIKM 2009, pages 947--956. ACM.
[25]
Evangelos Kanoulas, Keshi Dai, Virgil Pavlu, and Javed A Aslam. Score distribution models: assumptions, intuition, and robustness to score manipulation. In SIGIR 2010. ACM.
[26]
J.M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604--632, 1999.
[27]
W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry page search. In SIGIR. ACM, 2002.
[28]
V. Lavrenko and W.B. Croft. Relevance based language models. In Proceedings of SIGIR 2001, pages 120--127. ACM.
[29]
V. Lavrenko and W.B. Croft. Relevance models in information retrieval. Language modeling for information retrieval, pages 11--56, 2003.
[30]
T.Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 2009.
[31]
W. Lu, S. Robertson, and A. MacFarlane. Field-weighted xml retrieval based on bm25. Advances in XML Information Retrieval and Evaluation, pages 161--171, 2006.
[32]
C. Macdonald, R.L.T. Santos, I. Ounis, and I. Soboroff. Blog track research at trec. In SIGIR Forum 2010. ACM.
[33]
R. Manmatha, Toni M. Rath, and Fangfang Feng. Modeling score distributions for combining the outputs of search engines. In SIGIR 2001.
[34]
S. Mizzaro. Relevance: The whole history. JASIS, 1997.
[35]
M. Montague and J.A. Aslam. Condorcet fusion for improved retrieval. In Proceedings of CIKM 2002, pages 538--548. ACM.
[36]
M. Montague and J.A. Aslam. Relevance score normalization for metasearch. In CIKM 2001. ACM.
[37]
A. Onken, S. Grünewälder, M.H.J. Munk, and K. Obermayer. Analyzing short-term noise dependencies of spike-counts in macaque prefrontal cortex using copulas and the flashlight transformation. PLoS computational biology, 5(11):e1000577, 2009.
[38]
M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. JASIS, 47(10):749--764, 1996.
[39]
J.M. Ponte and W.B. Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR 1998, pages 275--281. ACM.
[40]
F. Radlinski and T. Joachims. Query chains: learning to rank from implicit feedback. In SIGKDD, pages 239--248. ACM, 2005.
[41]
B. Renard and M. Lang. Use of a gaussian copula for multivariate extreme value analysis: Some case studies in hydrology. Advances in Water Resources, 30(4):897--912, 2007.
[42]
S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. In CIKM 2004.
[43]
S.E. Robertson, S. Walker, M.M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. Gaithersburgh, MD, 1994.
[44]
T. Saracevic. Relevance reconsidered. In Conference on Conceptions of Library and Information Science, 1996.
[45]
L. Schamber, M.B. Eisenberg, and M.S. Nilan. A re-examination of relevance: toward a dynamic, situational definition. IPM, 26(6):755--776, 1990.
[46]
T. Schmidt. Coping with copulas. Risk Books: Copulas from Theory to Applications in Finance, 2007.
[47]
C. Schoelzel, P. Friederichs, et al. Multivariate non-normally distributed random variables in climate research--introduction to the copula approach. Nonlin. Processes Geophys., 15(5):761--772, 2008.
[48]
A. Sklar. Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8(1):11, 1959.
[49]
T. Tsikrika and M. Lalmas. Combining evidence for relevance criteria: a framework and experiments in web retrieval. ECIR 2007.
[50]
D. Vallet and P. Castells. Personalized diversification of search results. In SIGIR 2012. ACM.
[51]
C.C. Vogt and G.W. Cottrell. Fusion via a linear combination of scores. Information Retrieval, 1(3):151--173, 1999.
[52]
S. Wu and F. Crestani. Data fusion with estimated weights. In CIKM 2002. ACM.

Cited By

View all
  • (2024)A systematic review of multidimensional relevance estimation in information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.154114:5Online publication date: 7-May-2024
  • (2022)Exploiting hierarchical dependence structures for unsupervised rank fusion in information retrievalJournal of Intelligent Information Systems10.1007/s10844-022-00751-360:3(853-876)Online publication date: 18-Oct-2022
  • (2022)Supercalifragilisticexpialidocious: Why Using the “Right” Readability Formula in Children’s Web Search MattersAdvances in Information Retrieval10.1007/978-3-030-99736-6_1(3-18)Online publication date: 5-Apr-2022
  • Show More Cited By

Index Terms

  1. Copulas for information retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
    July 2013
    1188 pages
    ISBN:9781450320344
    DOI:10.1145/2484028
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 July 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data fusion
    2. multivariate relevance
    3. probabilistic framework
    4. ranking
    5. relevance models

    Qualifiers

    • Research-article

    Conference

    SIGIR '13
    Sponsor:

    Acceptance Rates

    SIGIR '13 Paper Acceptance Rate 73 of 366 submissions, 20%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A systematic review of multidimensional relevance estimation in information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.154114:5Online publication date: 7-May-2024
    • (2022)Exploiting hierarchical dependence structures for unsupervised rank fusion in information retrievalJournal of Intelligent Information Systems10.1007/s10844-022-00751-360:3(853-876)Online publication date: 18-Oct-2022
    • (2022)Supercalifragilisticexpialidocious: Why Using the “Right” Readability Formula in Children’s Web Search MattersAdvances in Information Retrieval10.1007/978-3-030-99736-6_1(3-18)Online publication date: 5-Apr-2022
    • (2020)Approximate Projection-Based Control of Networks2020 59th IEEE Conference on Decision and Control (CDC)10.1109/CDC42340.2020.9304414(5573-5579)Online publication date: 14-Dec-2020
    • (2017)Theory of the GMM KernelProceedings of the 26th International Conference on World Wide Web10.1145/3038912.3052679(1053-1062)Online publication date: 3-Apr-2017
    • (2017)Modeling Document Networks with Tree-Averaged Copula RegularizationProceedings of the Tenth ACM International Conference on Web Search and Data Mining10.1145/3018661.3018666(691-699)Online publication date: 2-Feb-2017
    • (2017)Aggregation operators in Information RetrievalFuzzy Sets and Systems10.1016/j.fss.2016.12.018324:C(3-19)Online publication date: 1-Oct-2017
    • (2017)A Gaussian copula regression model for movie box-office revenues prediction基于高斯连接回归模型的电影票房预测Science China Information Sciences10.1007/s11432-015-0905-660:9Online publication date: 25-Apr-2017
    • (2016)When time meets information retrievalJournal of Information Science10.1177/016555151560727742:6(725-747)Online publication date: 1-Dec-2016
    • (2016)A new readability measure for web documents and its evaluation on an effective web search engineProceedings of the 18th International Conference on Information Integration and Web-based Applications and Services10.1145/3011141.3011172(355-362)Online publication date: 28-Nov-2016
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media