skip to main content
10.1145/1555400.1555426acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Generative model-based metasearch for data fusion in information retrieval

Published:15 June 2009Publication History

ABSTRACT

"Data fusion" refers to the problem in information retrieval (IR) where several lists of documents ranked against a query are to be merged into a single ranked list for presentation to a user. Data fusion is also known as "metasearch." In a digital library setting data fusion may support operations such as federated search based on multiple repository representations. This paper presents a novel approach to the fusion problem: generative model-based Metasearch (GeM). We suggest viewing the appearance of documents in a return set as the outcome of a probabilistic process; some documents are likely to occur in the model, while others are unlikely. Using Bayesian parameter estimation to fit a multinomial distribution based on the return sets to be merged, GeM achieves a final ranking by listing documents in decreasing probability of generation under the induced model. We also introduce what we call "the impatient reader" approach to normalizing document ranks in service to the fusion operation. We report results from several experiments on TREC data suggesting that GeM, informed with impatient reader document scores, operates at state-of-the-art levels of effectiveness.

References

  1. Javed A. Aslam and Mark Montague. Bayes optimal metasearch: a probabilistic model for combining the results of multiple retrieval systems (poster session). In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 379--381, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Javed A. Aslam and Mark Montague. Models for metasearch. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 276--284, New York, NY, USA, 2001. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Javed A. Aslam and Robert Savell. On the effectiveness of evaluating retrieval systems in the absence of relevance judgments. In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 361--362, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. J. Belkin, P. B. Kantor, E. A. Fox, and E. A. Shaw. Combining the evidence of multiple query representations for information retrieval. Information Processing and Management, 31(3):431--448, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Nicholas J. Belkin, C. Cool, W. Bruce Croft, and James P. Callan. The effect multiple query representations on information retrieval system performance. In SIGIR '93: Proceedings of the 16th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pages 339--346, New York, NY, USA, 1993. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. William M. Bolstad. Introduction to Bayesian Statistics. Wiley Interscience, New York, NY, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching distributed collections with inference networks. In SIGIR '95: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pages 21--28, New York, NY, USA, 1995. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Merran Evans, Nicholas Hastings, and Brian Peacock. Statistical Distributions. Wiley-Interscience, New York, NY, 2000.Google ScholarGoogle Scholar
  9. E. A. Fox and J. A. Shaw. Combination of multiple searches. In Proceedings of the 2nd Text REtrieval Conference (TREC-2), pages 243--252. National Institute of Standards and Technology Special Publication 500-215, 1994.Google ScholarGoogle Scholar
  10. Thorsten Joachims. Optimizing search engines using clickthrough data. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133--142, New York, NY, USA, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Norman L. Johnson, Samuel Kotz, and N. Balakrishnan. Discrete Multivariate Distributions. Wiley Interscience, New York, NY, 1997.Google ScholarGoogle Scholar
  12. Diane Kelly and Xin Fu. Eliciting better information need descriptions from users of information search systems. Information Processing and Management, 43(1):30--46, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Carl Lagoze and Herbert Van de Sompel. The Open Archives Initiative: building a low-barrier interoperability framework. In JCDL '01: Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries, pages 54--62, New York, NY, USA, 2001. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Leah S. Larkey, Margaret E. Connell, and Jamie Callan. Collection selection and results merging with topically organized U.S. patents and TREC data. In CIKM '00: Proceedings of the ninth international conference on Information and knowledge management, pages 282--289, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Birger Larsen, Peter Ingwersen, and Jaana Kekalainen. The polyrepresentation continuum in IR. In IIiX: Proceedings of the 1st International Conference on Information Interaction in Context, pages 88--96, New York, NY, USA, 2006. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Joon Ho Lee. Analyses of multiple evidence combination. SIGIR Forum, 31(SI):267--276, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Lillis, F. Toolan, A. Mur, L. Peng, R. Collier, and J. Dunnion. Probability-based fusion of information retrieval result sets. Artificial Intelligence Review, 25(1--2):179--191, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. David Lillis, Fergus Toolan, Rem Collier, and John Dunnion. Probfuse: a probabilistic approach to data fusion. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 139--146, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jie Lu and Jamie Callan. Full-text federated search of text-based digital libraries in peer-to-peer networks. Information Retrieval, 9(4):477--498, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Manmatha, T. Rather, and F. Feng. Modeling score distributions for combining the outputs of search engines. In SIGIR '01: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 267--275, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Mark Montague and Javed A. Aslam. Condorcet fusion for improved retrieval. In CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management, pages 538--548, New York, NY, USA, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Allison L. Powell, James C. French, Jamie Callan, Margaret Connell, and Charles L. Viles. The impact of database selection on distributed searching. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 232--239, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Fabio Simeoni, Murat Yakici, Steve Neely, and Fabio Crestani. Metadata harvesting for content-based distributed information retrieval. Journal of the American Society for Information Science and Technology, 59(1):12--24, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mette Skov, Birger Larsen, and Peter Ingwersen. Inter and intra-document contexts applied in polyrepresentation for best match IR. Information Processing and Management, 44(5):1673--1683, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ian Soboroff, Charles Nicholas, and Patrick Cahan. Ranking retrieval systems without relevance judgments. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 66--73, New Orleans, Louisiana, United States, 2001. ACM. 383961. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Anselm Spoerri. Using the structure of overlap between search results to rank retrieval systems without relevance judgments. Information Processing and Management, 43(4):1059--1070, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Paul Thompson. A combination of expert opinion approach to probabilistic information retrieval, part 1: The conceptual model. Information Processing and Management, 26(3):371--382, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Shengli Wu and Fabio Crestani. Methods for ranking information retrieval systems without relevance judgments. In SAC '03: Proceedings of the 2003 ACM symposium on Applied computing, pages 811--816, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jingfang Xu and Xing Li. Learning to rank collections. In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 765--766, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Generative model-based metasearch for data fusion in information retrieval

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
            June 2009
            502 pages
            ISBN:9781605583228
            DOI:10.1145/1555400

            Copyright © 2009 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 15 June 2009

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate415of1,482submissions,28%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader