research-article

Generative model-based metasearch for data fusion in information retrieval

Author:
Miles Efron

University of Texas, Austin, TX, USA

University of Texas, Austin, TX, USA
View Profile

JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital librariesJune 2009Pages 153–162https://doi.org/10.1145/1555400.1555426

Published:15 June 2009Publication History

JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries

Pages 153–162

ABSTRACT

"Data fusion" refers to the problem in information retrieval (IR) where several lists of documents ranked against a query are to be merged into a single ranked list for presentation to a user. Data fusion is also known as "metasearch." In a digital library setting data fusion may support operations such as federated search based on multiple repository representations. This paper presents a novel approach to the fusion problem: generative model-based Metasearch (GeM). We suggest viewing the appearance of documents in a return set as the outcome of a probabilistic process; some documents are likely to occur in the model, while others are unlikely. Using Bayesian parameter estimation to fit a multinomial distribution based on the return sets to be merged, GeM achieves a final ranking by listing documents in decreasing probability of generation under the induced model. We also introduce what we call "the impatient reader" approach to normalizing document ranks in service to the fusion operation. We report results from several experiments on TREC data suggesting that GeM, informed with impatient reader document scores, operates at state-of-the-art levels of effectiveness.

References

Javed A. Aslam and Mark Montague. Bayes optimal metasearch: a probabilistic model for combining the results of multiple retrieval systems (poster session). In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 379--381, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
Javed A. Aslam and Mark Montague. Models for metasearch. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 276--284, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
Javed A. Aslam and Robert Savell. On the effectiveness of evaluating retrieval systems in the absence of relevance judgments. In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 361--362, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
N. J. Belkin, P. B. Kantor, E. A. Fox, and E. A. Shaw. Combining the evidence of multiple query representations for information retrieval. Information Processing and Management, 31(3):431--448, 1995. Google ScholarDigital Library
Nicholas J. Belkin, C. Cool, W. Bruce Croft, and James P. Callan. The effect multiple query representations on information retrieval system performance. In SIGIR '93: Proceedings of the 16th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pages 339--346, New York, NY, USA, 1993. ACM. Google ScholarDigital Library
William M. Bolstad. Introduction to Bayesian Statistics. Wiley Interscience, New York, NY, 2007. Google ScholarDigital Library
James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching distributed collections with inference networks. In SIGIR '95: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pages 21--28, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
Merran Evans, Nicholas Hastings, and Brian Peacock. Statistical Distributions. Wiley-Interscience, New York, NY, 2000.Google Scholar
E. A. Fox and J. A. Shaw. Combination of multiple searches. In Proceedings of the 2nd Text REtrieval Conference (TREC-2), pages 243--252. National Institute of Standards and Technology Special Publication 500-215, 1994.Google Scholar
Thorsten Joachims. Optimizing search engines using clickthrough data. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133--142, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
Norman L. Johnson, Samuel Kotz, and N. Balakrishnan. Discrete Multivariate Distributions. Wiley Interscience, New York, NY, 1997.Google Scholar
Diane Kelly and Xin Fu. Eliciting better information need descriptions from users of information search systems. Information Processing and Management, 43(1):30--46, 2007. Google ScholarDigital Library
Carl Lagoze and Herbert Van de Sompel. The Open Archives Initiative: building a low-barrier interoperability framework. In JCDL '01: Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries, pages 54--62, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
Leah S. Larkey, Margaret E. Connell, and Jamie Callan. Collection selection and results merging with topically organized U.S. patents and TREC data. In CIKM '00: Proceedings of the ninth international conference on Information and knowledge management, pages 282--289, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
Birger Larsen, Peter Ingwersen, and Jaana Kekalainen. The polyrepresentation continuum in IR. In IIiX: Proceedings of the 1st International Conference on Information Interaction in Context, pages 88--96, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
Joon Ho Lee. Analyses of multiple evidence combination. SIGIR Forum, 31(SI):267--276, 1997. Google ScholarDigital Library
D. Lillis, F. Toolan, A. Mur, L. Peng, R. Collier, and J. Dunnion. Probability-based fusion of information retrieval result sets. Artificial Intelligence Review, 25(1--2):179--191, 2006. Google ScholarDigital Library
David Lillis, Fergus Toolan, Rem Collier, and John Dunnion. Probfuse: a probabilistic approach to data fusion. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 139--146, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
Jie Lu and Jamie Callan. Full-text federated search of text-based digital libraries in peer-to-peer networks. Information Retrieval, 9(4):477--498, 2006. Google ScholarDigital Library
R. Manmatha, T. Rather, and F. Feng. Modeling score distributions for combining the outputs of search engines. In SIGIR '01: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 267--275, 2001. Google ScholarDigital Library
Mark Montague and Javed A. Aslam. Condorcet fusion for improved retrieval. In CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management, pages 538--548, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
Allison L. Powell, James C. French, Jamie Callan, Margaret Connell, and Charles L. Viles. The impact of database selection on distributed searching. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 232--239, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
Fabio Simeoni, Murat Yakici, Steve Neely, and Fabio Crestani. Metadata harvesting for content-based distributed information retrieval. Journal of the American Society for Information Science and Technology, 59(1):12--24, 2008. Google ScholarDigital Library
Mette Skov, Birger Larsen, and Peter Ingwersen. Inter and intra-document contexts applied in polyrepresentation for best match IR. Information Processing and Management, 44(5):1673--1683, 2008. Google ScholarDigital Library
Ian Soboroff, Charles Nicholas, and Patrick Cahan. Ranking retrieval systems without relevance judgments. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 66--73, New Orleans, Louisiana, United States, 2001. ACM. 383961. Google ScholarDigital Library
Anselm Spoerri. Using the structure of overlap between search results to rank retrieval systems without relevance judgments. Information Processing and Management, 43(4):1059--1070, 2007. Google ScholarDigital Library
Paul Thompson. A combination of expert opinion approach to probabilistic information retrieval, part 1: The conceptual model. Information Processing and Management, 26(3):371--382, 1990. Google ScholarDigital Library
Shengli Wu and Fabio Crestani. Methods for ranking information retrieval systems without relevance judgments. In SAC '03: Proceedings of the 2003 ACM symposium on Applied computing, pages 811--816, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
Jingfang Xu and Xing Li. Learning to rank collections. In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 765--766, New York, NY, USA, 2007. ACM. Google ScholarDigital Library

Index Terms

Generative model-based metasearch for data fusion in information retrieval
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems

Recommendations

Surrogate scoring for improved metasearch precision
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

We describe a method for improving the precision of metasearch results based upon scoring the visual features of documents' surrogate representations. These surrogate scores are used during fusion in place of the original scores or ranks provided by the ...
Read More
Building efficient and effective metasearch engines

Frequently a user's information needs are stored in the databases of multiple search engines. It is inconvenient and inefficient for an ordinary user to invoke multiple search engines and identify useful documents from the returned results. To support ...
Read More
On data fusion in information retrieval using different aggregation operators

This paper is concerned with the problem of unsupervised rank aggregation in the context of metasearch in information retrieval. In such tasks, we are given many partial ordered lists of retrieved items provided by many search engines and we want to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
June 2009
502 pages
ISBN:9781605583228
DOI:10.1145/1555400
General Chairs:
Fred Heath
University of Texas Libraries, USA
,
Mary Lynn Rice-Lively
University of Texas at Austin, USA
,
Program Chair:
Richard Furuta
Texas A&M University, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data fusion
digital libraries
generative models
information retrieval
metasearch
probabilistic models
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate415of1,482submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 324
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Generative model-based metasearch for data fusion in information retrieval

JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Surrogate scoring for improved metasearch precision

Building efficient and effective metasearch engines

On data fusion in information retrieval using different aggregation operators

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Generative model-based metasearch for data fusion in information retrieval

JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Surrogate scoring for improved metasearch precision

Building efficient and effective metasearch engines

On data fusion in information retrieval using different aggregation operators

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media