ABSTRACT
"Data fusion" refers to the problem in information retrieval (IR) where several lists of documents ranked against a query are to be merged into a single ranked list for presentation to a user. Data fusion is also known as "metasearch." In a digital library setting data fusion may support operations such as federated search based on multiple repository representations. This paper presents a novel approach to the fusion problem: generative model-based Metasearch (GeM). We suggest viewing the appearance of documents in a return set as the outcome of a probabilistic process; some documents are likely to occur in the model, while others are unlikely. Using Bayesian parameter estimation to fit a multinomial distribution based on the return sets to be merged, GeM achieves a final ranking by listing documents in decreasing probability of generation under the induced model. We also introduce what we call "the impatient reader" approach to normalizing document ranks in service to the fusion operation. We report results from several experiments on TREC data suggesting that GeM, informed with impatient reader document scores, operates at state-of-the-art levels of effectiveness.
- Javed A. Aslam and Mark Montague. Bayes optimal metasearch: a probabilistic model for combining the results of multiple retrieval systems (poster session). In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 379--381, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- Javed A. Aslam and Mark Montague. Models for metasearch. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 276--284, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
- Javed A. Aslam and Robert Savell. On the effectiveness of evaluating retrieval systems in the absence of relevance judgments. In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 361--362, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
- N. J. Belkin, P. B. Kantor, E. A. Fox, and E. A. Shaw. Combining the evidence of multiple query representations for information retrieval. Information Processing and Management, 31(3):431--448, 1995. Google ScholarDigital Library
- Nicholas J. Belkin, C. Cool, W. Bruce Croft, and James P. Callan. The effect multiple query representations on information retrieval system performance. In SIGIR '93: Proceedings of the 16th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pages 339--346, New York, NY, USA, 1993. ACM. Google ScholarDigital Library
- William M. Bolstad. Introduction to Bayesian Statistics. Wiley Interscience, New York, NY, 2007. Google ScholarDigital Library
- James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching distributed collections with inference networks. In SIGIR '95: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pages 21--28, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
- Merran Evans, Nicholas Hastings, and Brian Peacock. Statistical Distributions. Wiley-Interscience, New York, NY, 2000.Google Scholar
- E. A. Fox and J. A. Shaw. Combination of multiple searches. In Proceedings of the 2nd Text REtrieval Conference (TREC-2), pages 243--252. National Institute of Standards and Technology Special Publication 500-215, 1994.Google Scholar
- Thorsten Joachims. Optimizing search engines using clickthrough data. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133--142, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
- Norman L. Johnson, Samuel Kotz, and N. Balakrishnan. Discrete Multivariate Distributions. Wiley Interscience, New York, NY, 1997.Google Scholar
- Diane Kelly and Xin Fu. Eliciting better information need descriptions from users of information search systems. Information Processing and Management, 43(1):30--46, 2007. Google ScholarDigital Library
- Carl Lagoze and Herbert Van de Sompel. The Open Archives Initiative: building a low-barrier interoperability framework. In JCDL '01: Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries, pages 54--62, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
- Leah S. Larkey, Margaret E. Connell, and Jamie Callan. Collection selection and results merging with topically organized U.S. patents and TREC data. In CIKM '00: Proceedings of the ninth international conference on Information and knowledge management, pages 282--289, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- Birger Larsen, Peter Ingwersen, and Jaana Kekalainen. The polyrepresentation continuum in IR. In IIiX: Proceedings of the 1st International Conference on Information Interaction in Context, pages 88--96, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
- Joon Ho Lee. Analyses of multiple evidence combination. SIGIR Forum, 31(SI):267--276, 1997. Google ScholarDigital Library
- D. Lillis, F. Toolan, A. Mur, L. Peng, R. Collier, and J. Dunnion. Probability-based fusion of information retrieval result sets. Artificial Intelligence Review, 25(1--2):179--191, 2006. Google ScholarDigital Library
- David Lillis, Fergus Toolan, Rem Collier, and John Dunnion. Probfuse: a probabilistic approach to data fusion. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 139--146, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- Jie Lu and Jamie Callan. Full-text federated search of text-based digital libraries in peer-to-peer networks. Information Retrieval, 9(4):477--498, 2006. Google ScholarDigital Library
- R. Manmatha, T. Rather, and F. Feng. Modeling score distributions for combining the outputs of search engines. In SIGIR '01: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 267--275, 2001. Google ScholarDigital Library
- Mark Montague and Javed A. Aslam. Condorcet fusion for improved retrieval. In CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management, pages 538--548, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
- Allison L. Powell, James C. French, Jamie Callan, Margaret Connell, and Charles L. Viles. The impact of database selection on distributed searching. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 232--239, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- Fabio Simeoni, Murat Yakici, Steve Neely, and Fabio Crestani. Metadata harvesting for content-based distributed information retrieval. Journal of the American Society for Information Science and Technology, 59(1):12--24, 2008. Google ScholarDigital Library
- Mette Skov, Birger Larsen, and Peter Ingwersen. Inter and intra-document contexts applied in polyrepresentation for best match IR. Information Processing and Management, 44(5):1673--1683, 2008. Google ScholarDigital Library
- Ian Soboroff, Charles Nicholas, and Patrick Cahan. Ranking retrieval systems without relevance judgments. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 66--73, New Orleans, Louisiana, United States, 2001. ACM. 383961. Google ScholarDigital Library
- Anselm Spoerri. Using the structure of overlap between search results to rank retrieval systems without relevance judgments. Information Processing and Management, 43(4):1059--1070, 2007. Google ScholarDigital Library
- Paul Thompson. A combination of expert opinion approach to probabilistic information retrieval, part 1: The conceptual model. Information Processing and Management, 26(3):371--382, 1990. Google ScholarDigital Library
- Shengli Wu and Fabio Crestani. Methods for ranking information retrieval systems without relevance judgments. In SAC '03: Proceedings of the 2003 ACM symposium on Applied computing, pages 811--816, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
- Jingfang Xu and Xing Li. Learning to rank collections. In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 765--766, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
Index Terms
- Generative model-based metasearch for data fusion in information retrieval
Recommendations
Surrogate scoring for improved metasearch precision
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrievalWe describe a method for improving the precision of metasearch results based upon scoring the visual features of documents' surrogate representations. These surrogate scores are used during fusion in place of the original scores or ranks provided by the ...
Building efficient and effective metasearch engines
Frequently a user's information needs are stored in the databases of multiple search engines. It is inconvenient and inefficient for an ordinary user to invoke multiple search engines and identify useful documents from the returned results. To support ...
On data fusion in information retrieval using different aggregation operators
This paper is concerned with the problem of unsupervised rank aggregation in the context of metasearch in information retrieval. In such tasks, we are given many partial ordered lists of retrieved items provided by many search engines and we want to ...
Comments