skip to main content
10.1145/1835449.1835508acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Estimating probabilities for effective data fusion

Published: 19 July 2010 Publication History

Abstract

Data Fusion is the combination of a number of independent search results, relating to the same document collection, into a single result to be presented to the user. A number of probabilistic data fusion models have been shown to be effective in empirical studies. These typically attempt to estimate the probability that particular documents will be relevant, based on training data. However, little attempt has been made to gauge how the accuracy of these estimations affect fusion performance. The focus of this paper is twofold: firstly, that accurate estimation of the probability of relevance results in effective data fusion; and secondly, that an effective approximation of this probability can be made based on less training data that has previously been employed. This is based on the observation that the distribution of relevant documents follows a similar pattern in most high-quality result sets. Curve fitting suggests that this can be modelled by a simple function that is less complex than other models that have been proposed. The use of existing IR evaluation metrics is proposed as a substitution for probability calculations. Mean Average Precision is used to demonstrate the effectiveness of this approach, with evaluation results demonstrating competitive performance when compared with related algorithms with more onerous requirements for training data.

References

[1]
J. A. Aslam and M. Montague. Models for metasearch. In SIGIR '01: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 276--284, New York, NY, USA, 2001.
[2]
S. M. Beitzel, E. C. Jensen, A. Chowdhury, D. Grossman, O. Frieder, and N. Goharian. Fusion of effective retrieval strategies in the same information retrieval system. J. Am. Soc. Inf. Sci. Technol., 55:859--868, 2004.
[3]
J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In SIGIR '95: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pages 21--28, New York, NY, USA, 1995.
[4]
N. Craswell and D. Hawking. Overview of the TREC-2004 web track. In Proceedings of the Thirteenth Text REtrieval Conference (TREC-2004), 2004.
[5]
E. A. Fox and J. A. Shaw. Combination of Multiple Searches. In Proceedings of the 2nd Text REtrieval Conference (TREC-2), National Institute of Standards and Technology Special Publication 500-215, pages 243--252, 1994.
[6]
A. E. Howe and D. Dreilinger. SavvySearch: A Metasearch Engine That Learns Which Search Engines to Query. AI Magazine, 18:19--25, 1997.
[7]
J. H. Lee. Analyses of multiple evidence combination. SIGIR Forum, 31:267--276, 1997.
[8]
D. Lillis. ProbFuse: Probabilistic Data Fusion. Msc, University College Dublin, UCD, February 2006.
[9]
D. Lillis, F. Toolan, R. Collier, and J. Dunnion. ProbFuse: A Probabilistic Approach to Data Fusion. In Proceedings of the 29th annual international ACM SIGIR Conference on Research and Development in information retrieval, pages 139--146, New York, USA, 2006.
[10]
D. Lillis, F. Toolan, R. Collier, and J. Dunnion. Extending Probabilistic Data Fusion Using Sliding Windows. In Proceedings of the 30th European Conference on Information Retrieval (ECIR '08), volume 4956 of Lecture Notes in Computer Science, pages 358--369, Berlin, 2008. Springer.
[11]
R. Manmatha, T. Rath, and F. Feng. Modeling score distributions for combining the outputs of search engines. In SIGIR '01: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 267--275, New York, NY, USA, 2001.
[12]
M. Montague and J. A. Aslam. Relevance score normalization for metasearch. In CIKM '01: Proceedings of the Tenth International Conference on Information and Knowledge Management, pages 427--433, New York, NY, USA, 2001.
[13]
M. Montague and J. A. Aslam. Condorcet fusion for improved retrieval. In CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management, pages 538--548, New York, NY, USA, 2002.
[14]
A. L. Powell, J. C. French, J. Callan, M. Connell, and C. L. Viles. The impact of database selection on distributed searching. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 232--239, New York, NY, USA, 2000.
[15]
E. Selberg and O. Etzioni. The MetaCrawler Architecture for Resource Aggregation on the Web. IEEE Expert, pages 11--14, 1997.
[16]
M. Shokouhi. Segmentation of Search Engine Results for Effective Data-Fusion. Advances in Information Retrieval, 4425, April 2007.
[17]
C. C. Vogt and G. W. Cottrell. Fusion Via a Linear Combination of Scores. Information Retrieval, 1:151--173, 1999.
[18]
E. M. Voorhees, N. K. Gupta, and B. Johnson-Laird. The Collection Fusion Problem. In Proceedings of the Third Text REtrieval Conference (TREC-3), pages 95--104, 1994.
[19]
S. Wu and F. Crestani. Data fusion with estimated weights. In CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management, pages 648--651, New York, NY, USA, 2002

Cited By

View all
  • (2024)Injecting the score of the first-stage retriever as text improves BERT-based re-rankersDiscover Computing10.1007/s10791-024-09435-827:1Online publication date: 26-Jun-2024
  • (2023)Data Fusion Performance Prophecy: A Random Forest RevelationInformation Integration and Web Intelligence10.1007/978-3-031-48316-5_20(192-200)Online publication date: 22-Nov-2023
  • (2022)ranx.fuse: A Python Library for MetasearchProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557207(4808-4812)Online publication date: 17-Oct-2022
  • Show More Cited By

Index Terms

  1. Estimating probabilities for effective data fusion

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
    July 2010
    944 pages
    ISBN:9781450301534
    DOI:10.1145/1835449
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 July 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. information retrieval
    2. probabilistic data fusion
    3. results merging

    Qualifiers

    • Research-article

    Conference

    SIGIR '10
    Sponsor:

    Acceptance Rates

    SIGIR '10 Paper Acceptance Rate 87 of 520 submissions, 17%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Injecting the score of the first-stage retriever as text improves BERT-based re-rankersDiscover Computing10.1007/s10791-024-09435-827:1Online publication date: 26-Jun-2024
    • (2023)Data Fusion Performance Prophecy: A Random Forest RevelationInformation Integration and Web Intelligence10.1007/978-3-031-48316-5_20(192-200)Online publication date: 22-Nov-2023
    • (2022)ranx.fuse: A Python Library for MetasearchProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557207(4808-4812)Online publication date: 17-Oct-2022
    • (2022)Exploiting hierarchical dependence structures for unsupervised rank fusion in information retrievalJournal of Intelligent Information Systems10.1007/s10844-022-00751-360:3(853-876)Online publication date: 18-Oct-2022
    • (2022)VeTo+: improved expert set expansion in academiaInternational Journal on Digital Libraries10.1007/s00799-021-00318-723:1(57-75)Online publication date: 1-Mar-2022
    • (2022)Inexpensive and Effective Data Fusion Methods with Performance WeightsInformation Integration and Web Intelligence10.1007/978-3-031-21047-1_30(367-377)Online publication date: 28-Nov-2022
    • (2022)Data Fusion Methods with Graded Relevance JudgmentWeb Information Systems and Applications10.1007/978-3-031-20309-1_20(227-239)Online publication date: 16-Sep-2022
    • (2021)Assessing the Benefits of Model Ensembles in Neural Re-ranking for Passage RetrievalAdvances in Information Retrieval10.1007/978-3-030-72240-1_19(225-232)Online publication date: 28-Mar-2021
    • (2020)On the Evaluation of Data Fusion for Information RetrievalProceedings of the 12th Annual Meeting of the Forum for Information Retrieval Evaluation10.1145/3441501.3441506(54-57)Online publication date: 16-Dec-2020
    • (2018)Fusion in Information RetrievalThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3210186(1383-1386)Online publication date: 27-Jun-2018
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media