Original Research
Clustering-based fusion for medical information retrieval

https://doi.org/10.1016/j.jbi.2022.104213Get rights and content
Under an Elsevier user license
open archive

Highlights:

  • A clustering-based fusion method is proposed for selecting a subset of retrieval systems.

  • The major characteristic of the proposed method is both performance and diversity of component retrieval systems are considered.

  • Experiments with two medical retrieval data sets from TREC demonstrate the validity of the proposed method.

Abstract

Medicine is a fast-moving field, and the number of medical publications has increased rapidly over recent years. How to find relevant information from this vast pool of research effectively and efficiently has therefore become highly challenges. Previous studies have demonstrated that data fusion can improve search performance if properly utilized. However, in most cases effectiveness is the only concern and efficiency is not considered. A fusion-based system is by nature more complicated and expensive computationally than other retrieval models such as BM25, because many component retrieval systems and an extra layer of fusion are required. The number of component retrieval systems involved is an important indicator of complexity of the fusion-based system. We aim to select the optimal k-subset of component retrieval systems for any given number k, to optimize both fusion performance and reduce the cost of data fusion. A clustering-based approach is proposed. First all the candidates are divided into clusters by the Chameleon clustering algorithm, then representatives from every cluster are chosen by Sequential Forward Selection for fusion. Evaluated with two datasets from TREC, the proposed method performs more effectively than the other baseline methods including the state-of-the-art subset selection method significantly. When either of the two typical fusion methods is used, an improvement rate of over 10% is observed for both measures Mean Average Precision and Recall-level Precision, and an improvement rate of over 5% is observed for both measures Precision at 10 document level and Mean Reciprocal Rank.

Keywords

Medical information retrieval
Data fusion
Subset selection
Clustering
Efficiency and effectiveness

Cited by (0)