Knowledge-based query expansion to support scenario-specific retrieval of medical free text

Liu, Zhenyu; Chu, Wesley W.

doi:10.1007/s10791-006-9020-6

Knowledge-based query expansion to support scenario-specific retrieval of medical free text

Published: 17 January 2007

Volume 10, pages 173–202, (2007)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Knowledge-based query expansion to support scenario-specific retrieval of medical free text

Download PDF

Zhenyu Liu¹ &
Wesley W. Chu¹

447 Accesses
33 Citations
Explore all metrics

Abstract

In retrieving medical free text, users are often interested in answers pertinent to certain scenarios that correspond to common tasks performed in medical practice, e.g., treatment or diagnosis of a disease. A major challenge in handling such queries is that scenario terms in the query (e.g., treatment) are often too general to match specialized terms in relevant documents (e.g., chemotherapy). In this paper, we propose a knowledge-based query expansion method that exploits the UMLS knowledge source to append the original query with additional terms that are specifically relevant to the query's scenario(s). We compared the proposed method with traditional statistical expansion that expands terms which are statistically correlated but not necessarily scenario specific. Our study on two standard testbeds shows that the knowledge-based method, by providing scenario-specific expansion, yields notable improvements over the statistical method in terms of average precision-recall. On the OHSUMED testbed, for example, the improvement is more than 5% averaging over all scenario-specific queries studied and about 10% for queries that mention certain scenarios, such as treatment of a disease and differential diagnosis of a symptom/disease.

Using Dempster-Shafer’s Evidence Theory for Query Expansion Based on Freebase Knowledge

Focused Query Expansion with Entity Cores for Patient-Centric Health Search

Integrating Multiple Resources for Diversified Query Expansion

Introduction

In recent years, there has been a phenomenal growth of online medical document collections. Collections such as PubMed^{Footnote 1} and MedlinePlus^{Footnote 2} provide comprehensive coverage of medical literature and teaching materials. In searching these collections, it is desirable to retrieve only those documents pertaining to a specific medical “scenario,” where a scenario is defined as a frequently-reappearing medical task. For example, in treating a lung cancer patient, a physician may pose the query lung cancer treatment in order to find the latest treatment techniques for this disease. Here, treatment is the medical task that marks the scenario for this query. Recent studies (Ely et al., 1999, 2000; Haynes et al., 1990; Hersh et al., 1996; Wilczynski et al., 2001) reveal that in clinical practice, as many as 60% of physicians’ queries center on a limited number of scenarios, e.g., treatment, diagnosis, etiology, etc. While the contextual information in such queries (e.g., the particular disease of a patient such as lung cancer, the age group of that patient, etc.) varies from case to case, the set of frequently-asked medical scenarios remains unchanged. Retrieving documents that are specifically related to the query's scenario is referred to as scenario-specific retrieval.

Scenario-specific retrieval is not adequately addressed by traditional text retrieval systems (e.g., SMART (Salton and McGill, 1983) or INQUIRY (Callan et al., 1992)). Such systems suffer from the fundamental problem of query-document mismatch (Efthimiadis, 1996) when handling scenario-specific queries. Scenario terms in these queries are represented using general terms, e.g., the term treatment in the query lung cancer treatment. On the contrary, in full-text medical documents, more specialized terms such as lung excision or chemotherapy are used to express the same topic. Such mismatch of terms leads to poor retrieval performance (Tse and Soergel, 2003; Zeng et al., 2002).

There has been a substantial amount of research on query expansion (Buckley et al., 1994, 1995; Jing and Croft, 1994; Mitra et al., 1998; Qiu and Frei, 1993; Robertson et al., 1994; Srinivasan, 1996; Xu and Croft, 1996) that ameliorates the query–document mismatch problem. However, such techniques also have difficulties handling scenario-specific queries. Query expansion appends the original query with specialized terms that have a statistical cooccurrence relationship with original query terms in medical literature. Although appending such specialized terms makes the expanded query a better match with relevant documents, the expansion is not scenario-specific. For example, in handling the query lung cancer treatment, existing query expansion techniques will append not only terms such as lung excision or chemotherapy that are relevant to the treatment scenario, but also irrelevant terms like smoking and lymph node, simply because the latter terms cooccur with lung cancer in medical literature. Appending non-scenario-specific terms leads to the retrieval of documents that are irrelevant to the original query's scenario, diverging from our goal of scenario-specific retrieval.

In the domain of medical text retrieval, researchers have proposed to exploit the Unified Medical Language System (UMLS), a full-fledged knowledge source in the medical domain, to expand the original query with related terms and to improve retrieval performance. Current approaches either explore the synonym relationships defined in UMLS and expand synonyms of the original query terms (Aronson and Rindflesch, 1997; Guo et al., 2004; Plovnick and Zeng, 2004) or explore the hypernym/hyponym relationships and expand terms that have wider/narrower meaning than the original query terms (Hersh et al., 2000). Extensive evaluation of these approaches has been performed on standard testbeds such as OHSUMED (Aronson and Rindflesch, 1997; Hersh et al., 2000) and the TREC Genomics ad hoc topics (Guo et al., 2004). However, no study has consistently produced significant differences in retrieval effectiveness before and after expansion. Particularly, we note that when handling scenario-specific queries, such solutions still generally suffer from the query-document mismatch problem. For example, the synonyms, hypernyms, or hyponyms for all the terms in query lung cancer treatment, as defined by the knowledge source, are lung carcinoma, cancer, therapy, medical procedure, etc. With such terms expanded, the query will still have difficulty matching documents that extensively use specialized terms such as chemotherapy and lung excision.

In this paper, we propose a knowledge-based query expansion technique to support scenario-specific retrieval. Our technique exploits domain knowledge to restrict query expansion to scenario-specific terms and yields better retrieval performance than that of traditional query expansion approaches. The following are challenges in developing such a knowledge-based technique:

Using domain knowledge to automatically identify scenario-specific terms. It is impractical to ask users or domain experts to manually identify scenario-specific terms for every query and all possible scenarios. Therefore, an automatic approach is highly desirable. However, the distinction between scenario-specific expansion terms and non-scenario-specific ones may seem apparent to a human expert, but can be very difficult for a program. To treat this distinction, we propose to exploit a domain-specific knowledge source.
Incompleteness of knowledge sources. Knowledge sources are usually not specifically designed for the purpose of scenario-specific retrieval. As a result, scenarios frequently appearing in medical queries may not be adequately supported by those knowledge sources. We propose a knowledge-acquisition methodology to supplement the existing knowledge sources with additional knowledge that supports undefined scenarios.

The rest of this paper is organized as follows. We first present a framework for knowledge-based query expansion in Section 2. We then describe the detailed method in this framework in Section 3. We experimentally evaluate the method and report the results in Section 4. In Section 5, we address the issue of supplementing a knowledge source via knowledge acquisition. We further discuss the relevancy of expansion terms judged by domain experts in Section 6.

A framework for knowledge-based query expansion

Figure 1 depicts the components in a knowledge-based query expansion and retrieval framework. For a given query, Statistical Query Expansion (whose scope is marked by the inner dotted rectangle) will first derive candidate expansion concepts ^{Footnote 3} that are statistically cooccurring with the given query concepts (Section 3.1) and assign weights to each candidate concept according to the statistical cooccurrence. Such weights will be carried through the framework.

Based on the candidate concepts derived by statistical expansion, Knowledge-based Query Expansion (whose scope is marked by the outer dotted rectangle) further derives the scenario-specific expansion concepts, with the aid of a domain knowledge source such as UMLS (NLM, 2001) (Section 3.2). Such knowledge may be incomplete and fail to include all possible query scenarios. Therefore, in an offline process, we apply a Knowledge Acquisition and Supplementation module to supplement the incomplete knowledge (Section 5).

After the query is expanded with scenario-specific concepts, we employ a Vector Space Model (VSM) to compare the similarity between the expanded query and each document. Top-ranked documents with the highest similarity measures are output to the user.

Method

Formally, the problem for knowledge-based query expansion can be stated as follows: Given a scenario-specific query with a key concept denoted as c _key (e.g., lung cancer or keratoconus ^{Footnote 4}) and a set of scenario concepts denoted as c _s (e.g., treatment or diagnosis), we need to derive specialized concepts that are related to C _key and the relations should be specific to the scenarios defined by c _s.

In this section, we describe how to derive such scenario-specific concepts first by presenting existing statistical query expansion methods which generate candidates for such scenario-specific concepts. We then propose a knowledge-based method that selects scenario-specific concepts from this candidate set with the aid of a domain knowledge source.

Deriving statistically-related expansion concepts

Statistical expansion is also referred to as automatic query expansion (Efthimiadis, 1996; Mitra et al., 1998). The basic idea is to derive concepts that are statistically related to the given query concepts, where the statistical correlation is derived from a document collection (e.g., OHSUMED; Hersh et al., 1994). Appending such concepts to the original query makes the query expression more specialized and helps the query better match relevant documents. Depending on how such statistically-related concepts are derived, statistical expansion methods fall into two major categories:

Cooccurrence-thesaurus-based expansion (Jing and Croft, 1994; Qiu and Frei, 1993; Xu and Croft, 1996). In this method, a concept cooccurrence thesaurus is first constructed automatically offline. Given a vocabulary of M concepts, the thesaurus is an M×M matrix, where the 〈i, j〉 element quantifies the cooccurrence between concept i and concept j. When a query is posed, we look up the thesaurus to find all concepts that statistically cooccur with concepts in the given query and assign weights to those cooccurring concepts according to the values in the cooccurrence matrix. A detailed procedure for computing the cooccurrence matrix and for assigning weights to expansion concepts can be found in Qiu and Frei (1993).
Pseudo-relevance-feedback-based expansion (Buckley et al., 1994, 1995; Efthimiadis and Biron, 1993; Mitra et al., 1998; Robertson et al., 1994). In pseudo-relevance feedback, the original query is used to perform an initial retrieval. Concepts extracted from top-ranked documents in the initial retrieval are considered statistically related and are appended to the original query. This approach resembles the well-known relevance feedback approach except that, instead of asking users to identify relevant documents as feedback, top-ranked (e.g., top-10) documents are automatically treated as “pseudo” relevant documents and are inserted into the feedback loop. Weight assignment in pseudo-relevance feedback (Buckley et al., 1994) typically follows the same weighting scheme (〈α, β, γ〉) for conventional relevance feedback techniques (Rocchio, 1971).

We note that the choice of statistical expansion method is orthogonal to the design of the knowledge-based expansion framework (Fig. 1). In our current experimental evaluation, we used the cooccurrence-thesaurus-based method to derive statistically-related concepts. For convenience of discussion, we use co(c _i, c _j) to denote the cooccurrence between concept c _i and c _j, a value that appears as the 〈i, j〉 element in the M×M cooccurrence matrix. Table 1 lists the top-15 concepts that are statistically-related to keratoconus using the cooccurrence measure. Here, the cooccurrence measure is computed from the OHSUMED corpus that will be described in detail in Section 4.1.

Table 1 Concepts that statistically correlate to keratoconus

Knowledge-based query expansion to support scenario-specific retrieval of medical free text

Abstract

Similar content being viewed by others

Using Dempster-Shafer’s Evidence Theory for Query Expansion Based on Freebase Knowledge

Focused Query Expansion with Entity Cores for Patient-Centric Health Search

Integrating Multiple Resources for Diversified Query Expansion

Introduction

A framework for knowledge-based query expansion

Method

Deriving statistically-related expansion concepts

Deriving scenario-specific expansion concepts

UMLS—The knowledge source

A knowledge-based method to derive scenario-specific expansion concepts

Hypernym/hyponym expansion

Weight adjustment for expansion terms

Weight boosting

Experimental results

Experiment setup

Testbeds

VSM and indexing

Retrieval performance

Expansion sizes

The effectiveness of weight boosting

Sensitivity of performance improvements with query scenarios

Discussion of results

Choice of α for weight boosting

Comparison with previous knowledge-based query expansion studies

Knowledge acquisition

Knowledge acquisition methodology

Knowledge acquisition for undefined scenarios

Knowledge refinement through relevance judgments

Knowledge acquisition process

Knowledge acquisition results

Study of the relevancy of expansion concepts by domain experts

Conclusion

Appendix: Knowledge acquisition results

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation