A retrospective study of a hybrid document-context based retrieval model
Introduction
Various retrieval models have been developed and investigated over the past several decades based on a variety of mathematical frameworks (Dominich, 2000). For example, Salton, Wong, and Yang (1975) and Wong, Ziarko, and Wong (1985) worked on retrieval models based on vector spaces. The Binary Independence model (BIM) (Robertson & Sparck-Jones, 1976), the logistic regression model (Cooper, Gey, & Dabney, 1992), the 2-Poisson model (Harter, 1975) and its later practical approximation (Robertson & Walker, 1994) and the language modelling approach (Lafferty and Zhai, 2001, Lavrenko and Croft, 2001, Ponte and Croft, 1998) are based on the probability theory. The fuzzy retrieval model (e.g., Miyamoto, 1990) and the extended Boolean model (Salton, Fox, & Wu, 1983) are based on the fuzzy set theory (Zadeh, 1965). These models provide a system point of view of how to retrieve documents that are sufficiently relevant that they satisfy a user’s information need. On the other hand, an information retrieval system can be thought of as simulating the human user when making relevance decisions in the retrieval process (Bollmann & Wong, 1987). In this case, the ranking of the relevance of the documents to the user’s information need is in terms of preferences (Yao & Wong, 1991).
In this work, we simulate human relevance decision making in the development of a novel retrieval model that explicitly models a human relevance decision at each location in a document. The relevance decision at the specified location in the document is based on the context at that location so that the relevance decision preference (or context score) at the specified location is estimated using the context at that location. Although using contexts in documents to explore term co-occurrence relationships for query expansion is not new, to the best of our knowledge, it is new to model the contexts/windows features explicitly in the retrieval model by incorporating the locations of terms inside a document for re-weighting the query terms. By re-weighting the query terms using the contexts of the query terms in documents, the model assigns context dependent term weights which are aggregated together as the final document similarity score.
A document context is essentially a concordance or a keyword in context (KWIC) (Kupiec, Pedersen, & Chen, 1995). Fig. 1 shows some example document contexts containing a query term in the title query, “Hubble Telescope Achievements”. The contexts were extracted from a raw (un-processed) document. During retrieval, unlike Fig. 1, all the terms are stemmed and the stop words are removed. From Fig. 1, it should be noted that even for a relevant document, not all contexts in the document are relevant.
Our novel model uses current successful retrieval models and techniques to estimate the relevance decision preferences (or context scores) of document contexts containing a query term in the center. The relevance decision preferences are defined as the log-odds estimated using smoothing techniques and they are combined using aggregation operators. More specifically, we used the technique of smoothing (Chen and Goodman, 1996, Zhai and Lafferty, 2004) to solve the problem of zero probabilities (Ponte & Croft, 1998) in estimating the term distributions in relevant documents similar to the language models (Lafferty and Zhai, 2001, Lavrenko and Croft, 2001, Ponte and Croft, 1998). We calculated the probability of the relevance of a particular document context similar to that of the BIM model (Robertson & Sparck-Jones, 1976). In order to calculate the document score for ranking, the document-context log-odds are combined using different evidence aggregation operators based on the extended Boolean model (Salton et al., 1983) and some fuzzy (aggregation) operators (Dombi, 1982, Paice, 1984, Yager, 1988). Therefore, our proposed retrieval model is a hybrid of various past successful retrieval models and techniques.
In predictive experiments, a major source of difficulty in developing novel retrieval models is in determining whether the effectiveness performance is limited by the underlying model or by the poor parameter estimation techniques used. Instead of predictive experiments, we propose to evaluate our novel retrieval model based on retrospective experiments that are performed using relevance information (e.g., the TREC relevance judgments), similar to the retrospective experiments in Robertson and Sparck-Jones, 1976, Sparck-Jones et al., 2000, Hiemstra and Robertson, 2001. The purpose of the retrospective experiments is to
- (a)
evaluate the potential of the underlying novel retrieval model by observing the best effectiveness that can be attained by the model;
- (b)
reveal the (near) optimal performance of the model and provide a yardstick for future (predictive) experiments. In the probability ranking principle (Robertson, 1977), full relevance information can enable the model obtain optimal performance (Hiemstra & Robertson, 2001);
- (c)
focus on gathering crucial factors (e.g., the size of the context) affecting the performance of the model when using the context of query terms in a document. We gather statistical data on these factors for analysing and designing the model to operate in predictive experiments;
- (d)
show whether the model obeys the probability ranking principle (Robertson, 1977); and
- (e)
examine the relevance decision principles in Kong, Luk, Lam, Ho, and Chung (2004) and determine which is the most suitable in simulating the human user when making relevance decisions.
The problem of estimating parameters with limited or no relevance information is left for future work since it is not known whether the proposed model is worth further investigation. When considering the terms in relevant documents, we discard those terms with document frequency equals to one. This avoids finding identifiers (e.g., document id) that uniquely identify relevant documents, thereby guaranteeing to obtain high precision when the relevance information is present. By contrast, we would like to utilize the term distributions in relevant and irrelevant documents for retrieval.
We emphasize that our document-context based model is a descriptive model in this paper even though it could become a normative model. A descriptive model describes how the decision is made while a normative model specifies how the (optimal) decision should be made. Our document-context based model is descriptive in this paper because it does not feedback any effectiveness performance information (e.g., MAP) to the system for performance optimization (e.g., query optimization (Buckley & Harman, 2003) or model parameter optimization). For instance, our retrieval model directly estimates the probabilities without any effectiveness feedback about the estimation being good or not for document ranking. Also, the retrieval process of our model is a one-pass re-ranking process using the proposed ranking formula (discussed in detail in Section 2) that describes how the relevance decision is made.
One may argue that if we know the relevance information, then the retrieval effectiveness performance must be good and it is pointless to do the experiments. However, as mentioned above, we are not finding identifiers of relevant documents (terms with document frequency equals to one are ignored). The descriptive model does not optimize the query or the model parameters using effectiveness performance results from previous runs. Moreover, the retrieval performance is not guaranteed to be good even when we know the relevance information (e.g., in Hiemstra & Robertson (2001), the performances of the retrospective experiments are similar to those in the predictive experiments (Robertson & Walker, 1999)). This is because the terms in the relevant documents may also appear in the irrelevant documents. By using the relevance information, we are not manipulating or restricting the term distributions/occurrences in the documents but using existing probabilistic methods to estimate the term distributions/occurrences in the documents. Furthermore, we tested our model with different document collections (TREC-2, TREC-6, TREC-7, TREC-2005 and NTCIR-5) to show that the model is reliable. Finally, doing the retrospective experiments also provides us with an important clue about the potential of the retrieval model because an applicable model should perform well in the presence of relevance information. The use of relevance information can reveal the (near) optimal performance and the estimation of the relevance information is possible using various techniques such as pseudo relevance feedback.
We will not examine the time-efficiency of our retrieval model or retrieval system because:
- (a)
it is already very challenging to design and develop a highly effective retrieval model;
- (b)
once the effective retrieval model is developed, then we have enough information to design and develop (novel) index structures to support such an effective model;
- (c)
the time-efficiency problem may reduce its significance in time as computers are continually becoming more and more powerful.
We leave how to make our retrieval model more time-efficient to our future investigation.
The rest of the paper is organized as follows. Section 2 presents the details of our document-context based retrieval model. Section 3 shows the results of the model-oriented experiments which test the model extensively using one data collection, TREC-6. Section 4 shows the results of the scope-oriented experiments which test the model across different data collections and with another language. Section 5 discusses related works. Finally, Section 6 concludes and describes the future work.
Section snippets
Document-context based retrieval model
In this section, we introduce our document-context based retrieval model that ranks documents on the basis of the contexts of the query terms in documents (i.e., document contexts). A document context is uniquely identified by the location where the query term occurs in the document. Therefore, assigning different term weights to the same query term in different contexts can be thought of as assigning different term weights to the same query term in different locations in the document. Hence,
Model-oriented experiments
In this section, we present the results of the model-oriented experiments which extensively investigate the factors affecting the effectiveness of the model using the TREC-6 ad-hoc collection. This collection contains 556,077 English documents. We use the TREC-6 title (short) queries 301–350 in the experiments. Title queries are used because they have few (1–4) query terms which are similar to the lengths of web queries. All the terms in the documents and queries are stemmed using the Porter
Scope-oriented experiments
In this last set of experiments, we test the reliability of the proposed model by experimenting it with different data collections (the ad-hoc retrieval of TREC-2, TREC-6, TREC-7 and the robust-track retrieval of TREC-2005) and another language (Chinese NTCIR-5). Similar to the experiments in Section 3, title (short) queries in each of collections are used as they are commonly found in web search. The performance of TREC-6 has been evaluated in the previous section (Section 3) and the TREC-7
Related work
Vechtomova and Robertson (2000) presented a method of combining corpus-derived data on word co-occurrences with the probabilistic model of information retrieval. Significant collocates are selected using a window-based technique around the node terms. Mutual information (MI) scores and Z statistics were used to filter significant associated collocates. The notion of context in our proposed model is very similar to the window-based technique used by Vechtomova and Robertson. However, our model
Summary and future work
In summary, we proposed a novel hybrid document-context retrieval model which uses existing successful techniques to explore the effectiveness of incorporating term locations inside a document into our retrieval model. We used the log-odds as based on by the well known BIM retrieval model (Robertson & Sparck-Jones, 1976) as the starting point for deriving our document-context based model. We extended the existing probabilistic model from the document level to the document-context level, in
Acknowledgements
This work is supported by the CERG Project # PolyU 5183/03E. Robert thanks the Center for Intelligent Information Retrieval, University of Massachusetts (UMASS), for facilitating him to develop in part the basic IR system, when he was on leave at UMASS.
References (62)
A general class of fuzzy operators, the DeMorgan class of fuzzy operators and fuzziness measures induced by fuzzy operators
Fuzzy Sets and Systems
(1982)- et al.
Generalized means as model of compensative connectives
Fuzzy Sets and Systems
(1984) - et al.
On structuring probabilistic dependencies in stochastic language modeling
Computer Speech and Language
(1994) - et al.
A mathematical model of a weighted Boolean retrieval system
Information Processing & Management
(1979) Fuzzy sets
Information and Control
(1965)- Bai, J., Song, D., Bruza, P., Nei, J. Y., & Cao, G. (2005). Query expansion using term relationships in language models...
- Bollmann, P., & Wong, S. K. M. (1987). Adaptive linear information retrieval models. In Proceedings of ACM SIGIR 1987...
- Bruza, P., & Song, D. (2003). A comparison of various approaches for using probabilistic dependencies in language...
- Buckley, C., & Harman, D. (2003). Reliable information access final workshop report. In Workshop for reliable...
- et al.
Modelling parsing constraints with high-dimensional semantic space
Language and Cognitive Processes
(1997)
Explorations in context space: words, sentences, discourse
Discourse Processes
Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval
ACM Transactions on Information Systems (TOIS)
A unified mathematical definition of classical information retrieval
Journal of the American Society for Information Science
Extended Boolean Models
Information Retrieval: Data Structures & Algorithms
A probabilistic approach to automatic keyword indexing (part 1)
Journal of the American Society for Information Science
Theory of probability
Probability: deductive and inductive problems
Mind
Efficient passage ranking for document databases
ACM Transaction of Information Systems (TOIS)
Probabilistic relevance models based on document and query generation
Cited by (37)
The Power of Selecting Key Blocks with Local Pre-ranking for Long Document Information Retrieval
2023, ACM Transactions on Information SystemsEmbedding a Microblog Context in Ephemeral Queries for Document Retrieval
2023, Journal of Web EngineeringImproving the Quality of the Relevance of the Search for Scientific Publications Based on a Combination of Ranking Methods
2023, CEUR Workshop ProceedingsA Comparison between Term-Independence Retrieval Models for Ad Hoc Retrieval
2022, ACM Transactions on Information SystemsUnderstanding the role of human-inspired heuristics for retrieval models
2022, Frontiers of Computer ScienceDeep Learning solutions based on fixed contextualized embeddings from PubMedBERT on BioASQ 10b and traditional IR in Synergy
2022, CEUR Workshop Proceedings