Pseudo-relevance feedback has a history in information retrieval dating back thirty years (Croft and Harper 1979). Despite this legacy, pseudo-relevance feedback is perhaps one of the least understood concepts in modern information retrieval. Anecdotally, people dismiss the strategy as ineffective because of query drift or inappropriate expansion. However, this attitude is questioned every year as new SIGIR and TREC publications demonstrate the effectiveness of pseudo-relevance feedback.

In ‘A Generative Theory of Relevance’, Victor Lavrenko analyzes in depth both the theory and effectiveness of pseudo-relevance feedback. The theoretical analysis, grounded in probabilistic modeling, draws connections to classic approaches such as Robertson’s Probability Ranking Principle as well as recent approaches such as topic modeling. The experiments exploring the empirical performance of pseudo-relevance feedback are rigorous, including ad hoc and cross-lingual retrieval tasks, topic detection and tracking, as well as semi-structured data analysis. The combination and thoroughness of the theoretical and experimental discussions make this book an essential read for both the information retrieval theoretician as well as the practitioner.

To be fair, Lavrenko does not present this material as a study of pseudo-relevance feedback in general. In fact, I do not think I read the phrase once in the entire book. Instead, Lavrenko presents and evaluates a very specific retrieval framework: relevance modeling. Nevertheless, at its core, relevance modeling is a very clever and effective query expansion technique.

The first part of this book deals with the theoretical foundation of Lavrenko’s framework. Queries and documents are defined in a purposefully abstract way. A corpus is a collection of retrievable items, be they text documents or images; a query is an expression of the user’s information need, be it a keyword or an example document. In a particular retrieval setting, we may observe text documents in our collection or key word queries from the user. However, the framework supports arbitrary media and languages. The basis of Lavrenko’s framework is a corpus of latent documents. These latent documents can be thought of as extremely detailed versions of their observed counterparts (i.e. those documents we are interested in ranking). Whereas an observed document may consist of several paragraphs, a latent document will consist of much more text, images, videos, translations, and other information. In this framework, both observed documents and queries are deterministic corruptions of latent documents. For example, a multimedia latent document may be transformed into an observed image document by dropping the text components of the representation; the same multimedia latent document may be transformed into a text query by dropping the non-text components and much of the text. It is worth noting that nothing about the framework requires that the latent space be well-represented in the set of observable documents. The space may consist of millions of latent documents while the observed collection may consist of several thousand.

Relevance, in Lavrenko’s framework, is grounded in judgments on observed documents and a user’s information need is represented by the set of latent documents that produced observed relevant documents. A relevance model is defined as the ‘point estimate of the generative process for the relevant population’. In other words, given observed relevance data (e.g. a keyword query), a relevance model can be thought of as the mean latent document relevant to the query. Because Lavrenko assumes a generative process produced the latent documents, he can import methods from language modeling to determine the relevance of these documents. In particular, the normalized query likelihood scores of these latent documents are used to construct the weighted mean latent document. How does one score unobserved documents? Well, one (trivial) latent document space could merely consist of the observed documents. However, we can also be more creative in our definition of this space. For example, assume that our latent space consists of an extremely large corpus of text documents and our observed documents consist of shorter news articles. Given a text query, we will have a very rich relevance model, potentially estimated from millions of documents. Or consider the example where our latent space consists of a corpus of images with long captions and our observed documents consist of bitmap images. Given a text query, we will have a relevance model over both text representations from captions and visual representations from images. Lavrenko presents a number of well-motivated metrics for ranking observed documents given a relevance model.

Conceptually, there are similarities between this framework and Latent Semantic Analysis. Both assume documents embedded in some latent space. However, Lavrenko’s framework assumes that documents come from a higher dimensional space instead of a lower dimensional space. This distinction is nicely discussed in Chapter 4 where a very convincing argument is made against using dimensionality reduction techniques for information retrieval tasks. Dimensionality reduction techniques (e.g. topic modeling, LSI) throw away information in order to generalize. Information retrieval—especially ad hoc information retrieval—requires that a system handle queries at arbitrary granularities. Therefore, committing to a single granularity is potentially dangerous. In fact, this is behavior noticed in early TREC runs (Dumais 1995). Lavrenko’s framework, instead of making parametric assumptions about the structure of the data, takes a nonparametric approach, essentially using nearest neighbor estimates in the latent space. We know that the error of a nonparametric estimate vanishes as the number of sample points—documents in this case—grows; the error of a parametric estimate, however, will only be small for those topics which it models well. Because information retrieval tasks often consider huge spaces of documents, nonparametric methods are, therefore, more appealing.

The second half of the book is devoted to an empirical evaluation of Lavrenko’s framework on tasks including: ad hoc retrieval, cross-lingual retrieval, handwriting retrieval, image retrieval, video retrieval, structured search, and topic detection and tracking. For each of these tasks, Lavrenko carefully describes the relationship to his more general framework. Further evidence of this effectiveness can be found in the large body of publications which have used variants of Lavrenko’s framework. By the end of the book, the reader is comfortable enough with the techniques to apply them to new domains.

Despite its many strengths, Lavrenko’s work has a few shortcomings. First, although he has essentially written a monograph about pseudo-relevance feedback, Lavrenko rarely discusses his approach within this context. Readers must do this on their own. Second, Lavrenko seems to be so set on making a theoretical argument that he is not explicit enough about the core principles behind his framework. These principles may be lost on a reader with a weaker mathematical background. Third, although Lavrenko’s method essentially reduces to a non-parametric estimation task, there is only a passing discussion of this work in the book. Why is this kernel choice appropriate? What are the implications of having a single bandwidth instead of an adaptive bandwidth found in many non-parametric methods? Finally, although Lavrenko spends some time providing theoretical arguments against topic modeling and its relatives, their is no empirical comparison. This slightly weakens his case.

At the end of the day, Lavrenko makes nice theoretical and empirical contributions to the state of the art. However, what are the design principles an information retrieval researcher—regardless of theoretical approach—should take away from this book? There are two. 1. When performing pseudo-relevance feedback, weight instances by their initial retrieval score. It is surprising that in the thirty years of pseudo-relevance feedback work preceding this, it does not seem that anyone tried this. 2. When performing cross-domain retrieval, scores of parallel documents in the source domain allow you to do pseudo-relevance feedback in the target domain. This observation is extremely powerful since it allows one to avoid training a translation system for cross-lingual retrieval.

Pseudo-relevance feedback is still an active area of research. Extensions of Lavrenko’s work include estimation from large, non-target corpora (Diaz and Metzler 2006), multi-word expansion (Metzler and Croft 2007), and performance prediction (Cronen-Townsend et al. 2002). More generally, pseudo-relevance feedback research has focused both on theoretical aspects (Diaz 2008) as well as empirical performance (Collins-Thompson 2008). ‘A Generative Theory of Relevance’ presents a very nice, balanced introduction to the area.