Global term weights in distributed environments

https://doi.org/10.1016/j.ipm.2007.09.003Get rights and content

Abstract

This paper examines the estimation of global term weights (such as IDF) in information retrieval scenarios where a global view on the collection is not available. In particular, the two options of either sampling documents or of using a reference corpus independent of the target retrieval collection are compared using standard IR test collections. In addition, the possibility of pruning term lists based on frequency is evaluated.

The results show that very good retrieval performance can be reached when just the most frequent terms of a collection – an “extended stop word list” – are known and all terms which are not in that list are treated equally. However, the list cannot always be fully estimated from a general-purpose reference corpus, but some “domain-specific stop words” need to be added. A good solution for achieving this is to mix estimates from small samples of the target retrieval collection with ones derived from a reference corpus.

Introduction

In information retrieval applications, there are certain scenarios where a global view on the document collection is not available. One example of such a scenario is an adaptive filtering or routing task where a user’s long-term query is matched against a stream of incoming documents that are initially unknown.

Another important scenario lacking a global view is that of distributed IR where documents are stored and indexed in a distributed fashion. In classical distributed retrieval (Callan, 2000), results delivered by different databases need to be merged by a central broker, which may not have a global view on the collection.

A related, more strictly distributed scenario is that of peer-to-peer information retrieval (P2PIR) where each peer is connected to a number of other peers and routes incoming queries to one or more of its neighbours. Results are merged at the peer where the query was issued; this peer almost never has a global view on the collection. Because of the rapidly growing interest in P2PIR, this field is the main focus of the present work.

What challenges does the lack of global information introduce for information retrieval systems? In most of today’s text retrieval systems, some form of term weighting is performed that captures the importance of terms w.r.t. a given query or document. Term weighting often consists of a local and a global component. The local component measures the extent to which a term is relevant w.r.t. a query or document and is often related to its frequency within that query or document. The global component measures the amount of information that the term conveys in general and is often inversely proportional to its overall frequency – e.g. IDF, i.e. inverse document frequency.

The lack of global information about a document collection is hence a problem for the global component of weighting schemes: in order to compute IDF, one needs to know all the documents in the collection.

In this work, we examine the two major solutions to this problem: either compute global term weights from a sample of the whole collection or to use a reference corpus, i.e. a large collection of documents that contains a representative sample of language. It is not within the scope of this paper to discuss what “representative” means exactly here, but the impact of the choice of a reference corpus is examined later. Term weights are computed once and for all from that collection and then used for retrieval with completely different target collections (henceforth called “retrieval collections”) as if they were valid globally.

One advantage of reference corpora is that estimates derived from them never need to be updated, even when the retrieval collection changes. Another advantage is that it allows for trivial results merging in a distributed setting: since global term weights never change, scores for documents remain comparable throughout the whole distributed system.

This is also the case when using what is called “centralised sampling” in this paper, e.g. collecting samples of the retrieval collection on a distinguished node of a P2P network. However, this involves costly updates and message overhead when distributing new weight estimates to all peers. With “distributed sampling” on the other hand – e.g. each peer samples documents from its own neighbourhood – this problem does not arise, but results merging is much more complicated, because global term weights depend on the location where they are computed.

The rest of this paper is organised as follows: Section 2 reviews some related work on estimation of global term weights, Section 3 details the mathematical background needed for this estimation task. In Section 4, the results of experiments in ad hoc retrieval are presented and discussed before Section 5 concludes.

Section snippets

Related work

Depending on the application, various ways of estimating global term weights have been developed. Estimating global term weights from just a sample of the collection was initially used with dynamic collections, where the collection changed so quickly that recomputing global term weights seemed too costly. In Viles and French (1995b), the effect of not updating IDF when adding new documents to a collection was studied and it was found that doing so did not degrade effectiveness seriously at

Weighting schemes

Generally, in term weighting, rare terms are treated as informative. Hence, all variants of global term weights make use of term frequency information. There are two sorts of frequency commonly used: document frequency (DF), which denotes the number of documents within a collection that are indexed by the term, and collection frequency (CF) which refers to the overall number of a term’s occurrences within the collection.

In this paper, we consider one representative example of each of those two

Setup

The evaluation was done using IR test collections of various sizes and topical homogeneity. The basic approach consists in first estimating global term weights either from a reference corpus or from a sample of the respective test collection and then using these for ad hoc retrieval runs. This was then compared to using the full test collection.

Retrieval runs are evaluated using the relevance judgments provided with each collection and mean average precision (MAP) as a performance measure. The

Conclusions

In this paper, the estimation of global term weights from a reference corpus in a distributed IR setting was compared to using a sample of the retrieval collection or the whole collection itself.

The results showed that a general-language reference corpus (such as the BNC) may fail to identify words that appear very frequently in the domain of the retrieval collection, but not in general-language. A domain-adapted reference corpus – if available – can help to avoid this problem.

Sampling is

Hans Friedrich Witschel was born on December 14, 1977, in Freiburg, Germany. He completed his diploma in computer science at the University of Leipzig in March 2004, and his thesis was on “terminology extraction”. During the period April 2004 to September 2005, he worked as Scientific Assistant in a project on peer-to-peer information retrieval. In October 2005, he joined the University of Leipzig as an Ph.D. student and his thesis was on peer-to-peer information retrieval.

References (26)

  • C. Tang et al.

    pSearch: Information retrieval in structured overlays

    ACM SIGCOMM Computer Communication Review

    (2003)
  • S. Boneh et al.

    Estimating the prediction function and the number of unseen species in sampling with replacement

    Journal of the American Statistical Association

    (1998)
  • Brants, T., & Chen, F. (2003). A system for new event detection. In Proceedings of SIGIR’03 (pp....
  • Callan, J. (1996). Document filtering with inference networks. In Proceedings of SIGIR’96 (pp....
  • J. Callan

    Distributed information retrieval

  • Carmel, D., Cohen, D., Fagin, R., Farchi, E., Herscovici, M., Maarek, Y. S. (2001). Static index pruning for...
  • Chowdhury, A. R. (2001). On the design of reliable efficient information systems. Ph.D. thesis, Illinois Institute of...
  • Cuenca-Acuna, F. M., Peery, C., Martin, R. P., & Nguyen, T. D. (2003). PlanetP: Using gossiping to build content...
  • B. Efron et al.

    Estimating the number of unseen species: How many words did Shakespeare know?

    Biometrika

    (1976)
  • French, J. C., Powell, A. L., Callan, J., Viles, C. L., Emmitt, T., Prey, K. J., et al. (1999). Comparing the...
  • W. Gale et al.

    Good-Turing frequency estimation without tears

    Journal of Quantitative Linguistics

    (1995)
  • Karbhari, P., Ammar, M., Dhamdhere, A., Raj, H., Riley, G., & Zegura, E. (2004). Bootstrapping in Gnutella: A...
  • Klemm, F., & Aberer, K. (2005). Aggregation of a term vocabulary for peer-to-peer information retrieval: A DHT stress...
  • Cited by (10)

    • A hybrid approach for estimating document frequencies in unstructured P2P networks

      2011, Information Systems
      Citation Excerpt :

      However, this approach is clearly not appropriate for large-scale systems. In [33], the authors examine the estimation of global term weights (such as the document frequency of a term, i.e. the number of documents a term occurs in for a given collection) in information retrieval scenarios where a global view of the collection is not available. Two alternatives are studied: either sampling documents or using a reference corpus independent of the target retrieval collection.

    • The aboutness of words

      2017, Journal of the Association for Information Science and Technology
    • Estimating global statistics for unstructured P2P search in the presence of adversarial peers

      2014, SIGIR 2014 - Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval
    • Decentralized probabilistic text clustering

      2012, IEEE Transactions on Knowledge and Data Engineering
    • Relevance weighting using within-document term statistics

      2011, International Conference on Information and Knowledge Management, Proceedings
    View all citing articles on Scopus

    Hans Friedrich Witschel was born on December 14, 1977, in Freiburg, Germany. He completed his diploma in computer science at the University of Leipzig in March 2004, and his thesis was on “terminology extraction”. During the period April 2004 to September 2005, he worked as Scientific Assistant in a project on peer-to-peer information retrieval. In October 2005, he joined the University of Leipzig as an Ph.D. student and his thesis was on peer-to-peer information retrieval.

    View full text