Elsevier

Pattern Recognition Letters

Volume 33, Issue 12, 1 September 2012, Pages 1623-1631
Pattern Recognition Letters

Memory-restricted latent semantic analysis to accumulate term-document co-occurrence events

https://doi.org/10.1016/j.patrec.2012.05.002Get rights and content

Abstract

This paper addresses a novel adaptive problem of obtaining a new type of term-document weight. In our problem, an input is given by a long sequence of co-occurrence events between terms and documents, namely, a stream of term-document co-occurrence events. Given a stream of term-document co-occurrences, we learn unknown latent vectors of terms and documents such that their inner product adaptively approximates the target query-based term-document weights resulting from accumulating co-occurrence events. To this end, we propose a new incremental dimensionality reduction algorithm for adaptively learning a latent semantic index of terms and documents over a collection. The core of our algorithm is its partial updating style, where only a small number of latent vectors are modified for each term-document co-occurrence, while most other latent vectors remain unchanged. Experimental results on small and large standard test collections demonstrate that the proposed algorithm can stably learn the latent semantic index of terms and documents, showing an improvement in the retrieval performance over the baseline method.

Highlights

► The input of our task is given by a long sequence of term-document co-occurrence events. ► The goal of our task is to learn term-documents weights give input stream. ► The weight between term and document is proportional to their co-occurrence rate. ► We propose a dimensionality reduction that approximates target term-document weights. ► Experiment results show that our algorithm learns gradually the target weight.

Introduction

In this paper, we address the novel task of learning query-based term-document weights, often referred to as query-based weights. In our problem, the term-document weights of a term and a document are not provided explicitly; they are obtained indirectly from term-query and document-query weights. Instead of explicitly stating the target query-based term-document weights, we have a long sequence of term-document co-occurrence events, referred to as a stream of term-document co-occurrence events. Each co-occurrence event is described by two sets of terms and documents as evidence of a tighter relationship between them. Given a stream of term-document co-occurrence events, the objective of the problem is to gradually accumulate the co-occurrence events and learn a query-based term weighting metric for documents such that the weight of a term in a document is likely to be proportional to their co-occurrence rate.

A typical scenario for accumulating term-document co-occurrence events is presented in Algorithm 1, where a search engine continuously processes user queries online. In this scenario, each term-document co-occurrence event is defined for single retrieval. A term and a document are considered to have co-occurred if the document is retrieved in the top-ranked results by a query that includes the term.

Algorithm 1: Brief description of accumulating term-document co-occurrence

  • input: m terms and n documents in collection C

    • 1.

      Initialization: Wij = 0 for 1  i  m and 1  i  n;

    • 2.

      Querying: Query Q is provided by a user;

    • 3.

      Retrieval: Obtain top retrieved documents F for query Q;

    • 4.

      Update term weight for given F and Q;

  • for ti  Q do

  •  for djF do

  •  Wij  Wij + Δ

  •  end

  • end

  • Iterate Step 2–4 until learning is stopped.

An obvious way to accumulate term-document co-occurrence events is simply to store all term weight values directly in an m × n term-document matrix, where each entry is assigned by a weight value, Wij, between two objects. However, when the number of terms and documents is very large, the term-document matrix is high dimensional, which requires a less tractable manipulation that is not easily applicable.

To achieve better retrieval efficiency, we propose a novel algorithm called memory-restricted latent semantic analysis that effectively approximates target query-based term-document weights. Without maintaining a large-scale term-document matrix, our algorithm manages only low-dimensional latent vectors of terms and documents, and indirectly stores the target weight of a term in a document into the inner product between their latent vectors. In the proposed method, we first define the target query-based weights of terms in documents that are obtained by accumulating co-occurrence events. To restrict the memory capacity further, we then propose the use of a partial-update criterion that needs to be minimized, thereby modifying only a small number of latent vectors, called focused latent vectors, which are relevant to a given specific term-document co-occurrence event. Finally, we obtain the fixed-point iteration, which incrementally updates a set of focused latent vectors for each co-occurrence event.

Experimental results on small and large information retrieval (IR) test collections show that the proposed algorithm gradually and incrementally learns the target query-based term weighting metric from co-occurrence events, thereby improving the retrieval performance.

Section snippets

Related work

Dimensionality reduction has found extensive application in diverse areas, such as information retrieval (Dumais et al., 1988, Deerwester et al., 1990, Dumais, 1992, Bartell et al., 1992, Bartell et al., 1995, Berry et al., 1995, Hofmann, 1999, Xu et al., 2003, Wei and Croft, 2006, Wang et al., 2011), computer vision (Levy and Lindenbaum, 2000, Brand, 2002), collaborative filtering (Hofmann, 1999, Hofmann, 2003, Koren et al., 2009), and data mining. Existing works have investigated singular

Target query-based term-document matrix

To describe our algorithm, we first need to define the target query-based term-document matrix. Suppose that aijN is the target query-based weight of the ith term in the jth document obtained after processing the total number N of term-document co-occurrence events. Let qN be the Nth co-occurrence event and T(qN) and F(qN) the sets of terms and documents that appear in an co-occurrence event qN, respectively. For convenience, we also refer to T(qN) and F(qN) as TN and FN, respectively.

Now,

Test set

The proposed algorithm is evaluated on one small test collection from the MED dataset and a sub-collection of the TIPSTER dataset in TREC, named AP (i.e., Associated Process), which consists of 158,240 documents.

Automatic generation of queries

One critical problem we encountered when evaluating our algorithms was that each test collection contained only a small number of queries. To obtain a sufficient number of real queries, we generated queries automatically for each collection. Our procedure for generating queries was:

  • 1.

Conclusion

In this paper, we addressed the novel problem of incrementally learning query-based term-document weights, given a stream of co-occurrence term-document events, in a memory-restricted manner. We formulate the learning problem by first assuming that each term or document has a low dimensional latent vector and by approximately projecting target term-document weights into the inner-product space among the latent vectors of terms and documents. We propose an effective fixed-point algorithm that

Acknowledgement

The work of the second author was supported by IT Consilience Creative Program of MKE and NIPA (C1515-1121-0003).

References (39)

  • Bartell, B.T., Cottrell, G.W., Belew, R.K., 1992. Latent semantic indexing is an optimal special case of...
  • B.T. Bartell et al.

    Representing documents using an explicit model of their similarities

    J. Amer. Soc. Inform. Sci.

    (1995)
  • M.W. Berry

    Large scale sparse singular value computations

    Int. J. Supercomput. Appl.

    (1992)
  • M.W. Berry et al.

    Using linear algebra for intelligent information retrieval

    SIAM Rev.

    (1995)
  • M.W. Berry et al.

    Matrices, vector spaces, and information retrieval

    SIAM Rev.

    (1999)
  • M.W. Berry et al.

    Algorithms and applications for approximate nonnegative matrix factorization

    Comput. Statist. Data Anal.

    (2006)
  • D.M. Blei et al.

    Latent Dirichlet allocation

    J. Machine Learn. Res.

    (2003)
  • Brand, M., 2002. Incremental singular value decomposition of uncertain data with missing values. In: Proc. 7th European...
  • J. Dean et al.

    Mapreduce: Simplified data processing on large clusters

    Comm. ACM

    (2008)
  • S. Deerwester et al.

    Indexing by latent semantic analysis

    J. Amer. Soc. Inform. Sci.

    (1990)
  • Dumais, S.T., 1992. Lsi meets trec: A status report. In: Proc. 1st Text REtrieval Conf., TREC-1, pp....
  • Dumais, S.T., Furnas, GW, Landauer, T.K., Deerwester, S., Harshman, R., 1988. Using latent semantic analysis to improve...
  • Hiemstra, D., Robertson, S., Zaragoza, H., 2004. Parsimonious language models for information retrieval. In: Proc. 27th...
  • Hofmann, T., 1999. Probabilistic latent semantic indexing. In: Proc. 22nd Annual Internat. ACM SIGIR Conf. on Research...
  • Hofmann, T., 2003. Collaborative filtering via gaussian probabilistic latent semantic analysis. In: Proc. 26th Annual...
  • N. Jardine et al.

    The use of hierarchic clustering in information retrieval

    Inform. Storage Retriev.

    (1971)
  • Y. Koren et al.

    Matrix factorization techniques for recommender systems

    Computer

    (2009)
  • Kurland, O., Lee, L., 2004. Corpus structure, language models, and ad hoc information retrieval. In: Proc. 27th Annual...
  • Lavrenko, V., Croft, W.B., 2001. Relevance based language models. In: Proc. 24th Annual International ACM SIGIR Conf....
  • Cited by (1)

    • Hierarchical online NMF for detecting and tracking topic hierarchies in a text stream

      2018, Pattern Recognition
      Citation Excerpt :

      These methods could be divided into two categories, i.e., probabilistic methods and non-probabilistic methods. Probabilistic methods [8–11,33] usually model topics as latent factors, and assume that the joint probability of the words and the documents could be described by the mixture of the conditional probabilities over the latent factors. On the contrary, non-probabilistic methods usually use NMF [12,13,32] and dictionary learning [14] to uncover the low-rank structures using matrix factorization.

    View full text