Elsevier

Pattern Recognition Letters

Volume 36, 15 January 2014, Pages 62-73
Pattern Recognition Letters

Partial-update dimensionality reduction for accumulating co-occurrence events

https://doi.org/10.1016/j.patrec.2013.08.032Get rights and content

Highlights

  • The input of our task is given by a long sequence of co-occurrence events.

  • The goal of our task is to learn similarity metrics give input stream.

  • The target similarity between two objects is proportional to their co-occurrence rate.

  • We propose a dimensionality reduction that approximates target inter-object similarities.

  • Experiment results show that our algorithm learns gradually the target similarity.

Abstract

This paper addresses a novel problem when learning similarities. In our problem, an input is given by a long sequence of co-occurrence events among objects, namely a stream of co-occurrence events. Given a stream of co-occurrence events, we learn unknown latent vectors of objects such that their inner product adaptively approximates the target similarities resulting from accumulating co-occurrence events. Toward this end, we propose a new incremental algorithm for dimensionality reduction. The core of our algorithm is its partial updating style where only a small number of latent vectors are modified for each co-occurrence event, while most other latent vectors remain unchanged. Experiment results using both synthetic and real data sets demonstrate that in contrast to some existing methods, the proposed algorithm can stably and gradually learn target similarities among objects without being trapped by the collapsing problem.

Introduction

In this paper, we address the novel task of learning similarity metrics among objects. In our problem, target similarities are not explicitly stated. Instead, we have a long sequence of co-occurrence events among objects, referred as a stream of co-occurrence events. Given a stream of co-occurrence events, the goal of the problem is to gradually accumulate the co-occurrence events and learn a similarity metric such that the similarity value between two objects is likely to be proportional to their co-occurrence rate.

A typical scenario for accumulating co-occurrence events is presented in Algorithm 1, where a search engine continuously processes user queries in an online manner. In this scenario, each co-occurrence event is defined for a single retrieval. Two documents are considered co-occurrent if they are co-retrieved by the same query or they are co-located in a set of top-retrieved documents.

An obvious way for accumulating co-occurrence events is simply to store all similarity values directly in an n×n inter-object similarity matrix, where each entry is assigned a similarity value, simij, between the two objects. However, the object-to-object matrix is high dimensional when the number of objects is very large, which requires a less tractable manipulation that is not easily applicable.

To achieve better efficiency, we want to impose an extreme restriction on the available memory capacity, which is much smaller than that required for maintaining a full inter-object similarity matrix. To achieve this goal we propose a novel algorithm called partial-update dimensionality reduction that effectively approximates inter-object similarities. Without maintaining a large-scale inter-object matrix, our algorithm only manages low-dimensional latent vectors of objects and indirectly stores target similarity between two objects as the inner product between their latent vectors. In our proposed method, we first define the target inter-object similarities that are obtained by accumulating co-occurrence events. To further restrict the memory capacity, we then propose the use of a partial update criterion that needs be minimized, thereby modifying only a small number of latent vectors called focused latent vectors that are relevant to a given specific co-occurrence event. Finally, we obtain a fixed-point iteration that incrementally updates a set of focused latent vectors for each co-occurrence event.

Experimental results with both synthetic data and realistic IR test collections show that the proposed algorithm learns gradually and incrementally the similarity metric from co-occurrence events, which helps to improve the original similarity metric.

The organization of this paper is as follows. Section 2 reviews previous studies on the learning of inter-object similarity and discusses their weaknesses. Section 3 presents the proposed partial updating algorithm in detail, while Section 4 contains the experiment results. Finally, Section 5 provides our conclusions and future work.

Section snippets

Yu’s method

The most relevant work to our proposed algorithm is Yu’s method (Yu et al., 1985). Yu proposed an approximation method for adaptively learning similarities among objects. For each object, Yu’s method introduced a one-dimensional latent vector, called the latent position, which is randomly initialized before learning. For each co-occurrence event, Yu’s method performs the moving procedure on latent positions as follows: given a co-occurrence event, their latent positions are all moved slightly

Target inter-object similarities

To describe our algorithm, we first need to define target inter-object similarities. Suppose that simijN is the target similarity between the ith object and the jth object obtained after processing the total number N of co-occurrence events. Let qN be the Nth co-occurrence event and F(qN) be the set of objects that are linked with co-occurrence event qN. In an example scenario for Algorithm 1, qN and F(qN) correspond to a specific query and the set of top-retrieved documents for the query,

Experiment

In this section, we first compare Yu’s method with the proposed partial-update dimensionality reduction in a synthetic experiment and then we evaluate the proposed method using real IR test collections. All experiments are based on our example scenario in Algorithm 1, where objects are assumed to be co-occurrent if two documents are co-retrieved for a query.

Conclusions and future works

This paper addresses the novel problem of incrementally learning inter-object similarities, given a stream of co-occurrence events in a partial-update manner. We cast the learning problem by first assuming that each object has a low dimensional latent vector and by approximately projecting target similarities into inner-product space among the latent vectors of objects. We propose an effective fixed-point algorithm that incrementally updates the low dimensional latent vectors based on a

References (41)

  • S.-H. Na et al.

    Adaptive document clustering based on query-based similarity

    Information Processing and Management

    (2007)
  • S.-H. Na et al.

    Parsimonious translation models for information retrieval

    Information Processing and Management

    (2007)
  • Ando, R.K., Lee, L., 2001. Iterative residual rescaling. In: SIGIR ’01: Proceedings of the 24th Annual International...
  • Bartell, B.T., Cottrell, G.W., Belew, R.K., 1992. Latent semantic indexing is an optimal special case of...
  • B.T. Bartell et al.

    Representing documents using an explicit model of their similarities

    Journal of the American Society for Information Science

    (1995)
  • M.W. Berry

    Large scale sparse singular value computations

    International Journal of Supercomputer Applications

    (1992)
  • M.W. Berry et al.

    Using linear algebra for intelligent information retrieval

    SIAM Review

    (1995)
  • M.W. Berry et al.

    Matrices, vector spaces, and information retrieval

    SIAM Review

    (1999)
  • M.W. Berry et al.

    Algorithms and applications for approximate nonnegative matrix factorization

    Computational Statistics and Data Analysis

    (2006)
  • D.M. Blei et al.

    Latent dirichlet allocation

    The Journal of Machine Learning Research

    (2003)
  • Brand, M., 2002. Incremental singular value decomposition of uncertain data with missing values. In: Proceedings of the...
  • Brauen, T., 1971. Document vector modifications. In: Salton, G. (Ed.), The SMART Retrieval System – Experiments in...
  • J. Dean et al.

    Mapreduce: simplified data processing on large clusters

    Communications of the ACM

    (2008)
  • S. Deerwester et al.

    Indexing by latent semantic analysis

    Journal of the American Society of Information Science

    (1990)
  • Dumais, S.T., 1992. Lsi meets trec: a status report. In: Proceedings of the 1st Text REtrieval Conference, TREC-1, pp....
  • Dumais, S.T., Furnas, G.W., Landauer, T.K., Deerwester, S., Harshman, R., 1988. Using latent semantic analysis to...
  • P. Hall et al.

    Merging and splitting eigenspace models

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2000)
  • Hiemstra, D., Robertson, S., Zaragoza, H., 2004. Parsimonious language models for information retrieval. In:...
  • Hofmann, T., 1999. Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR...
  • Hofmann, T., 2003. Collaborative filtering via Gaussian probabilistic latent semantic analysis. In: Proceedings of the...
  • Cited by (0)

    View full text