Partial-update dimensionality reduction for accumulating co-occurrence events
Introduction
In this paper, we address the novel task of learning similarity metrics among objects. In our problem, target similarities are not explicitly stated. Instead, we have a long sequence of co-occurrence events among objects, referred as a stream of co-occurrence events. Given a stream of co-occurrence events, the goal of the problem is to gradually accumulate the co-occurrence events and learn a similarity metric such that the similarity value between two objects is likely to be proportional to their co-occurrence rate.
A typical scenario for accumulating co-occurrence events is presented in Algorithm 1, where a search engine continuously processes user queries in an online manner. In this scenario, each co-occurrence event is defined for a single retrieval. Two documents are considered co-occurrent if they are co-retrieved by the same query or they are co-located in a set of top-retrieved documents.
An obvious way for accumulating co-occurrence events is simply to store all similarity values directly in an inter-object similarity matrix, where each entry is assigned a similarity value, , between the two objects. However, the object-to-object matrix is high dimensional when the number of objects is very large, which requires a less tractable manipulation that is not easily applicable.
To achieve better efficiency, we want to impose an extreme restriction on the available memory capacity, which is much smaller than that required for maintaining a full inter-object similarity matrix. To achieve this goal we propose a novel algorithm called partial-update dimensionality reduction that effectively approximates inter-object similarities. Without maintaining a large-scale inter-object matrix, our algorithm only manages low-dimensional latent vectors of objects and indirectly stores target similarity between two objects as the inner product between their latent vectors. In our proposed method, we first define the target inter-object similarities that are obtained by accumulating co-occurrence events. To further restrict the memory capacity, we then propose the use of a partial update criterion that needs be minimized, thereby modifying only a small number of latent vectors called focused latent vectors that are relevant to a given specific co-occurrence event. Finally, we obtain a fixed-point iteration that incrementally updates a set of focused latent vectors for each co-occurrence event.
Experimental results with both synthetic data and realistic IR test collections show that the proposed algorithm learns gradually and incrementally the similarity metric from co-occurrence events, which helps to improve the original similarity metric.
The organization of this paper is as follows. Section 2 reviews previous studies on the learning of inter-object similarity and discusses their weaknesses. Section 3 presents the proposed partial updating algorithm in detail, while Section 4 contains the experiment results. Finally, Section 5 provides our conclusions and future work.
Section snippets
Yu’s method
The most relevant work to our proposed algorithm is Yu’s method (Yu et al., 1985). Yu proposed an approximation method for adaptively learning similarities among objects. For each object, Yu’s method introduced a one-dimensional latent vector, called the latent position, which is randomly initialized before learning. For each co-occurrence event, Yu’s method performs the moving procedure on latent positions as follows: given a co-occurrence event, their latent positions are all moved slightly
Target inter-object similarities
To describe our algorithm, we first need to define target inter-object similarities. Suppose that is the target similarity between the ith object and the jth object obtained after processing the total number N of co-occurrence events. Let be the Nth co-occurrence event and be the set of objects that are linked with co-occurrence event . In an example scenario for Algorithm 1, and correspond to a specific query and the set of top-retrieved documents for the query,
Experiment
In this section, we first compare Yu’s method with the proposed partial-update dimensionality reduction in a synthetic experiment and then we evaluate the proposed method using real IR test collections. All experiments are based on our example scenario in Algorithm 1, where objects are assumed to be co-occurrent if two documents are co-retrieved for a query.
Conclusions and future works
This paper addresses the novel problem of incrementally learning inter-object similarities, given a stream of co-occurrence events in a partial-update manner. We cast the learning problem by first assuming that each object has a low dimensional latent vector and by approximately projecting target similarities into inner-product space among the latent vectors of objects. We propose an effective fixed-point algorithm that incrementally updates the low dimensional latent vectors based on a
References (41)
- et al.
Adaptive document clustering based on query-based similarity
Information Processing and Management
(2007) - et al.
Parsimonious translation models for information retrieval
Information Processing and Management
(2007) - Ando, R.K., Lee, L., 2001. Iterative residual rescaling. In: SIGIR ’01: Proceedings of the 24th Annual International...
- Bartell, B.T., Cottrell, G.W., Belew, R.K., 1992. Latent semantic indexing is an optimal special case of...
- et al.
Representing documents using an explicit model of their similarities
Journal of the American Society for Information Science
(1995) Large scale sparse singular value computations
International Journal of Supercomputer Applications
(1992)- et al.
Using linear algebra for intelligent information retrieval
SIAM Review
(1995) - et al.
Matrices, vector spaces, and information retrieval
SIAM Review
(1999) - et al.
Algorithms and applications for approximate nonnegative matrix factorization
Computational Statistics and Data Analysis
(2006) - et al.
Latent dirichlet allocation
The Journal of Machine Learning Research
(2003)