Introduction
In this paper, we address the novel task of learning query-based term-document weights, often referred to as query-based weights. In our problem, the term-document weights of a term and a document are not provided explicitly; they are obtained indirectly from term-query and document-query weights. Instead of explicitly stating the target query-based term-document weights, we have a long sequence of term-document co-occurrence events, referred to as a stream of term-document co-occurrence events. Each co-occurrence event is described by two sets of terms and documents as evidence of a tighter relationship between them. Given a stream of term-document co-occurrence events, the objective of the problem is to gradually accumulate the co-occurrence events and learn a query-based term weighting metric for documents such that the weight of a term in a document is likely to be proportional to their co-occurrence rate.
A typical scenario for accumulating term-document co-occurrence events is presented in Algorithm 1, where a search engine continuously processes user queries online. In this scenario, each term-document co-occurrence event is defined for single retrieval. A term and a document are considered to have co-occurred if the document is retrieved in the top-ranked results by a query that includes the term.Algorithm 1: Brief description of accumulating term-document co-occurrence
input: m terms and n documents in collection
- 1.
Initialization: Wij = 0 for 1 ⩽ i ⩽ m and 1 ⩽ i ⩽ n;
- 2.
Querying: Query Q is provided by a user;
- 3.
Retrieval: Obtain top retrieved documents for query Q;
- 4.
Update term weight for given and Q;
for ti ∈ Q do
for do
Wij ← Wij + Δ
end
end
Iterate Step 2–4 until learning is stopped.
An obvious way to accumulate term-document co-occurrence events is simply to store all term weight values directly in an m × n term-document matrix, where each entry is assigned by a weight value, Wij, between two objects. However, when the number of terms and documents is very large, the term-document matrix is high dimensional, which requires a less tractable manipulation that is not easily applicable.
To achieve better retrieval efficiency, we propose a novel algorithm called memory-restricted latent semantic analysis that effectively approximates target query-based term-document weights. Without maintaining a large-scale term-document matrix, our algorithm manages only low-dimensional latent vectors of terms and documents, and indirectly stores the target weight of a term in a document into the inner product between their latent vectors. In the proposed method, we first define the target query-based weights of terms in documents that are obtained by accumulating co-occurrence events. To restrict the memory capacity further, we then propose the use of a partial-update criterion that needs to be minimized, thereby modifying only a small number of latent vectors, called focused latent vectors, which are relevant to a given specific term-document co-occurrence event. Finally, we obtain the fixed-point iteration, which incrementally updates a set of focused latent vectors for each co-occurrence event.
Experimental results on small and large information retrieval (IR) test collections show that the proposed algorithm gradually and incrementally learns the target query-based term weighting metric from co-occurrence events, thereby improving the retrieval performance.