1 Introduction

The focus of the information retrieval (IR) field shifted in the past years from the text IR to the speech IR. It is only natural that researchers from many fields like history, arts or culture request comfortable and easy access to the large audio-visual documents available nowadays. Listening to every audio document is impossible. With the improving quality of the automatic speech recognition (ASR) systems, the most frequent approach to handling this problem is the use of ASR to transcribe the speech into the text and then use the IR methods to search in them. To deal with the query words not present in the searched documents the query expansion techniques are often used. One of these methods often used in the IR field is the relevance feedback method. The idea behind this method is that the relevant documents retrieved in the first run of the search are used to enrich the query for the second run of the search. In most cases, the retrieval system does not have the feedback from the user and thus it does not know which documents are relevant. The blind relevance feedback (BRF) method can be used, where the system “blindly” selects some documents, which it considers to be relevant and uses them for the enrichment of the query.

The paper presents the comparison of the two most used methods in IR and also the use of the BRF method. Experiments aimed at the better automatic selection of the relevant documents for the BRF method are presented. Our idea is to apply the score normalization techniques originally used in the open-set text-independent speaker identification problem.

2 Information Retrieval Collection

All the experiments were performed on the spoken document retrieval collection used in the Czech task of the Cross-Language Speech Retrieval track organized in the CLEF 2007 evaluation campaign [1]. The collection contains automatically transcribed spontaneous interviews of the holocaust survivors (segmented by a fixed-size window into 22 581 “documents”) and two sets of TREC-like topics - 29 training and 42 evaluation topics. Each topic consists of 3 fields - <title> (T), <desc> (D) and <narr> (N). The training topic set was used for our experiments and the queries were created from all terms from the fields T, D and N, stop words were omitted. All terms were also lemmatized [2].

The mean Generalized Average Precision (mGAP) measure that was used in the CLEF 2007 Czech task was used as an evaluation measure. The measure (described in detail in [3]) is designed for the evaluation of the retrieval performance on the conversational speech data, where the topic shifts in the conversation are not separated as documents. The mGAP measure is based on the evaluation of the precision of finding the correct beginning of the relevant part of the data.

3 Information Retrieval System

In this paper, we wanted to compare the two still most used IR methods in the speech retrieval task. For our experiments, we have selected the vector space retrieval model and language modeling approach with several smoothing variants. In our previous work, we have experimented separately with the use of the score normalization methods for the blind relevance feedback in the vector space model [4] and in the language modeling environment - basic query likelihood model [5, 6]. This paper presents experiments with more complex smoothing methods for the language modeling - the Dirichlet prior smoothing method [7] and the Two-stage smoothing method presented by Zhai and Lafferty in [8].

3.1 Language Modeling

For the previous experiments [5, 6], the language modeling (LM) approach [9] was used as the information retrieval method, specifically the query likelihood model with a linear interpolation of the unigram language model of the document \(M_{d}\) with an unigram language model of the whole collection \(M_{c}\) (Jelinek-Mercer smoothing). The idea of this method is to create a language model from each document d and then for each query q to find the model which most likely generated that query, that means to rank the documents according to the probability P(d|q). The final ranking of the documents according to the query is:

$$\begin{aligned} P(d|q)\propto \prod _{t\in q}(\lambda P(t|M_{d}) + (1-\lambda )P(t|M_{c})), \end{aligned}$$
(1)

where t is a term in a query and \(\lambda \) is the interpolation parameter.

Dirichlet Prior Method. With the Dirichlet prior smoothing method the Eq. (1) changes to the form:

$$\begin{aligned} P(d|q)\propto \prod _{t\in q}\frac{tf_{t,d_{j}} + \alpha P(t|M_{c})}{L_{d_{j}} + \alpha }, \end{aligned}$$
(2)

where \(\alpha \) is the smoothing parameter, \(tf_{t,d_{j}}\) is the term frequency and \(L_{d_{j}}\) is the length of the document d.

Two-Stage Smoothing Method. The Two-stage smoothing method is a combination of the Dirichlet prior smoothing and the Jelinek-Mercer smoothing methods. It is defined:

$$\begin{aligned} P(d|q)\propto \prod _{t\in q}\lambda \frac{tf_{t,d_{j}} + \alpha P(t|M_{c})}{L_{d_{j}} + \alpha } + (1-\lambda )P(t|M_{U}), \end{aligned}$$
(3)

where \(P(t|M_{U})\) is a language model of the query user environment.

3.2 Vector Space Model

In the vector space model VSM [10] the document \(d_j\) and query q are represented as vectors containing the importance weights \(w_{i,j}\) of each of its terms:

$$ d_{j} = (w_{1,j},w_{2,j},...,w_{n,j}), \qquad q = (w_{1,q},w_{2,q},...,w_{n,q}) $$

For the \(w_{i,j}\) we have used the TF-IDF weighting scheme:

$$\begin{aligned} w_{i,j} = tf_{t_{i},d_{j}}\cdot idf_{t_{i}}, \qquad idf_{t_{i}} = \mathrm {log}\frac{N}{n_i}, \end{aligned}$$
(4)

where N is the total number of documents and \(n_i\) is the number of documents containing the term \(t_i\). The similarity of a document \(d_j\) and a query q is then computed using the cosine similarity of vectors:

$$\begin{aligned} sim_{d_{j},q} = \frac{d_{j}\cdot q}{\left\| d_{j}\right\| \left\| q\right\| } = \frac{\sum ^{t}_{i=1}w_{i,j}w_{i,q}}{\sqrt{\sum ^{t}_{i=1}w^{2}_{i,j}}\sqrt{\sum ^{t}_{i=1}w^{2}_{i,j}}}. \end{aligned}$$
(5)

The most similar documents are then considered to be the most relevant.

3.3 Blind Relevance Feedback

Query expansion techniques based on the blind relevance feedback (BRF) method has been shown to improve the results of the information retrieval [2]. The idea behind the blind relevance feedback is that amongst the top retrieved documents most of them are relevant to the query and the information contained in them can be used to enhance the query for acquiring better retrieval results. First, the initial retrieval run is performed, documents are ranked according to some similarity or likelihood function. Then the top N documents are selected as relevant and the top k terms (according to some term importance weight \(L_t\), for example TF-IDF) from them is extracted and used to enhance the query. The second retrieval run is then performed with the expanded query.

In the standard approach to the BRF, the number of documents and terms is defined experimentally in advance the same for all queries. In our experiments, we would like to find the best setting of the standard BRF method and then compare it with the use of the score normalization methods.

BRF in Vector Space Model. First, for each document its similarity \(sim_{d_{j},q}\) is computed and the documents are sorted accordingly. For the selection of terms we have used the TF-IDF weight defined in (4).

BRF in Language Modeling. In the language modeling approach, the importance weight \(L_t\) defined in [9] was selected for weighting the terms for the BRF method, R is the set of relevant documents:

$$\begin{aligned} L_t = \sum _{d \in R} \log {\frac{P(t|M_{d})}{P(t|M_{c})}}. \end{aligned}$$
(6)

4 Score Normalization Methods

In the previous work, the score normalization methods were derived for the language modeling IR system [6] and for the vector space system [4]. In the following, the derivation process will be summarized for the language modeling environment since its principle is the most similar to the open-set text-independent speaker identification (OSTI-SI) and then it will be shown how the normalization methods are used in the VSM system.

After the initial run, we have the ranked list of the document likelihoods p(d|q). Similarly as in the OSTI-SI [11], we can define the decision formula:

$$\begin{aligned} p(d_R|q)> p(d_I|q) \rightarrow q \in d_R \quad \text{ else }\quad q \in d_I, \end{aligned}$$
(7)

where \(p(d_R|q)\) is the score given by the relevant document model \(d_R\) and \(p(d_I|q)\) is the score given by the irrelevant document model \(d_I\). By the application of the Bayes’ theorem, formula (7) can be rewritten as:

$$\begin{aligned} \frac{p(q|d_R)}{p(q|d_I)}> \frac{P(d_I)}{P(d_R)} \rightarrow q \in d_R \quad \text{ else } \quad q \in d_I, \end{aligned}$$
(8)

where \(l(q)=\frac{p(q|d_R)}{p(q|d_I)}\) is the normalized likelihood score and \(\theta =\frac{P(d_I)}{P(d_R)}\) is a threshold that has to be determined. Setting \(\theta \) a priori is a difficult task, since we do not know the prior probabilities \(P(d_I)\) and \(P(d_R)\). Similarly as in the OSTI-SI task the document set can be open - a query belonging to a document not contained in our set can easily occur. A frequently used form to represent the normalization process [11] can therefore be modified for the IR task:

$$\begin{aligned} L(q) = \log {p(q|d_R)}- \log {p(q|d_I)}, \end{aligned}$$
(9)

where \(p(q|d_R)\) is the score given by the relevant document and \(p(q|d_I)\) by the irrelevant document. Since the normalization score \(\log {p(q|d_I)}\) of an irrelevant document is not known, there are several possibilities how to approximate it:

World Model Normalization (WMN). The unknown model \(d_I\) can be approximated by the collection model \(M_{c}\) created as a language model from all documents in the retrieval collection. This technique was inspired by the World Model normalization [12]. The normalization score of a model \(d_I\) is defined as:

$$\begin{aligned} \log p(q|d_I) = \log {p(q|M_{c})}. \end{aligned}$$
(10)

Unconstrained Cohort Normalization (UCN). For each document model, a set (cohort) of N similar models \(C = \left\{ d_{1},...,d_{N}\right\} \) is chosen [13]. These models are the most competitive models with the document model, i.e. models which yield the next N highest likelihood scores. The normalization score is given by:

$$\begin{aligned} \log p(q|d_I) = \log p(q|d_{UCN}) = \frac{1}{N}\sum _{n=1}^N \log p(q|d_n). \end{aligned}$$
(11)

Standardizing a Score Distribution. Another solution called Test normalization (T-norm) stated in [13] is to transform a score distribution into a standard form. The formula (9) now has the form:

$$\begin{aligned} L(q) = (\log {p(q|d_R)} - \mu (q))/\sigma (q), \end{aligned}$$
(12)

where \(\mu (q)\) and \(\sigma (q)\) are the mean and standard deviation of the distribution.

4.1 Score Normalization in VSM

The likelihood p(d|q) in the normalization formula (9) can be replaced with the similarity \(sim_{d_{j},q}\), but since the likelihoods are in logarithms of probabilities the formula has to be changed to the form:

$$\begin{aligned} l(q) = sim_{d_R,q} / sim_{d_I,q}. \end{aligned}$$
(13)

Then the actual score normalization methods can also be rewritten. We have done our experiments with the UCN and the T-norm methods since they are easily transformed for the use in VSM system. The WMN on the other hand, requires replacing the “world” model defined with the collection model \(M_{c}\) with some equivalent in the vector space. The UCN method can be rewritten as:

$$\begin{aligned} sim_{d_I,q} = \frac{1}{N}\sum _{n=1}^N sim_{d_n,q}, \end{aligned}$$
(14)

and the T-norm method will now have the form:

$$\begin{aligned} l(q) = (sim_{d_R,q} - \mu (q)) / \sigma (q). \end{aligned}$$
(15)

Threshold Selection. Even when we have the scores normalized, we still have to set the threshold for verifying the relevance of each document in the list. Selecting a threshold defining the boundary between the relevant and the irrelevant documents in a list of normalized scores is more robust because the normalization removes the influence of the various query characteristics. Since in the former experiments the threshold was successfully defined as a percentage of the normalized score of the best scoring document, the threshold \(\theta \) will be similarly defined as the ratio k of the best normalized score.

5 Experiments

First, we have done experiments with the setting (smoothing parameters) of each presented method to find the best one. Then thorough experiments with the standard blind relevance feedback method (the selection of the number of documents and the number of terms) were done for each presented method. We have found the best parameters settings and selected it for our baseline. Finally, detailed experiments with the score normalization methods were performed.

5.1 IR Methods

According to the experiments, the parameter \(\lambda \) in Jelinek-Mercer smoothing was set to the best value - \(\lambda =0.1\). The Dirichlet prior smoothing parameter \(\alpha \) was set to the best value \(\alpha =10000\) and for the Two-stage smoothing method the parameters achieving the best results were found to be \(\lambda =0.99\) and \(\alpha =5000\).

Number of Documents for Standard BRF. We have experimented with the number of documents to select equal to 5, 10, 20, 30, 40, 50, 100. For all methods except Jelinek-Mercer (JM) smoothing it seems that the best results are achieved with higher number of documents. But we have found out that the number of documents and the number of terms to select are dependent on each other, so with 40 terms and 100 documents the result of JM is almost the same as presented in the results table.

Number of Terms. We have done experiments with the number of terms to select with all the described methods in this paper. The number of terms was selected from 5 to 45 terms, with 5 term interval (5, 10, 15...). For all methods, the experiments show that best results are achieved with a moderate number of terms selected - around 30 terms.

Table 1. IR results for all methods (mGAP score) for no blind relevance feedback, with standard BRF and BRF with score normalization.

Score Normalization. In score normalization methods, the number of documents to select for the BRF is dependent on the threshold \(\theta \) defined as the ratio k of the best normalized score. The final number of documents selected this way is different for each query. The experiments with the different ratio setting (from 0.1 to 0.95 with 0.05 distance) were done for all the methods presented. In the UCN method apart from the ratio k also the size C has to be set. Experiments with C from 5 to 800 with distance 10 were performed. The ratio k and the cohort size C depends on each other directly, because the normalization score in (11) is bigger (an average from the higher likelihoods) for a smaller cohort.

The final comparison of the vector space model (VSM) and language modeling methods with Jelinek-Mercer (JM), Dirichlet prior (DP) and Two-stage (TS) smoothing methods can be seen in Table 1. As can be seen from the table, in all cases the BRF methods achieved a better score than without BRF. All the score normalization methods achieved a better mGAP score than the standard BRF, except the WMN method in TS smoothing. The best score for all IR methods achieved the UCN score normalization method.

6 Conclusions

We have compared the most used methods in the information retrieval task in the environment of the speech retrieval. We have also compared these methods with the use of the standard blind relevance feedback method. For the standard BRF method, the extensive experiments have been done to find the best possible setting to be able to further compare it with the use of the score normalization methods. In all cases, the results were better with the use of the BRF method than without it and also were better with the use of the score normalization methods for the selection of documents for the BRF than with the standard blind relevance feedback. It also seems that the Two-stage smoothing method is the best method for incorporating the blind relevance feedback, the results show the biggest improvement when comparing without and with the use of BRF.