Support vector machines: relevance feedback and information retrieval

https://doi.org/10.1016/S0306-4573(01)00037-1Get rights and content

Abstract

We compare support vector machines (SVMs) to Rocchio, Ide regular and Ide dec-hi algorithms in information retrieval (IR) of text documents using relevancy feedback. It is assumed a preliminary search finds a set of documents that the user marks as relevant or not and then feedback iterations commence. Particular attention is paid to IR searches where the number of relevant documents in the database is low and the preliminary set of documents used to start the search has few relevant documents. Experiments show that if inverse document frequency (IDF) weighting is not used because one is unwilling to pay the time penalty needed to obtain these features, then SVMs are better whether using term-frequency (TF) or binary weighting. SVM performance is marginally better than Ide dec-hi if TF-IDF weighting is used and there is a reasonable number of relevant documents found in the preliminary search. If the preliminary search is so poor that one has to search through many documents to find at least one relevant document, then SVM is preferred.

Introduction

Our problem is that of relevancy feedback within the context of information retrieval (IR). There is a set of documents that a user wants to retrieve within a database. Some of the articles are relevant, some not. It is important to understand that relevancy is relative to the perception of the user, that is document Dj may be relevant to user Uk but not user Up.

The user is assumed to present a preliminary (or initial) query to the system, in which case the system returns a set of ranked documents that the user examines. Although many documents may be retrieved by the system, the system only presents one screen of documents at a time. In our case, we assume that ten documents are returned on the initial screen with enough information for the user to gauge whether a document is relevant or not. An optimal preliminary query would only return relevant documents and all the user has to do is scroll through all the screens to find all the relevant documents. In actuality, depending on the quality of the initial query, many documents may be returned but few may be relevant. The initial query may be a Boolean query such as conjunction or disjunctions of key words or the inquiry could be a sophisticated inquiry in the form of a question.

In our case, we ignore the exact nature of the preliminary query and assume that the return of the documents from this initial query is poor (three or less relevant documents from the full screen of ten documents). However, our technique will work no matter how many documents are returned from the initial query. We believe that a hard test of an IR system with relevancy feedback occurs when the number of relevant documents returned in the initial query is low. We assume that if the documents returned in the initial screen are all relevant, then the user will just scroll to the next screen while if no relevant documents are returned, the user continues to scroll through the screens until at least one relevant document is returned on a screen and then the first feedback iteration begins. Thus if there are between one and nine relevant documents returned on the initial screen, the user marks the relevant documents (unmarked documents are taken as non-relevant) and the system goes though a first feedback iteration and another set of the top ten ranked documents are returned. These feedback iterations continue until the user terminates the procedure.

We first concentrate on the state when between one and nine (inclusive) relevant documents are returned in the initial screen. Our method will be based on the use of support vector machines (SVMs) (Drucker, Wu, & Vapnik, 1999; Joachims, 1998; Vapnik, 1998) with comparisons to other IR relevancy feedback techniques: Rocchio (1971), Ide regular and Ide dec-hi (Salton & Buckley, 1990; Harman, 1992). These algorithms will be examined in detail later but suffice to say now that all except Ide dec-hi use all the relevant and non-relevant documents on the first screen while Ide dec-hi uses all the relevant documents and the top ranked non-relevant document.

Recall that we are paying particular attention to the case where the initial retrieval is poor. As anyone who has done IR or web searches will attest, it is rather discouraging to get a return of a search stating that the search has found thousands of documents when in fact most of the documents on the first screen (the highest ranked documents) are not relevant to the user. Our typical user is hypothesized as preferring to mark the top ten returned documents as relevant or not and going through a series of feedback iterations rather that examining many screens to mark all the relevant documents.

Summarizing: In the initial preliminary search we obtain either (1) no relevant documents, (2) one to nine relevant documents or (3) all relevant documents. In case (1), we will be forced to go to succeeding screens until we get one screen with at least one relevant document. All the documents on that last screen and previous screens will be used in the first feedback iteration. In case (2) we mark the relevant documents on the first screen (unmarked on the first screen are non-relevant) and go through feedback iterations. In case (3) we go to the next screen. We will concentrate on the situation when the number of relevant documents returned on the first screen is low (three or less) and could be zero. Please distinguish between the preliminary query that returns the first set of documents and the first feedback iteration which starts with that initial set of documents marked by the user.

After the first feedback iteration and on all subsequent iterations we will examine only the first screen no matter how many of the returned documents are relevant (even if none). We then track performance as a function of feedback iteration.

In Section 1.2 we discuss SVMs, the Rocchio algorithm and Ide algorithms. Section 2 discusses the various options for the terms in the documents vectors, namely binary term weights, term frequency and inverse term frequency. We discuss the concepts of stemming and the use of stop lists in Section 3. In Section 4 we describe the performance metrics of precision, recall and coverage ratio and argue that coverage ratio is the best metric to compare performance. In Section 5 we describe the test set and in Section 6 we describe our experiments using a random set of relevant and non-relevant documents. However, in Section 7, we describe a set of experiments where the preliminary documents are determined from a keyword search. Finally, we reach our conclusions in Section 8.

One difference between our study and others is the simultaneous tracking of performance as a function of feedback iteration and the use of SVMs. Although there have been many studies of the use of SVMs in text retrieval (Drucker et al., 1999; Joachims, 1998; Vapnik, 1982, Vapnik, 1998), most studies emphasize finding the method that optimizes performance after one feedback iteration.

SVMs have been studied in the context of the IR filtering problem (Dumais, Platt, Heckerman, & Sahami, 1998; Joachims, 1998). It is understood that both relevancy feedback and filtering problems are both classification problems in that documents (in our case) are assigned to one of two classes (relevant or not). However, in the filtering situation, we usually have a marked set of documents termed the training set and use that set to train a classifier. Performance is then judged on a separate test set. In a sense, filtering could be considered relevancy feedback with only one feedback iteration. The problem with using filtering rather than many iterations of relevancy feedback is that (1) one has to mark “many” documents in the training set to obtain reasonable classification rates on the test set (2) how many is “many” depends on the problem and is not known in advance (3) since what are to be considered relevant documents is user dependent, this would mean that every user must construct a different training set. Multiple iterations of feedback could be considered to be an attempt to maximize performance on a test set that includes all documents except the ones already marked. In that sense, relevancy feedback is similar to what is termed active learning (Schohn & Cohen, 2000; Tong & Koller, 2000) in that we try to maximize test performance using the smallest number of documents in the training set. However the important difference between relevancy feedback and active learning is that active learning may force the user to mark many more non-relevant documents than in IR feedback and our supposition is that the user wants to see the maximum number of relevant documents at each feedback iteration. Additionally, in active learning we are interested in maximizing performance on the entire test set. In our case we are interested in maximizing performance on the next ten documents retrieved.

IR and relevancy feedback has a long history. In the Rocchio (1971) algorithm formulation we have a set of documents, each document represented by a vector Dj whose size is the size of the vocabulary of words in all the documents after pruning some words. An element of this vector indicates some property of a particular word that occurs in the article – zero if it does not appear, “1” if it appears and we are using binary features, and the number of occurrences of that word in the article if we are using term-frequency (TF) weighting. The preliminary query (not necessary returned by a Rocchio feedback iteration) contains N total documents. If the preliminary search realizes between one and nine relevant documents (and the resultant number of non-relevant documents), then N is ten. If there are no relevant documents on the first screen then we search subsequent screens until there is at least one relevant document – in this case N is a multiple of ten. We will ignore the case of ten relevant documents returned on the preliminary search because then the preliminary query is very good and one just goes to subsequent screens to retrieve more relevant documents.

The first feedback iteration using Rocchio after the initial (non-Rocchio) search forms the following query:Q1=βRRelevantDiγN−RNon-relevantDi.β and γ are constants used to assign the relative importance of the relevant and non-relevant documents to the query. R is the number of relevant documents retrieved in the preliminary query and N is the total number of documents retrieved in the preliminary query. Negative elements of the vector are clipped to zero (Rocchio, 1971). To implement the first iteration we form the dot product of this first query against all the documents (Q1·Dj) where the documents are those not yet marked as relevant and non-relevant and then we rank the dot products from high to low, present the ten largest dot products for the user to mark as relevant and non-relevant and then continue to the next iteration.

After the first feedback iteration, we form subsequent iterations:QjQj−1+βRRelevantDiγN−RNon-relevantDi,where α represents the relative importance of the prior query. We have a number of concerns with the Rocchio algorithm that we feel will make it problematic for use; mainly that it depends on three constants (α,β,γ). Most studies on the Rocchio algorithm vary the three constants to determine the optimum choice of these constants. However, we feel that is unfair – a separate validation set should have been used. Furthermore, even if one has a validation set, one does not have time to search for that optimum set of constants. Thus, we will use the following choices of α,β,γ as 8, 16, and 4, respectively, a choice that seemed to work well elsewhere (Buckley, Salton, & Allen, 1994). Since all the negative elements of the resultant query are set to zero and since we are assuming that most of the documents returned in the first iteration will be non-relevant, there may be many elements of the query set to zero making for very poor performance on the next iteration.

Schapire, Singer, and Singhal (1998) investigate a modified Rocchio algorithm and boosting as applied to text filtering. Although their investigation was not in the relevancy feedback domain, they did show that a modified Rocchio algorithm could do much better than the original Rocchio algorithm. However, their algorithm requires multiple passes over documents and is problematic for large databases. Joachims (1997) compared Rocchio and naı̈ve Bayes in text categorization while Salton and Buckley (1990) examine Rocchio and probabilistic feedback but only for one feedback iteration.

The Ide regular algorithm (Salton & Buckley, 1990; Harman, 1992) is of the following format:Qj=Qj−1+∑RelevantDi−∑Non-relevantDi,where for the first feedback iteration, the Q on the right-hand side is zero and as usual, all negative elements of the resultant query are set to zero.

The Ide dec-hi has basically the same form as Ide regular except the last summation has only one term, namely the highest ranked non-relevant document.

The final technique will be based on the use of SVMs. SVMs can best be understood in the context of Fig. 1 where the black diamonds represents the relevant vectors D in high dimensional space and the empty diamonds represent the non-relevant documents and it is assumed in this figure that the document vectors are linearly separable.

When SVMs are constructed, two sets of hyperplanes are formed (the solid lines), one hyperplane going through one or more examples of the non-relevant vectors and one hyperplane going through one or more examples of the relevant vectors. Vectors lying on the hyperplanes are termed support vectors and in fact define the two hyperplanes. If we define the margin as the orthogonal distance between the two hyperplanes, then a SVM maximizes this margin. Equivalently, the optimal hyperplane (the dashed line half-way between the support hyperplanes) is such that the distance to the nearest vector is maximum.

Before we introduce the key concepts it should be noted that if we followed typical convention we should use lower case bold characters as vectors and upper case bold characters as matrices. However, the common convention in IR seems to be to use upper case bold D as the document vector. The key concepts we want to use are the following: There are two classes: yi∈{−1,1} where +1 is assigned to a document if it is relevant and −1 if the class is non-relevant and there are N labeled training examples:(D1,y1),…,(DN,yN)D∈Rd,where d is the dimensionality of the vector.

If the two classes are linearly separable, then one can find an optimal weight vector Q* that describes an optimal separating hyperplane such that the distance from the separating hyperplane to the closest vector of any class is maximum. These conditions are as follows:Q*·Di−b⩾1ifyi=1,Q*·Di−b⩽−1ifyi=−1or equivalently:yi(Q*·Di−b)⩾1,where b is the bias.

Training examples that satisfy the equality are termed support vectors. The support vectors define two hyperplanes, one that goes through the support vectors of one class and one goes through the support vectors of the other class. The orthogonal distance between the two hyperplanes defines the margin and this margin is maximized when the norm of the weight vector Q* is minimum. Vapnik (1998) has shown we may perform this minimization by maximizing the following function with respect to the variables αj:W(α)=∑i=1Nαi−0.5∑i=1Nj=1Nαiαj(Di·Dj)yiyjsubject to the constraints that ∑i=1Nαiyi=0 and αi⩾0 where it is assumed there are N training examples, Di is one of the training vectors and · represents the dot product. If αi>0 then it can be shown that yi(Q*·Di−b)=1 and the Di that corresponds to the nonzero αi is a support vector. For an unknown vector Dj classification then corresponds to finding:F(Dj)=sgnQ*·Dj−b,whereQ*=∑i=1rαiyiDiand the sum is over the r support vectors taken from the training set. The advantage of the linear representation is that Q* can be calculated after training and classification amounts to computing the dot product of this optimum weight vector with the input vector.

For the non-separable case, training errors are allowed and we now must minimize: Q*2+C∑i=1Nξi subject to the constraint:yi(Q*·Di−b)⩾1−ξi,ξi⩾0,where ξi is a slack variable and allows training examples to exist in the region (margin) between the two hyperplanes that go through the support points of the two classes. We can equivalently maximize W(α) but the constraint is now 0⩽αiC instead of 0⩽αj. Maximizing W(α) is quadratic in α subject to constraints and may be solved using quadratic programming techniques, some of which are particular to SVMs (Joachims, 1998). Details for solving this problem also may be obtained from Vapnik, 1982, Vapnik, 1998, Cristianini and Shawe-Taylor (2000) and Cherkassky and Mulier (1998).

Linear SVMs execution speeds are very fast and there is only one parameter to tune (C), which is a restriction on the largest value of α. In most learning algorithms, if there are many more examples of one class than another, the algorithm will tend to correctly classify the class with the larger number of examples, thereby driving down the error rate. Since SVMs minimize the error rate by trying to separate the patterns in high dimensional space, the result is that SVMs are relatively insensitive to the relative numbers of each class. For example, new training examples that are “behind” the support vectors of the same class will not change the optimum hyperplane since their values of α are zero.

In our model of relevancy feedback, after construction of Q* and b, we calculate Q*·Di−b for all documents not seen so far and rank them from high to low (assuming the relevant documents are of class +1) and return the top ten to the user for marking. (Strictly speaking, the bias term b is not needed since the rankings will remain the same whether the bias is subtracted from the dot product or not). Q*·Dj−b represents the proportional distance from the optimal separating hyperplane to the vector Dj. The documents not used in the training set may be inside or outside the margin (since they were not used to generate the present support vectors). Those documents on the class +1 side of the optimal hyperplane and furthest from the optimal hyperplane are the top ranked documents. Some of these top-ten ranked documents may in fact be non-relevant and in the next feedback iteration these newly marked vectors (in addition to those marked in previous feedback iterations) are used to construct a new set of support vectors. We contrast this with active learning (Schohn & Cohen, 2000; Tong & Koller, 2000) where the emphasis will be to take vectors in the next feedback iteration from within the margin. We don't want to do this in relevancy feedback, as many of the points within the margin will be non-relevant and not useful to the user. If the top-ten ranked documents are outside the margin and are all relevant, then in the next feedback iteration the support vectors will not change. If any of the top ten documents are within the margin, the next set of support vectors will be different from the present set.

Solving the previous set of equations is done using SVMlight (see Acknowledgement section). There are not many candidate vectors to consider as support points. In general the number of potential support points is ten times the iteration number and the training time is usually under three seconds although in some cases, it takes longer for the algorithm to converge (up to 30 s).

Joachims (1998) looked at SVMs in text categorization and compared this to naı̈ve Bayes, C4.5, k-nearest neighbor, and Rocchio. Although not a relevancy feedback study, it discusses the issue of whether all or just some of the features should be used (features are the elements of the vectors). Although reducing the number of features does improve performance on some algorithms (k-nearest neighbor, C4.5 and Rocchio), it does not for naı̈ve Bayes and SVM. It is our contention that we cannot waste time searching for the best set of features and so we use the full set of features in our study.

Others studies in IR that are relevant are that of Buckley, Salton, and Allen (1993) who examined IR within the context of a routing problem. They use the Rocchio algorithm modified so that the last term in the equation defining the new query includes not only the non-relevant documents marked on the present screen but assumes that all unseen documents are non-relevant and included in the last term. The same three authors (1994) also examined the use of locality information to improve performance. We mention the study of incremental feedback (Aalbersberg, 1992) where only the top ranked document is retuned to the user and marked as relevant or not and is another example of text categorization. Finally, it should be pointed out that all the techniques discussed here are vector techniques. Harman (1992) compares many models including probabilistic models (as opposed to vector space models) and is the only paper we could find that tracks performance as a function of iteration.

Section snippets

Term weighting issues

We discuss the issue of the term ti in the document vector Dj. In the IR field, this term is called the term weighting while in the machine learning field, this term is called the feature and in linear algebra it is the ith element of the vector Dj. ti states something about word i in document Dj. If this word is absent, ti is zero. If the word is present, then there are several options. One option is that the term weight is a count of the number of times this word occurs in this document

Stemming and stop lists

Full stemming is the removal of all suffixes of a word. For example, “builder”, “building”, and “builds” will all be reduced to their common root “build”. There could also be partial stemming such as changing plural forms to their singular. Stemming reduces the size of the document vectors. One performance issue will be the effect of stemming on retrieval accuracy. But there are other performance issues such as retrieval speed and size of the inverted index. Buckley et al. (1993, p. 68),

Performance metrics

There are too many ways to assess the effectiveness of the feedback process to discuss here in detail. References are Lewis (1995), Tauge-Sutcliffe (1992), Saracevic (1975), Mizzaro (1997), and Korfhage (1997). However, traditionally recall and precision are used. Let R be the number of relevant documents in the collection, nRel be the number of relevant documents actually retrieved in a feedback iteration and N be the total number of documents returned in the feedback iteration (typically, N

A test set

A set of documents labeled by topic can be used to simulate the relevance feedback environment. For a test set we use the Reuters corpus of news articles (www.reseach.att.com/∼lewis). This is a database of over 11,000 articles. Each article is delimited by SGML tags to indicate (among other items) the beginning and end of the article and the topic(s) assigned to that article. Some articles have multiple topics. Processing of the database proceeded as follows:

  • 1.

    Eliminate articles that have no

Experimental results

In Fig. 2, we assume one document retrieved on the initial search for two cases: one case where the topic has a high visibility (33%) and the other case with low visibility (1.4%). For each case, we show the results of two algorithms: SVM using binary features and Rocchio using TF-IDF averaged over ten experiments.

Let us discuss the high visibility case first (v=33%). Since the two algorithms have almost identical performance, we have used one label to identify the top two graphs. Recall that

Preliminary search based on keywords

The results reported above used a random set of relevant and non-relevant documents in the preliminary search. This has the advantage of then being able to average over multiple experiments but has the disadvantage in that the preliminary set of documents are not retrieved in a realistic manner, as perhaps in a keyword search. Thus, if a good keyword search finds a relevant document in the first screen, even if the topic has low visibility, the resultant performance may be better than using the

Conclusions

We have analyzed the performance of SVM-based algorithms and compared them to Rocchio, Ide regular, and Ide dec-hi. In the first set of experiments we picked a random set of preliminary documents with a small, sometimes zero, number of relevant documents presented in the first screen. Based on these experiments, we can generally state that if the initial search is very poor and the visibility of the topic is low, then SVMs are superior to the other techniques investigated. In the second set of

Acknowledgements

Thanks go to Vladimir Vapnik for his insights and Thorsten Joachims who supplied the code for the support vector machine optimization problem. The code may be retrieved from: www-ai.informatik.uni-dortmund.de/thorsten/svm_light.html.

References (24)

  • G. Salton et al.

    Term-weighting approaches in automatic text retrieval

    Information Processing and Management

    (1988)
  • Aalbersberg, J. I. (1992). Incremental relevance feedback. In Proceedings of the fifteenth annual international ACM...
  • Buckley, C., Salton, G., & Allen, J. (1993). Automatic retrieval with locality information using SMART. In Proceedings...
  • Buckley, C., Salton, G., & Allen, J. (1994). The effect of adding relevance information in a relevance feedback...
  • Cherkassky, V., & Mulier, F. (1998). Learning from data. New York:...
  • N. Cristianini et al.

    Support vector machines

    (2000)
  • H. Drucker et al.

    Support vector machines for spam categorization

    IEEE Transactions on Neural Networks

    (1999)
  • Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text...
  • Harman, D. (1992). Relevance feedback revisited. In Proceedings of the fifth international SIGIR conference on research...
  • Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Proceedings...
  • Joachims, T. (1998). Text categorization with support vector machines: learning with features. In European conference...
  • R.R. Korfhage

    Information storage and retrieval

    (1997)
  • Cited by (63)

    • PRESTO: Predictive REcommendation of Surrogate models To approximate and Optimize

      2022, Chemical Engineering Science
      Citation Excerpt :

      These models map input data to output data when the actual relationship between the two is unknown or computationally expensive to evaluate (Han and Zhang, 2012). Various techniques have been developed for constructing surrogate models for both regression and classification tasks (Breiman, 2001; Cozad et al., 2014; Drucker et al., 2002; Rasmussen and Williams, 2005). The current common practice for choosing a model form from the many available techniques relies on process-specific expertise or expensive trial-and-error methods.

    • A boosted SVM classifier trained by incremental learning and decremental unlearning approach

      2021, Expert Systems with Applications
      Citation Excerpt :

      Though SVM might be moderate for a boosting algorithm, the boosting is used to boost the performance and speed up the training. It is believed that the time taken to train several local SVM classifiers is shorter than the time needed to train an optimal global one, while at the same time achieving a comparable performance (Druker et al., 2002; Nalepa & Kawulok, 2019). In the proposed boosting algorithm, a boosting data set is used to train the weak classifier at each step.

    View all citing articles on Scopus

    A much shorter version of this paper was presented at the 2001 International Conference on Machine Learning.

    View full text