Elsevier

Neurocomputing

Volume 168, 30 November 2015, Pages 829-845
Neurocomputing

Modeling user preferences in content-based image retrieval: A novel attempt to bridge the semantic gap

https://doi.org/10.1016/j.neucom.2015.05.041Get rights and content

Highlights

  • A novel method for image retrieval have been proposed based on Generalized Linear Model.

  • The model aims to bridge the semantic gap between low level features and user preferences.

  • A drastic dimension reduction of feature vector is achieved by using a distance matrix.

  • A broad set of experiments has been carried out for different databases.

  • A new evaluation procedure has been proposed based on the empirical commutative distribution functions of the relevant and non-relevant retrieved images.

Abstract

This paper is concerned with content-based image retrieval from a stochastic point of view. The semantic gap problem is addressed in two ways. First, a dimensional reduction is applied using the (pre-calculated) distances among images. The dimension of the reduced vector is the number of preferences that we allow the user to choose from, in this case, three levels. Second, the conditional probability distribution of the random user preference, given this reduced feature vector, is modeled using a proportional odds model. A new model is fitted at each iteration. The score used to rank the image database is based on the estimated probability function of the random preference. Additionally, some memory is incorporated in the procedure by weighting the current and previous scores. Also, a novel evaluation procedure is proposed in this work based on the empirical commutative distribution functions of the relevant and non-relevant retrieved images. Good experimental results are achieved in very different experimental setups and tested in different databases.

Introduction

Content based image retrieval is the process by which a system automatically selects a set of images from a possibly very large collection that match a user׳s preference, expressed either in words or as a visual query (showing the system one or a small sample of images that meet the user׳s intention). Most of the collections are not semantically annotated by textual labels and consequently the selection of relevant images is based only on visual features. Indeed, in most cases, low level visual features related with color, texture, etc., and less commonly mid-level features based on regions are extracted and compared with the query. As is universally acknowledged, the main challenge in the design of these systems is how to bridge the semantic gap between low level representation, mostly in the form of a vector of numerical features, and high level semantic representation of the user׳s intention expressed either textually or as a visual query. This goal was already introduced in publications as early as [6], [38], [37], or [27].

The following review of previous works will focus on certain issues of CBIR systems, namely feature vector dimensionality reduction, query movement and/or expansion, combination of subspaces of features and image ranking; these issues are directly related to our contribution which will be explained in detail in Section 2.

Some ideas that have been previously applied by other researchers to reduce the semantic gap involve the reduction of the dimensionality of the feature vector, since it is assumed that the new dimensions of the reduced space have some kind of semantic significance. Common approaches rely on linear transformations, mostly by Principal Component Analysis (PCA) such as [32] that retain a certain amount of variation. Other ideas use Support Vector Machines (SVM) such as [35] or [36], which are known to cope well with the problem of high dimensionality with respect to the available data set size (the ”curse of dimensionality”). A less common approach to dimensionality reduction uses a non-linear transformation based on the projection on subspaces of smaller dimension defined by the nearest neighbors of each point [34]. On the contrary, other authors prefer to make a previous classification of the training set to extract a small set of representatives and use some sort of distance to the elements of this set as features. The advantage is that a high reduction of the dimensionality can be achieved without compromising the system׳s effectiveness so much, the work proposed in [26] being highly relevant. Other examples are [18] in which a Radial Basis Function is used to apprehend the topological structure of the semantic space, and [13] in which the structure is being learned during the relevance feedback process. The main drawback of some of these approaches is that the semantic meaning of some features, or groups of them, is completely hidden which may cause what the user perceives as erratic behavior during the feedback process.

The search carried out to satisfy the query is helped by the user׳s feedback through interaction using a graphical user interface, an approach known as relevance feedback which has been routinely used in recent years, [28] being a classical widely cited work.

Regarding the process of feedback, there are several ways of incorporating the information provided by the user. Roughly, these can be classified as techniques based on query re-weighting, query expansion and query movement. Two recent compact summaries of these classifications, and a comparative study, are [29], [24]. The first group (query re-weighting) changes the weights assigned to each feature, or group of them, based on the user׳s choices. A typical example can be seen in [8]. This is most commonly done by altering the weights of a pseudo-Euclidean metric which is used to calculate distances between the query image and all the images in the database, as in [36]. On the other hand, query expansion proceeds by adding more images to the original query taken from those the user marks as positive during the feedback process and finally, the query movement acts by changing the original query proposed by the user, on the understanding that it was intrinsically ambiguous and that it will be refined by the users themselves through their own choices during the feedback process. A substantial difference between query expansion and query movement, pointed out by [21], is an underlying assumption assumed in query movement: that the relevant images form a uni-modal cluster in the used feature space. This can leave out entire collections of images that would be classified as relevant if shown to the user (false negatives). On the contrary, query expansion techniques aim to admit a multimodal query which is usually the case, especially for complex semantic requests, but they have the drawback of giving a higher number of false positives (images given by the system as relevant, but which are not). An interesting example of a clever combination of both methods is [21].

Both query expansion and query movement can be considered as ways of learning the user׳s preferences. Differences can be established according to how this information is used across successive iterations. Most methods only use the choices of the last iteration, assuming that former ones are implicitly incorporated into the current state, but more complex methods may take into account the whole history of the search (user׳s log). Interesting examples are [15], [31].

An important point in CBIR systems is how to rank the images in the database to show them to the user. Ranking by distance to the query is the most obvious choice, but if the query is multi-objective (which is always true in query expansion techniques) some global measure of ranking must be used. There are examples based on post-retrieval clustering [23] or on rank aggregation [25]. An experimental comparison of some of these methods can be found in [17].

Finally, a less treated but important point in CBIR systems is the system׳s ability to rank the images and show them to the user in a reasonable time. This is compulsory if several iterations must be performed to attain a result of sufficient quality. Obviously, the key points to be considered are the computational cost of the evaluation of the similarity index chosen for a given image, the cost of ranking and the total number of images in the database. In our experience, the most important point is the database size and, close to this, the evaluation cost. Many of the published experiments work with small databases (around 1000 images), or medium-size ones (up to 100,000 images) with the highly relevant exception of [9] which evaluates its algorithm in a 100-million image database using a similarity caching system.

The main differences between the previously cited works and the current work in each of the aforementioned issues are as follows:

The relationship between low-level features and high-level preferences (reduction of the semantic gap) will be approached by using generalized linear models, in short, GLM [20]. The use of GLMs requires either a relatively large number of images evaluated by the user, or the reduction of the dimensionality of the low level feature vector. We have opted for the second approach: indeed, a significant reduction of the dimension of the feature vector is done by using a new procedure that relies on a previously evaluated matrix that contains the distance between every pair of images in the database. Once the dimensionality has been reduced, the GLMs can be applied. In particular, an accumulated proportional odds model will be used.

Regarding the feedback process, what we change in each iteration are the coefficients of a generalized linear model that links a weighted Mahalanobis distance to the query components with the probability of each image being similar to the query; this can be seen as a sophisticated way of query re-weighting. Our system does not carry out query movement (the query keeps all the original images), but it does query expansion (the images marked as relevant as long as other images given by the model in successive iterations are added to the query) with the particularity that the images visited (seen by the user, but not explicitly marked) are classified and used, as well, for the current iteration.

With respect to the ranking procedure, we decided that, since the images can be classified into three categories (relevant, neutral and non-relevant), good ranking can be built by the weighted addition of probabilities of belonging to the first two classes.

Finally, in order to accelerate the search we use a pre-calculated table of distances, model fitting with the provided data and model evaluation on the whole database are not critical since the generalized linear model is expressed as a simple formula.

Section snippets

Methodology

As stated before, we are concerned with the retrieval of images within large databases by using stochastic modeling. In particular, the random preference of the user given the low level features of the image is the event to be modeled. This in turn involves the choice of appropriate low level features, the reduction of their dimensionality so that a sound model can be fitted and the ranking of the results based on the evaluation of the model and taking into account the user׳s feedback.

Although

Databases description

Firstly, the different collections we work with and the low level features we have used to characterize each image in the collections will be described. Table 2 contains a summary of the different sizes of databases:

    Our Own collection:

    This first collection has been specially built for evaluation purposes and has been assembled using some images obtained from the web and others taken by the authors. These pictures are classified as belonging to 28 different categories such as flowers, horses,

First experiment: measuring the semantic gap

The procedure has been tested over our Own collection and the Wikipedia2011 collection. The first collection is indexed by feature vector FA and the second collection is indexed by both feature vectors, FA and FB (see Section 2.1).

The results in Table 8, Table 9 reflect the intrinsic visual difficulty of each topic. This difficulty is related to the distance between low level features and high level preferences. The procedure will run up to 10 iterations and the last ranking will be evaluated

Comparison and discussion

We have compared our algorithm with the one proposed by Lucas and Giacinto [26] as both algorithms aim to achieve a very drastic dimensional reduction by projecting the feature space into a dissimilarity space. In order to perform the comparison between our approach and the cited algorithm (from now on, NBB procedure), we have reproduced the experiment 4 (database Wikipedia2011 and FB low level feature vector). The results are shown in Table 13: column 6 shows the AP values for each selected

Conclusions and further work

In this paper, a new relevance feedback procedure has been proposed that relies on the proportional odds model to bridge the semantic gap between the low level features, used to describe each image, and random user preferences. This is the heart of our approach. The use of these kinds of models requires a small number of low level preferences. The preferences given by the user have been used to define a low dimensional feature vector for the whole database. This low level feature vector has

Acknowledgments

This work has been partially supported by projects MCYT TEC2009-12980, DPI2013-45742-R, DPI2013-47279-C2-1-R and TIN2013-47090-C3-1-P from Spanish government.

Esther de Ves was born in Almansa (Spain). She received the M.S. degree in Physics and the Ph.D. in Computer Science from the University of Valencia, in 1993 and 1999, respectively. Since 1994 she has been within the Department of Computer Science from the University of Valencia where she is an Assistant Professor. Her current interests are in the areas of texture analysis and multimedia databases retrieval.

References (38)

  • Thomas Deselaers et al.

    Features for image retrievalan experimental comparison

    Inf. Retr.

    (2008)
  • Yubing Dong, Baice Li, Combined automatic weighting and relevance feedback method in content-based image retrieval, in:...
  • Ruben Granados, Joan Benavent, Xaro Benavent, Esther de Ves, Ana García-Serrano, Multimodal information approaches for...
  • Michael Grubinger, Analysis and evaluation of visual information systems performance (Ph.D. thesis), School of Computer...
  • Rob Hess, An open-source siftlibrary, in: Proceedings of the International Conference on Multimedia MM ׳10, ACM, New...
  • S.C.H. Hoi, Wei Liu, Shih-Fu Chang, Semi-supervised distance metric learning for collaborative image retrieval, in:...
  • Torsten Hothorn et al.

    Implementing a class of permutation teststhe coin package

    J. Stat. Softw.

    (2008)
  • Lu Hui, Huang Xiang-Lin, Yang Li-Fang, Liu Min, A relevance feedback system for cbir with long-term learning, in: 2010...
  • Wei Liu, Yujing Ma, Wenhui Li, Wei Wang, Yan Liu, A cbir framework: dimension reduction by radial basis function, in:...
  • Cited by (15)

    • SPA: Spatially Pooled Attributes for image retrieval

      2017, Neurocomputing
      Citation Excerpt :

      The performance of recent image retrieval systems is severely compromised by the semantic gap which exists between low-level visual descriptions and high-level image semantics [1,2].

    View all citing articles on Scopus

    Esther de Ves was born in Almansa (Spain). She received the M.S. degree in Physics and the Ph.D. in Computer Science from the University of Valencia, in 1993 and 1999, respectively. Since 1994 she has been within the Department of Computer Science from the University of Valencia where she is an Assistant Professor. Her current interests are in the areas of texture analysis and multimedia databases retrieval.

    Guillermo Ayala was born in Cartagena (Spain), in 1962. He graduated in Mathematics (1985) and received his Ph.D in Statistics (1988), both at the University of Valencia. He was a scholarship holder at CMU S. Juan de Ribera (Burjasot, Spain) from 1979 to 1987. He is currently a Professor at the Department of Statistics and Operations Research in the University of Valencia, Spain. His research interests are in the areas of medical image analysis and applications of stochastic geometry in computer vision.

    Xaro Benavent-Garcìa was born in Valencia (Spain). She received the M.S. degree in Computer Science from the Polytechnic University of Valencia, in 1994, and Ph.D. in Computer Science from the University of Valencia in 2001. Since 1996 she has been with the Department of Computer Science from the University of Valencia where she is an Assistant Professor. Her current interests are in the areas of image database retrieval and multimodal fusion algorithms.

    Juan Domingo was a scholarship holder at CMU S. Juan de Ribera (Burjasot, Spain) from 1983 to 1988. He graduated with a degree in physics, in 1988, from the University of Valencia. He received the MS.c. degree in information technology, in 1991, from the University of Edinburgh and the Ph.D. degree in computing, in 1993, from the University of Valencia, where he is currently a Titular Professor in the Department of Informatics. His research interests include medical image analysis, database image retrieval and intelligent control.

    Esther Dura received the M.Eng. degree in computing science from Universidad de Valencia, Valencia, Spain, in 1998, the M.Sc. degree in artificial intelligence from the University of Edinburgh, Edinburgh, UK, in 1999, and the Ph.D. degree in computing and electrical engineering from Heriot-Watt University, Edinburgh, UK, in 2003. Her thesis investigated the use of computer vision and image processing techniques for the classification and reconstruction of objects and textured seafloors from sidescan sonar images. From 2003 to 2006, she was a Postdoctoral Researcher working on remote sensing and biomedical applications at the Departments of Electrical and Computing Engineering and the Department of Biomedical Engineering, Duke University, Durham, NC. Her research interests include pattern recognition, computer and image processing techniques for remote sensing, image retrieval, and medical applications. She is currently working as a Associate Professor at the Department of Computer Science, Universidad de Valencia.

    View full text