Modeling user preferences in content-based image retrieval: A novel attempt to bridge the semantic gap

doi:10.1016/j.neucom.2015.05.041

Neurocomputing

Volume 168, 30 November 2015, Pages 829-845

https://doi.org/10.1016/j.neucom.2015.05.041 Get rights and content

Highlights

•
A novel method for image retrieval have been proposed based on Generalized Linear Model.
•
The model aims to bridge the semantic gap between low level features and user preferences.
•
A drastic dimension reduction of feature vector is achieved by using a distance matrix.
•
A broad set of experiments has been carried out for different databases.
•
A new evaluation procedure has been proposed based on the empirical commutative distribution functions of the relevant and non-relevant retrieved images.

Abstract

This paper is concerned with content-based image retrieval from a stochastic point of view. The semantic gap problem is addressed in two ways. First, a dimensional reduction is applied using the (pre-calculated) distances among images. The dimension of the reduced vector is the number of preferences that we allow the user to choose from, in this case, three levels. Second, the conditional probability distribution of the random user preference, given this reduced feature vector, is modeled using a proportional odds model. A new model is fitted at each iteration. The score used to rank the image database is based on the estimated probability function of the random preference. Additionally, some memory is incorporated in the procedure by weighting the current and previous scores. Also, a novel evaluation procedure is proposed in this work based on the empirical commutative distribution functions of the relevant and non-relevant retrieved images. Good experimental results are achieved in very different experimental setups and tested in different databases.

Introduction

Content based image retrieval is the process by which a system automatically selects a set of images from a possibly very large collection that match a user׳s preference, expressed either in words or as a visual query (showing the system one or a small sample of images that meet the user׳s intention). Most of the collections are not semantically annotated by textual labels and consequently the selection of relevant images is based only on visual features. Indeed, in most cases, low level visual features related with color, texture, etc., and less commonly mid-level features based on regions are extracted and compared with the query. As is universally acknowledged, the main challenge in the design of these systems is how to bridge the semantic gap between low level representation, mostly in the form of a vector of numerical features, and high level semantic representation of the user׳s intention expressed either textually or as a visual query. This goal was already introduced in publications as early as [6], [38], [37], or [27].

The following review of previous works will focus on certain issues of CBIR systems, namely feature vector dimensionality reduction, query movement and/or expansion, combination of subspaces of features and image ranking; these issues are directly related to our contribution which will be explained in detail in Section 2.

Some ideas that have been previously applied by other researchers to reduce the semantic gap involve the reduction of the dimensionality of the feature vector, since it is assumed that the new dimensions of the reduced space have some kind of semantic significance. Common approaches rely on linear transformations, mostly by Principal Component Analysis (PCA) such as [32] that retain a certain amount of variation. Other ideas use Support Vector Machines (SVM) such as [35] or [36], which are known to cope well with the problem of high dimensionality with respect to the available data set size (the ”curse of dimensionality”). A less common approach to dimensionality reduction uses a non-linear transformation based on the projection on subspaces of smaller dimension defined by the nearest neighbors of each point [34]. On the contrary, other authors prefer to make a previous classification of the training set to extract a small set of representatives and use some sort of distance to the elements of this set as features. The advantage is that a high reduction of the dimensionality can be achieved without compromising the system׳s effectiveness so much, the work proposed in [26] being highly relevant. Other examples are [18] in which a Radial Basis Function is used to apprehend the topological structure of the semantic space, and [13] in which the structure is being learned during the relevance feedback process. The main drawback of some of these approaches is that the semantic meaning of some features, or groups of them, is completely hidden which may cause what the user perceives as erratic behavior during the feedback process.

The search carried out to satisfy the query is helped by the user׳s feedback through interaction using a graphical user interface, an approach known as relevance feedback which has been routinely used in recent years, [28] being a classical widely cited work.

Regarding the process of feedback, there are several ways of incorporating the information provided by the user. Roughly, these can be classified as techniques based on query re-weighting, query expansion and query movement. Two recent compact summaries of these classifications, and a comparative study, are [29], [24]. The first group (query re-weighting) changes the weights assigned to each feature, or group of them, based on the user׳s choices. A typical example can be seen in [8]. This is most commonly done by altering the weights of a pseudo-Euclidean metric which is used to calculate distances between the query image and all the images in the database, as in [36]. On the other hand, query expansion proceeds by adding more images to the original query taken from those the user marks as positive during the feedback process and finally, the query movement acts by changing the original query proposed by the user, on the understanding that it was intrinsically ambiguous and that it will be refined by the users themselves through their own choices during the feedback process. A substantial difference between query expansion and query movement, pointed out by [21], is an underlying assumption assumed in query movement: that the relevant images form a uni-modal cluster in the used feature space. This can leave out entire collections of images that would be classified as relevant if shown to the user (false negatives). On the contrary, query expansion techniques aim to admit a multimodal query which is usually the case, especially for complex semantic requests, but they have the drawback of giving a higher number of false positives (images given by the system as relevant, but which are not). An interesting example of a clever combination of both methods is [21].

Both query expansion and query movement can be considered as ways of learning the user׳s preferences. Differences can be established according to how this information is used across successive iterations. Most methods only use the choices of the last iteration, assuming that former ones are implicitly incorporated into the current state, but more complex methods may take into account the whole history of the search (user׳s log). Interesting examples are [15], [31].

An important point in CBIR systems is how to rank the images in the database to show them to the user. Ranking by distance to the query is the most obvious choice, but if the query is multi-objective (which is always true in query expansion techniques) some global measure of ranking must be used. There are examples based on post-retrieval clustering [23] or on rank aggregation [25]. An experimental comparison of some of these methods can be found in [17].

Finally, a less treated but important point in CBIR systems is the system׳s ability to rank the images and show them to the user in a reasonable time. This is compulsory if several iterations must be performed to attain a result of sufficient quality. Obviously, the key points to be considered are the computational cost of the evaluation of the similarity index chosen for a given image, the cost of ranking and the total number of images in the database. In our experience, the most important point is the database size and, close to this, the evaluation cost. Many of the published experiments work with small databases (around 1000 images), or medium-size ones (up to 100,000 images) with the highly relevant exception of [9] which evaluates its algorithm in a 100-million image database using a similarity caching system.

The main differences between the previously cited works and the current work in each of the aforementioned issues are as follows:

The relationship between low-level features and high-level preferences (reduction of the semantic gap) will be approached by using generalized linear models, in short, GLM [20]. The use of GLMs requires either a relatively large number of images evaluated by the user, or the reduction of the dimensionality of the low level feature vector. We have opted for the second approach: indeed, a significant reduction of the dimension of the feature vector is done by using a new procedure that relies on a previously evaluated matrix that contains the distance between every pair of images in the database. Once the dimensionality has been reduced, the GLMs can be applied. In particular, an accumulated proportional odds model will be used.

Regarding the feedback process, what we change in each iteration are the coefficients of a generalized linear model that links a weighted Mahalanobis distance to the query components with the probability of each image being similar to the query; this can be seen as a sophisticated way of query re-weighting. Our system does not carry out query movement (the query keeps all the original images), but it does query expansion (the images marked as relevant as long as other images given by the model in successive iterations are added to the query) with the particularity that the images visited (seen by the user, but not explicitly marked) are classified and used, as well, for the current iteration.

With respect to the ranking procedure, we decided that, since the images can be classified into three categories (relevant, neutral and non-relevant), good ranking can be built by the weighted addition of probabilities of belonging to the first two classes.

Finally, in order to accelerate the search we use a pre-calculated table of distances, model fitting with the provided data and model evaluation on the whole database are not critical since the generalized linear model is expressed as a simple formula.

Section snippets

Methodology

As stated before, we are concerned with the retrieval of images within large databases by using stochastic modeling. In particular, the random preference of the user given the low level features of the image is the event to be modeled. This in turn involves the choice of appropriate low level features, the reduction of their dimensionality so that a sound model can be fitted and the ranking of the results based on the evaluation of the model and taking into account the user׳s feedback.

Although

Databases description

Firstly, the different collections we work with and the low level features we have used to characterize each image in the collections will be described. Table 2 contains a summary of the different sizes of databases:

Our Own collection:

This first collection has been specially built for evaluation purposes and has been assembled using some images obtained from the web and others taken by the authors. These pictures are classified as belonging to 28 different categories such as flowers, horses,

First experiment: measuring the semantic gap

The procedure has been tested over our Own collection and the Wikipedia2011 collection. The first collection is indexed by feature vector $F_{A}$ and the second collection is indexed by both feature vectors, $F_{A}$ and $F_{B}$ (see Section 2.1).

The results in Table 8, Table 9 reflect the intrinsic visual difficulty of each topic. This difficulty is related to the distance between low level features and high level preferences. The procedure will run up to 10 iterations and the last ranking will be evaluated

Comparison and discussion

We have compared our algorithm with the one proposed by Lucas and Giacinto [26] as both algorithms aim to achieve a very drastic dimensional reduction by projecting the feature space into a dissimilarity space. In order to perform the comparison between our approach and the cited algorithm (from now on, NBB procedure), we have reproduced the experiment 4 (database Wikipedia2011 and $F_{B}$ low level feature vector). The results are shown in Table 13: column 6 shows the AP values for each selected

Conclusions and further work

In this paper, a new relevance feedback procedure has been proposed that relies on the proportional odds model to bridge the semantic gap between the low level features, used to describe each image, and random user preferences. This is the heart of our approach. The use of these kinds of models requires a small number of low level preferences. The preferences given by the user have been used to define a low dimensional feature vector for the whole database. This low level feature vector has

Acknowledgments

This work has been partially supported by projects MCYT TEC2009-12980, DPI2013-45742-R, DPI2013-47279-C2-1-R and TIN2013-47090-C3-1-P from Spanish government.

Esther de Ves was born in Almansa (Spain). She received the M.S. degree in Physics and the Ph.D. in Computer Science from the University of Valencia, in 1993 and 1999, respectively. Since 1994 she has been within the Department of Computer Science from the University of Valencia where she is an Assistant Professor. Her current interests are in the areas of texture analysis and multimedia databases retrieval.

References (38)

E. de Ves et al.
A novel Bayesian framework for relevance feedback in image content-based retrieval systems
Pattern Recognit.
(2006)
Fabrizio Falchi et al.
Similarity caching in large-scale image retrieval
Inf. Process. Manag.
(2012)
T. León et al.
Applying logistic regression to relevance feedback in image retrieval systems
Pattern Recognit.
(2007)
Yangxi Li et al.
A comprehensive study on learning to rank for content-based image retrieval
Signal Process.
(2013)
Gunhan Park et al.
Re-ranking algorithm using post-retrieval clustering for content-based image retrieval
Inf. Process. Manag.
(2005)
A. Agresti
Categorical Data Analysis
(2002)
X. Benavent et al.
Multimedia information retrieval based on late semantic fusion approachesexperiments on a wikipedia image collection
IEEE Trans. Multimedia
(2013)
B.M. Bolstad et al.
A comparison of normalization methods for high density oligonucleotide array data based on variance and bias
Bioinformatics
(2003)
Savvas A. Chatzichristofis, Yiannis S. Boutalis, CEDD: color and edge directivity descriptor: a compact descriptor for...
A. del Bimbo
Visual Information Retrieval
(1999)

Thomas Deselaers et al.

Features for image retrievalan experimental comparison

Inf. Retr.

(2008)

Yubing Dong, Baice Li, Combined automatic weighting and relevance feedback method in content-based image retrieval, in:...

Ruben Granados, Joan Benavent, Xaro Benavent, Esther de Ves, Ana García-Serrano, Multimodal information approaches for...

Michael Grubinger, Analysis and evaluation of visual information systems performance (Ph.D. thesis), School of Computer...

Rob Hess, An open-source siftlibrary, in: Proceedings of the International Conference on Multimedia MM ׳10, ACM, New...

S.C.H. Hoi, Wei Liu, Shih-Fu Chang, Semi-supervised distance metric learning for collaborative image retrieval, in:...

Torsten Hothorn et al.

Implementing a class of permutation teststhe coin package

J. Stat. Softw.

(2008)

Lu Hui, Huang Xiang-Lin, Yang Li-Fang, Liu Min, A relevance feedback system for cbir with long-term learning, in: 2010...

Wei Liu, Yujing Ma, Wenhui Li, Wei Wang, Yan Liu, A cbir framework: dimension reduction by radial basis function, in:...

Cited by (15)

Material and textural features-based retrieval using feature similarity mapping for synthetic images
2023, Materials Today: Proceedings
Content Based image retrieval (CBIR) is one of the prominent application of image processing where image retrieval from the database is made automatic based on the content in it. The materialist details of the images contains high degree of information and can serve as important toll for analysis. Different variant of content based image retrieval work by locating an object in the image and then retrieving images with similar object in it. The method works using different similarity assessment techniques. In this research work, we have developed a new material feature and HMMD space CBIR technique which works by transforming the domain to HMMD space and then finding the material and Texture Features along with other relevant features. These features forms he feature vector and are then deployed to assess the similarity of the two images which then generates a relevant and irrelevant image search. The system has higher accuracy of assessment as verified by experimentation for precision, productivity and precision agriculture.
Iterative brain tumor retrieval for MR images based on user's intention model
2022, Pattern Recognition
Generally, medical content-based image retrieval (CBIR) systems select low-level visual features as image descriptors. However, these descriptors fail to provide clues for understanding the content of medical images in a similar way as a human expert, which makes the retrieval results inconsistent with the user’s intention. To solve this problem, we propose a closed-loop brain tumor retrieval system for MR images with an eye-tracking based relevance feedback mechanism. In our method, we first model the intention of the user by training a convolutional neural network based on the temporal and spatial features extracted from his/her eye-tracking data collected when inspecting the relevance between different images. Upon using visual features as a bridge, the relevancy degree to the query image of any of the database images is computed with our user’s intention model by transferring to it the eye movement data from the most visually similar image amongst images iteratively accumulated in the canvas. Our proposed retrieval system is implemented in an iterative manner. In each round of iteration, user’s eye movement data when inspecting the system returns are collected and the canvas collection of images is also updated by appending to it the user inspected system returns. With the updated canvas collections, the relevancy degree of database images can be recomputed and the system can begin a new round search of the most relevant images. Extensive experiments have been performed on a publicly available T1-weighted contrast-enhanced magnetic resonance image (CE-MRI) dataset that consists of three types of brain tumors (glioma, meningioma, and pituitary tumor) collected from 233 patients with a total of 3064 images across the axial, coronal, and sagittal views. Experimental results of 22 volunteers (11 males and 11 females, with an average age of 24.4 years) from our medical school show that upon implicit involvement of users in the brain tumor retrieving process, our proposed system significantly outperforms state-of-the-art methods and achieves $P r e c @ 10$ to $99.94 %$ , $m A P$ to $97.95 %$ after the third round of iteration.
FCA-based knowledge representation and local generalized linear models to address relevance and diversity in diverse social images
2019, Future Generation Computer Systems
In social image retrieval, the main goal is to offer a relevant but also diverse result set of images to the user. To address relevance and diversity at the same time, we propose a multi-modal procedure. This approach deals with the diversification problem using a two-step procedure based on the application of Formal Concept Analysis (FCA) to organize the text content of the images, followed by a Hierarchical Agglomerative Clustering (HAC) step to find the topics addressed by the images. FCA detects the latent concepts covered by the images in the result set, organizing them according to these concepts. In the second step, clustering is carried out to group together the ones with a similar concept. To assess the relevance, we use an adaptive multi-model relevance feedback algorithm which uses the low-level visual features to estimate a relevance measurement for all the images in the dataset. Several local logistic regression models are automatically adjusted to select the best performance for each topic. Finally, the images are ranked by selecting the highest probability image at each text cluster, generating a relevant but diverse ranked list. Diverse social images 2013, 2014, 2015, 2016 and 2017 datasets from the MediaEval benchmark are used to test the performance of our approach in a real scenario. Experimental results show that the proposed joint multimedia procedure works well on multi-concept and complex collections without adjusting parameters among collections, achieving the best results with a user-based relevance feedback algorithm. Our challenge is to achieve similar results with our automatic version.
Learning to rank images for complex queries in concept-based search
2018, Neurocomputing
Concept-based image search is an emerging search paradigm that utilizes a set of concepts as intermediate semantic descriptors of images to bridge the semantic gap. Typically, a user query is rather complex and cannot be well described using a single concept. However, it is less effective to tackle such complex queries by simply aggregating the individual search results for the constituent concepts. In this paper, we propose to introduce the learning to rank techniques to concept-based image search for complex queries. With freely available social tagged images, we first build concept detectors by jointly leveraging the heterogeneous visual features. Then, to formulate the image relevance, we explicitly model the individual weight of each constituent concept in a complex query. The dependence among constituent concepts, as well as the relatedness between query and non-query concepts, are also considered through modeling the pairwise concept correlations in a factorization way. Finally, we train our model to directly optimize the image ranking performance for complex queries under a pairwise learning to rank framework. Extensive experiments on two benchmark datasets well verified the promise of our approach.
Annotation modification for fine-grained visual recognition
2018, Neurocomputing
Query modification is an intensively studied and widely used technique in information retrieval, for it helps better understand the intention of the users. In this work, we introduce this idea into fine-grained visual recognition, which is important to ambiguous queries in image retrieval task. Unlike most existing works, which incorporate information about object bounding boxes or parts for extracting discriminative local features, we propose a novel approach from a new viewpoint to solve the fine-grained recognition problem, namely annotation modification. The proposed approach fully exploits the inter-class ambiguity (which is generally regarded as noise) to form active sets of annotations for boosting the fine-grained visual recognition. Specifically, it first obtains some most confusing classes of each image through an easy-to-evaluate classifier, and then modify the annotation of each image using the active set of annotations. To handle the modified annotations, a novel ranking based loss function is further designed to learn effective classification models. We evaluate the proposed approach on three popular fine-grained image datasets (i.e., Oxford-IIIT Pets, Flower-102 and CUB200-2011), and the experimental results clearly demonstrate its effectiveness.
SPA: Spatially Pooled Attributes for image retrieval
2017, Neurocomputing
Citation Excerpt :
The performance of recent image retrieval systems is severely compromised by the semantic gap which exists between low-level visual descriptions and high-level image semantics [1,2].
Semantic gap, which refers to the limitation that low-level hand-crafted visual features insufficiently encode high-level semantic concepts contained in the images, has been a challenging issue in image retrieval and significantly impairs the performance of real-world retrieval systems. Despite massive efforts that have been devoted to developing effective image signatures, e.g., Bag-of-Visual-Words (BOVW), the Fisher Vector (FV) and the Vector of Locally Aggregated Descriptors (VLAD), these mid-level image features still fail to handle the problem of semantic gap and thus lead to suboptimal results. Towards this end, a large body of work focuses on introducing attribute learning into a variety of vision applications. As inherent nature that describes the intrinsic properties of objects, such as color, shape and rigidity, learned attributes serve as intermediate representations that bridge the semantic gap. However, conventional attribute embedding methods are generally developed for image global representation while ignoring local spatial cues, which prevents them from achieving desirable performance. In this paper, we attempt to encode weak spatial information into attribute embedding for effective image retrieval. Specifically, we partition the image into regular grids and extract Classemes attribute vector from each patch, which results in a large pool of Classemes descriptors followed by VLAD aggregation for generating holistic representation. In order to produce a compact and discriminative code, we employ a piecewise Fisher Discriminant Analysis (FDA) for dimension reduction and concatenate all the compressed Classemes into a single vector coined Spatially Pooled Attributes (SPA). Thorough experimental evaluation and comparative study on three public benchmarks demonstrate the superiority of the proposed approach.

View all citing articles on Scopus

Guillermo Ayala was born in Cartagena (Spain), in 1962. He graduated in Mathematics (1985) and received his Ph.D in Statistics (1988), both at the University of Valencia. He was a scholarship holder at CMU S. Juan de Ribera (Burjasot, Spain) from 1979 to 1987. He is currently a Professor at the Department of Statistics and Operations Research in the University of Valencia, Spain. His research interests are in the areas of medical image analysis and applications of stochastic geometry in computer vision.

Xaro Benavent-Garcìa was born in Valencia (Spain). She received the M.S. degree in Computer Science from the Polytechnic University of Valencia, in 1994, and Ph.D. in Computer Science from the University of Valencia in 2001. Since 1996 she has been with the Department of Computer Science from the University of Valencia where she is an Assistant Professor. Her current interests are in the areas of image database retrieval and multimodal fusion algorithms.

Juan Domingo was a scholarship holder at CMU S. Juan de Ribera (Burjasot, Spain) from 1983 to 1988. He graduated with a degree in physics, in 1988, from the University of Valencia. He received the MS.c. degree in information technology, in 1991, from the University of Edinburgh and the Ph.D. degree in computing, in 1993, from the University of Valencia, where he is currently a Titular Professor in the Department of Informatics. His research interests include medical image analysis, database image retrieval and intelligent control.

Esther Dura received the M.Eng. degree in computing science from Universidad de Valencia, Valencia, Spain, in 1998, the M.Sc. degree in artificial intelligence from the University of Edinburgh, Edinburgh, UK, in 1999, and the Ph.D. degree in computing and electrical engineering from Heriot-Watt University, Edinburgh, UK, in 2003. Her thesis investigated the use of computer vision and image processing techniques for the classification and reconstruction of objects and textured seafloors from sidescan sonar images. From 2003 to 2006, she was a Postdoctoral Researcher working on remote sensing and biomedical applications at the Departments of Electrical and Computing Engineering and the Department of Biomedical Engineering, Duke University, Durham, NC. Her research interests include pattern recognition, computer and image processing techniques for remote sensing, image retrieval, and medical applications. She is currently working as a Associate Professor at the Department of Computer Science, Universidad de Valencia.

View full text

Modeling user preferences in content-based image retrieval: A novel attempt to bridge the semantic gap

Highlights

Abstract

Introduction

Section snippets

Methodology

Databases description

First experiment: measuring the semantic gap

Comparison and discussion

Conclusions and further work

Acknowledgments

Pattern Recognit.

Inf. Process. Manag.

Pattern Recognit.

Signal Process.

Inf. Process. Manag.

Categorical Data Analysis

Multimedia information retrieval based on late semantic fusion approachesexperiments on a wikipedia image collection

IEEE Trans. Multimedia

A comparison of normalization methods for high density oligonucleotide array data based on variance and bias

Bioinformatics

Visual Information Retrieval

Features for image retrievalan experimental comparison

Inf. Retr.

Implementing a class of permutation teststhe coin package

J. Stat. Softw.