A graph-based image annotation framework

https://doi.org/10.1016/j.patrec.2007.10.018Get rights and content

Abstract

Automatic image annotation is crucial for keyword-based image retrieval because it can be used to improve the textual description of images. In this paper, we propose a unified framework for image annotation, which contains two kinds of learning processes and incorporates three kinds of relations among images and keywords. In addition, we propose some improvements on its components, i.e. a reinforced image-to-image relation; a combined word-to-word relation; and a progressive learning method. Experiments on the Corel dataset demonstrate their effectiveness. We also show that many existing image annotation algorithms can be formulated into this framework and present an experimental comparison among these algorithms to evaluate their performance comprehensively.

Introduction

With the advent of digital imagery, the number of images has been growing rapidly and there is an increasing need for effectively indexing and searching these images. Systems using non-textual (image) queries have been proposed but many users found it hard to represent their queries using abstract image features. Most users prefer textual queries, i.e. keyword-based image search, which is typically achieved by manually providing image annotations and searching over these annotations using a textual query. However, manual annotation is an expensive and tedious procedure. Thus, automatic image annotation is necessary for efficient image retrieval.

Many algorithms have been proposed for automatic image annotation. In a straightforward way, each semantic keyword or concept is treated as an independent class and corresponds to one classifier. Methods like linguistic indexing of pictures (Li and Wang, 2003), image annotation using SVM (Cusano and Schettini, 2003) and Bayes point machine (Chang et al., 2003) fall into this category. Other methods try to learn a relevance model associating images and keywords. The early work in (Duygulu et al., 2002) applied a translation model (TM) to translate a set of blob tokens (obtained by clustering image regions) to a set of annotation keywords. Jeon et al. (2003) assumed that image annotation could be viewed as analogous to the cross-lingual retrieval problem and proposed a cross-media relevance model (CMRM). Lavrenko et al. (2003) proposed continuous-space relevance model (CRM) which assumed that every image is divided into regions and each region is described as a continuous-valued feature vector. Given a training set of images with annotations, a joint probabilistic model of image features and words is estimated. Then the probability of generating a word given the image regions can be predicted. Compared with the CMRM, the CRM directly models continuous features, so it does not rely on clustering and consequently avoids the granularity issues. Feng et al. (2004) proposed another relevance model in which a multiple Bernoulli model is used to generate words instead of the multinomial one as in CRM. Recently, there are some efforts considering the word correlation in the annotation process, such as coherent language model (CLM) (Jin et al., 2004), correlated label propagation (CLP) (Kang et al., 2006), annotation refinement using random walk (Wang et al., 2006) and WordNet-based method (Jin et al., 2005). The graph-based methods also achieved much attention. Pan et al. (2004) firstly proposed a graph-based automatic caption method, in which images, annotations and regions are considered as three types of nodes to construct a mixed media graph so as to perform image annotation. In our previous work (Liu et al., 2006), we proposed an NSC-based method to calculate image similarities on visual features and propagated annotations from training images to their similar test images.

As these algorithms seem to be so different from each other, it is not easy to answer such questions as which models are better, what the connections among them are, and how they should be utilized. In this paper, we conduct a formal study on these issues and find that previous research work can be induced as two kinds of learning processes, which integrate three kinds of relations as shown in Fig. 1: image-to-image relation, word-to-word relation, and image-to-word relation.

We propose a unified framework for image annotation. In the framework, automatic image annotation can be performed across two graph learning processes. The first process (referred as “basic image annotation”) aims to obtain the preliminary annotations for each untagged image. It is a learning process on an image-based graph, whose nodes are images and edges are relations between images. The second process (referred as “annotation refinement”) aims to refine the candidate annotations obtained from the prior process. It is a word-based graph learning process, where the nodes are words and the edges are relations between words.

The proposed framework allows us to analyze and understand some previous work more clearly, and offers some potential research guidance. In this paper, we propose three improvements on different parts of the framework. First, considering the intra-relations among training images and test images, we propose a reinforced inter-relation between training image and test image. Second, a combined word correlation is designed as a comprehensive estimation, in which not only the statistical distribution in the training dataset, but also the visual-content-based measurement within the context of web are considered. Third, a progressive learning method is proposed to perform image annotation in a greedy manner, while the traditional assumption of word independence for an image is relaxed to the conditional independence. To evaluate the performance of these improvements, we carry out several experiments on benchmark data of Corel images. Besides, we give a systematic comparison among some related work. Exciting performance of our scheme and some consistent conclusions with the theoretical results are achieved under the proposed framework.

The rest of the paper is organized as follows. Section 2 introduces the unified framework of image annotation. Some improvements based on the proposed framework are addressed in Section 3. Section 4 presents the implementation of image annotation with the proposed improvements. Section 5 presents experimental comparisons among several related work and our scheme. Conclusions and future work are given in Section 6.

Section snippets

Image annotation framework based on graph learning

The proposed framework consists of two learning processes denoted as “basic image annotation” and “annotation refinement”, and three kinds of relations as mentioned above. In the basic image annotation process, image-to-image relation and image-to-word relation are integrated to obtain the candidate annotations. In the annotation refinement process, the word-to-word relation is explored to refine those candidate annotations from the prior process. The both learning processes are performed

Improvements under the framework

Above analysis about the proposed framework demonstrates that the problem of image annotation can be decomposed into several specific and well defined sub-problems. Specially, three kinds of relations and two graph learning processes are referred to. We can improve these items and expect their combination to enhance the overall performance. In the following, we present our improvements focusing on some items in the framework.

Basic image annotation

Obtaining the reinforced relation Sts in Section 3.1, we prepare the image correlation (SII) for the basic image annotation. The image-to-word relation (SIW) based on the training set is another key part. Here, we select the Multi-Bernoulli model to model the word distribution as (Feng et al., 2004)SIW(i,j)=P(wj|Ii)=μδwj,Ii+Nwjμ+NTwhere SIW(i, j) indicates the probability of the word wj given the image Ii, μ is a smoothing parameter estimated by the cross-validation, δwj,Ii=1 if the word wj

Experimental design

To present a fair comparison with some previous work, we use the Corel dataset provided by Duygulu et al. (2002) without any modification. The dataset contains 5000 images. Each image is segmented into 1–10 regions. A 36-dimensional feature for each region is extracted, which includes color, texture and area features as in (Duygulu et al., 2002). All the regions are clustered into 500 clusters (called as blobs). Each image is annotated with 1–5 words. The total number of words is 371. The

Conclusions and future work

In this paper, we propose a unified framework for the automatic image annotation. It includes two graph learning processes by exploring three kinds of relations. The reinforced image-to-image relation, the combined word-to-word relation, and the progressive learning method are proposed to effectively improve the performance of image annotation. The comprehensive experiments and discussion demonstrate that any improvement in the framework is beneficial to image annotation.

In future work, we will

Acknowledgement

The research was supported by National Natural Science Foundation of China (60605004, 60675003) and National 863 Project (2006AA01Z315, 2006AA01Z117).

References (19)

  • Blei, D., Jordan, M., 2003. Modeling annotated data. In: Proc. 26th Internat. Conf. on Research and Development in...
  • Budanitsky, A., Hirst, G., 2001. Semantic distance in wordnet: An experimental, application-oriented evaluation of five...
  • E. Chang et al.

    Cbsa: Content-based soft annotation for multimodal image retrieval using Bayes point machines

    IEEE Trans. CSVT

    (2003)
  • C. Cusano, G.C., Schettini, R., 2003. Image annotation using svm. In: Proc. Internet Imaging IV, SPIE 5304, vol. 5304,...
  • Duygulu, P., Barnard, K., de Freitas, J., Forsyth, D.A., 2002. Object recognition as machine translation: Learning a...
  • Feng, S.L., Manmatha, R., Lavrenko, V., 2004. Multiple bernoulli relevance models for image and video annotation. In:...
  • Jeon, J., Lavrenko, V., Manmatha, R., 2003. Automatic image annotation and retrieval using cross-media relevance...
  • Jiang, J., Conrath, D., 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In: Proc. Internat....
  • Jin, R., Chai, J., Si, L., 2004. Effective automatic image annotation via a coherent language model and active...
There are more references available in the full text version of this article.

Cited by (23)

  • Two-tier image annotation model based on a multi-label classifier and fuzzy-knowledge representation scheme

    2016, Pattern Recognition
    Citation Excerpt :

    Pan et al. [24] have proposed a graph-based method for image annotation in which images, annotations and regions are considered as three types of nodes of a mixed media graph. In [19], automatic image annotation is performed using two graph-based learning processes. In the first graph, the nodes are images and the edges are relations between the images, and in the second graph the nodes are words and the edges are relations between the words.

  • The effectiveness of image features based on fractal image coding for image annotation

    2012, Expert Systems with Applications
    Citation Excerpt :

    Li, Shi, Liu, and Shi (2011) proposed a semantic annotation model which correctly predicts the semantic annotations (labels) of images by establishing the distribution of visual features and labels through continuous probabilistic latent semantic analysis (PLSA) and standard PLSA, and by learning the correlation between these two methods using an asymmetric learning approach. Despite the presence of diversified methods, automatic image annotation can be summarized into three methods: Web-based (Chang, 2008; Gong et al., 2006), region-based (Escalante et al., 2011; Fu et al., 2010; Zhao, Zhao, & Zhu, 2009), and image-based methods (Li et al., 2011; Liu, Wang, Lu, & Ma, 2008; Makadia et al., 2010; Wang, Mei, Gong, & Hua, 2009). The Web-based method contributes to labeling by using textual information around an image in a Web page; however, these texts frequently include numerous irrelevant noises (Hanbury, 2008), which lower the accuracy of image annotation.

  • MAP-based image tag recommendation using a visual folksonomy

    2010, Pattern Recognition Letters
    Citation Excerpt :

    This joint probability is subsequently used to annotate the input image. Lavrenko et al. (2003) is frequently cited by other papers in the scientific literature (e.g., Feng et al., 2004 and Liu et al., 2007). Moreover, the performance of Lavrenko et al. (2003) is still on par with the state-of-the-art in the field of automatic image annotation.

  • Transductive Multi-Instance Multi-Label learning algorithm with application to automatic image annotation

    2010, Expert Systems with Applications
    Citation Excerpt :

    An intuitive description of this algorithm is: a weighted graph is first constructed which takes each data point as a vertex; assign ranking scores to each labeled point while 0 to unlabeled points; all the points then spread their scores to the nearby points via the weighted graph; the spread process is repeated until a global stable state is reached, and all the points beyond the labeled ones will have their own scores according to which they will be ranked. Liu and Li (2009, 2008) and Tong et al. (2006) tried to employ manifold-ranking algorithm (Zhou et al., 2003; Zhou & Weston, 2003) for keyword propagation. These approaches use the analogy that visual similar images would be annotated similarly as well.

  • Review of Automatic Image Annotation Technology

    2020, Jisuanji Yanjiu yu Fazhan/Computer Research and Development
View all citing articles on Scopus
View full text