Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In our lives, there are many emotional and memorable moments that worth keeping and sharing with others. Therefore, services allowing users to upload and share their personal photos are always ones of many notable products of different companies such as Facebook, Flickr, Instagram, and Google Photos. This shows that sharing photos is one of greatest demands of users on the Internet.

Online photo services usually allow users to attach some memos to their photos as well as to search their photos more easily using text queries. Currently, the most common way for users to do so is to tag their photos manually which consumes a lot of time and effort. There are also some proposed methods [3, 5] and smart systems which are able to automatically identify noticeable landmarks or locations related to the photos such as Google Photos and Flickr. However, these automated annotation systems suggest tags that are identical for all users and thus do not reflect one’s own memories, feelings, or characteristics. For example, these systems would recommend phrases like “Eiffel Tower”, “a dog”, or “a cat” rather than “where I first met my lover” or the name of your pet. Therefore, it is necessary to automatically tag users’ photos with personalized captions corresponding to their memory and personal characteristics.

In this paper, we propose a system that can suggest appropriate annotations for each photo uploaded by users using Visual Instance Search. In our system, users can assign their personalized annotations for some photos as initial examples, then, the system automatically propagates these annotations to other existed photos in their collection based on visual similarities among the photos. For each uploaded photo, the system bases on visual similarities between that photo and already-annotated photos of the corresponding user to propose a list of suitable annotations for the new uploaded photo in descending order of similarity. Then, the user can choose to approve reasonable annotation for the uploaded photo. In addition, if a user uploads more than one photo and change the annotations, the system has more samples for reference and thus, it tends to better adapt to user’s interests. As a result, our system is not only able to recommend proper annotations which are unique for each user but also to interactively and incrementally learn and adapt as users change annotations.

Since the problem of retrieving similar images in a collection corresponding to a single image has been developed for years, there are many different approaches to the problem. One of them is template matching method, i.e. a technique for finding small parts of an image which match a template image [2, 4, 16]. Another popular technique is to evaluate the similarity of two images by comparing some regions which appear to be critical parts of the images, namely features matching [1, 17, 23]. In this paper, the authors develop our own Visual Instance Search framework using Bag-of-Words (BoW) model. In Bag-of-Words model, each image is represented as a histogram of pre-trained visual words (codebook). Since Bag-of-Words allows parts of a query image to appear in a flexible way in the result images, it is a potential approach that is widely used in many Visual Search systems.

Together with the exponential increasing of the number of uploaded images, the system faces lots of difficulty adapting those new images. Since re-training the codebook requires changing Bag-of-Words vectors of users’ existing images and is also computationally expensive, the authors propose to use a fixed codebook trained with different types of features (e.g. vehicles, animals, buildings...) and use it universally. Because of the varieties of those different features, it is appropriate to compute and represent any new images’ Bag-of-Words vectors without changing the codebook. Therefore, we train our codebook on Oxford Building Dataset and use this codebook for our system.

Our main contributions in this paper are as follows:

  • We propose the idea and realize the system that can recommend annotation for photos with visual instance search.

  • Our system allows recommended annotation to be personalized and to vary from user to user.

  • Our system is interactively user adaptive, i.e. the more a user annotates his/her photos via our system, the more accurate the recommended annotations are.

The rest of this paper is organized as follows. In Sect. 2, we review the background and related works in image retrieval and image classification. Detailed steps of the automatic annotation system and how we use the BoW model is described in Sect. 3. Section 4 contains our experiment result. Conclusion is presented in Sect. 5.

2 Background and Related Works

There are many approaches to build an Image Information Retrieval System. Some methods aim at high precision, i.e. to achieve high quality of top retrieved results, while others focus on high recall, i.e. to retrieve all positive results. Among them, the first effective and scalable method is Bag-of-Words, proposed by Sivic and Zisserman [20], which is inspired by the correspondence algorithm using in text retrieval. Before going into details of BoW model in Subsect. 2.2, we first introduce some different methods for image retrieval problem in Subsect. 2.1.

2.1 Different Approaches for Image Retrieval Problem

One of many popular methods is histogram comparison which compares 2 different images based on their color histograms. Some early works of this approach using a cross-bin matching cost for histogram comparison can be found in [12, 19, 24]. In [12], Peleg et al. represent images as sets of pebbles after normalization. The similarity score is then computed as the matching cost of two sets of pebbles based on their distances.

Another well-known technique is template matching, i.e. seeking a given pattern in an image by comparing to the pattern with candidate regions of the same size in the target image. By considering both the pattern and candidate regions as a length-N vector, we can compare these two vectors using different kinds of distance metrics, and one such metric is the Minkowski distance [10]. The major disadvantage of 2 listed methods is that they require the query and target images to share a similar stationary interrelation, which means that components of the given image are not allowed to change freely in a certain extent. Bag-of-Words, the method that is discussed in this paper, is another approach that can tolerate the flexibility in structures of the object and thus, has a wider variation of applications in many problems.

2.2 Bag-of-Words

Since Bag-of-Words is originally a text retrieval algorithm, we first introduce some backgrounds about BoW in text retrieval problem in Subsect. 2.2 before discussing using BoW in image retrieval in Subsect. 2.2.

Bag-of-Words in Text Retrieval. In text retrieval, a text is represented as a histogram of words, also known as BoW [6]. This scheme is called term frequency weighting as the value of each histogram bin is equal to the number of times the word appears in the document. Moreover, some words are less informative than others since those words appear in almost every document. Therefore, we need a weighting scheme that address this problem. Such weighting scheme is called inverse document frequency (idf) and is formulated as \(log(N_{D} / N_{i})\), where \(N_{D}\) is the number of documents in the collection and \(N_{i}\) is the number of documents which contains word i. The overall BoW representation is thus weighted by multiplying the term frequency (tf) with the inverse document frequency (idf) giving rise to the tf-idf weighting [6]. In addtion, extremely frequent words, “stop words”, can be removed entirely in order to reduce storage requirements and query time.

Bag-of-Words in Image Retrieval. When applying BoW to image retrieval, a major obstacle is the fact that text documents are naturally broken into words by spaces, dots, hyphens, or commas. In contrast, there is no such separator in images. Therefore, the concept of “visual word” is introduced where each visual word is represented as a cluster obtained using k-means on the local descriptor vectors [20].

The bigger the vocabulary size is, the more different the visual words are. Hence, the vocabulary helps us distinguish images more effectively. Nonetheless, with bigger vocabulary size, slightly different descriptors can be assigned to different visual words thus not contributing to the similarity of the respective images and causing a drop in performance examined in [9, 15, 18]. Philbin et al. [15] suggests “soft assign” method where each descriptor is assigned to multiple nearest visual words instead of using “hard assignment”, i.e. only assign a local descriptor to only one nearest visual word. Despite its effectiveness, this method also significantly costs more storage and time.

3 Proposed System

In this section, we present how our system can learn to annotate different photos and briefly describe main steps in our BoW model.

3.1 Learn from Manual Annotations and Automatic Annotation for New Photos

Figure 1 illustrates the overview of our proposed system to automatically recommend personalized annotations for newly uploaded photos. First, a user simply uses his or her smartphones camera to capture scenes or objects in real life such as books, dogs, or buildings. A photo is then sent to the annotation server for processing and the server returns the list of visual similar photos. Additionally, each photo attaches a list of annotations and these possible personalized annotations are re-ranked and sent to the user. The user can review and approve these personalized annotations before sharing the photo to social networks such as Facebook, Flickr, or Google Plus along with the approved personalized tags.

Figure 2 shows how our system learns to annotate a photo from samples provided by a user in the past. First, a user manually chooses suitable tags for some photos and these photos along with the tags are then sent to the server. Subsequently, our server process identifies and recommends the user to also apply these changes to visually similar photos in his or her albums. The user can approve before these changes take effect in the database. From this point of time, our system automatically annotates new photos for that user based on these new configurations.

Fig. 1.
figure 1

Overview of our proposed system to automatically recommend personalized tags.

Fig. 2.
figure 2

Overview on how our system learn to annotate photo from samples provided by users.

3.2 Visual Instance Search Method

Feature Extraction. To detect and extract features from images, there are many methods that have been proposed (Harris-Affine, Hessian-Affine detectors [8], Maximally stable extremal region (MSER) detector [7], Edge-based region detector [21], Intensity extrema-based region detector [22] ...). The authors choose to use Hessian-Affine detector, for detecting and extracting features from images. In our version of BoW model, we use Perd’och’s implementation of SIFT detector, which is shown to perform best on Oxford Building Dataset [13] (Figs. 3 and 4).

Fig. 3.
figure 3

How an Image Retrieval System works

Fig. 4.
figure 4

Proposed framework

Dictionary Building. Treating each descriptor as an individual visual words in the dictionary results in a worthless waste of resources and time. In order to overcome this obstacle, the authors therefore build the dictionary by considering some similar descriptors as one. In other words, all descriptor vectors are divided into k clusters, each representing a visual word. There are many algorithms that are proposed to solve this kind of problem. However, the authors use the approximate k-means (AKM). AKM is proposed by Philbin et al. [14]. Comparing to the original k-means, AKM can reduce the majority amount of time taken by exact nearest neighbors computation but only gives slightly different result. Also, in [14], Philbin et al. shows that using 1M dictionary size would have the best performance on the Oxford Building 5K Dataset [11].

Quantization. Subsequently, each 128-dimension SIFT descriptor needs to be mapped into the dictionary. Commonly, each descriptor is assigned into the nearest word in the dictionary. Thus, when two descriptors are assigned to different words, they are considered as totally different. In practice, this hard assignment leads to errors due to variability in descriptor (e.g. image noise, varying scene illumination, instability in the feature detection process ...) [15]. In order to handling this problem, the authors use soft assignment instead of hard assignment. In particular, each 128-dimension SIFT descriptor is reduced to a k-dimension vector of their k nearest visual words in the dictionary. Each of these k nearest cluster is assigned with weights calculated from the formula proposed by Sivic et al. [15], \(weight = \exp (-\frac{d^2}{2\delta ^2})\), where d is the distance from the cluster center to descriptor point. Then, by adding all these weights to their corresponding bins, we have the BoW representation of an image.

In this work, k and \(\delta ^2\) are chosen to be 3 and 6250, respectively.

tf-idf Weighting Scheme. As mentioned in Sect. 2, tf-idf is a popular weighting scheme that is used by almost any BoW model. In this section, the authors show how this scheme is applied to our system.

For a term \(t_{i}\) in a particular document \(d_{j}\), its term frequency \(tf_{i, j}\) is defined as follow:

$$\begin{aligned} tf_{i, j} = \frac{n_{i, j}}{\sum \limits _{k} n_{k, j}} \end{aligned}$$
(1)

where \(n_{i, j}\) is the number of occurrences of the considered term \(t_{i}\) in the document \(d_{j}\). The denominator is the sum of the number of occurrences of all the terms in document \(d_{j}\).

The inverse document frequency \(idf_{i}\) of a term \(t_{i}\) is computed by the following formula:

$$\begin{aligned} idf_{i} = \log {\frac{\left| D\right| }{\left| \{j: t_{i} \in d_{j}\}\right| }} \end{aligned}$$
(2)

where, \(\left| D\right| \) is the total number of documents in the corpus, \(\left| \{j: t_{i} \in d_{j}\}\right| \) is the number of documents where the term \(t_{i}\) appears, i.e. \(n_{i, j} \ne 0\)

The tf-idf weight of a term \(t_{i}\) in a document \(d_{j}\) is then calculated as the product of tf and idf:

$$\begin{aligned} {tfidf}_{i, j} = tf{i, j} \times idf_{i} \end{aligned}$$
(3)

The tf-idf weight is then used to compute the similarity score between an image \(d_{i}\) and a query q:

$$\begin{aligned} s_{d_{i}, q} = \varvec{{tfidf}_{i}} \cdot \varvec{{tfidf}_{q}} = \sum \limits _{j = 1}^{\left| T\right| } {tfidf}_{i, j} \times {tfidf}_{q, j} \end{aligned}$$
(4)

Finally, by sorting the list of images corresponding to their similarity score with a query, we achieve the raw ranked list of this query which is then used for the Spatial Rerank step.

4 Experiment and Result

In this section, first, we present our experiment result on Oxford 5K Building Dataset to prove that our BoW implementation can achieve good enough performance on standard benchmark. The experiment shows that our version of BoW achieves the mean average precision of 0.844 on Oxford 5K Building Dataset with nearly one second average time for each query. This dataset was constructed by Philbin et al. in 2007 [14]. It consists of 5,062 images of resolution \(1024 \times 768\) belongs to 11 different Oxford buildings. Images for each building are collected from Flickr by searching using text queries. Along with the dataset, there are also 55 queries along with their ground-truth, 5 for each landmark. The ground truth of 55 queries are manually constructed. For each query, images are classified into 4 groups: (1) Good: the building appears apparently, (2) OK: more than 25 % of the building is present, (3) Bad: the building is not shown up, and (4) Junk: less than 25 % of the building is captured. The reason why the authors use this dataset is because of its popularity, it is used by many previous works in this field. Thus, we can easily compare our systems with those previous works.

Secondly, we also present and illustrate several typical scenarios of our automatic annotation system with the dataset consisting of our personal photos taken from Facebook. This dataset includes 5 different classes corresponding with 5 social events that are personally annotated. There are 2 classes that share a common annotation. Photos in each class share some particular attributes such as background, mascots, logos. As a result, whenever users create or edit the annotation of these common objects, other photos in the same class can also be tagged similarly thanks to theses mutual attributes. The details of these 5 classes in the dataset are described below:

Fig. 5.
figure 5

Our personal dataset. Column (a) shows 5 queries of 5 classes in the dataset. Column (b) and (c) are some examples in the returned result of the queries.

  1. 1.

    #APCS_Party: Photos taken at a party of our university. Photos in this class contain nearly the same group of people and have similar background and decoration on the stage.

  2. 2.

    #First_time_in_Singapore: These photos are taken at the Merlion in Singapore. They all contain the merlion statue.

  3. 3.

    #Hoi_An_with_family: Consisting of photos taken at Hoi An town in Vietnam with one of the authors’ family. The people appearing in them and the background are their common attributes.

  4. 4.

    #My_favorite_competition: These are taken at multiple times I have taken part in the ACM-ICPC, a really famous collegiate programming competition. The mutual characteristic of these photos is the logo of the competition.

  5. 5.

    #My_first_regional: Photos taken at my ICPC regional contest in Phuket, Thailand. The photos all accommodate the mascot of the competition.

We then performed experiment on 5 different queries corresponding to 5 different classes. These queries and some sample result are given in Fig. 5. In the experiment, each query also takes our system nearly one second on average and the mean average precision over 5 queries is 0.749. The detail result is shown in Table 1.

5 Conclusion

In this paper, we have proposed our idea to use Visual Instance Search to create a system that can help user in 2 different tasks: automatically suggested personalized annotations for uploaded photos and propagate users’ annotations for their photos to similar images. To realize this, we build our system based on Bag-of-Words model. To evaluate the performance of the system, we have used Oxford 5K Building Dataset, a really popular benchmark for Visual Instance Search task. In addition, we have also experimented and illustrated our systems with some scenarios taken from personal photos along with their tags on Facebook. Our system has achieves the mAP of 0.844 and 0.749 respectively on these 2 datasets with the processing time less than 1 second for each query. In the future, the authors believe that the system can further be developed to become a valuable extension for online photo sharing services.