Keywords

1 Introduction

Most of the modern image search systems only determine images that have properties specified by a given query. They still work most of the time. However, it is often the case that search results are not quite right for our need. Suppose, for example, we have an image that is perfect except for the color of one object. In this case, it is desirable that we can perform arithmetic operations such as [a image of a blue car] - ‘blue’ + ‘red’ so that we can get the ideal image. Such a system could extend the possibilities of image searching. This could lead to a more interactive system, for instance, which can search for images through verbal interactions with voice assistants.

To address this problem, we propose an image search system based on multimodal analogy. A visual semantic embedding model, which forms the core of the system, enables us to perform analogical reasoning over images by specifying properties to be added to/subtracted from the current results with words.

Additionally, in order to search for an image specified by the aforementioned arithmetic operation, it is necessary to introduce an appropriate similarity measure. We therefore propose a measure based on the difference between additive and subtractive queries. We show the effectiveness of the measure by experiment.

2 Multimodal Learning

2.1 Visual-Semantic Embedding Models

To operate arithmetic between images and text, they need to be represented in a shared vector space. Much research has been done on learning joint embeddings of images and text. A well-known approach is to learn a function that maps both image and word embeddings into a common vector space [1,2,3]. The learned image-text embeddings are often called visual-semantic embedding since semantic relationships between images and text can be obtained through its training process.

2.2 Multimodal Linguistic Regularities

Kiros et al. [1] reported that multimodal linguistic regularities were found in an image-text embedding space, while the main focus of their work is on image captioning. They qualitatively investigated properties of the multimodal vector space and the results indicate that linguistic regularities [4] carry over to the joint space.

Fig. 1.
figure 1

Image-text encoder of the visual-semantic embedding model

They also proposed the visual-semantic embedding learned on the image-text encoder. The image-text encoder (Fig. 1) consists of convolutional neural network (CNN) [5] and long short-term memory (LSTM) [6]. The CNN and LSTM take images and sentences as input, respectively. In the training phase, the network is optimized to minimize a pairwise ranking loss:

$$\begin{aligned} \begin{aligned} \underset{\varvec{\theta }}{\min }&\sum _{\varvec{X}} \sum _k \max \{\, 0,\ \alpha - S(\varvec{X}, \varvec{V}) + S(\varvec{X}, \varvec{V}_k)\, \} \ + \quad \\&\sum _{\varvec{V}} \sum _k \max \{\, 0,\ \alpha - S(\varvec{V}, \varvec{X}) + S(\varvec{V}, \varvec{X}_k)\, \}, \quad \,\,\, \end{aligned} \end{aligned}$$
(1)

where \(\max \{\cdot ,\cdot \}\) returns the larger value, \(\theta \) denotes the model parameters, \(\alpha \) is a margin, and cosine similarity is used as a scoring function \(S(\cdot ,\cdot )\). \(\varvec{V}_k\) and \(\varvec{X}_k\) are, respectively, contrastive embeddings for image embeddings \(\varvec{X}\) and sentence embeddings \(\varvec{V}\). Intuitively, the loss function trains the network to assign high scores to correct pairs of images and text, while it gives incorrect pairs low scores.

3 Similarity Measure Based on the Difference Vector

Given a base query image and words that specify additive/subtractive properties, Kiros et al. [1] uses the following similarity measure:

$$\begin{aligned} S(\varvec{X},\ q_\text {img} - q_\text {sub} + q_\text {add}) , \end{aligned}$$
(2)

where \(q_\text {img}\), \(q_\text {add}\), \(q_\text {sub}\) are vector representations of queries in the multimodal vector space and \(S(\cdot ,\cdot )\) is cosine similarity. With the similarity measure (2), we try to find vector \(\varvec{X}\) that is closest to \(q_\text {img} - q_\text {sub} + q_\text {add}\), with respect to cosine similarity.

A desirable target \(q_\text {img} - q_\text {sub} + q_\text {add}\) can be represented as “\(q_\text {img} + \text {difference}\).” The difference vector is specified by direction and magnitude. The similarity measure in (2) constrains both the direction and the magnitude of the difference from the base image. However, the arithmetic “\(\text {base} - \text {sub} + \text {add}\)” is just a qualitative one. Usually, it could be hard for us to specify the magnitude of the difference vector by only giving additive/subtractive words.

In our method, instead of (2), we use the following measure:

$$\begin{aligned} S(\varvec{X} - q_\text {img},\ q_\text {add} - q_\text {sub}). \end{aligned}$$
(3)

With this similarity measure, we try to find \(\varvec{X}\) such that the difference from the base image is similar to “\(\text {add} - \text {sub}\)” (see Fig. 2).

Fig. 2.
figure 2

Relations among our similarity measure and the previous one

Fig. 3.
figure 3

Examples that our measure performs better than the previous method

4 Experiments

We conduct experiments in the same manner as [1] to compare our results to the previous methodFootnote 1. We used the Microsoft COCO dataset [7] to train the encoder. The dataset contains about 83,000 images and each image is accompanied by 5 descriptive sentences.

Figures 3 and 4 both shows (2) vs. (3) comparison on multimodal analogical reasoning tasks. Figure 3 illustrates some examples that our measure performs better than the previous one in terms of ranking order. Note that our measure also performs well when query words are not visually obvious as shown in the first example in Fig. 3, while the previous approach struggles to find plausible images.

On the other hand, Fig. 4 shows cases of our poor results. The system gives us irrelevant images. We speculate this is most likely due to the insufficiency of training data and it prevents the network from learning semantic relationships. However, we still need to investigate the cause further.

Taking all these results into consideration, we consider our measure to be more suitable to search for images based on multimodal analogical reasoning.

Fig. 4.
figure 4

Examples that our measure shows irrelevant images

5 Conclusion

We proposed an image search system based on multimodal analogy that allows us to perform analogical reasoning over images and text. Our difference-based similarity measure gives us reasonably better results than the previous method at qualitative analogical reasoning tasks. The system provides us flexibility that would be useful when searching for images through verbal interactions with voice assistants, which is gaining more and more attentions recently, as well as traditional web image search systems.