Abstract
We propose an image search system based on multimodal analogy, which is enabled by using a visual-semantic embedding model. It allows us to perform analogical reasoning over images by specifying properties to be added to/subtracted with words such as [a image of a blue car] - ‘blue’ + ‘red’. The system mainly consists of the following two parts: (i) an encoder that learns image-text embeddings and (ii) a similarity measure between embeddings in a multimodal vector space. As for the encoder, we adopt a CNN-LSTM encoder proposed in [1], which was reported that it can learn multimodal linguistic regularities. We also introduce a new similarity measure based on the difference between additive and subtractive query. It gives us reasonably better results than the previous approach at qualitative analogical reasoning tasks.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Most of the modern image search systems only determine images that have properties specified by a given query. They still work most of the time. However, it is often the case that search results are not quite right for our need. Suppose, for example, we have an image that is perfect except for the color of one object. In this case, it is desirable that we can perform arithmetic operations such as [a image of a blue car] - ‘blue’ + ‘red’ so that we can get the ideal image. Such a system could extend the possibilities of image searching. This could lead to a more interactive system, for instance, which can search for images through verbal interactions with voice assistants.
To address this problem, we propose an image search system based on multimodal analogy. A visual semantic embedding model, which forms the core of the system, enables us to perform analogical reasoning over images by specifying properties to be added to/subtracted from the current results with words.
Additionally, in order to search for an image specified by the aforementioned arithmetic operation, it is necessary to introduce an appropriate similarity measure. We therefore propose a measure based on the difference between additive and subtractive queries. We show the effectiveness of the measure by experiment.
2 Multimodal Learning
2.1 Visual-Semantic Embedding Models
To operate arithmetic between images and text, they need to be represented in a shared vector space. Much research has been done on learning joint embeddings of images and text. A well-known approach is to learn a function that maps both image and word embeddings into a common vector space [1,2,3]. The learned image-text embeddings are often called visual-semantic embedding since semantic relationships between images and text can be obtained through its training process.
2.2 Multimodal Linguistic Regularities
Kiros et al. [1] reported that multimodal linguistic regularities were found in an image-text embedding space, while the main focus of their work is on image captioning. They qualitatively investigated properties of the multimodal vector space and the results indicate that linguistic regularities [4] carry over to the joint space.
They also proposed the visual-semantic embedding learned on the image-text encoder. The image-text encoder (Fig. 1) consists of convolutional neural network (CNN) [5] and long short-term memory (LSTM) [6]. The CNN and LSTM take images and sentences as input, respectively. In the training phase, the network is optimized to minimize a pairwise ranking loss:
where \(\max \{\cdot ,\cdot \}\) returns the larger value, \(\theta \) denotes the model parameters, \(\alpha \) is a margin, and cosine similarity is used as a scoring function \(S(\cdot ,\cdot )\). \(\varvec{V}_k\) and \(\varvec{X}_k\) are, respectively, contrastive embeddings for image embeddings \(\varvec{X}\) and sentence embeddings \(\varvec{V}\). Intuitively, the loss function trains the network to assign high scores to correct pairs of images and text, while it gives incorrect pairs low scores.
3 Similarity Measure Based on the Difference Vector
Given a base query image and words that specify additive/subtractive properties, Kiros et al. [1] uses the following similarity measure:
where \(q_\text {img}\), \(q_\text {add}\), \(q_\text {sub}\) are vector representations of queries in the multimodal vector space and \(S(\cdot ,\cdot )\) is cosine similarity. With the similarity measure (2), we try to find vector \(\varvec{X}\) that is closest to \(q_\text {img} - q_\text {sub} + q_\text {add}\), with respect to cosine similarity.
A desirable target \(q_\text {img} - q_\text {sub} + q_\text {add}\) can be represented as “\(q_\text {img} + \text {difference}\).” The difference vector is specified by direction and magnitude. The similarity measure in (2) constrains both the direction and the magnitude of the difference from the base image. However, the arithmetic “\(\text {base} - \text {sub} + \text {add}\)” is just a qualitative one. Usually, it could be hard for us to specify the magnitude of the difference vector by only giving additive/subtractive words.
In our method, instead of (2), we use the following measure:
With this similarity measure, we try to find \(\varvec{X}\) such that the difference from the base image is similar to “\(\text {add} - \text {sub}\)” (see Fig. 2).
4 Experiments
We conduct experiments in the same manner as [1] to compare our results to the previous methodFootnote 1. We used the Microsoft COCO dataset [7] to train the encoder. The dataset contains about 83,000 images and each image is accompanied by 5 descriptive sentences.
Figures 3 and 4 both shows (2) vs. (3) comparison on multimodal analogical reasoning tasks. Figure 3 illustrates some examples that our measure performs better than the previous one in terms of ranking order. Note that our measure also performs well when query words are not visually obvious as shown in the first example in Fig. 3, while the previous approach struggles to find plausible images.
On the other hand, Fig. 4 shows cases of our poor results. The system gives us irrelevant images. We speculate this is most likely due to the insufficiency of training data and it prevents the network from learning semantic relationships. However, we still need to investigate the cause further.
Taking all these results into consideration, we consider our measure to be more suitable to search for images based on multimodal analogical reasoning.
5 Conclusion
We proposed an image search system based on multimodal analogy that allows us to perform analogical reasoning over images and text. Our difference-based similarity measure gives us reasonably better results than the previous method at qualitative analogical reasoning tasks. The system provides us flexibility that would be useful when searching for images through verbal interactions with voice assistants, which is gaining more and more attentions recently, as well as traditional web image search systems.
Notes
- 1.
We reproduced their results using code available on https://github.com/ryankiros/visual-semantic-embedding.
References
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)
Dong, J., Li, X., Snoek, C.G.M.: Word2visualvec: cross-media retrieval by visual feature prediction. arXiv preprint arXiv:1604.06838 (2016)
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M.A., Mikolov, T.: Devise: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems (NIPS) (2013)
Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (2013)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of International Conference on Learning Representations (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. (1997)
Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollar, P., Zitnick, C.L.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Acknowledgement
M. Maruyama was supported by JSPS Kakenhi Grant Number JP26330249.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Ota, K., Shirai, K., Miyao, H., Maruyama, M. (2017). Interactive Image Search System Based on Multimodal Analogy. In: Stephanidis, C. (eds) HCI International 2017 – Posters' Extended Abstracts. HCI 2017. Communications in Computer and Information Science, vol 714. Springer, Cham. https://doi.org/10.1007/978-3-319-58753-0_83
Download citation
DOI: https://doi.org/10.1007/978-3-319-58753-0_83
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58752-3
Online ISBN: 978-3-319-58753-0
eBook Packages: Computer ScienceComputer Science (R0)