Towards Cycle-Consistent Models for Text and Image Retrieval

Cornia, Marcella; Baraldi, Lorenzo; Tavakoli, Hamed R.; Cucchiara, Rita

doi:10.1007/978-3-030-11018-5_58

Towards Cycle-Consistent Models for Text and Image Retrieval

Marcella Cornia¹⁴,
Lorenzo Baraldi¹⁴,
Hamed R. Tavakoli¹⁵ &
…
Rita Cucchiara¹⁴

Conference paper
First Online: 23 January 2019

1281 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11132))

Abstract

Cross-modal retrieval has been recently becoming an hot-spot research, thanks to the development of deeply-learnable architectures. Such architectures generally learn a joint multi-modal embedding space in which text and images could be projected and compared. Here we investigate a different approach, and reformulate the problem of cross-modal retrieval as that of learning a translation between the textual and visual domain. In particular, we propose an end-to-end trainable model which can translate text into image features and vice versa, and regularizes this mapping with a cycle-consistency criterion. Preliminary experimental evaluations show promising results with respect to ordinary visual-semantic models.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Matching visual data and natural language is an important challenge for multimedia as it enables a large variety of different applications ranging from retrieval, visual question answering, image and video captioning [1, 3, 7, 12, 13]. One of the core challenges in this scenario is that of enabling a cross-modal retrieval, i.e. the retrieval of visual items given textual queries, and vice versa.

Current cross-modal retrieval methods often rely on the construction of a common multi-modal embedding space in which project data from the two modalities (i.e. images and text) [4, 9, 14]. The retrieval, in this case, is then carried out by measuring distances in the joint space, which should be low for matching text-image pairs and higher for non-matching pairs. While this approach leads to very good results, it is not the only possible solution.

Here, we foresee a different approach and address the problem of retrieving images and captions as a translation from the image domain to the textual domain and vice versa. In the first direction, an image i (usually, represented with a feature vector x) is converted to a textual representation $\tilde{s}$ of its content; in the latter direction, a sentence s is converted into an image feature $\tilde{x}$ which reflects its meaning.

Figure 1 visually describes the idea: a learnable architecture translates textual data to a suitable representation in a visual domain, and visual features back to the textual domain. The overall architecture is trainable end-to-end: generated visual features are required to be realistic with respect to positive and negative image samples, and a cyclic constraint is imposed to guarantee that the forward and backward translation are feasible at the same time and consistent.

2 Cycle-Consistent Retrieval

We introduce a cycle-consistent text and image retrieval network which operates a translation between the textual and the visual domains. Under the model, input captions can be translated to proper image features, and image vectors can be translated back to the textual domain. Exploiting this translation capability, a reconstruction constraint makes sure that the reconstructed text is similar to the original one. The overall architecture is shown in Fig. 2.

From Text to Image (txt2img). The first part of the architecture consists of a visual-semantic model which can transform a sentence s in a meaningful vector in the image feature space, $\tilde{x}$. Words are represented with one-hot vectors that are embedded with a linear embedding, which can be either learned end-to-end together with the model, or pre-trained using another word embedding model, like Word2Vec [10], GloVe [11] or FastText [2]. Under the model, words are consumed by a GRU layer.

We train this model with a cost function which encourages the generated image vector to be close to the one of an image which has been described by the same caption. To this aim, we define a similarity function inside the image feature space (e.g. the cosine similarity), and apply a hinge-based triplet ranking loss commonly used in image-text retrieval [4, 9].

From Image to Text (img2txt). While sentences can be projected into an image feature space, the second component of the model translates image vectors x into the textual space by generating a textual description $\tilde{s}$. This roughly corresponds to an image captioning model in which the image is treated as the first input of an LSTM-based recurrent model.

At each iteration, the hidden state is linearly projected to the dimensionality of the vocabulary, and a softmax activation is then used to produce a probability distribution over the vocabulary. For each input image vector, the model generates the corresponding textual representation $\tilde{s}$ composed by the words produced at each time-step of the LSTM.

Closing the Loop. The $\texttt {txt2img}$ and $\texttt {img2txt}$ models defined above realize the forward and backward translations between the image and the textual domain. Due to the diversity and high dimension of raw images, directly translating to and from the image domain would be intractable, therefore both models operate in the space of image feature vectors extracted from a CNN.

The mapping between the two spaces is regularized with a cycle-consistency criterion, in which we require the feasibility of the forward and backward translation at the same time. In practice, we require that the projection of a generated image vector into the textual space should be similar to the text from which the vector originated, i.e.

$$\begin{aligned} \texttt {img2txt}(\texttt {txt2img}(s)) \approx s. \end{aligned}$$

(1)

The similarity constraint imposed by Eq. 1 could be realized by taking into account the semantics of both sentences, either by evaluating a machine translation metric or by defining a network in charge of learning the similarity between two sentences. To keep the model simple and concentrate on the evaluation of the regularization power of the proposal, we realize Eq. 1 by computing the negative log-likelihood of generated words with respect to the words in s.

Implementation Details. To encode input images, we extract feature vectors from the average pooling layer of a ResNet-152 [5], thus obtaining an image dimensionality of 2048. For encoding image captions, since we do not project images and corresponding captions in a joint embedding space, we set the output size of the GRU to the same size of image embeddings (i.e. 2048). The dimensionality of word embeddings is set to 300. All experiments have been performed using the Adam optimizer [8] with an initial learning rate of $2 \times 10^{-4}$.

3 Experimental Results

We show preliminary evaluation results for the proposed approach, employing rank-based performance metrics R@K ($K=1, 5, 10$) for text and image retrieval. In particular, R@K computes the percentage of test images or test sentences for which at least one correct result is found among the top-K retrieved sentences, in the case of text retrieval, or the top-K retrieved images, in the case of image retrieval.

As a baseline, we consider the txt2img model, which removes the cycle-consistency regularizer and is therefore well suited to evaluate the claims of the proposal regarding the role of the cycle-consistent constraint. This, also, is practically equivalent to a visual-semantic embedding model in which the visual projector is the identity function.

Table 1. Experimental results of our model on the Flickr8K and Flickr30k dataset using different word embeddings

Full size table

Table 1 reports the results of our model on the Flickr8K [6] and Flickr30K [15] datasets using different word embedding strategies, together with that of the $\texttt {txt2img}$ model alone. It can been observed that the performance of the complete model is always superior to that of the baseline, thus confirming the importance of translating backwards to the textual space and demonstrating the effectiveness of our promising solution.

References

Baraldi, L., Cornia, M., Grana, C., Cucchiara, R.: Aligning text and document illustrations: towards visually explainable digital humanities. In: International Conference on Pattern Recognition (2018)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Paying more attention to saliency: image captioning with saliency and context attention. ACM Trans. Multimedia Comput. Commun. Appl. 14(2), 48:1–48:21 (2018)
Article Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: British Machine Vision Conference (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE International Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
Article MathSciNet Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE International Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing (2014)
Google Scholar
Shetty, R., Tavakoli, H.R., Laaksonen, J.: Image and video captioning with augmented neural architectures. IEEE MultiMedia 25, 34–46 (2018)
Article Google Scholar
Tavakoli, H.R., Shetty, R., Borji, A., Laaksonen, J.: Paying attention to descriptions generated by image captioning models. In: IEEE International Conference on Computer Vision (2017)
Google Scholar
Wang, L., Li, Y., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. (2018)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar

Download references

Acknowledgments

We gratefully acknowledge the support of Facebook AI Research and NVIDIA Corporation with the donation of the GPUs used for this research.

Author information

Authors and Affiliations

University of Modena and Reggio Emilia, Modena, Italy
Marcella Cornia, Lorenzo Baraldi & Rita Cucchiara
Aalto University, Espoo, Finland
Hamed R. Tavakoli

Authors

Marcella Cornia
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Baraldi
View author publications
You can also search for this author in PubMed Google Scholar
Hamed R. Tavakoli
View author publications
You can also search for this author in PubMed Google Scholar
Rita Cucchiara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcella Cornia .

Editor information

Editors and Affiliations

Technical University of Munich, Garching, Germany
Laura Leal-Taixé
Technische Universität Darmstadt, Darmstadt, Germany
Stefan Roth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cornia, M., Baraldi, L., Tavakoli, H.R., Cucchiara, R. (2019). Towards Cycle-Consistent Models for Text and Image Retrieval. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11132. Springer, Cham. https://doi.org/10.1007/978-3-030-11018-5_58

Download citation

DOI: https://doi.org/10.1007/978-3-030-11018-5_58
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11017-8
Online ISBN: 978-3-030-11018-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics