1 Introduction

Matching visual data and natural language is an important challenge for multimedia as it enables a large variety of different applications ranging from retrieval, visual question answering, image and video captioning [1, 3, 7, 12, 13]. One of the core challenges in this scenario is that of enabling a cross-modal retrieval, i.e. the retrieval of visual items given textual queries, and vice versa.

Current cross-modal retrieval methods often rely on the construction of a common multi-modal embedding space in which project data from the two modalities (i.e. images and text) [4, 9, 14]. The retrieval, in this case, is then carried out by measuring distances in the joint space, which should be low for matching text-image pairs and higher for non-matching pairs. While this approach leads to very good results, it is not the only possible solution.

Here, we foresee a different approach and address the problem of retrieving images and captions as a translation from the image domain to the textual domain and vice versa. In the first direction, an image i (usually, represented with a feature vector x) is converted to a textual representation \(\tilde{s}\) of its content; in the latter direction, a sentence s is converted into an image feature \(\tilde{x}\) which reflects its meaning.

Fig. 1.
figure 1

Instead of relying on a joint embedding space, we address the problem of cross-modal retrieval as that of learning a translation between the textual and visual domain, with a reconstruction objective which keeps the overall process cycle-consistent

Figure 1 visually describes the idea: a learnable architecture translates textual data to a suitable representation in a visual domain, and visual features back to the textual domain. The overall architecture is trainable end-to-end: generated visual features are required to be realistic with respect to positive and negative image samples, and a cyclic constraint is imposed to guarantee that the forward and backward translation are feasible at the same time and consistent.

2 Cycle-Consistent Retrieval

We introduce a cycle-consistent text and image retrieval network which operates a translation between the textual and the visual domains. Under the model, input captions can be translated to proper image features, and image vectors can be translated back to the textual domain. Exploiting this translation capability, a reconstruction constraint makes sure that the reconstructed text is similar to the original one. The overall architecture is shown in Fig. 2.

From Text to Image (txt2img). The first part of the architecture consists of a visual-semantic model which can transform a sentence s in a meaningful vector in the image feature space, \(\tilde{x}\). Words are represented with one-hot vectors that are embedded with a linear embedding, which can be either learned end-to-end together with the model, or pre-trained using another word embedding model, like Word2Vec [10], GloVe [11] or FastText [2]. Under the model, words are consumed by a GRU layer.

We train this model with a cost function which encourages the generated image vector to be close to the one of an image which has been described by the same caption. To this aim, we define a similarity function inside the image feature space (e.g. the cosine similarity), and apply a hinge-based triplet ranking loss commonly used in image-text retrieval [4, 9].

Fig. 2.
figure 2

Architecture of our model

From Image to Text (img2txt). While sentences can be projected into an image feature space, the second component of the model translates image vectors x into the textual space by generating a textual description \(\tilde{s}\). This roughly corresponds to an image captioning model in which the image is treated as the first input of an LSTM-based recurrent model.

At each iteration, the hidden state is linearly projected to the dimensionality of the vocabulary, and a softmax activation is then used to produce a probability distribution over the vocabulary. For each input image vector, the model generates the corresponding textual representation \(\tilde{s}\) composed by the words produced at each time-step of the LSTM.

Closing the Loop. The \(\texttt {txt2img}\) and \(\texttt {img2txt}\) models defined above realize the forward and backward translations between the image and the textual domain. Due to the diversity and high dimension of raw images, directly translating to and from the image domain would be intractable, therefore both models operate in the space of image feature vectors extracted from a CNN.

The mapping between the two spaces is regularized with a cycle-consistency criterion, in which we require the feasibility of the forward and backward translation at the same time. In practice, we require that the projection of a generated image vector into the textual space should be similar to the text from which the vector originated, i.e.

$$\begin{aligned} \texttt {img2txt}(\texttt {txt2img}(s)) \approx s. \end{aligned}$$
(1)

The similarity constraint imposed by Eq. 1 could be realized by taking into account the semantics of both sentences, either by evaluating a machine translation metric or by defining a network in charge of learning the similarity between two sentences. To keep the model simple and concentrate on the evaluation of the regularization power of the proposal, we realize Eq. 1 by computing the negative log-likelihood of generated words with respect to the words in s.

Implementation Details. To encode input images, we extract feature vectors from the average pooling layer of a ResNet-152 [5], thus obtaining an image dimensionality of 2048. For encoding image captions, since we do not project images and corresponding captions in a joint embedding space, we set the output size of the GRU to the same size of image embeddings (i.e. 2048). The dimensionality of word embeddings is set to 300. All experiments have been performed using the Adam optimizer [8] with an initial learning rate of \(2 \times 10^{-4}\).

3 Experimental Results

We show preliminary evaluation results for the proposed approach, employing rank-based performance metrics R@K (\(K=1, 5, 10\)) for text and image retrieval. In particular, R@K computes the percentage of test images or test sentences for which at least one correct result is found among the top-K retrieved sentences, in the case of text retrieval, or the top-K retrieved images, in the case of image retrieval.

As a baseline, we consider the txt2img model, which removes the cycle-consistency regularizer and is therefore well suited to evaluate the claims of the proposal regarding the role of the cycle-consistent constraint. This, also, is practically equivalent to a visual-semantic embedding model in which the visual projector is the identity function.

Table 1. Experimental results of our model on the Flickr8K and Flickr30k dataset using different word embeddings

Table 1 reports the results of our model on the Flickr8K [6] and Flickr30K [15] datasets using different word embedding strategies, together with that of the \(\texttt {txt2img}\) model alone. It can been observed that the performance of the complete model is always superior to that of the baseline, thus confirming the importance of translating backwards to the textual space and demonstrating the effectiveness of our promising solution.