Vision-language integration using constrained local semantic features

https://doi.org/10.1016/j.cviu.2017.05.017Get rights and content

Highlights

  • Vision and language integration at two levels, including the semantic level.

  • A semantic signature that adapts its sparsity to the actual visual content of images.

  • CNN-based mid-level features boosting semantic signatures.

  • Top performances on publicly available benchmarks for several tasks.

Abstract

This paper tackles two recent promising issues in the field of computer vision, namely “the integration of linguistic and visual information” and “the use of semantic features to represent the image content”. Semantic features represent images according to some visual concepts that are detected into the image by a set of base classifiers. Recent works exhibit competitive performances in image classification and retrieval using such features. We propose to rely on this type of image descriptions to facilitate its integration with linguistic data. More precisely, the contribution of this paper is threefold. First, we propose to automatically determine the most useful dimensions of a semantic representation according to the actual image content. Hence, it results into a level of sparsity for the semantic features that is adapted to each image independently. Our model takes into account both the confidence on each base classifier and the global amount of information of the semantic signature, defined in the Shannon sense. This contribution is further extended to better reflect the detection of a visual concept at a local scale. Second, we introduce a new strategy to learn an efficient mid-level representation by CNNs that boosts the performance of semantic signatures. Last, we propose several schemes to integrate a visual representation based on semantic features with some linguistic piece of information, leading to the nesting of linguistic information at two levels of the visual features. Experimental validation is conducted on four benchmarks (VOC 2007, VOC 2012, Nus-Wide and MIT Indoor) for classification, three of them for retrieval and two of them for bi-modal classification. The proposed semantic feature achieves state-of-the-art performances on three classification benchmarks and all retrieval ones. Regarding our vision-language integration method, it achieves state-of-the-art performances in bi-modal classification.

Introduction

Real-world applications in computer vision, such as image classification and image retrieval, can greatly benefit from a finer integration of visual and language cues available in data. Indeed, at the core of these problems, one aims at solving an ill-posed inverse problem that consists in identifying some “causes” to the visual observations (pixels) on the one hand, and matching these causes to a linguistic description (tags, captions, class-labels, textual description, etc.) on the other. Moreover, in a multimedia context, one modality can greatly assist the processing of the other by providing complementary information leading to disambiguate the perceptive context. On the visual side, most of the recent successful works rely on convolutional neural networks (CNNs) (Chatfield, Simonyan, Vedaldi, Zisserman, 2014, He, Zhang, Ren, Sun, 2014, Oquab, Bottou, Laptev, Sivic, 2014), following a “bottom-up” approach that uses directly the data to learn the best possible representation for a given task. However, in parallel to this tendency, several works proposed an alternative image representation that relies not only on data but also on ad-hoc semantic information that reflects human knowledge (Li, Su, Xing, Fei-fei, 2010, Torresani, Szummer, Fitzgibbon, 2010). These works thus adopt a partially “top-down” scheme that includes language-level information to design semantically grounded image features. More precisely, these so-called semantic features are built from the outputs of a bench of object detectors, also named base classifiers. Using such a description offers a high-level description of images, that directly matches language and human understanding. Hence, we argue in this article that such a description reflects a particular view of the fundamental “causes” of the visual observations, by integrating intimately the visual information of a particular image with that of a large ad-hoc semantic resource. Above this first level of integration of visual and linguistic information, we consider other possible linguistic information linked to an image and propose a method to integrate them to the semantic signature. This second level of integration allows the fusion of a purely linguistic view of the “causes” with the first description.

Another advantage of semantic signatures is that they can easily benefit from the improvements of the “bottom-up” works, since these last can be used as mid-level features to feed the base classifiers. Moreover, semantic features can cope with a wide variety of content since the number of base classifiers can be freely extended or specialized to a restricted use case. The computational complexity of the semantic features increases at most linearly with the number of concepts considered and can thus be considered as scalable in terms of number of recognized classes. Since images are represented in terms of a discrete number of concepts, one can index them using an inverted index and benefit from its efficiency to handle large-scale databases. Semantic features usually exploit all classifier outputs (Torresani et al., 2010), leading to a dense representation of images. However, Ginsca et al. (2015) recently showed that forcing the lowest value to zero improves the performance of these representations in image retrieval. Such a sparsification process is as well interesting for image classification, since the compactness of the features resulting from a low amount of non-zero values leads to be more computationally efficient at test time. However, the setting of this exact amount of sparsity remains an open problem.

The contribution of this paper is threefold. First, we introduce a new strategy to learn an efficient mid-level representation by CNNs that boosts the performance of semantic signatures. We then propose a method to adapt the level of sparsity of these semantic features according to the actual content of each image. We consider that a given concept can be retained only if we are confident enough on its detection and if it weights highly in the amount of information carried by the semantic features. From these hypotheses, we derive a method named “Content-Based Sparsity” (CBS) that automatically determines an appropriate level of sparsity for each image independently, taking into account its actual visual content.

Second, we also address the problem of analyzing the content of the images at a local scale for semantic features. Former works such as Object Bank (Li et al., 2010) encoded the spatial location of objects using a pyramid representation. More precisely they concatenated the base classifier outputs computed at three different scales. Such an approach is feasible when one uses a small number of detectors (200 in Li et al., 2010) but becomes intractable when tens of thousand detectors are used (Bergamo, Torresani, 2012, Ginsca, Popescu, Le Borgne, Ballas, Vo, Kanellos, 2015). We propose an alternative solution, named “Constrained Local Semantic Features” (CLSF), that is inspired by a popular scheme in the domain of bag-of-features (Grauman, Darrell, 2005, Lazebnik, Schmid, Ponce, 2006) and CNNs (Chatfield, Simonyan, Vedaldi, Zisserman, 2014, He, Zhang, Ren, Sun, 2014), namely pooling of local regions. While not new in itself, it is, to the best of our knowledge, the first attempt to use it in the context of semantic features. An advantage in comparison to the concatenation-based approaches is that the pooling does not change the size of the feature. Moreover, the use of a pooling in the following of our CBS scheme is a convenient way to focus only on important information. In fact, while, in a global scheme, CBS has the ability to sparsify the semantic feature regarding the content of each image, in a local scheme, it has the ability to neglect non-informative regions and consider highly informative ones, which is a desirable property, since it avoids to introduce noisy information in the final feature.

Third, we propose a new method to integrate visual and linguistic cues, that directly takes advantage of the semantic level of the proposed image signature. We are inspired by the Plato’s Theory of Forms, assuming the existence of inaccessible Forms that are the most accurate possible reality. These Forms are the essences or fundamental properties of the things we want to describe, the “causes” that were previously mentioned1. At the perceptual level, one can only access particular patterns of the ideal Form, under the form of a visual or linguistic description. Considering these two views, our approach identifies some dimensions that are considered as particular projections of the Form on each of them. Then, we propose several methods to find correspondences between the dimensions of both views. For the dimensions that match, we then aggregate the information of each view to get a unified description.

We validate our work in the contexts of mono-modal and multi-modal documents. First, we focus on images only and conduct experiments on four publicly available benchmarks, focusing on a scene classification task with the MIT Indoor benchmark and multi-class object classification with Pascal VOC 2007, Pascal VOC 2012 and Nus-Wide Object datasets. Results show that the proposed CLSF approach achieves state-of-the-art performances on three benchmarks (VOC 2007, Nus-Wide Object and MIT Indoor) and obtains competitive results on Pascal VOC 2012 compared to the best approaches of the literature. We also validate it in an image retrieval context over three datasets, Pascal VOC 2007, Nus-Wide Object and MIT Indoor, where CLSF outperforms best methods of the literature. Second, we evaluate the proposed vision-language integration approach on multi-modal documents, through bi-modal classification over two benchmarks (Pascal VOC 2007 and Nus-Wide-10k). Results show that our method achieves state-of-the-art performances on both datasets.

A part of the work presented in this paper is built upon a recently published conference paper (Tamaazousti et al., 2016b). However, the present manuscript differs in many ways from this work; in particular, it includes the following new items:

  • a new learning strategy of mid-level representation by CNNs that boosts significantly the performances of the semantic representation (Section 4.3);

  • a new method for vision-language integration that directly takes advantage of the semantic level of our image signature (Section 4.4);

  • more detailed experimental results, especially about evaluation on a large scale and competitive database (Nus-Wide Chua et al. (2009) in Section 5) as well as in the context of bi-modal classification (Section 6);

  • an in-depth analysis of the various components of our system (Section 7).

Section snippets

Image representations

For a couple of years, image representations based on fully-connected layers extracted from pre-trained CNNs (Jia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, Darrell, 2014, Sermanet, Eigen, Zhang, Mathieu, Fergus, LeCun, 2014, Simonyan, Zisserman, 2015) achieved a breakthrough on many benchmarks. Building on such mid-level representations, we focus on “semantic features” (Li, Su, Xing, Fei-fei, 2010, Torresani, Szummer, Fitzgibbon, 2010) that have the three following desirable

Background theory

In this section, we present, in details, the formalism of existing semantic features (Bergamo, Torresani, 2012, Ginsca, Popescu, Le Borgne, Ballas, Vo, Kanellos, 2015, Torresani, Szummer, Fitzgibbon, 2010) as well as the sparsification process (Ginsca, Popescu, Le Borgne, Ballas, Vo, Kanellos, 2015, Tamaazousti, Le Borgne, Hudelot, 2016, Wang, Yang, Yu, Lv, Huang, Gong, 2010). Let consider an image I and a global feature xI extracted from I. Given a set C of C high-level concepts, C={c1,cC}, a

Proposed approach

We first present (in Section 4.1) the “Content-Based Sparsity” (CBS) scheme that adapts the level of sparsity of the semantic feature to the content of each image. We then present (in Section 4.2) the “Constrained Local Semantic Features” (CLSF), which benefits from CBS applied to each image regions to better reflect the visual content at a local scale. Third, we provide (in Section 4.3) an improved version of semantic features that is based on a new diversified mid-level representation.

Experiments: image classification and retrieval

In this section we evaluate the proposed image representation on image classification (Section 5.2) and retrieval (Section 5.3) tasks, by considering the image-modality only.

Experiments: bi-modal classification

In this section, we evaluate the proposed “Pure Concept Space” (PCS) approach presented in Section 4.4 in a task of multi-modal (text-image pairs) alignment. More specifically, this evaluation is conducted on “bi-modal classification” that consists to classify bi-modal documents (containing pairs of text and image) into different categories. We first describe, in Section 6.1 the comparison methods and the datasets used to evaluate them, then analyze in Section 6.2 the obtained results.

Analysis

In this section, we conduct an in-depth analysis of the proposed approach. First (in Section 7.1), we evaluate the impact of each proposed contributions in our whole approach. Second (in Section 7.2), we focus on the proposed diversification strategy by deeply analyzing it and comparing it to other strategies of the literature. Then (in Section 7.3), we systematically evaluate the correlation between the semantic features and the mid-level features used to build them. Finally (Section 7.4), we

Conclusions

We introduced a novel method to design semantic features for visual representation that integrates a diversified CNN feature, an adaptive sparsification and a constrained local scheme. In contrast to existing works, which build semantic features on existing mid-level features, we proposed to build ours upon a diversified one obtained through a new CNN learning strategy that consists to select a diversified set of categories and to use fine-tuning as a genericness modeling. Moreover, while

Acknowledgments

This work is partially supported by the USEMP FP7 project, funded by the European Commission under contract number 611596.

References (48)

  • A. Bergamo et al.

    Meta-class features for large-scale object categorization on a budget

    Computer Vision and Pattern Recognition, CVPR

    (2012)
  • K. Chatfield et al.

    Return of the devil in the details: delving deep into convolutional nets

    British Machine Vision Conference, BMVC

    (2014)
  • T.-S. Chua et al.

    Nus-wide: a real-world web image database from National University of Singapore

    ACM Conference on Image and Video Retrieval, CIVR

    (2009)
  • J. Costa Pereira et al.

    On the role of correlation and abstraction in cross-modal multimedia retrieval

    Pattern Analysis and Machine Intelligence, PAMI

    (2014)
  • J. Deng et al.

    ImageNet: a large-scale hierarchical image database

    Computer Vision and Pattern Recognition, CVPR

    (2009)
  • C. Doersch et al.

    Mid-level visual element discovery as discriminative mode seeking

    Advances in Neural Information Processing Systems, NIPS

    (2013)
  • P. Dollar et al.

    Fast edge detection using structured forests

    Pattern Analysis and Machine Intelligence, PAMI

    (2015)
  • T. Durand et al.

    Weldon: weakly supervised learning of deep convolutional neural networks

    Computer Vision and Pattern Recognition, CVPR

    (2016)
  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., Zisserman, A.,. The pascal visual object classes...
  • M. Everingham et al.

    The pascal visual object classes challenge

    Int. J. Comput. Vision IJCV

    (2010)
  • F. Feng et al.

    Cross-modal retrieval with correspondence autoencoder

    International Conference on Multimedia, ACM MM

    (2014)
  • A. Frome et al.

    Devise: a deep visual-semantic embedding model

    Advances in Neural Information Processing Systems, NIPS

    (2013)
  • P. Gehler et al.

    On feature combination for multiclass object classification

    Computer Vision and Pattern Recognition, CVPR

    (2009)
  • A.L. Ginsca et al.

    Large-scale image mining with FlickR groups

    International Conference on Multimedia Modelling, MM

    (2015)
  • Y. Gong et al.

    A multi-view embedding space for modeling internet images, tags, and their semantics

    Int. J. Comput. Vision IJCV

    (2014)
  • K. Grauman et al.

    The pyramid match kernel: discriminative classification with sets of image features

    International Conference on Computer Vision, ICCV

    (2005)
  • D.R. Hardoon et al.

    Canonical correlation analysis: an overview with application to learning methods

    Neural Comput.

    (2004)
  • K. He et al.

    Spatial pyramid pooling in deep convolutional networks for visual recognition

    European Conference on Computer Vision, ECCV

    (2014)
  • S.J. Hwang et al.

    Learning the relative importance of objects from tagged images for retrieval and cross-modal search

    Int. J. Comput. Vision IJCV

    (2012)
  • Y. Jia et al.

    Caffe: convolutional architecture for fast feature embedding

    ACM International Conference on Multimedia, ACM

    (2014)
  • A. Karpathy et al.

    Deep fragment embeddings for bidirectional image sentence mapping

    Advances in Neural Information Processing Systems, NIPS

    (2014)
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    Advances in Neural Information Processing Systems, NIPS

    (2012)
  • S. Lazebnik et al.

    Beyond bags of features: spatial pyramid matching for recognizing natural scene categories

    Computer Vision and Pattern Recognition, CVPR

    (2006)
  • L.J. Li et al.

    Object bank: a high-level image representation for scene classification & semantic feature sparsification

    Advances in Neural Information Processing Systems, NIPS

    (2010)
  • Cited by (7)

    View all citing articles on Scopus
    View full text