Vision-language integration using constrained local semantic features
Graphical abstract
Introduction
Real-world applications in computer vision, such as image classification and image retrieval, can greatly benefit from a finer integration of visual and language cues available in data. Indeed, at the core of these problems, one aims at solving an ill-posed inverse problem that consists in identifying some “causes” to the visual observations (pixels) on the one hand, and matching these causes to a linguistic description (tags, captions, class-labels, textual description, etc.) on the other. Moreover, in a multimedia context, one modality can greatly assist the processing of the other by providing complementary information leading to disambiguate the perceptive context. On the visual side, most of the recent successful works rely on convolutional neural networks (CNNs) (Chatfield, Simonyan, Vedaldi, Zisserman, 2014, He, Zhang, Ren, Sun, 2014, Oquab, Bottou, Laptev, Sivic, 2014), following a “bottom-up” approach that uses directly the data to learn the best possible representation for a given task. However, in parallel to this tendency, several works proposed an alternative image representation that relies not only on data but also on ad-hoc semantic information that reflects human knowledge (Li, Su, Xing, Fei-fei, 2010, Torresani, Szummer, Fitzgibbon, 2010). These works thus adopt a partially “top-down” scheme that includes language-level information to design semantically grounded image features. More precisely, these so-called semantic features are built from the outputs of a bench of object detectors, also named base classifiers. Using such a description offers a high-level description of images, that directly matches language and human understanding. Hence, we argue in this article that such a description reflects a particular view of the fundamental “causes” of the visual observations, by integrating intimately the visual information of a particular image with that of a large ad-hoc semantic resource. Above this first level of integration of visual and linguistic information, we consider other possible linguistic information linked to an image and propose a method to integrate them to the semantic signature. This second level of integration allows the fusion of a purely linguistic view of the “causes” with the first description.
Another advantage of semantic signatures is that they can easily benefit from the improvements of the “bottom-up” works, since these last can be used as mid-level features to feed the base classifiers. Moreover, semantic features can cope with a wide variety of content since the number of base classifiers can be freely extended or specialized to a restricted use case. The computational complexity of the semantic features increases at most linearly with the number of concepts considered and can thus be considered as scalable in terms of number of recognized classes. Since images are represented in terms of a discrete number of concepts, one can index them using an inverted index and benefit from its efficiency to handle large-scale databases. Semantic features usually exploit all classifier outputs (Torresani et al., 2010), leading to a dense representation of images. However, Ginsca et al. (2015) recently showed that forcing the lowest value to zero improves the performance of these representations in image retrieval. Such a sparsification process is as well interesting for image classification, since the compactness of the features resulting from a low amount of non-zero values leads to be more computationally efficient at test time. However, the setting of this exact amount of sparsity remains an open problem.
The contribution of this paper is threefold. First, we introduce a new strategy to learn an efficient mid-level representation by CNNs that boosts the performance of semantic signatures. We then propose a method to adapt the level of sparsity of these semantic features according to the actual content of each image. We consider that a given concept can be retained only if we are confident enough on its detection and if it weights highly in the amount of information carried by the semantic features. From these hypotheses, we derive a method named “Content-Based Sparsity” (CBS) that automatically determines an appropriate level of sparsity for each image independently, taking into account its actual visual content.
Second, we also address the problem of analyzing the content of the images at a local scale for semantic features. Former works such as Object Bank (Li et al., 2010) encoded the spatial location of objects using a pyramid representation. More precisely they concatenated the base classifier outputs computed at three different scales. Such an approach is feasible when one uses a small number of detectors (200 in Li et al., 2010) but becomes intractable when tens of thousand detectors are used (Bergamo, Torresani, 2012, Ginsca, Popescu, Le Borgne, Ballas, Vo, Kanellos, 2015). We propose an alternative solution, named “Constrained Local Semantic Features” (CLSF), that is inspired by a popular scheme in the domain of bag-of-features (Grauman, Darrell, 2005, Lazebnik, Schmid, Ponce, 2006) and CNNs (Chatfield, Simonyan, Vedaldi, Zisserman, 2014, He, Zhang, Ren, Sun, 2014), namely pooling of local regions. While not new in itself, it is, to the best of our knowledge, the first attempt to use it in the context of semantic features. An advantage in comparison to the concatenation-based approaches is that the pooling does not change the size of the feature. Moreover, the use of a pooling in the following of our CBS scheme is a convenient way to focus only on important information. In fact, while, in a global scheme, CBS has the ability to sparsify the semantic feature regarding the content of each image, in a local scheme, it has the ability to neglect non-informative regions and consider highly informative ones, which is a desirable property, since it avoids to introduce noisy information in the final feature.
Third, we propose a new method to integrate visual and linguistic cues, that directly takes advantage of the semantic level of the proposed image signature. We are inspired by the Plato’s Theory of Forms, assuming the existence of inaccessible Forms that are the most accurate possible reality. These Forms are the essences or fundamental properties of the things we want to describe, the “causes” that were previously mentioned1. At the perceptual level, one can only access particular patterns of the ideal Form, under the form of a visual or linguistic description. Considering these two views, our approach identifies some dimensions that are considered as particular projections of the Form on each of them. Then, we propose several methods to find correspondences between the dimensions of both views. For the dimensions that match, we then aggregate the information of each view to get a unified description.
We validate our work in the contexts of mono-modal and multi-modal documents. First, we focus on images only and conduct experiments on four publicly available benchmarks, focusing on a scene classification task with the MIT Indoor benchmark and multi-class object classification with Pascal VOC 2007, Pascal VOC 2012 and Nus-Wide Object datasets. Results show that the proposed CLSF approach achieves state-of-the-art performances on three benchmarks (VOC 2007, Nus-Wide Object and MIT Indoor) and obtains competitive results on Pascal VOC 2012 compared to the best approaches of the literature. We also validate it in an image retrieval context over three datasets, Pascal VOC 2007, Nus-Wide Object and MIT Indoor, where CLSF outperforms best methods of the literature. Second, we evaluate the proposed vision-language integration approach on multi-modal documents, through bi-modal classification over two benchmarks (Pascal VOC 2007 and Nus-Wide-10k). Results show that our method achieves state-of-the-art performances on both datasets.
A part of the work presented in this paper is built upon a recently published conference paper (Tamaazousti et al., 2016b). However, the present manuscript differs in many ways from this work; in particular, it includes the following new items:
- •
a new learning strategy of mid-level representation by CNNs that boosts significantly the performances of the semantic representation (Section 4.3);
- •
a new method for vision-language integration that directly takes advantage of the semantic level of our image signature (Section 4.4);
- •
more detailed experimental results, especially about evaluation on a large scale and competitive database (Nus-Wide Chua et al. (2009) in Section 5) as well as in the context of bi-modal classification (Section 6);
- •
an in-depth analysis of the various components of our system (Section 7).
Section snippets
Image representations
For a couple of years, image representations based on fully-connected layers extracted from pre-trained CNNs (Jia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, Darrell, 2014, Sermanet, Eigen, Zhang, Mathieu, Fergus, LeCun, 2014, Simonyan, Zisserman, 2015) achieved a breakthrough on many benchmarks. Building on such mid-level representations, we focus on “semantic features” (Li, Su, Xing, Fei-fei, 2010, Torresani, Szummer, Fitzgibbon, 2010) that have the three following desirable
Background theory
In this section, we present, in details, the formalism of existing semantic features (Bergamo, Torresani, 2012, Ginsca, Popescu, Le Borgne, Ballas, Vo, Kanellos, 2015, Torresani, Szummer, Fitzgibbon, 2010) as well as the sparsification process (Ginsca, Popescu, Le Borgne, Ballas, Vo, Kanellos, 2015, Tamaazousti, Le Borgne, Hudelot, 2016, Wang, Yang, Yu, Lv, Huang, Gong, 2010). Let consider an image I and a global feature extracted from I. Given a set of C high-level concepts, a
Proposed approach
We first present (in Section 4.1) the “Content-Based Sparsity” (CBS) scheme that adapts the level of sparsity of the semantic feature to the content of each image. We then present (in Section 4.2) the “Constrained Local Semantic Features” (CLSF), which benefits from CBS applied to each image regions to better reflect the visual content at a local scale. Third, we provide (in Section 4.3) an improved version of semantic features that is based on a new diversified mid-level representation.
Experiments: image classification and retrieval
In this section we evaluate the proposed image representation on image classification (Section 5.2) and retrieval (Section 5.3) tasks, by considering the image-modality only.
Experiments: bi-modal classification
In this section, we evaluate the proposed “Pure Concept Space” (PCS) approach presented in Section 4.4 in a task of multi-modal (text-image pairs) alignment. More specifically, this evaluation is conducted on “bi-modal classification” that consists to classify bi-modal documents (containing pairs of text and image) into different categories. We first describe, in Section 6.1 the comparison methods and the datasets used to evaluate them, then analyze in Section 6.2 the obtained results.
Analysis
In this section, we conduct an in-depth analysis of the proposed approach. First (in Section 7.1), we evaluate the impact of each proposed contributions in our whole approach. Second (in Section 7.2), we focus on the proposed diversification strategy by deeply analyzing it and comparing it to other strategies of the literature. Then (in Section 7.3), we systematically evaluate the correlation between the semantic features and the mid-level features used to build them. Finally (Section 7.4), we
Conclusions
We introduced a novel method to design semantic features for visual representation that integrates a diversified CNN feature, an adaptive sparsification and a constrained local scheme. In contrast to existing works, which build semantic features on existing mid-level features, we proposed to build ours upon a diversified one obtained through a new CNN learning strategy that consists to select a diversified set of categories and to use fine-tuning as a genericness modeling. Moreover, while
Acknowledgments
This work is partially supported by the USEMP FP7 project, funded by the European Commission under contract number 611596.
References (48)
- et al.
Meta-class features for large-scale object categorization on a budget
Computer Vision and Pattern Recognition, CVPR
(2012) - et al.
Return of the devil in the details: delving deep into convolutional nets
British Machine Vision Conference, BMVC
(2014) - et al.
Nus-wide: a real-world web image database from National University of Singapore
ACM Conference on Image and Video Retrieval, CIVR
(2009) - et al.
On the role of correlation and abstraction in cross-modal multimedia retrieval
Pattern Analysis and Machine Intelligence, PAMI
(2014) - et al.
ImageNet: a large-scale hierarchical image database
Computer Vision and Pattern Recognition, CVPR
(2009) - et al.
Mid-level visual element discovery as discriminative mode seeking
Advances in Neural Information Processing Systems, NIPS
(2013) - et al.
Fast edge detection using structured forests
Pattern Analysis and Machine Intelligence, PAMI
(2015) - et al.
Weldon: weakly supervised learning of deep convolutional neural networks
Computer Vision and Pattern Recognition, CVPR
(2016) - Everingham, M., Van Gool, L., Williams, C. K., Winn, J., Zisserman, A.,. The pascal visual object classes...
- et al.
The pascal visual object classes challenge
Int. J. Comput. Vision IJCV
(2010)
Cross-modal retrieval with correspondence autoencoder
International Conference on Multimedia, ACM MM
Devise: a deep visual-semantic embedding model
Advances in Neural Information Processing Systems, NIPS
On feature combination for multiclass object classification
Computer Vision and Pattern Recognition, CVPR
Large-scale image mining with FlickR groups
International Conference on Multimedia Modelling, MM
A multi-view embedding space for modeling internet images, tags, and their semantics
Int. J. Comput. Vision IJCV
The pyramid match kernel: discriminative classification with sets of image features
International Conference on Computer Vision, ICCV
Canonical correlation analysis: an overview with application to learning methods
Neural Comput.
Spatial pyramid pooling in deep convolutional networks for visual recognition
European Conference on Computer Vision, ECCV
Learning the relative importance of objects from tagged images for retrieval and cross-modal search
Int. J. Comput. Vision IJCV
Caffe: convolutional architecture for fast feature embedding
ACM International Conference on Multimedia, ACM
Deep fragment embeddings for bidirectional image sentence mapping
Advances in Neural Information Processing Systems, NIPS
Imagenet classification with deep convolutional neural networks
Advances in Neural Information Processing Systems, NIPS
Beyond bags of features: spatial pyramid matching for recognizing natural scene categories
Computer Vision and Pattern Recognition, CVPR
Object bank: a high-level image representation for scene classification & semantic feature sparsification
Advances in Neural Information Processing Systems, NIPS
Cited by (7)
Learning More Universal Representations for Transfer-Learning
2020, IEEE Transactions on Pattern Analysis and Machine IntelligenceSemantic change analysis of Korean verbs based on massive culture corpus data
2020, Personal and Ubiquitous ComputingBuilding a multimodal entity linking dataset from tweets
2020, LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference ProceedingsLearning finer-class networks for universal representations
2019, British Machine Vision Conference 2018, BMVC 2018