1 Introduction

Recently years, visual representation has become a vital part of computer vision applications (e.g. image recognition [20, 23, 39] and image retrieval [19, 28, 34]). Methods vary from raw feature extraction, to high-level feature statistics such Bag-of-Words, and today more complex feature extraction framework like deep neural network. However, as compared to our humans, who can easily recognize more than tens of thousands of objects, the above artificial frameworks are obvious inferior [36]. Further, it has been widely recognized that a good visual representation should integrate both low-level visual features addressing the more detailed perceptual aspects and high-level semantic features underlying the more general conceptual aspects of visual data [24, 37, 38, 42]. In other words, there remains semantic gap between the low-level features extracted using current methods and high-level semantics. Although many efforts [8, 9, 31, 43] have been devoted to reduce this semantic gap by combining these two types of visual representation methods, the gap is still a challenge for us. In this work, we aim to reduce this semantic gap through finding the correlation between low-level features and high-level concepts to obtain better visual representation.

In order to capture high-level semantics and cross category properties, attributes [4] has been proposed, which offers an important intermediate representation. As human recognize objects by finding their main attributes in high-level concepts, attributes description is more explainable and could be used to describe unknown category objects. Thus, by leveraging high-level attribute information with low-level features, we could obtain more compact and discriminative features across categories and further reduce the semantic gap. Specifically, the more semantic attributes two concepts share, the similar the two concepts are, such as “cat” and “dog” share many attributes together like “fur”, “has head”, and they are conceptually similar. Then we could use dictionary learning scheme in the coding process to efficiently solve this correlation embedding and feature selection problem.

In this paper, we propose a novel binary visual representation learning framework which exploits the semantic attributes to reduce semantic gap and generate binary codes for visual representation. As illustrated in Fig. 1, the proposed framework comprises two stages, the off-line stage and the on-line stage. In the off-line stage, we utilize semantics to find the correlation relationships among concepts. To this end, we project the original objects into the conceptual space and further divide dictionary bases of conceptual space into groups using semantic attributes. In the on-line stage, we devise a novel semantic coding approach by exploiting semantics, including concept-rich dictionaries and semantic-based concept correlation information obtained in the off-line stage. In order to achieve this goal, we incorporate group-based sparsity in the dictionary learning process, which efficiently selects salient concepts in the data. Then, we adapt hashing schemes to generate compact and efficient binary codes. The generated sparse codes are considered as the final visual representation and further used in specific tasks.

Fig. 1
figure 1

The proposed visual representation framework. In the off-line stage, we learn dictionaries in the conceptual space where each concept correspond to one dictionary base. Then inter-concept correlation are exploited by leveraging semantic attributes. In the on-line stage, overlapping groups are formed using the correlation information and then group sparsity are imposed to generate features using dictionaries obtained in the off-line stage. At last, hashing methods are used to generate compact binary representation

We summarize our contribution as follows:

  • Semantic Binary Learning Framework. The proposed framework adapts hashing schemes to generate binary codes for visual representation. Extensive experiments for visual retrieval tasks have been conducted compared to several state-of-art hashing methods and salient performance boost could be observed.

  • Dictionary Learning in Concept Space. The proposed framework learns dictionaries in conceptual space, which holds objects that if are conceptually similar, then they are close in the conceptual space.

  • Exploit Concept Correlation The proposed framework utilizes the semantic attribute information to exploit the inter-concept correlation, which combines low-level features and high-level semantics in efficient way and thus reduce the semantic gap.

This paper makes further improvement based on our previous work in [30]. Specifically, the procedures of dictionary learning in concept space and concept correlation exploration are integrated to a unified framework for joint optimization in the proposed model, whereas in [30], they are two separate steps. In addition, we also developed an efficient iterative algorithm to solve the complex optimization problem in the proposed model. We conducted extensive experiments with various settings to evaluate the effectiveness of the proposed method on visual recognition task, and made comprehensive comparison with several latest state-of-the-art approaches on various aspects.

The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 details the proposed semantic binary visual representation learning framework, including the formulation, algorithms. Experimental results and analysis are reported in Section 4, followed by conclusions in Section 5.

2 Related work

2.1 Dictionary learning and visual representation

The state-of-the-art visual representation methods comprise two stages: First, using local descriptors (e.g., SIFT [18] and HOG [2]) to extract feature, then, an over-complete representation is encoded using these features. Recently, convolution neural network (CNN) has shown its great power in learning visual representations on various tasks like image classification [25]. CNN differs from the previous one that it learns features automatically while the previous one need manually set the extraction scheme. In this work, we combine these two methods to achieve better performance.

In [10], an efficient sparse coding pipeline was proposed where the dictionary is the feature matrix of the whole training images. Zhang et al. [41] used group sparsity to select different image pairs in visual retrieval tasks. Yang et al. [33] proposed the spatial group sparse coding to considered regions of an image as a group to achieve region-level image annotation. Gao et al. [6] proposed a multi-layer group sparse coding framework for concurrent image classification and annotation. Wu et al. [29] used trace lasso-based sparsity and form a weekly supervised dictionary learning framework. Similar to those works, our framework also adapt group sparsity to select features of the same group, which indicates concept in our work. However, our work further exploits the group correlation and impose overlapping group sparsity to better select variables and thus obtain a better representation.

2.2 Semantic attribute

Semantic attribute has been proposed to better describe visual objects [4, 14, 23]. Ouyang et al. [21] shows that semantic attributes are, if used in a proper way, helpful in learning feature representations that improve large-scale object detection. Semantic attributes express high-level meanings that could help represent the visual objects. By combining low-level features and high-level semantic attributes information, semantic gap could be reduced and the combined features are more meaningful and explainable.

Many approaches used predictions on attributes as middle-level features for recognizing new object categories [4, 13]. Chiang et al. [1] proposed a new distance matrix by jointly incorporating data and attribute similarities. Farhadi et al. [5] used the functionality, superordinate categories, viewpoint, and pose of segments as attributes to improve detection accuracy. Wu et al. [29] learned a dictionaries by exploiting visual attribute correlations rather than label priors. Unlike these methods, we utilize the attribute information in a different stage of the coding process, where its main role is to group the dictionaries and thus find the correlation relationships of concepts. Also different from [29] where they cluster visual objects for each concept and then impose group sparsity penalty, we choose to make the concepts overlapped and thus make the eventual space could hold concept similarity. Our dictionaries and concepts correlation are learned in the off-line stage and can be used in the on-line stage to generate visual representation which shows the generalization ability of our framework. Besides, the proposed method leverages explicit concept correlation, that is, first concepts are divided into groups, and then binary codes are generated using the group information and hashing methods. Various grouping mechanism such as [32, 35] could be adopted.

3 The proposed framework

In this section, we elaborate our proposed framework in terms of the off-line and on-line stages.

3.1 OFF-LINE: dictionary learning in the conceptual space

In off-line stage, we obtain the dictionary bases in the conceptual space and exploit the concepts correlation by utilizing attribute correlation information. Then the dictionary bases and concepts correlation information could be used in the on-line stage to produce new binary visual representation.

3.1.1 Conceptual space dictionary

Dictionary learning [26] plays a vital role in multimedia applications. In our work, we aim to construct a semantic-enriched dictionary by mapping the original images to the conceptual space. Close objects in conceptual space should be conceptual similar. Suppose we have n samples \(\mathbf {X} = [\mathbf {x}_{\mathbf {1}}, \mathbf {x}_{\mathbf {2}}, {\cdots } , \mathbf {x}_{\mathbf {n}}] \in \mathbb {R}^{m \times n} \). We map the samples from original space to conceptual space by the projection:

$$ \| \mathbf{X} - \mathbf{D}_{\mathbf{c}} \mathbf{S} \|, $$
(1)

where \(\mathbf {D}_{\mathbf {c}} \in \mathbb {R}^{m \times k}\) is the bases of the conceptual space, and \(\mathbf {S} \in \mathbb {R}^{k \times n}\) is the new representation in the conceptual space.

3.1.2 Learning bases of concepts

Category is a nature choice for conceptual space, where each concept is implicitly corresponding to one category. In order to effectively exploit the images and category label information to construct conceptual space, we choose to obtain the bases’ representations of each category using the classification hyperplanes learned from logistic regression, which split categories from each other:

$$ \underset{\mathbf{d},c}{\min} \frac{1}{2} \mathbf{d}^{\mathbf{T}}\mathbf{d} + C \sum\limits_{i = 1}^{n} \log (exp(-\mathbf{y}^{\mathbf{i}} ({\mathbf{X}}_{\mathbf{i}}^{\mathbf{T}} \mathbf{d} + c)) + 1), $$
(2)

where \(y^{i}\) is the category label for each sample \(\mathbf {X}_{\mathbf {i}}\), C is the reverse of regularization strength, and smaller C specify stronger regularization, \(\mathbf {d}\) is the classification hyperplane we need (\(\mathbf {D}_{\mathbf {c}} = [\mathbf {d}_{\mathbf {1}}, \mathbf {d}_{\mathbf {2}}, \cdots , \mathbf {d}_{\mathbf {k}}])\). For each category, we learn the hyperplane that could tell samples from whether belong to the category. Then, bases of conceptual space are made up of the hyperplanes. In this way, our dictionary has the natural power of generating discriminative features that could be easily used in the classification task.

3.1.3 Modeling concepts correlation with semantic attributes

Intuitively, the co-occurrence statistics include meaningful information about correlation relationship. The more semantic attributes two concepts share, the similar the two concepts are. Thus, conceptual correlation relationship could be found through attributes. To utilize the attribute correlation relationship, we propose to group the conceptual dictionary bases by leveraging attribute information. Any reasonable grouping schemes could be adapted in our framework. In a more specific case which we use in the experiment, a group is formed by selecting which bases share the same attribute, that is, if there are m attributes, there should be m groups where each group all share the same attribute. The formed groups are defined as:

$$ \mathbf{G}_{\mathbf{i}} = [j \text{ for } \mathit{concept}_{j} \text{ has } \mathit{attribute}_{i}]. $$
(3)

If two conceptual dictionary bases are divided into one group, it means that they are somehow conceptually connected through attributes. Notice that, groups could be overlapping due to classes often share more than one attributes, which means one concept is similar to another concept in multiple attribute aspects.

3.2 ON-LINE: binary visual representation generation

As the dictionary bases of conceptual space and the relationships of concepts have been obtained in the off-line stage, in this stage, we adapt dictionary learning and hashing methods to produce our semantic binary representation.

3.2.1 Semantic visual coding

New representation in the conceptual space could be obtained by using the projection illustrated by the (1). S is the reconstruction in the conceptual space. However, the conceptual space bases contain the whole classes in the dataset, out of which many bases would not be used to reconstruct the feature for one specific image. Thus, variable selection methods could be used to impose sparsity on the reconstruction weight of bases which generates more explainable features. We use group-based lasso [11] to solve the variable selection problem. The forming groups could be non-overlapping or overlapping, we obtain our new feature representation by separately handle these two situation. The lasso [27] estimate is defined by

$$ \hat{\beta}^{lasso} = \underset{\beta}{\arg \min} \left\{ \frac{1}{2} \sum\limits_{i = 1}^{N} \left( \mathbf{y}_{\mathbf{i}} - \beta_{0} - \sum\limits_{j = 1}^{p} \mathbf{x}_{\mathbf{i}\mathbf{j}} \beta{j}\right)^{2} + \lambda \sum\limits_{j = 1}^{p} | \beta_{j} | \right\}. $$
(4)

Here, \(\lambda \) is a tuning parameter that controls the amount of shrinkage: the larger the value of \(\lambda \), the greater the amount of shrinkage. The \(l_{1}\)-norm penalty induces sparsity in the solution. In our work, each conceptual dictionary base may be divided into one or more than one groups, thus we select our variables in two cases, non-overlapping and overlapping.

3.2.2 Visual coding with non-overlapping concept correlation

If the concept correlation is non-overlapped which means the formed groups are non-overlapped, we could simply impose group sparsity in the dictionary learning process. And the problem could be described as the follow:

Given a set of groups G which form a partition of \([1,p]\), the group lasso penalty [40] is a norm over \(\mathbb {R}^{p}\) defined as:

$$ \forall w \in \mathbb{R}^{p}, {\Omega}_{group}^{G} = \underset{g \in G}{\sum} \| w_{g} \|. $$
(5)

When the groups in G form a partition of the set of covariates, then \({\Omega }_{group}^{G} (w)\) is a norm whose balls have singularities when some \(w_{g}\) are equal to zero.

In our case, we aim to project the original feature into the conceptual space, derived from (4) and (5), the whole objective function could be written as:

$$ \min_{\mathbf{v}} \| \mathbf{D}_{\mathbf{c}}\mathbf{v} - \mathbf{x} \|^{2}_{2} + \lambda \sum\limits_{i = 1}^{k} {\beta}_{i}^{g} \|\mathbf{v_{G_{i}}}\|, $$
(6)

where \(\mathbf {D}_{\mathbf {c}} \in \mathbb {R}^{m \times k}\) is the bases of the conceptual space. \(\mathbf {x} \in \mathbb {R}^{m \times 1}\) is the original feature, \(\mathbf {v} \in \mathbb {R}^{k \times 1}\) is the new feature. The \(\mathbf {v}\) is divided into k non-overlapping groups \(\mathbf {v}_{\mathbf {G}_{\mathbf {1}}}, \mathbf {v}_{\mathbf {G}_{\mathbf {2}}}, \cdots , \mathbf {v}_{\mathbf {G}_{\mathbf {k}}}\), and \({{\beta }_{i}^{g}}\) denotes the weight for the i-th group.

3.2.3 Visual coding with overlapping concept correlation

In the case of overlapping concept correlation where formed groups are overlapped, we impose overlapping group sparsity penalty when learning the visual representation. The variables selection and visual representation learning process could be formalized as the follow:

$$ \underset{\mathbf{v}}{\min} \| \mathbf{D}_{\mathbf{c}} \mathbf{v} - \mathbf{x} {\|}_{2}^{2} + \lambda_{1} \| \mathbf{v} \|_{1} + \lambda_{2} \sum\limits_{i = 1}^{k} {{\beta}_{i}^{g}} \| \mathbf{v}_{\mathbf{G}_{\mathbf{i}}}\|, $$
(7)

where \(\mathbf {D}_{\mathbf {c}} \in \mathbb {R}^{m \times k}\) is the bases of the conceptual space. \(\mathbf {x} \in \mathbb {R}^{m \times 1}\) is the original feature, \(\mathbf {v} \in \mathbb {R}^{k \times 1}\) is the new feature. The groups \({\mathbf {G}_{\mathbf {i}}}\) may overlap, and \({{\beta }_{i}^{g}}\) is the weight for the i-th group. The obtained new visual representation \(\mathbf {v}\) is not only sparse, but also contains salient conceptual and attribute information. Equation (7) is a standard sparse dictionary learning problem, and in practice it can be efficiently solved by the the SLEP software [16].

3.2.4 Binary representation generation

In order to generate compact binary codes, we utilize existing hashing methods to generate binary hashing codes. Unsupervised learning methods such as ITQ [7], LSH [22] and SKLSH [22], supervised hashing methods such as COSDISH [12], KSH [17] and FastHash [15] are adapted to evaluate our generated features. Take ITQ for example, we first do PCA projection converting \(\mathbf {V} = [\mathbf {v}_{\mathbf {1}}, \mathbf {v}_{2}, \cdots , \mathbf {v}_{\mathbf {n}}]\) obtained from (6) or (7) to \({\mathbf {V}^{\prime }} \in \mathbb {R}^{l \times n}\), l is the code length. Then the objective function is described as:

$$ \| \mathbf{B} - \mathbf{V}^{\boldsymbol{\prime }\mathbf{T}}\mathbf{R} {\|}_{F}^{2}, $$
(8)

where \(\mathbf {B} \in \{-1,1\}^{l \times n}\) is the final binary codes, \(\mathbf {R} \in \mathbb {R}^{l \times l}\) is the rotation matrix, and \(\| \cdot \|_{F}\) denotes the Frobenuis norm.

4 Experiments

4.1 Datasets

Animals with attributes (AWA) [13]

The AWA dataset consists of 30475 images. Images are categorized into 50 animals classes and there are 85 numeric attribute values for each class. Each image is represented by a 4096-dimension vector, fc7 layer of very deep 19-layer CNN, pretained on ILSVRC2014 [25]. We split the dataset into two parts, which are separately used in off-line and on-line stages. More specifically, we randomly select two-thirds of each concept (class) as the off-line stage training dataset, and the left one-third as the on-line stage testing dataset, which gives us 15226 images for the off-line stage and 10172 images for the on-line stage.

aPascal & aYahoo dataset [4]

aPascal dataset contains 12695 visual objects, 20 categories and 64 semantic attributes to describe objects. aYahoo dataset contains 2644 visual objects, 12 categories. In our experiment, we use the whole aPascal dataset in the off-line stage to obtain concept dictionaries and exploit concept correlation. And we use the whole aYahoo dataset in the on-line stage to evaluate our framework. This also helps evaluate the generalization abilities of our framework.

4.2 Evaluation settings

For evaluation metrics, we chose commonly used Average Precision (AP) and mean Average Precision (mAP). For AWA dataset, we randomly choose 1000 samples in the testing phase, and for aPascal dataset, we randomly choose 100 samples in the testing phase.

For AWA dataset, We compare our proposed binary visual representation framework with the state-of-art deep neural networks methods (VGG-19 [25]. For aPascal dataset, we compare our proposed binary visual representation framework with the method provided by [4] which is carefully designed for this dataset. We use several hashing methods to in our framework, including unsupervised hashing methods, LSH [3], SKLSH [22], ITQ [7] and supervised hashing methods, COSDISH [12], KSH [17] and FastHash [15]. All the hash methods are implemented by the source code provided by the corresponding authors. For KSH, following the setting in [17], we randomly sample 2000 points as training set in the off-line stage and the number of support vectors is 300. For FastHash, boosted decision trees are used for out-of-sample extension.

4.3 Results and discussions

In this part, first we show how our proposed semantic binary learning framework achieve better performance by comparing it with other popular methods using several popular hashing methods including both unsupervised ones and supervised ones. Then we explore the sensitivity of hashing methods and code length. Finally we show our proposed attribute-based group scheme boost performance by comparing with representation learned by not leveraging attribute information.

4.3.1 Comparison with other methods

We evaluate our proposed framework by comparing with deep neural network representation [25] on AWA dataset, and method designed by [4] on aPascal dataset. As illustrated, Tables 1 and 2 shows the results of mAP on AWA dataset using unsupervised and supervised hashing methods, respectively. We can observe from the results that our proposed framework can achieve obvious improvements in terms of retrieval performance over other methods on AWA dataset. Our proposed framework not only boosts the performance of unsupervised hashing methods but also supervised learning methods. Take 48 bits codes for example, our proposed framework outperforms CNN representation by around 1.74, 57.80, 20.39% for ITQ, LSH and SKLSH hashing methods, and around 3.31, 3.17, 2.82% for COSDISH, KSH, and FashHash hashing methods. The underlying reasons are two folds. First of all, our framework works in the conceptual space where similar concepts could be close in this space. Second, our framework efficiently exploit the concepts correlation by using semantic attribute information, thus reducing semantic gap.

Table 1 Comparisons in terms of mAP on AWA Dataset with unsupervised hashing methods
Table 2 Comparison in terms of mAP on AWA Dataset with Supervised hashing methods

Table 3 reports the mAP performance for our proposed framework and the method designed by [4] on aYahoo dataset. Figure 2 shows the detailed performance (AP) of individual concepts on aYahoo dataset. We can see from the results that our proposed framework achieve better performance of all individual concepts for all the hashing methods we use. This consistently validates the advantages of our framework. Notice that, the dictionary bases and concepts correlation relationships for aYahoo dataset are not derived from the dataset itself but from aPascal dataset. This further shows the salient performance and generalization abilities of our framework.

Fig. 2
figure 2

Average precision of individual concepts for visual retrieval on aYahoo dataset

Table 3 Comparison in terms of mAP on aYahoo Dataset

4.3.2 The sensitivity of hashing methods and code length

We can observe from Tables 1 and 3 that the improvement of our proposed framework highly depends on the hashing methods we adapt. For example, we obtain 1.74% improvement using ITQ, compared to 57.80% using LSH for 48 bits on AWA dataset. We infer that this is because some hashing methods utilize the data better than others, thus our framework boosts less using these methods compared to others. We could also observe that the improvement varies with code length. For example, mAP improvement on AWA dataset using SKLSH varies from 8.56, 30.49, 45.49, 16.66% with code length from 8 bits to 48 bits, the max improvement occurs using 32 bits. We assume the reason that there is a best code length suitable for the hashing methods to interpret the dataset, thus influencing the performance boost of our framework.

4.3.3 Effectiveness of semantics

Exploiting concept correlation using semantics is a crucial part of our proposed framework (Semantic). In order to evaluate the effectiveness of our semantics scheme, we design another framework which does not consider any semantic attribute information (Non-Semantic). For Non-Semantic method, we assume each concept is a group and then follow the process the same as our proposed Semantic framework. Tables 4 and 5 shows how semantics leading concept correlation help improve the performance on AWA and aYahoo datasets, respectively. As we can see, the Semantic method which utilizes semantics to find concept correlation achieves better performance than the Non-Semantic method the two datasets and all the hashing methods we use. For instance, the Semantic method outperform Non-Semantic method by around 17.18, 11.80 and 5.11% for ITQ, LSH and COSDISH hashing methods over AWA dataset for 32 bits codes. The reason behind this is because the concepts are related, exploiting concept correlation relationship using semantics reduces the semantic gap. Thus, this experiment shows the effectiveness of our proposed semantic-based concept correlation scheme.

Table 4 Comparison on representations with semantics and without semantics on AWA dataset
Table 5 Comparison on representations with semantics and without semantics on aYahoo dataset

4.3.4 Parameter sensitivity

In this part, we empirically test the parameter sensitivity in our approach. Specifically, we evaluate the effects of \(\lambda \) on visual retrieval over AWA dataset. We tune \(\lambda \) in the range of \([0.005, 0.01, 0.05, 0.1, 0.2, 0.5]\). As we can see from Fig. 3, the sensitivity of the parameter depends on the hashing method our proposed framework adapts. For hashing methods FastHash and LSH, the parameter \(\lambda \) is nearly no influence on them. For COSDISH, the performance improves as \(\lambda \) rises from 0.1 to 0.5, while for ITQ, the performance decreases as \(\lambda \) rises from 0.1 to 0.5. This is because some hashing methods is sensible to the sparsity of the data while others are not.

Fig. 3
figure 3

Evaluation of effects of λ on visual retrieval (mAP) over AWA dataset

5 Conclusion

In this paper, we introduced a novel binary visual representation learning framework to effectively exploit inter-concept correlation. Through the designed framework, we generated a concept-enriched dictionary and exploit concepts correlation by utilizing semantic attribute information. We proposed to impose overlapping group sparsity on the conceptual dictionaries, which achieves a better variable selection process. Also, we show the generalization ability of our framework by using different datasets in the off-line and on-line stages. In the future, we will further investigate an automatic way for group the dictionaries as well as exploit the inter-concept correlation.