Abstract
In real world, it is common that an entity is represented by multiple modalities, which motivates multi-modal learning, e.g., multi-modal clustering and cross-modal retrieval. Traditional methods based on deep neural networks usually assume a joint factor or multiple similar factors are learned. However, different modalities representing the same content share both common and modality-specific characteristics, and few approaches can fully discover those features, i.e., consistency and complementarity. In this paper, we propose to learn shared and specific factors for each modality. Then the consistency can be explored through the shared factors. By combining the shared and specific factors, the complementarity will be excavated. Finally, a triadic autoencoder with deep architecture is developed for the shared and specific factors learning. Extensive experiments are conducted for cross-modal retrieval and multi-model clustering, which clearly demonstrate the effectiveness of our model.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Various kinds of real-world data appear in multiple modalities. For example, a web page can be described by both images and texts, and an image can be represented by either image itself or its associated tags. Since different modalities provide complementary and consistent representations of the same concept, multi-modal learning is driven to explore consistency and complementarity characteristics among multiple modalities, which has a wide range of applications, e.g., cross-modal retrieval exploring the consistency between different modalities, and multi-modal clustering discovering the complementarity among multiple modalities.
Traditional machine learning algorithms concatenate multiple modalities into a single feature set to fit multi-modal data. However, such a concatenation cannot explore the correlation between different kinds of information and ignores the incompatibility of heterogeneous feature sets. Multiple kernel learning methods usually assume a kernel corresponds to a modality, and various fusion strategies are developed to combine different kernels [1]. Some other methods learn a latent space, where different kinds of modalities can be compared. Typical examples such as canonical correlation analysis [2], partial least squares [3], and bilinear model [4] obtain good results in various multi-modal learning tasks.
Recently, several deep learning methods are developed for multi-modal learning, which explore feature learning and multi-modal characteristics into a unified framework, and have obtained promising results. Some of those methods learn a joint factor among multiple modalities [5, 6], and some other methods learn a factor for each modality and force the factors to be similar [7,8,9]. However, none of the above methods can fully excavate the consistency and complementarity among multiple modalities. For example, learning a joint factor ignores the consistency and those methods may not perform cross-modal matching tasks. Besides, learning similar factors may result in overlook of the complementarity and the data will not be fully represented.
To alleviate the above problem, a novel multi-modal learning method is developed. Firstly, higher-level factors for each modality are learned by feeding the original features into an autoencoder network. Then the learned factors are divided into shared and specific parts. Using the shared factors, the consistency can be explored based on the correspondence between the modalities with a typical triplet loss. Besides, by combining the shared and specific factors to reconstruct each modality with decoder networks, the complementarity can be excavated via a reconstruction loss. Finally, several stacked modality-friendly models are employed to learn higher-level representations of different modalities for reducing the semantic difference, and a deep architecture is accordingly developed. With the learned shared and specific factors, various multi-modal learning tasks, e.g., cross-modal retrieval and multi-modal clustering, can be performed.
The main contributions are listed as follows.
-
We proposed a multi-modal learning framework that can explore consistency and complementarity characteristics simultaneously.
-
We verify our model in terms of cross-modal retrieval and multi-modal clustering tasks, and the experimental results clearly demonstrate the effectiveness of our model.
2 Related Work
Multi-modal learning deals with data represented by multiple modalities, which has a wide range of applications. Various machine learning algorithms are developed to explore multi-modal characteristics based on different learning tasks. For example, cross-modal retrieval aims at discovering correlation between different modalities [2,3,4, 10], so the modality-specific factors should be removed. As for multi-modal clustering [11, 12], the main challenge lies in the mining of the complementary information among multiple modalities, so both the correlation and modality-specific factors should be considered to fully represent the data. Among various multi-modal learning methods, subspace learning based ones are popular due to their good results and ease of understanding. Those methods aim to find a low dimensional latent space, where different modalities can be compared [2, 13, 14]. Roughly speaking, our model also finds a space to fully explore the multi-modal characteristics.
Recently, with the resurgence of deep neural network in 2006, several deep learning methods are brought into multi-modal learning [9, 15]. Ngiam et al. [15] proposed deep autoencoder models to fuse audio and video modalities for classification and retrieval tasks. Andrew et al. [5] developed a nonlinear extension of the canonical correlation analysis to obtain a higher correlation. Feng et al. [7] utilized correspondence autoencoder and deep boltzmann machine for cross-modal retrieval. Chang et al. [8] developed a highly nonlinear multi-layer embedding function to capture the complex interactions between the heterogeneous data in networks. Huang et al. [6] proposed a multi-label conditional restricted boltzmann machine to deal with modality completion, fusion and multi-label prediction. Overall, all above methods learning a joint factor or multiple similar factors cannot capture the multi-modal characteristics, i.e., consistency and complementarity, simultaneously.
It should be noted that topics of image caption [16] and image-sentence matching [17, 18] are not discussed here because they go beyond of our scope.
3 Model
Taking two typical modalities, i.e., image and text, as an example, our model can be elaborated with two parts: a triadic autoencoder network for multi-modal characteristics exploring and stacked restricted boltzmann machines layers for high level representations learning.
3.1 Triadic Autoencoder
The learning architecture consists of three subnetworks with each corresponding to a basic autoencoder. Then the subnetworks are connected by a predefined triplet loss imposed on part of the code layers. More specifically, one subnetwork is fed with image representation, and the other two subnetworks sharing the same parameters are fed with similar and dissimilar text representations of the current image. With the above architecture, we can model the consistency and complementarity among multiple modalities nicely.
Formally, given an input \(({\mathbf {r}},{\mathbf {s}},{\mathbf {t}})\), which indicates \({\mathbf {r}}\) and \({\mathbf {s}}\) are a corresponding image-text pair and \({\mathbf {t}}\) is a randomly selected text representation, we model the consistency in the code layers. Suppose the image and text mappings are f and g respectively. Then the code layers are calculated as \(f({\mathbf {r}};{{\mathbf {W}}_f})\), \(g({\mathbf {s}};{{\mathbf {W}}_g})\) and \(g({\mathbf {t}};{{\mathbf {W}}_g})\), where \({\mathbf {W}}_f\) and \({\mathbf {W}}_g\) are the weight parameters in the subnetworks. Since multiple modalities share common and modality-specific characteristics and consistency is built based on their common parts, we use partial nodes to model the consistency:
where \(L_1\) is a widely used triplet loss, and \(\gamma \) is the size of margin. \({\mathbf {x}}\), \({\mathbf {y}}\) and \({\mathbf {z}}\) are code layers of r, s and t, respectively. \({\mathbf {I}}_d\) is a diagonal matrix with its first d elements being 1 and the others 0. Through such a matrix, we force the former d nodes to represent the shared factors.
After obtaining the code layer, we reconstruct each modality. By using all nodes in the code layer to reconstruct a modality, modality-specific factors can be represented using nodes except the shared parts. Then, the complementarity can be explored by combing shared and specific factors. The reconstruction is written as:
where \(L_2\) is the reconstruction loss. \({\mathbf {p}}\) represents an image or text. \(\varTheta \) is parameters of the encoder and decoder networks, and \(\tilde{{\mathbf {p}}}\) is a reconstructed image or text.
Then the final loss for an input triplet \(({\mathbf {r}},{\mathbf {s}},{\mathbf {t}})\) is:
where \(\alpha \) is a parameter balancing the two terms. In summary, minimizing the loss function defined in Eq. 3 enables triadic autoencoder to explore consistency and complementarity among multiple modalities simultaneously.
3.2 Deep Architecture
Generally, data from multiple modalities consist of heterogeneous feature sets, and those descriptors may have a very big difference of semantic representation. Thus, it makes a layer of autoencoder hard to capture the consistency and complementarity characteristics. To alleviate this problem, a deep architecture is accordingly proposed. More specifically, some stacked modality-friendly models are utilized to learn higher-level representations of each modality. Then the semantic difference will be reduced for better exploring the multi-modal characteristics.
Practically, we use several restricted boltzmann machines (RBMs) to extract high-level features. To be simple, RBM is an undirected graphical model having a visible layer and a hidden layer with each layer consisting of stochastic binary units but without connections between these units. Usually, extended RBMs are utilized with the visible layer being the input image and text modalities. Here, Gaussian RBM (GRBM) [19] and replicated softmax RBM (RSM) [19] are utilized to model the real-valued feature vectors for image and the discrete sparse word count vectors for text, respectively. Then the hidden layers of GRBM and RSM will serve as the visible layers of basic RBMs. After several stacked RBMs, high level representations will be extracted for the triadic autoencoder network.
For parameters inference, all RBMs can be efficiently learned through the contrastive divergence approximation algorithm (CD). As for the triadic autoencoder network, an autoencoder can be initialized using an RBM. Then the parameters are optimized using Eq. 3 through back-propagation algorithm. Note that we can use back-propagation for the entire network, but we just finetune the triadic autoencoder network for simplicity.
4 Experiments
4.1 Datasets
Wiki Dataset: It is a widely used image-text dataset, which consists of 2,173/693 training/testing image-text pairs. In total, there are 10 categories. As for the features, the text is ten dimensional topics obtained through a topic model (Latent Dirichlet Allocation) [20], and the image is represented by 128 dimensional SIFT descriptors. Similar to [21], we split the dataset into a training set of 1,300 pairs (130 pairs per class) and a testing set of 1,566 pairs.
Pascal VOC Dataset: It is used in various multi-modal learning tasks, which consists of 5,011/4,952 training/testing image-tag pairs classified into 20 categories. As for the features, the tag feature is a 399-dimensional word frequency vector, and the image is encoded by a 512-dimensional Gist feature. For simplicity, we remove image-tag pairs with their tag features all being zero as did in [21].
4.2 Tasks and Evaluation Metrics
Since our model aims to explore consistency and complementarity among multiple modalities, we perform two kinds of tasks, i.e., cross-modal retrieval and multi-modal clustering, which depend mainly on the consistency and complementarity respectively.
Cross-modal retrieval: We map the testing images and texts into the code layers and select the former d nodes as their final embeddings. Then the two embeddings can be compared using the Euclidean distance. Finally, two cross-modal retrieval tasks, i.e., Image query vs. Text database and Text query vs. Image database, are conducted. As for the metrics, We use mean average precision (MAP) and precision-recall curve (PR) to evaluate the overall performance [22].
Multi-modal clustering: We map the testing images and texts into the code layers and concatenate all the code layers as their final embeddings. Then we cluster those embeddings through Kmeans algorithm. As for the metrics, five widely used measures [12], i.e., the accuracy (ACC), normalized mutual information (NMI), F-measure (F1), R-Index (RI) and Entropy, are utilized for performance evaluation. As for the former four metrics, the bigger the better performance, and for the Entropy, the lower the values the better performance.
4.3 Compared Methods
PLS [3], BLM [4] and CCA [2] are three representative unsupervised methods that use pairwise information for embeddings learning. CorrAE: Feng et al. [7] proposed a correspondence autoencoder that learns similar factors for multiple modalities. DCCA: Andrew et al. [5] extended traditional canonical correlation analysis to a deep architecture.
CDFE [13], GMLDA [14] and GMMFA [14] are three typical supervised multi-modal leaning methods. They use labels to obtain relatively discriminative subspaces to enhance the performance. We compared those methods to further validate the effectiveness of our unsupervised model.
Our method is denoted as LSSF. Besides, LSSF without dividing the code layer into common and specific factors is denoted as BaseF. Comparing with this method will further validate the effectiveness of fully exploring multi-modal characteristics.
4.4 Cross-Modal Retrieval
For the Wiki dataset, all features are low dimensional and well extracted, so we do not use RBMs for feature preprocessing. As for the VOC dataset, We use two RBM layers to extract high level representations. The parameter d deciding the number of nodes being the shared factors is empirically selected to be half number of the code layer.
The MAP results of image query and text query on the Wiki and VOC datasets are shown in Tables 1 and 2 respectively. Overall, it can be seen that our method almost outperforms all the compared methods on all the datasets. Compared with BaseF, we learn shared and modality-specific factors, which is more reasonable due to the characteristics of multi-modal data.
Compared with PLS, CCA and BLM, our method is much better because we use triplet loss to model the relation between images and texts, which may be more effective than the pairwise relation. More importantly, consistency and complementarity are considered simultaneously in our model. Methods CorrAE and DCCA learn similar factors. However, none of these methods can fully discover multi-modal characteristics. Thus our method is better than them.
Finally, the precision-recall curves of Image query and Text query on the Wiki and VOC datasets are shown in Figs. 1 and 2 respectively. The results are similar with that of MAP, and this further validates the effectiveness of our method.
4.5 Multi-modal Clustering
The training settings in the clustering task is the same with in the retrieval task. Then the clustering results on the Wiki and VOC datasets are shown in Tables 3 and 4 respectively. Overall, it can be seen that our method almost beats all the competing methods on the two datasets. This validates that our model can well discover the complementarity characteristic. Taking the retrieval tasks into consideration, we can draw the conclusion that considering the consistency and complementarity characteristics of multi-modal data can promote the learning performance.
5 Conclusion
In this paper, we have proposed a novel multi-modal learning method. By learning shared and modality-specific factors for each modality through a triadic autoencoder network, our model can explore consistency and complementarity characteristics among multiple modalities simultaneously. Finally, extensive experiments including cross-modal retrieval and multi-modal clustering have validated the proposed method by comparing with the state-of-the-art methods.
References
Gonen, M., Alpaydin, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, 2211–2268 (2011)
Kim, T.-K., Kittler, J., Cipolla, R.: Discriminative learning and recognition of image set classes using canonical correlations. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1005–1018 (2007)
Rosipal, R., Krämer, N.: Overview and recent advances in partial least squares. In: Saunders, C., Grobelnik, M., Gunn, S., Shawe-Taylor, J. (eds.) SLSFS 2005. LNCS, vol. 3940, pp. 34–51. Springer, Heidelberg (2006). https://doi.org/10.1007/11752790_2
Tenenbaum, J.B., Freeman, W.T.: Separating style and content with bilinear models. Neural Comput. 12(6), 1247–1283 (2000)
Andew, G., Arora, R., Bilmes, J., Livesu, K.: Deep canonical correlation analysis. In: International Conference on Machine Learning, pp. 1247–1255 (2013)
Huang, Y., Wang, W., Wang, L.: Unconstrained multimodal multi-label learning. IEEE Trans. Multimedia 17(11), 1923–1935 (2015)
Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: ACM on Multimedia, pp. 7–16 (2014)
Chang, S., Han, W., Tang, J., Qi, G.-J., Aggarwal, C.C., Huang, T.S.: Heterogeneous network embedding via deep architectures. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 119–128 (2015)
Wang, W., Arora, R., Livescu, K., Bilmes, J.: On deep multi-view representation learning: objectives and optimization. In: arXiv (2016)
Cao, Y., Long, M., Wang, J., Liu, S.: Collective deep quantization for efficient cross-modal retrieval. In: AAAI Conference on Artificial Intelligence (2017)
Yin, Q., Wu, S., Wang, L.: Unified subspace learning for incomplete and unlabeled multi-view data. Patern Recogn. (2017)
Kumar, A., Daume III, H.: A co-training approach for multi-view spectral clustering. In: International Conference on Machine Learning, pp. 393–400 (2011)
Lin, D., Tang, X.: Inter-modality face recognition. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 13–26. Springer, Heidelberg (2006). https://doi.org/10.1007/11744085_2
Sharma, A., Kumar, A., Daume III, H.: Generalized multiview analysis: a discriminative latent space. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2160–2167 (2012)
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: International Conference on Machine Learning, pp. 689–696 (2011)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Nam, H., Ha, J.-W., Kim, J.: Dual attention networks for multimodal reasoning and matching. In: arXiv (2016)
Huang, Y., Wang, W., Wang, L.: Instance-aware image and sentence matching with selective multimodal LSTM. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2310–2318 (2017)
Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. J. Mach. Learn. Res. 15, 2949–2980 (2014)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Multimodal learning with deep boltzmann machines. J. Mach. Learn. Res. 3, 993–1022 (2003)
Wang, K., He, R., Wang, W., Wang, L., Tan, T.: Learning coupled feature spaces for cross-modal matching. In: IEEE International Conference on Computer Vision, pp. 2088–2095 (2013)
Rasiwasia, N., Pereira, J.C., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: ACM Conference on Multimedia, pp. 251–260 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yin, Q., Huang, Y., Wu, S., Wang, L. (2017). Learning Shared and Specific Factors for Multi-modal Data. In: Yang, J., et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 772. Springer, Singapore. https://doi.org/10.1007/978-981-10-7302-1_8
Download citation
DOI: https://doi.org/10.1007/978-981-10-7302-1_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7301-4
Online ISBN: 978-981-10-7302-1
eBook Packages: Computer ScienceComputer Science (R0)