Learning Shared and Specific Factors for Multi-modal Data

Yin, Qiyue; Huang, Yan; Wu, Shu; Wang, Liang

doi:10.1007/978-981-10-7302-1_8

Qiyue Yin¹⁶,
Yan Huang¹⁶,
Shu Wu¹⁶ &
…
Liang Wang¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 772))

Included in the following conference series:

CCF Chinese Conference on Computer Vision

2421 Accesses

Abstract

In real world, it is common that an entity is represented by multiple modalities, which motivates multi-modal learning, e.g., multi-modal clustering and cross-modal retrieval. Traditional methods based on deep neural networks usually assume a joint factor or multiple similar factors are learned. However, different modalities representing the same content share both common and modality-specific characteristics, and few approaches can fully discover those features, i.e., consistency and complementarity. In this paper, we propose to learn shared and specific factors for each modality. Then the consistency can be explored through the shared factors. By combining the shared and specific factors, the complementarity will be excavated. Finally, a triadic autoencoder with deep architecture is developed for the shared and specific factors learning. Extensive experiments are conducted for cross-modal retrieval and multi-model clustering, which clearly demonstrate the effectiveness of our model.

You have full access to this open access chapter, Download conference paper PDF

Deep Multi-modal Learning with Cascade Consensus

MutualFormer: Multi-modal Representation Learning via Cross-Diffusion Attention

Article 24 April 2024

Cross-Modal Retrieval Using Deep Learning

Keywords

1 Introduction

Various kinds of real-world data appear in multiple modalities. For example, a web page can be described by both images and texts, and an image can be represented by either image itself or its associated tags. Since different modalities provide complementary and consistent representations of the same concept, multi-modal learning is driven to explore consistency and complementarity characteristics among multiple modalities, which has a wide range of applications, e.g., cross-modal retrieval exploring the consistency between different modalities, and multi-modal clustering discovering the complementarity among multiple modalities.

Traditional machine learning algorithms concatenate multiple modalities into a single feature set to fit multi-modal data. However, such a concatenation cannot explore the correlation between different kinds of information and ignores the incompatibility of heterogeneous feature sets. Multiple kernel learning methods usually assume a kernel corresponds to a modality, and various fusion strategies are developed to combine different kernels [1]. Some other methods learn a latent space, where different kinds of modalities can be compared. Typical examples such as canonical correlation analysis [2], partial least squares [3], and bilinear model [4] obtain good results in various multi-modal learning tasks.

Recently, several deep learning methods are developed for multi-modal learning, which explore feature learning and multi-modal characteristics into a unified framework, and have obtained promising results. Some of those methods learn a joint factor among multiple modalities [5, 6], and some other methods learn a factor for each modality and force the factors to be similar [7,8,9]. However, none of the above methods can fully excavate the consistency and complementarity among multiple modalities. For example, learning a joint factor ignores the consistency and those methods may not perform cross-modal matching tasks. Besides, learning similar factors may result in overlook of the complementarity and the data will not be fully represented.

To alleviate the above problem, a novel multi-modal learning method is developed. Firstly, higher-level factors for each modality are learned by feeding the original features into an autoencoder network. Then the learned factors are divided into shared and specific parts. Using the shared factors, the consistency can be explored based on the correspondence between the modalities with a typical triplet loss. Besides, by combining the shared and specific factors to reconstruct each modality with decoder networks, the complementarity can be excavated via a reconstruction loss. Finally, several stacked modality-friendly models are employed to learn higher-level representations of different modalities for reducing the semantic difference, and a deep architecture is accordingly developed. With the learned shared and specific factors, various multi-modal learning tasks, e.g., cross-modal retrieval and multi-modal clustering, can be performed.

The main contributions are listed as follows.

We proposed a multi-modal learning framework that can explore consistency and complementarity characteristics simultaneously.
We verify our model in terms of cross-modal retrieval and multi-modal clustering tasks, and the experimental results clearly demonstrate the effectiveness of our model.

2 Related Work

Multi-modal learning deals with data represented by multiple modalities, which has a wide range of applications. Various machine learning algorithms are developed to explore multi-modal characteristics based on different learning tasks. For example, cross-modal retrieval aims at discovering correlation between different modalities [2,3,4, 10], so the modality-specific factors should be removed. As for multi-modal clustering [11, 12], the main challenge lies in the mining of the complementary information among multiple modalities, so both the correlation and modality-specific factors should be considered to fully represent the data. Among various multi-modal learning methods, subspace learning based ones are popular due to their good results and ease of understanding. Those methods aim to find a low dimensional latent space, where different modalities can be compared [2, 13, 14]. Roughly speaking, our model also finds a space to fully explore the multi-modal characteristics.

Recently, with the resurgence of deep neural network in 2006, several deep learning methods are brought into multi-modal learning [9, 15]. Ngiam et al. [15] proposed deep autoencoder models to fuse audio and video modalities for classification and retrieval tasks. Andrew et al. [5] developed a nonlinear extension of the canonical correlation analysis to obtain a higher correlation. Feng et al. [7] utilized correspondence autoencoder and deep boltzmann machine for cross-modal retrieval. Chang et al. [8] developed a highly nonlinear multi-layer embedding function to capture the complex interactions between the heterogeneous data in networks. Huang et al. [6] proposed a multi-label conditional restricted boltzmann machine to deal with modality completion, fusion and multi-label prediction. Overall, all above methods learning a joint factor or multiple similar factors cannot capture the multi-modal characteristics, i.e., consistency and complementarity, simultaneously.

It should be noted that topics of image caption [16] and image-sentence matching [17, 18] are not discussed here because they go beyond of our scope.

3 Model

Taking two typical modalities, i.e., image and text, as an example, our model can be elaborated with two parts: a triadic autoencoder network for multi-modal characteristics exploring and stacked restricted boltzmann machines layers for high level representations learning.

3.1 Triadic Autoencoder

The learning architecture consists of three subnetworks with each corresponding to a basic autoencoder. Then the subnetworks are connected by a predefined triplet loss imposed on part of the code layers. More specifically, one subnetwork is fed with image representation, and the other two subnetworks sharing the same parameters are fed with similar and dissimilar text representations of the current image. With the above architecture, we can model the consistency and complementarity among multiple modalities nicely.

Formally, given an input $({\mathbf {r}},{\mathbf {s}},{\mathbf {t}})$, which indicates ${\mathbf {r}}$ and ${\mathbf {s}}$ are a corresponding image-text pair and ${\mathbf {t}}$ is a randomly selected text representation, we model the consistency in the code layers. Suppose the image and text mappings are f and g respectively. Then the code layers are calculated as $f({\mathbf {r}};{{\mathbf {W}}_f})$, $g({\mathbf {s}};{{\mathbf {W}}_g})$ and $g({\mathbf {t}};{{\mathbf {W}}_g})$, where ${\mathbf {W}}_f$ and ${\mathbf {W}}_g$ are the weight parameters in the subnetworks. Since multiple modalities share common and modality-specific characteristics and consistency is built based on their common parts, we use partial nodes to model the consistency:

$$\begin{aligned} \begin{array}{l} {L_1}(r,s,t;{{\mathbf {W}}_f},{{\mathbf {W}}_g}) = \\ \mathop {\min }\limits _{{{\mathbf {W}}_f},{{\mathbf {W}}_g}} \max \left( {0,{{\left\| {{\mathbf {x}}{\mathbf {I}}_d - {\mathbf {y}}{\mathbf {I}}_d} \right\| }^2} + \gamma - {{\left\| {{\mathbf {x}}{\mathbf {I}}_d - {\mathbf {z}}{\mathbf {I}}_d} \right\| }^2}} \right) \\ \end{array} \end{aligned}$$

(1)

where $L_1$ is a widely used triplet loss, and $\gamma $ is the size of margin. ${\mathbf {x}}$, ${\mathbf {y}}$ and ${\mathbf {z}}$ are code layers of r, s and t, respectively. ${\mathbf {I}}_d$ is a diagonal matrix with its first d elements being 1 and the others 0. Through such a matrix, we force the former d nodes to represent the shared factors.

After obtaining the code layer, we reconstruct each modality. By using all nodes in the code layer to reconstruct a modality, modality-specific factors can be represented using nodes except the shared parts. Then, the complementarity can be explored by combing shared and specific factors. The reconstruction is written as:

$$\begin{aligned} {L_2}({\mathbf {p}};\varTheta ) = \sum \nolimits _{{\mathbf {p}} = {\mathbf {r}},{\mathbf {s}},{\mathbf {t}}} {||{\mathbf {p}} - \tilde{{\mathbf {p}}}|{|^2}} \end{aligned}$$

(2)

where $L_2$ is the reconstruction loss. ${\mathbf {p}}$ represents an image or text. $\varTheta $ is parameters of the encoder and decoder networks, and $\tilde{{\mathbf {p}}}$ is a reconstructed image or text.

Then the final loss for an input triplet $({\mathbf {r}},{\mathbf {s}},{\mathbf {t}})$ is:

$$\begin{aligned} L = {L_1} + \alpha {L_2} \end{aligned}$$

(3)

where $\alpha $ is a parameter balancing the two terms. In summary, minimizing the loss function defined in Eq. 3 enables triadic autoencoder to explore consistency and complementarity among multiple modalities simultaneously.

3.2 Deep Architecture

Generally, data from multiple modalities consist of heterogeneous feature sets, and those descriptors may have a very big difference of semantic representation. Thus, it makes a layer of autoencoder hard to capture the consistency and complementarity characteristics. To alleviate this problem, a deep architecture is accordingly proposed. More specifically, some stacked modality-friendly models are utilized to learn higher-level representations of each modality. Then the semantic difference will be reduced for better exploring the multi-modal characteristics.

Practically, we use several restricted boltzmann machines (RBMs) to extract high-level features. To be simple, RBM is an undirected graphical model having a visible layer and a hidden layer with each layer consisting of stochastic binary units but without connections between these units. Usually, extended RBMs are utilized with the visible layer being the input image and text modalities. Here, Gaussian RBM (GRBM) [19] and replicated softmax RBM (RSM) [19] are utilized to model the real-valued feature vectors for image and the discrete sparse word count vectors for text, respectively. Then the hidden layers of GRBM and RSM will serve as the visible layers of basic RBMs. After several stacked RBMs, high level representations will be extracted for the triadic autoencoder network.

For parameters inference, all RBMs can be efficiently learned through the contrastive divergence approximation algorithm (CD). As for the triadic autoencoder network, an autoencoder can be initialized using an RBM. Then the parameters are optimized using Eq. 3 through back-propagation algorithm. Note that we can use back-propagation for the entire network, but we just finetune the triadic autoencoder network for simplicity.

4 Experiments

4.1 Datasets

Wiki Dataset: It is a widely used image-text dataset, which consists of 2,173/693 training/testing image-text pairs. In total, there are 10 categories. As for the features, the text is ten dimensional topics obtained through a topic model (Latent Dirichlet Allocation) [20], and the image is represented by 128 dimensional SIFT descriptors. Similar to [21], we split the dataset into a training set of 1,300 pairs (130 pairs per class) and a testing set of 1,566 pairs.

Pascal VOC Dataset: It is used in various multi-modal learning tasks, which consists of 5,011/4,952 training/testing image-tag pairs classified into 20 categories. As for the features, the tag feature is a 399-dimensional word frequency vector, and the image is encoded by a 512-dimensional Gist feature. For simplicity, we remove image-tag pairs with their tag features all being zero as did in [21].

4.2 Tasks and Evaluation Metrics

Since our model aims to explore consistency and complementarity among multiple modalities, we perform two kinds of tasks, i.e., cross-modal retrieval and multi-modal clustering, which depend mainly on the consistency and complementarity respectively.

Cross-modal retrieval: We map the testing images and texts into the code layers and select the former d nodes as their final embeddings. Then the two embeddings can be compared using the Euclidean distance. Finally, two cross-modal retrieval tasks, i.e., Image query vs. Text database and Text query vs. Image database, are conducted. As for the metrics, We use mean average precision (MAP) and precision-recall curve (PR) to evaluate the overall performance [22].

Multi-modal clustering: We map the testing images and texts into the code layers and concatenate all the code layers as their final embeddings. Then we cluster those embeddings through Kmeans algorithm. As for the metrics, five widely used measures [12], i.e., the accuracy (ACC), normalized mutual information (NMI), F-measure (F1), R-Index (RI) and Entropy, are utilized for performance evaluation. As for the former four metrics, the bigger the better performance, and for the Entropy, the lower the values the better performance.

4.3 Compared Methods

PLS [3], BLM [4] and CCA [2] are three representative unsupervised methods that use pairwise information for embeddings learning. CorrAE: Feng et al. [7] proposed a correspondence autoencoder that learns similar factors for multiple modalities. DCCA: Andrew et al. [5] extended traditional canonical correlation analysis to a deep architecture.

CDFE [13], GMLDA [14] and GMMFA [14] are three typical supervised multi-modal leaning methods. They use labels to obtain relatively discriminative subspaces to enhance the performance. We compared those methods to further validate the effectiveness of our unsupervised model.

Our method is denoted as LSSF. Besides, LSSF without dividing the code layer into common and specific factors is denoted as BaseF. Comparing with this method will further validate the effectiveness of fully exploring multi-modal characteristics.

4.4 Cross-Modal Retrieval

For the Wiki dataset, all features are low dimensional and well extracted, so we do not use RBMs for feature preprocessing. As for the VOC dataset, We use two RBM layers to extract high level representations. The parameter d deciding the number of nodes being the shared factors is empirically selected to be half number of the code layer.

The MAP results of image query and text query on the Wiki and VOC datasets are shown in Tables 1 and 2 respectively. Overall, it can be seen that our method almost outperforms all the compared methods on all the datasets. Compared with BaseF, we learn shared and modality-specific factors, which is more reasonable due to the characteristics of multi-modal data.

Table 1. MAP comparison of different methods on the Wiki dataset.

Full size table

Table 2. MAP comparison of different methods on the VOC dataset.

Full size table

Compared with PLS, CCA and BLM, our method is much better because we use triplet loss to model the relation between images and texts, which may be more effective than the pairwise relation. More importantly, consistency and complementarity are considered simultaneously in our model. Methods CorrAE and DCCA learn similar factors. However, none of these methods can fully discover multi-modal characteristics. Thus our method is better than them.

Finally, the precision-recall curves of Image query and Text query on the Wiki and VOC datasets are shown in Figs. 1 and 2 respectively. The results are similar with that of MAP, and this further validates the effectiveness of our method.

Table 3. Clustering results on the Wiki datasets.

Full size table

Table 4. Clustering results on the VOC datasets.

Full size table

4.5 Multi-modal Clustering

The training settings in the clustering task is the same with in the retrieval task. Then the clustering results on the Wiki and VOC datasets are shown in Tables 3 and 4 respectively. Overall, it can be seen that our method almost beats all the competing methods on the two datasets. This validates that our model can well discover the complementarity characteristic. Taking the retrieval tasks into consideration, we can draw the conclusion that considering the consistency and complementarity characteristics of multi-modal data can promote the learning performance.

5 Conclusion

In this paper, we have proposed a novel multi-modal learning method. By learning shared and modality-specific factors for each modality through a triadic autoencoder network, our model can explore consistency and complementarity characteristics among multiple modalities simultaneously. Finally, extensive experiments including cross-modal retrieval and multi-modal clustering have validated the proposed method by comparing with the state-of-the-art methods.

References

Gonen, M., Alpaydin, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, 2211–2268 (2011)
MathSciNet MATH Google Scholar
Kim, T.-K., Kittler, J., Cipolla, R.: Discriminative learning and recognition of image set classes using canonical correlations. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1005–1018 (2007)
Article Google Scholar
Rosipal, R., Krämer, N.: Overview and recent advances in partial least squares. In: Saunders, C., Grobelnik, M., Gunn, S., Shawe-Taylor, J. (eds.) SLSFS 2005. LNCS, vol. 3940, pp. 34–51. Springer, Heidelberg (2006). https://doi.org/10.1007/11752790_2
Chapter Google Scholar
Tenenbaum, J.B., Freeman, W.T.: Separating style and content with bilinear models. Neural Comput. 12(6), 1247–1283 (2000)
Article Google Scholar
Andew, G., Arora, R., Bilmes, J., Livesu, K.: Deep canonical correlation analysis. In: International Conference on Machine Learning, pp. 1247–1255 (2013)
Google Scholar
Huang, Y., Wang, W., Wang, L.: Unconstrained multimodal multi-label learning. IEEE Trans. Multimedia 17(11), 1923–1935 (2015)
Article Google Scholar
Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: ACM on Multimedia, pp. 7–16 (2014)
Google Scholar
Chang, S., Han, W., Tang, J., Qi, G.-J., Aggarwal, C.C., Huang, T.S.: Heterogeneous network embedding via deep architectures. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 119–128 (2015)
Google Scholar
Wang, W., Arora, R., Livescu, K., Bilmes, J.: On deep multi-view representation learning: objectives and optimization. In: arXiv (2016)
Google Scholar
Cao, Y., Long, M., Wang, J., Liu, S.: Collective deep quantization for efficient cross-modal retrieval. In: AAAI Conference on Artificial Intelligence (2017)
Google Scholar
Yin, Q., Wu, S., Wang, L.: Unified subspace learning for incomplete and unlabeled multi-view data. Patern Recogn. (2017)
Google Scholar
Kumar, A., Daume III, H.: A co-training approach for multi-view spectral clustering. In: International Conference on Machine Learning, pp. 393–400 (2011)
Google Scholar
Lin, D., Tang, X.: Inter-modality face recognition. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 13–26. Springer, Heidelberg (2006). https://doi.org/10.1007/11744085_2
Chapter Google Scholar
Sharma, A., Kumar, A., Daume III, H.: Generalized multiview analysis: a discriminative latent space. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2160–2167 (2012)
Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: International Conference on Machine Learning, pp. 689–696 (2011)
Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
Nam, H., Ha, J.-W., Kim, J.: Dual attention networks for multimodal reasoning and matching. In: arXiv (2016)
Google Scholar
Huang, Y., Wang, W., Wang, L.: Instance-aware image and sentence matching with selective multimodal LSTM. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2310–2318 (2017)
Google Scholar
Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. J. Mach. Learn. Res. 15, 2949–2980 (2014)
MathSciNet MATH Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Multimodal learning with deep boltzmann machines. J. Mach. Learn. Res. 3, 993–1022 (2003)
Google Scholar
Wang, K., He, R., Wang, W., Wang, L., Tan, T.: Learning coupled feature spaces for cross-modal matching. In: IEEE International Conference on Computer Vision, pp. 2088–2095 (2013)
Google Scholar
Rasiwasia, N., Pereira, J.C., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: ACM Conference on Multimedia, pp. 251–260 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Automation, Chinese Academy of Sciences, 95, Zhongguancun East Road Haidian District, Beijing, People’s Republic of China
Qiyue Yin, Yan Huang, Shu Wu & Liang Wang

Authors

Qiyue Yin
View author publications
You can also search for this author in PubMed Google Scholar
Yan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Shu Wu
View author publications
You can also search for this author in PubMed Google Scholar
Liang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liang Wang .

Editor information

Editors and Affiliations

Civil Aviation University of China, Tianjin, China
Jinfeng Yang
Tianjin University, Tianjin, China
Qinghua Hu
Nankai University, Tianjin, China
Ming-Ming Cheng
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Liang Wang
Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Huazhong University of Science and Technology, Wuhan, China
Xiang Bai
Xi’an Jiaotong University, Xi’an, China
Deyu Meng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yin, Q., Huang, Y., Wu, S., Wang, L. (2017). Learning Shared and Specific Factors for Multi-modal Data. In: Yang, J., et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 772. Springer, Singapore. https://doi.org/10.1007/978-981-10-7302-1_8

Download citation

DOI: https://doi.org/10.1007/978-981-10-7302-1_8
Published: 30 November 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7301-4
Online ISBN: 978-981-10-7302-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Shared and Specific Factors for Multi-modal Data

Abstract

Similar content being viewed by others

Deep Multi-modal Learning with Cascade Consensus

MutualFormer: Multi-modal Representation Learning via Cross-Diffusion Attention

Cross-Modal Retrieval Using Deep Learning

Keywords

1 Introduction

2 Related Work

3 Model

3.1 Triadic Autoencoder

3.2 Deep Architecture

4 Experiments

4.1 Datasets

4.2 Tasks and Evaluation Metrics

4.3 Compared Methods

4.4 Cross-Modal Retrieval

4.5 Multi-modal Clustering

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us