Elsevier

Information Fusion

Volume 91, March 2023, Pages 612-622
Information Fusion

Full length article
Instance-wise multi-view representation learning

https://doi.org/10.1016/j.inffus.2022.11.006Get rights and content

Highlights

  • Improving multi-view learning performance from the perspective of model input.

  • Dynamically selecting dimensions for different samples in representation learning.

  • Adopting an alternative learning mechanism to optimize all networks.

  • Experiments on various datasets with various tasks to verify the model effectiveness.

Abstract

Multi-view representation learning aims to integrate multiple data information from different views to improve the task performance. The information contained in multi-view data is usually complex. Not only do different views contain different information, but also different samples of the same view contain different information. In the multi-view representation learning, most existing methods either simply treat each view/sample with equal importance, or set fixed or dynamic weights for different views/samples, which is not accurate enough to capture the information of dimensions of each sample and causes information redundancy, especially for high-dimensional samples. In this paper, we propose a novel unsupervised multi-view representation learning method based on instance-wise feature selection. A main advantage of instance-wise feature selection in this paper is that one can dynamically select dimensions that favor both view-specific representation learning and view-shared representation learning for each sample, thereby improving the performance from the perspective of model input. The proposed method consists of selector network, view-specific network and view-shared network. Specifically, selector network is used to obtain the selection template, which selects different number of dimensions conducive to representation learning from different samples to solve the sample heterogeneity problem; the view-specific network and view-shared network are used to extract the view-specific and view-shared representations, respectively. The selector network, view-shared network, and view-specific network are optimized alternately. Extensive experiments on various multi-view datasets with clustering and multi-label classification tasks demonstrate that the proposed method outperforms the state-of-the-art multi-view learning methods.

Introduction

In real world, an object can be characterized by different types of data simultaneously, and each type of data can be regarded as a specific view. This kind of data is often named as multi-view data. Multi-view data widely exists in practical applications [1]. For example, in web page classification, a web page can often be described by three views: the text content of the web page itself, the corresponding images and the anchor text of any web page linking to this web page. In image recognition tasks, an image can be represented by diverse views (visual features), such as color histograms [2], texture descriptor, SIFT [3] and HOG [4]. Different types of data often complement each other, providing more comprehensive information [5]. Describing objects from different perspectives with different features can help people understand objects comprehensively and represent objects more accurately [6]. Accordingly, conducting representation learning from multi-view data has the potential to improve the generalization performance [7].

Multi-view representation learning is to learn representations of the multi-view data that facilitate extracting useful information when developing prediction models [8]. To effectively explore the multi-view data, a series of methods have been proposed in recent years [9]. The most naive multi-view representation learning method is to directly concatenate all multiple views into one representation. However, since the feature and distribution of different views may be different, simple concatenation may ignore the unique statistical attributes of views and the relationship between views, and lead to data redundancy [10], [11]. In view of this, Canonical Correlation Analysis (CCA) [12] and its variants [9], [13], [14] are proposed, which mainly maximize the consistency of multiple views by projecting different views into a common subspace. Nevertheless, CCA-based methods usually ignore the view-specific information of each view, which may degrade the quality of the learned representation. Considering this, some methods have been proposed to explore both the consistency and complementary information of multi-view data in the representation learning [15], [16]. The above methods have effectively promoted the development of multi-view representation learning, but all of these methods treat each view with equal importance. It should be noted here that different views of data may contain different amounts of information due to sensor factors or environmental factors. Furthermore, different samples of the same view are usually heterogeneous [5]. To solve the above problems, some methods have been proposed to assign fixed or dynamic weights to views or samples. For instance, Geng et al. proposed a method that can dynamically adjust the sample weight in representation learning according to the quality of sample [5]. Although these methods take view differences or sample differences into account and are more effective than equal-weighting methods, they cannot be accurately capture the information of the internal dimensions of the sample and cause information redundancy when learning different view representations, especially for high-dimensional samples. That is, they can only deal with the heterogeneity problem from the global level of views or samples. But the quality of a sample is directly related to its internal dimensions (features) information. Understanding which dimensions (features) are most relevant to an outcome or to a model output is an important first step in improving results and interpretability [17]. In addition, for heterogeneous samples, the representation learned by a model (for prediction) may rely on a different subset of the dimensions for different subgroups within the samples [18]. In essence, it is better to mine information from each sample by directly selecting different dimensions from different samples than by setting weights on views or samples globally.

To tackle the above issues, we present a novel instance-wise unsupervised multi-view representation learning method. Benefiting from the idea of instance-wise feature selection, the proposed method can dynamically select the dimensions conducive to view-specific representation and view-shared representation learning for each sample to deal with the sample heterogeneity problem, rather than setting weights for each sample globally. Specifically, we introduce three components into the model: the selector network, view-specific network and view-shared network. The view-shared network can guide the selection network to get the selection template which can select dimensions conducive to view-shared representation learning. Considering that view-shared representations and view-specific representations are independent (but complementary) [16], we make the view-shared selection template and the view-specific selection template complementary. That is, the dimensions conducive to view-shared representation learning and view-specific representation learning are complementary, these dimensions together form a sample. In our method, the view-specific network and view-shared network are also optimized alternately. Comprehensive experiments on various datasets with various tasks demonstrate the effectiveness of the proposed method.

The main contributions of this paper are summarized as follows:

  • We provide a new insight into unsupervised multi-view representation learning, improving performance from the perspective of model input.

  • We introduce a strategy which can dynamically select dimensions conducive to view-specific representation and view-shared representation learning for different samples. Therefore, the information can be mined from samples according to the sample itself, so that each sample can play its unique role in representation learning.

  • We adopt an alternative learning mechanism for selection template learning, view-specific representation learning, and view-shared representation learning within a unified framework so that they can improve each other adaptively.

  • Extensive experimental results verify the effectiveness of the proposed method on diverse benchmark datasets for both clustering and multi-label classification tasks.

Section snippets

Multi-view representation learning

For unsupervised multi-view representation learning, a straightforward way is to concatenate all views into a single view, and the downstream tasks are conducted on this single-view. However, this method ignores the inherent structures and specific statistical properties of different views [10], [11]. Recently, a large number of multi-view representation learning methods have been proposed and made great progress. One of the classical multi-view representation methods is CCA-based methods, such

Method

In this section, we introduce the proposed method. Suppose that we have a multi-view dataset χ={X(1),,X(V)}, where X(v)Rn×dv is the feature matrix of the v-th view. V, n and dv denote the number of views, the number of samples, and the dimensionality of feature space for the vth view, respectively. The illustration of the proposed instance-wise multi-view representation learning framework is shown in Fig. 1. As shown in Fig. 1, the proposed method consists of selector network, view-shared

Datasets

We give a brief description of four multi-view clustering datasets and three multi-view multi-label datasets used in the experiments in Table 1, Table 2, respectively.

Multi-view clustering dataset. ORL dataset contains 400 face images from 40 categories. Three types of features: intensity, LBP and Gabor are used as different views. MSRCV1 [33] dataset contains images of 7 categories, and the number of images in each category is 30. Six types of features: CENT, CMT, GIST, HOG, LBP, SIFT are

Conclusion

In this paper, we propose a novel instance-wise multi-view representation learning method. Different from existing multi-view representation learning methods, our method can automatically select data dimensions from each sample that are conducive to view-shared representation and view-specific representation learning respectively. We introduce the idea of actor–critic model for selection template learning, which can flexibly select different relevant dimensions for different samples, so that

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (12101536), Natural Science Foundation of Shandong Province, China (ZR2022QF064), Natural Science Foundation of the Jiangsu Higher Education Institutions of China under Grant (21KJB510040) and Open Fund project of Jiangsu Industrial Perception and Intelligent Manufacturing Equipment Engineering Research Center (ZK21-05-10).

References (38)

  • H. Zhao, Z. Ding, Y. Fu, Multi-view clustering via deep matrix factorization, in: Proceedings of the AAAI Conference on...
  • WangW. et al.

    On deep multi-view representation learning

  • XuC. et al.

    Multi-view intact space learning

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • ZhaoL. et al.

    Co-learning non-negative correlated and uncorrelated features for multi-view data

    IEEE Trans. Neural Netw. Learn. Syst.

    (2020)
  • HotellingH.

    Relations between two sets of variates

    Biometrika

    (1936)
  • BachF.R. et al.

    Kernel independent component analysis

    J. Mach. Learn. Res.

    (2002)
  • AndrewG. et al.

    Deep canonical correlation analysis

  • C. Zhang, Y. Liu, H. Fu, AE2-nets: Autoencoder in autoencoder networks, in: IEEE Conference on Computer Vision and...
  • X. Wu, Q.-G. Chen, Y. Hu, D. Wang, X. Chang, X. Wang, M.-L. Zhang, Multi-view multi-label learning with view-specific...
  • Cited by (3)

    • Relaxed multi-view discriminant analysis

      2024, Engineering Applications of Artificial Intelligence
    View full text