Elsevier

Information Sciences

Volume 524, July 2020, Pages 148-164
Information Sciences

Joint specific and correlated information exploration for multi-view action clustering

https://doi.org/10.1016/j.ins.2020.03.029Get rights and content

Highlights

  • A unified framework for multi-view action clustering is proposed.

  • A novel bag-of-shared-words model is designed for discriminative feature representation of multi-view actions.

  • A new joint information bottleneck method is presented for multi-view action clustering.

Abstract

Human action clustering is crucial in many practical applications. However, existing action clustering methods work in a single-view manner, which always ignore the relationships among different views and fail to discover correct clusters as the viewpoint and position change. To address the challenges, we propose a unified framework for multi-view human action clustering. First, we design a new Bag-of-Shared-Words (BoSW) model to discover the view-shared visual words that preserve the consistency among visual words of different views. Then, we obtain a more discriminative feature representation, from which the view correlation can be fully explored. Then, we present a novel JOint INformation boTtleneck (JOINT) algorithm to jointly exploit both the view-specific and view-correlated information to improve the action clustering performance. Specifically, JOINT formulates the problem as minimizing an information loss function, which compresses the actions of each view while jointly preserving the complementary view-specific information and correlated information among views. To solve the proposed objective function, a new sequential procedure is presented to guarantee convergence to a local optimal solution. Extensive experiments on three challenging multi-view single-person and interactive action datasets demonstrate the superiority of our algorithm.

Introduction

Human action recognition [1], [2] is an active research area in computer vision communities and is critical for various applications such as video summarization, video surveillance and human-computer interaction. However, these methods are heavily dependent on large volumes of labelled videos, which may not be suitable for certain practical applications. Intuitively, it is better to learn human action categories in an unsupervised fashion.

Till now, numerous human action clustering algorithms [3], [4], [5] have been proposed and applied to many practical applications, such as the automatic annotation of action video databases and fast content-based video retrieval. Although these methods have shown their advantages, there are still two serious problems. (1) Difficulty of correct cluster discovery as the view angle and position change. Despite the fact that many powerful feature extraction methods such as the spatio-temporal interest point [6] have been proposed, the clustering performance may degrade significantly as the view angle changes. For instance, the same action observed from four cameras/views may look quite different and the extracted visual features corresponding to some views become less discriminative due to certain limitations such as self-occlusion [7], as shown in Fig. 1. (2) Ignoring of the correlated information among multiple views. An action across different viewpoints naturally has diverse characteristics; however, abundant correlated information still exists among multiple views. Intuitively, exploiting the multi-view correlation can improve the action clustering performance. One direct method is to concatenate multiple views together and then feed it into a typical single-view action clustering method. However, this strategy is just a naive fusion of multi-view action videos and fails to explore the underlying correlation among views.

Given these challenges, many multi-view action recognition methods have been proposed, which usually explore and leverage the relationships among views to improve the recognition accuracy. They can be roughly categorized into two classes: (1) Multi-task strategy-based methods [10], [11], [12], which learn a set of recognition models simultaneously by discovering task relations. For example, Yan et al. [10] propose a multi-task multi-class linear discriminant analysis learning framework for multi-view action recognition by sharing discriminative features among multiple views. (2) Feature representation learning-based methods [13], [14], [15], [16], [17], which generally learn the discriminative or view-invariant feature representations for action recognition. For instance, the authors in [15] introduce a 4D (3D space+time) spatio-temporal interest points detector and a histogram of 3D flow descriptor to obtain a view-invariant feature representation for multi-view action recognition. However, these methods above are based on the assumption that we are given sufficient labelled multi-view action data, which may not be practical for many complex applications. Moreover, it is difficult for human labelers with different backgrounds and prior knowledge to assign uniformly specific action types to video samples.

In this paper, we focus on developing a new multi-view action clustering method that can simultaneously address the challenges faced by the single-view action clustering and multi-view action recognition methods. Multi-view action clustering is truly beneficial in many practical applications, such as surveillance behavior understanding and content-based video retrieval. Taking the surveillance abnormal behavior analysis at crossroads as an example, it is commonly seen at zebra crossings that some pedestrians will play on or call cellphones, as the action shown in Fig. 1 a), which is quite dangerous and requires a warning to be dispatched by a real-time trained system. However, it is expensive to collect a large number of manually labeled multi-view action videos at crossroads for training. In this case, we can resort to multi-view action clustering to help human labelers. However, one of the most challenging issues is the difficulty in learning the underlying relationships among views under unsupervised settings.

To effectively learn and leverage the relationships among views, many multi-view clustering (MVC) methods have been proposed from different perspectives and have shown promising clustering performance. Most of them utilize the following two strategies. (1) Combination strategy [18], [19], [20], [21], [22], [23], [24], [25], as shown in Fig. 2 a). For instance, Cai et al. [19] propose a multi-view k-means clustering method to combine different feature representations of data for clustering. However, this type of MVC method only combines the view-specific features or cluster information and fails to capture the correlations among views. (2) Correlation strategy [26], [27], [28], as shown in Fig. 2 b). For example, Xia et al. [27] introduce a new robust multi-view spectral clustering method by learning a shared low-rank transition probability matrix, which is clustered by the standard Markov chain method. However, this type of MVC method merely concentrates on the correlations across views and ignores the specific information of different views, which may lead to performance degradation.

Recently, many visual feature representation methods [6], [29], [30], [31], [32] have been designed for human action recognition and have shown impressive performance. With the descriptors above, the popular bag-of-visual-words (BoVW) model [33] can be used for feature representation of action videos. The BoVW model contains three steps, including visual descriptor extraction, vocabulary learning, and feature encoding. However, when we are confronted with actions from multiple views, constructing the BoVW representations independently for each view may neglect the relatedness among them, which will lead to the limited descriptive ability of the BoVW model. To address this problem, Liu et al. [34] propose a new bag-of-bilingual-words (BoBW) model to transfer view knowledge across views by using the bipartite graph partitioning technique. However, the BoBW model faces some severe challenges: (1) It only works well for cross-view (i.e., two-view) modelling; (2) The bilingual words have limited descriptive ability when confronted with large view angle differences in the cross-view scenario; (3) The model may lead to sub-optimal bilingual words when the learned affinity matrix is not the optimal one for graph partitioning. The affinity matrix between visual words from pairwise views has to be built before applying bipartite graph partitioning. If the matrix fails to reflect the intrinsic similarity between view-specific words, it may not perform well in the learning of bilingual words.

In this study, a unified framework is proposed for multi-view human action clustering to answer the following two questions.

How can we effectively represent multi-view action videos and fully explore the correlations among views? We design a new bag-of-shared-words (BoSW) model to discover the view-shared visual words that preserve the consistency among visual words of different views, and then obtain much more discriminative feature representation, from which the view correlation can be fully explored. The view correlation is exactly the relationship among the view-specific visual words. Taking the action “Call cellphone” in the M2I dataset [8] as an example, as shown in Fig. 1 a), we may obtain the semantic words of “calling to somebody” and “cellphone” from the front and side views respectively due to self-occlusion. These visual words across views are consistent and can be used to describe the same action, the connections of which help to obtain discriminative view-shared words, i.e., the semantic shared words “call cellphone”. Unlike existing models such as BoBW, the proposed BoSW model has three key strengths: (1) It is suitable for multi-view modelling; (2) Regardless of how the view angle changes, shared visual words can be effectively learned to make the feature representation discriminative; (3) It can learn highly semantically shared words by using the hierarchical bottom-up method. The low-level view-specific visual words are transformed into high-level semantic view-shared words under the hierarchical organization since the visual words across views actually describe the same action categories.

How can we appropriately combine complementary information of multiple views and exploit the view correlation? We propose a novel JOint INformation boTtleneck (JOINT) algorithm to jointly exploit the complementary information from multiple views and the correlated information among views to improve the action clustering performance, as shown in Fig. 2 c). Specifically, the JOINT method formulates the problem as minimizing an information loss function, which compresses the actions of each view while jointly preserving the complementary view-specific information contained in the BoVW representations of multiple views and the view-correlated information residing in the learned BoSW representation. To solve the proposed objective function, a new sequential procedure is presented to guarantee convergence to a local optimal solution. Different from most existing multi-view clustering methods, the proposed JOINT method has two major advantages: (1) It can jointly explore and exploit the specific and correlated information from multiple views so that the comprehensive information among views can be fully utilized for action cluster discovery; (2) The relevant/useful information can be maximally preserved under the information bottleneck principle.

With the two essential components of the proposed framework above, we can fully explore and exploit the view-specific and view-correlated information to boost the action clustering performance. Experimental results on three challenging multi-view single-person and interactive action datasets, including the IXMAS [35], WVU [9] and M2I [8] datasets, show the superiority of the proposed method.

The major contributions of this paper are summarized as follows:

  • A unified framework is proposed for the problem of multi-view human action clustering.

  • A novel bag-of-shared-words model is designed for discriminative feature representation of actions from multiple views, from which the view correlations can be fully explored.

  • To preserve the individual characteristics as well as the common knowledge among views, we propose a new joint information bottleneck algorithm to jointly exploit the complementary view-specific information and correlated information among views for action clustering.

  • To solve the proposed objective function, a new sequential procedure is presented to guarantee convergence to a local optimal solution.

The rest of this paper is organized as follows. Section 2 revisit the preliminary knowledge on information bottleneck theory. In Section 3, we illustrate the proposed framework briefly. Then Section 4 and Section 5 show the two essential parts of the unified framework respectively, including bag-of-shared-words model and joint information bottleneck algorithm. Section 6 provides the experimental results which demonstrate the effectiveness of our proposed method. Section 7 shows the conclusion and future work.

Section snippets

Revisit: Information bottleneck

Information bottleneck (IB) [36] is an information-theoretic principle and has been successfully used in human action recognition in supervised [7] and unsupervised fashion [37]. In this section, we pay more attention to the unsupervised aspect (i.e., clustering) of the information bottleneck. We start with defining the concept of mutual information [38], a vital measurement that is frequently used in our work.

Definition 1

Let X be a set of data points and Y be its co-occurrence features, and denote X and Y

The proposed framework

In this paper, we propose a novel unified framework for multi-view human action clustering. To reach this goal, the method consists of two essential parts, feature representation learning and action clustering, as shown in Fig. 3.

Bag of shared words model

In this part, to address the challenge that the limited descriptive ability of specific visual words extracted in different views (e.g., the views in Fig. 1) may lead to poor clustering performance, we attempt to discover the shared visual words among different views so as to construct a more discriminative BoSW feature representation, as shown in Fig. 4. It is seen that the construction of the BoSW model mainly consists of three stages: (1) m BoVW models generation. The BoVW models of m views

Joint information bottleneck

In this section, we first briefly formulate the multi-view action clustering problem and then propose a novel JOINT algorithm, followed by its optimization method and complexity analysis. Finally, we discuss the relationships of the proposed method with some related works.

Experiments

In this section, we conduct several experiments to validate the effectiveness of our proposed framework.

Conclusion and future work

In this paper, we propose a unified framework for multi-view human action clustering. First, a new BoSW model is designed to discover the view-shared words, from which the view correlations can be fully explored. Furthermore, we propose a novel JOINT algorithm for the problem of multi-view action clustering. JOINT can jointly exploit the complementary view-specific information and the view-correlated information embedded in the BoSW feature representation to improve the action clustering

CRediT authorship contribution statement

Shizhe Hu: Conceptualization, Methodology, Formal analysis, Investigation, Writing - original draft. Xiaoqiang Yan: Validation, Visualization, Supervision, Writing - review & editing. Yangdong Ye: Methodology, Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. We confirm that neither the entire paper nor any part of its content has been published or has been accepted elsewhere. Moreover, it is not being submitted to any other journal.

Acknowledgments

This work was supported by the National Key R&D Program of China [grant number 2018YFB1201403] and the National Natural Science Foundation of China [grant number 61772475].

References (50)

  • S. Jones et al.

    Unsupervised spectral dual assignment clustering of human actions in context

    Proc. IEEE Computer Vision Pattern Recognition

    (2014)
  • D. Wang et al.

    Fast robust non-negative matrix factorization for large-scale human action data clustering

    Proc. International Joint Conference Artificial Intelligence

    (2016)
  • I. Laptev et al.

    Space-time interest points

    Proc. IEEE International Conference Computer Vision

    (2003)
  • J. Liu et al.

    Learning human actions via information maximization

    Proc. IEEE Computer Vision Pattern Recognition

    (2008)
  • A. Liu et al.

    Benchmarking a multimodal and multiview and interactive dataset for human action recognition

    IEEE Trans. Cybernetics

    (2017)
  • S. Ramagiri et al.

    Real-time multi-view human action recognition using a wireless camera network

    Proc. ACM/IEEE International Conference Distributed Smart Cameras

    (2011)
  • Y. Yan et al.

    Multitask linear discriminant analysis for view invariant action recognition

    IEEE Trans. Image Processing

    (2014)
  • A. Liu et al.

    Multiple/single-view human action recognition via part-induced multitask structural learning

    IEEE Trans. Cybernetics

    (2015)
  • C. Rao et al.

    View-invariant representation and recognition of actions

    International Journal Computer Vision

    (2002)
  • I.N. Junejo et al.

    Cross-view action recognition from temporal self-similarities

    Proc. European Conference Computer Vision

    (2008)
  • M.B. Holte et al.

    A local 3-d motion descriptor for multi-view human action recognition from 4-d spatio-temporal interest points

    J. Sel. Topics Signal Processing

    (2012)
  • Y. Kong et al.

    Deeply learned view-invariant features for cross-view action recognition

    IEEE Trans. Image Processing

    (2017)
  • D. Wang et al.

    Dividing and aggregating network for multi-view action recognition

    Proc. European Conference Computer Vision

    (2018)
  • A. Kumar et al.

    Co-regularized multi-view spectral clustering

    Proc. Advances Neural Information Processing Systems

    (2011)
  • X. Cai et al.

    Multi-view k-means clustering on big data

    Proc. International Joint Conference Artificial Intelligence

    (2013)
  • Cited by (20)

    • Joint contrastive triple-learning for deep multi-view clustering

      2023, Information Processing and Management
    • Task-driven joint dictionary learning model for multi-view human action recognition

      2022, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      Thus, the TJDL algorithm has higher recognition accuracy than the JDL algorithm on WVU dataset too. Comparison results of the proposed TJDL algorithm and silhouette-based human action recognition (SHAR) [21], view-independent action recognition with SSM (VIAR-SSM) [22], joint information bottleneck (JOINT) [26], view-invariant action recognition with artificial neural networks (VIAR-ANN) [23], jointly learning multi-view features (JLMF) [31], view-independent action recognition with action volumes and cluster discriminant analysis (VIAR-AV-CDA) [24], and cubic multi-class SVM (C-MSVM) [32] are shown in Table 2. The computation times, recall and F-score [37][38] are introduced, and Table 3 shows the comparison results of SSM, JDL-KNN, JDL-SVM, C-MSVM and TJDL, which run on the PC platform with a 2.0 GHz, Intel Core CPU i7, 8 GB of RAM and 64-bit OS.

    • Multi-view clustering by virtually passing mutually supervised smooth messages

      2022, Information Sciences
      Citation Excerpt :

      In recent years, a variety of multi-view clustering methods have been developed to solve various multi-view clustering tasks, whereby different data views present their heterogeneous features while being related to each other via isogeny [1–7,13,16–17,33,35,45].

    View all citing articles on Scopus
    View full text