Recognising occluded multi-view actions using local nearest neighbour embedding

https://doi.org/10.1016/j.cviu.2015.06.003Get rights and content

Highlights

  • We propose a robust learning-free algorithm: local nearest neighbour embedding (LNNE).

  • We introduce 3 multi-view fusion scenarios to test the LNNE method.

  • We conduct extensive experiments on two multi-view action data sets with occlusions, where the LNNE method achieves significant performance improvements on all scenarios.

Abstract

The recent advancement of multi-sensor technologies and algorithms has boosted significant progress to human action recognition systems, especially for dealing with realistic scenarios. However, partial occlusion, as a major obstacle in real-world applications, has not received sufficient attention in the action recognition community. In this paper, we extensively investigate how occlusion can be addressed by multi-view fusion. Specifically, we propose a robust representation called local nearest neighbour embedding (LNNE). We then extend the LNNE method to 3 multi-view fusion scenarios. Additionally, we provide detailed analysis of the proposed voting strategy from the boosting point of view. We evaluate our approach on both synthetic and realistic occluded databases, and the LNNE method outperforms the state-of-the-art approaches in all tested scenarios.

Introduction

Human action recognition has received increasing attentions during the past decades. It has a wide range of applications such as medical surveillance [1], smart home [2] and human-machine interaction [3]. However, how to recognise multiple, complex human actions or activities remains a challenging problem [4]. So far, the majority of action recognition systems are only restricted to a finite number of well-defined action categories, and the performance is evaluated on actions cropped by detected bounding boxes [5], [8]. For realistic applications, current methods are still very sensitive to trivial environmental variations, e.g., gender, body size, viewpoint and illumination variations, and occlusions [6], [7]. Among these problems, view-variation and occlusion are two main inevitable hurdles of action recognition. As a pessimistic conclusion claimed in [9], the monocular computer vision systems are not competent enough for surveillance applications. Fortunately, the progressive visual technologies have made it possible to solve the action recognition problem using multi-view or range sensors [10], [11]. Hence, extensive studies are conducted on view-invariance and transferable representations [6], [13], [14], [16], [17], [18], [19], [20]. Other works also consider multi-descriptor fusion approaches [22], [23]. Nonetheless, only few techniques such as [15] tackle the occlusion problem. Therefore, dealing with the occlusion problem remains an imminent research area to bridge the gap between existing action recognition algorithms and realistic applications [10].

Intuitively, the occlusion problem can be solved by a multi-view system, as shown in Fig. 1. If actions captured from a viewpoint are occluded, the information loss can be compensated by data from other views which are not occluded, thus, the occlusion problem is transformed to a view-disparity problem. However, such a strategy leads to two main difficulties. The first one is how to suppress the intra-class distance caused by viewpoint variations. For this concern, it is widely acknowledged that local descriptors are less susceptible to intra-class variations [24], [25], [26], [27], [28], which are generally fused with holistic representations [29], [30], [31]. The second difficulty is that, in real-world applications, occlusions appear unpredictably in both training and testing data, and, as a result, break the consistency of the holistic models in the two datasets.

In order to overcome these problems, this paper is devoted to investigating multi-view methods that can incorporate local descriptors and are robust to occlusions in both training and testing action datasets. Specifically, we adopt the dense trajectories (DT) [24], which are further transformed to a robust higher-level representation and then used for multi-view fusion. We conclude our main contributions in the following 3 aspects: (1) we propose a robust learning-free algorithm: local nearest neighbour embedding (LNNE); (2) we introduce 3 multi-view fusion scenarios to test the LNNE method; (3)we conduct extensive experiments on two multi-view action data sets with occlusions, where the LNNE method achieves significant performance improvements on all scenarios.

The following sections are arranged as follows: We introduce related works in Section 2. In Section 3, we explicitly describe the LNNE method. We then illustrate the structures of the 3 fusion pipelines in the Section 4. Detailed experimental results are presented with analysis and discussions in Sections 5 and 6. Finally, we conclude our work in Section 7.

Section snippets

Background

We review previous works from two main aspects. In the first aspect, we review the basis of feature embedding techniques, and we aim at providing an intuitive and generalised view of embedding. Also, we discuss their relations to our LNNE method. In the second aspect, we compare the proposed fusion scenarios with existing multi-view action recognition scenarios.

Local nearest neighbour embedding

In a typical action recognition task using local representations, an action sequence x(i) can be consisted by an uncertain number mi of extracted local descriptor x(i)=[d(1),,d(mi)]. The training set is represented by pooling the descriptors from all training action instances into one matrix. Suppose there are totally I instances, X=[x(1),,x(I)]=[d(1),,d(n)], where n=m1++mI. Y=[y(1),,y(n)] is a single column vector indexing the corresponding action label of each training local descriptor.

Multi-view fusions

Based on the LNNE algorithm, we propose 3 multi-view fusion scenarios for action recognition according to Eq. (6). We discuss the advantages of each scenario and particularly explain how our methods can deal with occlusions. The problem of occlusion is demonstrated in Fig. 1. Note that we redefine the training set as a group of individual subsets: [X(1),X(2),,X(s)] with label vectors: [Y(1),Y(2),,Y(s)]. In correspondence, the query action is also represented as the same ordered views: [x^(1),x

Synthetic database

The first release of IXMAS is a multi-view action dataset. It consists of 11 daily action categories, each of which is performed by 11 actors and 3 times per action per person. As a result, there are totally 11 by 11 by 3 performed action examples. Particularly, each action is captured by 5 cameras simultaneously from different viewpoints. Fig. 3 provides an example of the “check-watch” action from 5 views.

In order to investigate the occlusion problem in action recognition, we evaluate our

Discusions

From the above results and analysis, our 3 methods demonstrate improved robustness for occluded multi-view action recognition compared to the published state-of-the-art methods. We particularly discuss the advantages and shortcomings for each of our methods in the following:

Conclusion and future work

In this paper, we have proposed an embedding algorithm, LNNE, and 3 fusion scenarios to deal with the occlusion problem in multi-view action recognition. We introduced an odds-ratio term in LNNE, and such a term can assign less weights to non-discriminative local features, which exist in both training and querying data. LNNE can also rectify misaligned views to a certain degree in order to fit itself to the fusion scenarios.

All 3 fusion methods outperform the state-of-the-art methods on both

Yang Long is currently a Ph.D. student with the Department of Electronic and Electrical Engineering, The University of Sheffield, Sheffield, UK and a visiting student with the Department of Computer Science and Digital Technologies, Northumbria University, Newcastle upon Tyne, UK. He has co-authored one Chinese patent. His current research interests include computer vision, pattern recognition, machine learning, and ontology Engineering. He is also a part-time lecturer of C++ programming in

References (44)

  • Q. Wang et al.

    Visual saliency by selective contrast

    IEEE Trans. Circuits Syst. Video Technol.

    (2013)
  • J.C. Nebel et al.

    Are current monocular computer vision system for human action recognition suitable for visual surveillance applications?

    Proceedings of the International Symposium on Visual Computing (ISVC)

    (2011)
  • G. Cheng et al.

    Object detection in remote sensing images using a discriminatively trained mixture model

    J. Photogramm. Remote Sens.

    (2013)
  • I.N. Junejo et al.

    Cross-view action recognition from temporal self-similarities

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2008)
  • D. Tran et al.

    Human activity recognition with metric learning

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2008)
  • A. Farhadi et al.

    Learning to recognize activities from the wrong view point

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2008)
  • D. Weinland et al.

    Making action recognition robust to occlusions and viewpoint changes

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2010)
  • F. Zhu et al.

    Weakly-supervised cross-domain dictionary learning for visual recognition

    Int. J. Comput. Vis.

    (2014)
  • C. Xu et al.

    Multi-view intact space learning

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • Y. Luo et al.

    Decomposition based transfer distance metric learning for image classification

    IEEE Trans. Image Process.

    (2014)
  • D. Weinland et al.

    Action recognition from arbitrary views using 3D exemplars

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    (2007)
  • Y. Zhang et al.

    Saliency detection by combining spatial and spectral information

    Opt. Lett.

    (2013)
  • Cited by (7)

    • Generic compact representation through visual-semantic ambiguity removal

      2019, Pattern Recognition Letters
      Citation Excerpt :

      All these methods follow the restricted one-way paradigm that suffers from the ambiguity between low-level instances and high-level semantic concepts and labels. Recently, a new direction of ZSL is using the transductive model [5,25–31]. Unlabelled target domain data is collected for learning a transfer function.

    • An effective and efficient approximate two-dimensional dynamic programming algorithm for supporting advanced computer vision applications

      2017, Journal of Visual Languages and Computing
      Citation Excerpt :

      Another classical example is that of sequence alignment. Generally-speaking, computer vision applications are emerging trends for such a context, and, recently, the research community has devoted a lot of attention to this topic (e.g., [2–11]). DP has been applied to various tasks in pattern recognition and computer vision [12,13].

    • Towards affordable semantic searching: Zero-shot retrieval via dominant attributes

      2018, 32nd AAAI Conference on Artificial Intelligence, AAAI 2018
    • Towards fine-grained open zero-shot learning: Inferring unseen visual features from attributes

      2017, Proceedings - 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017
    • Describing unseen classes by exemplars: Zero-shot learning using grouped simile ensemble

      2017, Proceedings - 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017
    View all citing articles on Scopus

    Yang Long is currently a Ph.D. student with the Department of Electronic and Electrical Engineering, The University of Sheffield, Sheffield, UK and a visiting student with the Department of Computer Science and Digital Technologies, Northumbria University, Newcastle upon Tyne, UK. He has co-authored one Chinese patent. His current research interests include computer vision, pattern recognition, machine learning, and ontology Engineering. He is also a part-time lecturer of C++ programming in the Northumbria University.

    Fan Zhu is currently a post-doctoral associate in the NYU Multimedia and Visual Computing Lab, Abu Dhabi, UAE. He has authored/co-authored over 10 papers in well-known journals/conferences, such as IJCV, IEEE TNNLS, CVPR, CIKM and BMVC, and two Chinese patents. His research interests include submodular optimization for computer vision, sparse coding, 3D feature learning, dictionary learning and transfer learning. He has been awarded the National Distinguished Overseas Self-funded Student of China prize in 2014. He serves as a reviewer of IEEE Transactions on Cybernetics.

    Ling Shao is currently a Full Professor with the Department of Computer Science and Digital Technologies, Northumbria University, Newcastle upon Tyne, UK and an Advanced Visiting Fellow with the University of Sheffield, UK. He has authored or co-authored over 160 academic papers in well-known journals/conferences. His current research interests include computer vision, image/video processing, pattern recognition, and machine learning. Prof. Shao is an Associate Editor of the IEEE Transactions On Image Processing, The IEEE Transactions On Cybernetics, and several other journals. He is also a fellow of the British Computer Society and the Institution of Engineering and Technology.

    View full text