Recognising occluded multi-view actions using local nearest neighbour embedding

doi:10.1016/j.cviu.2015.06.003

Computer Vision and Image Understanding

Volume 144, March 2016, Pages 36-45

https://doi.org/10.1016/j.cviu.2015.06.003 Get rights and content

Highlights

•
We propose a robust learning-free algorithm: local nearest neighbour embedding (LNNE).
•
We introduce 3 multi-view fusion scenarios to test the LNNE method.
•
We conduct extensive experiments on two multi-view action data sets with occlusions, where the LNNE method achieves significant performance improvements on all scenarios.

Abstract

The recent advancement of multi-sensor technologies and algorithms has boosted significant progress to human action recognition systems, especially for dealing with realistic scenarios. However, partial occlusion, as a major obstacle in real-world applications, has not received sufficient attention in the action recognition community. In this paper, we extensively investigate how occlusion can be addressed by multi-view fusion. Specifically, we propose a robust representation called local nearest neighbour embedding (LNNE). We then extend the LNNE method to 3 multi-view fusion scenarios. Additionally, we provide detailed analysis of the proposed voting strategy from the boosting point of view. We evaluate our approach on both synthetic and realistic occluded databases, and the LNNE method outperforms the state-of-the-art approaches in all tested scenarios.

Introduction

Human action recognition has received increasing attentions during the past decades. It has a wide range of applications such as medical surveillance [1], smart home [2] and human-machine interaction [3]. However, how to recognise multiple, complex human actions or activities remains a challenging problem [4]. So far, the majority of action recognition systems are only restricted to a finite number of well-defined action categories, and the performance is evaluated on actions cropped by detected bounding boxes [5], [8]. For realistic applications, current methods are still very sensitive to trivial environmental variations, e.g., gender, body size, viewpoint and illumination variations, and occlusions [6], [7]. Among these problems, view-variation and occlusion are two main inevitable hurdles of action recognition. As a pessimistic conclusion claimed in [9], the monocular computer vision systems are not competent enough for surveillance applications. Fortunately, the progressive visual technologies have made it possible to solve the action recognition problem using multi-view or range sensors [10], [11]. Hence, extensive studies are conducted on view-invariance and transferable representations [6], [13], [14], [16], [17], [18], [19], [20]. Other works also consider multi-descriptor fusion approaches [22], [23]. Nonetheless, only few techniques such as [15] tackle the occlusion problem. Therefore, dealing with the occlusion problem remains an imminent research area to bridge the gap between existing action recognition algorithms and realistic applications [10].

Intuitively, the occlusion problem can be solved by a multi-view system, as shown in Fig. 1. If actions captured from a viewpoint are occluded, the information loss can be compensated by data from other views which are not occluded, thus, the occlusion problem is transformed to a view-disparity problem. However, such a strategy leads to two main difficulties. The first one is how to suppress the intra-class distance caused by viewpoint variations. For this concern, it is widely acknowledged that local descriptors are less susceptible to intra-class variations [24], [25], [26], [27], [28], which are generally fused with holistic representations [29], [30], [31]. The second difficulty is that, in real-world applications, occlusions appear unpredictably in both training and testing data, and, as a result, break the consistency of the holistic models in the two datasets.

In order to overcome these problems, this paper is devoted to investigating multi-view methods that can incorporate local descriptors and are robust to occlusions in both training and testing action datasets. Specifically, we adopt the dense trajectories (DT) [24], which are further transformed to a robust higher-level representation and then used for multi-view fusion. We conclude our main contributions in the following 3 aspects: (1) we propose a robust learning-free algorithm: local nearest neighbour embedding (LNNE); (2) we introduce 3 multi-view fusion scenarios to test the LNNE method; (3)we conduct extensive experiments on two multi-view action data sets with occlusions, where the LNNE method achieves significant performance improvements on all scenarios.

The following sections are arranged as follows: We introduce related works in Section 2. In Section 3, we explicitly describe the LNNE method. We then illustrate the structures of the 3 fusion pipelines in the Section 4. Detailed experimental results are presented with analysis and discussions in Sections 5 and 6. Finally, we conclude our work in Section 7.

Section snippets

Background

We review previous works from two main aspects. In the first aspect, we review the basis of feature embedding techniques, and we aim at providing an intuitive and generalised view of embedding. Also, we discuss their relations to our LNNE method. In the second aspect, we compare the proposed fusion scenarios with existing multi-view action recognition scenarios.

Local nearest neighbour embedding

In a typical action recognition task using local representations, an action sequence x⁽ⁱ⁾ can be consisted by an uncertain number m_i of extracted local descriptor $x^{(i)} = [d^{(1)}, \dots, d^{(m_{i})}]$ . The training set is represented by pooling the descriptors from all training action instances into one matrix. Suppose there are totally I instances, $X = [x^{(1)}, \dots, x^{(I)}] = [d^{(1)}, \dots, d^{(n)}],$ where $n = m_{1} + \dots + m_{I}$ . $Y = [y^{(1)}, \dots, y^{(n)}]$ is a single column vector indexing the corresponding action label of each training local descriptor.

Multi-view fusions

Based on the LNNE algorithm, we propose 3 multi-view fusion scenarios for action recognition according to Eq. (6). We discuss the advantages of each scenario and particularly explain how our methods can deal with occlusions. The problem of occlusion is demonstrated in Fig. 1. Note that we redefine the training set as a group of individual subsets: $[X^{(1)}, X^{(2)}, \dots, X^{(s)}]$ with label vectors: $[Y^{(1)}, Y^{(2)}, \dots, Y^{(s)}]$ . In correspondence, the query action is also represented as the same ordered views: $[{\hat{x}}^{(1)}, x$

Synthetic database

The first release of IXMAS is a multi-view action dataset. It consists of 11 daily action categories, each of which is performed by 11 actors and 3 times per action per person. As a result, there are totally 11 by 11 by 3 performed action examples. Particularly, each action is captured by 5 cameras simultaneously from different viewpoints. Fig. 3 provides an example of the “check-watch” action from 5 views.

In order to investigate the occlusion problem in action recognition, we evaluate our

Discusions

From the above results and analysis, our 3 methods demonstrate improved robustness for occluded multi-view action recognition compared to the published state-of-the-art methods. We particularly discuss the advantages and shortcomings for each of our methods in the following:

Conclusion and future work

In this paper, we have proposed an embedding algorithm, LNNE, and 3 fusion scenarios to deal with the occlusion problem in multi-view action recognition. We introduced an odds-ratio term in LNNE, and such a term can assign less weights to non-discriminative local features, which exist in both training and querying data. LNNE can also rectify misaligned views to a certain degree in order to fit itself to the fusion scenarios.

All 3 fusion methods outperform the state-of-the-art methods on both

Yang Long is currently a Ph.D. student with the Department of Electronic and Electrical Engineering, The University of Sheffield, Sheffield, UK and a visiting student with the Department of Computer Science and Digital Technologies, Northumbria University, Newcastle upon Tyne, UK. He has co-authored one Chinese patent. His current research interests include computer vision, pattern recognition, machine learning, and ontology Engineering. He is also a part-time lecturer of C++ programming in

References (44)

J. Han et al.
Modelling and segmenting subunits for sign language recognition based on hand motion analysis
Pattern Recognit. Lett.
(2009)
D. Weinland et al.
Free viewpoint action recognition using motion history volumes
Comput. Vis. Image Underst.
(2006)
D. Yu et al.
Efficient highlight removal of metal surfaces
Signal Process.
(2014)
Q. Wang et al.
Multi-spectral dataset and its application in saliency detection
Comput. Vis. Image Underst.
(2013)
J.K. Aggarwal et al.
Human activity recognition from 3D data: a review
Pattern Recognit. Lett.
(2014)
F. Zhu et al.
Multi-view action recognition using local similarity random forests and sensor fusion
Pattern Recognit. Lett.
(2013)
W. Zheng et al.
Locally nearest neighbor classifiers for pattern classification
Pattern Recognition
(2004)
Z. Gao et al.
Multi-camera recognition of people operating home medical devices
Proceedings of the IEEE International Conference on BioMedical Engineering and Informatics (BMEI)
(2010)
C. Wu et al.
Multiview activity recognition in smart homes with spatio-temporal features
Proceedings of the ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC)
(2010)
Q. Wang et al.
Saliency detection by multiple-instance learning
IEEE Trans. Cybern.
(2013)

Q. Wang et al.

Visual saliency by selective contrast

IEEE Trans. Circuits Syst. Video Technol.

(2013)

J.C. Nebel et al.

Are current monocular computer vision system for human action recognition suitable for visual surveillance applications?

Proceedings of the International Symposium on Visual Computing (ISVC)

(2011)

G. Cheng et al.

Object detection in remote sensing images using a discriminatively trained mixture model

J. Photogramm. Remote Sens.

(2013)

I.N. Junejo et al.

Cross-view action recognition from temporal self-similarities

Proceedings of the European Conference on Computer Vision (ECCV)

(2008)

D. Tran et al.

Human activity recognition with metric learning

Proceedings of the European Conference on Computer Vision (ECCV)

(2008)

A. Farhadi et al.

Learning to recognize activities from the wrong view point

Proceedings of the European Conference on Computer Vision (ECCV)

(2008)

D. Weinland et al.

Making action recognition robust to occlusions and viewpoint changes

Proceedings of the European Conference on Computer Vision (ECCV)

(2010)

F. Zhu et al.

Weakly-supervised cross-domain dictionary learning for visual recognition

Int. J. Comput. Vis.

(2014)

C. Xu et al.

Multi-view intact space learning

IEEE Trans. Pattern Anal. Mach. Intell.

(2015)

Y. Luo et al.

Decomposition based transfer distance metric learning for image classification

IEEE Trans. Image Process.

(2014)

D. Weinland et al.

Action recognition from arbitrary views using 3D exemplars

Proceedings of the IEEE International Conference on Computer Vision (ICCV)

(2007)

Y. Zhang et al.

Saliency detection by combining spatial and spectral information

Opt. Lett.

(2013)

Cited by (7)

Generic compact representation through visual-semantic ambiguity removal
2019, Pattern Recognition Letters
Citation Excerpt :
All these methods follow the restricted one-way paradigm that suffers from the ambiguity between low-level instances and high-level semantic concepts and labels. Recently, a new direction of ZSL is using the transductive model [5,25–31]. Unlabelled target domain data is collected for learning a transfer function.
Zero-Shot Hashing (ZSH) aims to learn compact binary codes that can preserve semantic contents of the images from unseen categories. Conventional approaches project visual features to a semantic space that is shared by both seen and unseen categories. However, we observe that such a one-way paradigm suffers from the visual-semantic ambiguity problem. Namely, the semantic concepts (e.g. attributes) cannot explicitly correspond to visual patterns, and vice versa. Such a problem can lead to a huge variance in the visual features for each attribute. In this paper, we investigate how to remove such semantic ambiguity based on the observed visual appearances. In particular, we propose (1) a novel latent attribute space to mitigate the gap between visual appearances and semantic expressions; (2) a dual-graph regularised embedding algorithm called Visual-Semantic Ambiguity Removal (VSAR) that can simultaneously extract the shared components between visual and semantic information and mutually align the data distribution based on the intrinsic local structures of both spaces; (3) a new zero-shot hashing framework that can deal with both instance-level and category-level tasks. We validate our method on four popular benchmarks. Extensive experiments demonstrate that our proposed approach significantly performs the state-of-the-art methods.
An effective and efficient approximate two-dimensional dynamic programming algorithm for supporting advanced computer vision applications
2017, Journal of Visual Languages and Computing
Citation Excerpt :
Another classical example is that of sequence alignment. Generally-speaking, computer vision applications are emerging trends for such a context, and, recently, the research community has devoted a lot of attention to this topic (e.g., [2–11]). DP has been applied to various tasks in pattern recognition and computer vision [12,13].
Dynamic programming is a popular optimization technique, developed in the 60’s and still widely used today in several fields for its ability to find global optimum. Dynamic Programming Algorithms (DPAs) can be developed in many dimension. However, it is known that if the DPA dimension is greater or equal to two, the algorithm is an NP complete problem. In this paper we present an approximation of the fully two-dimensional DPA (2D-DPA) with polynomial complexity. Then, we describe an implementation of the algorithm on a recent parallel device based on CUDA architecture. We show that our parallel implementation presents a speed-up of about 25 with respect to a sequential implementation on an Intel I7 CPU. In particular, our system allows a speed of about ten 2D-DPA executions per second for 85 × 85 pixels images. Experiments and case studies support our thesis.
Incremental 2-D nearest-point search with evenly populated strips
2019, Informatica (Slovenia)
Towards affordable semantic searching: Zero-shot retrieval via dominant attributes
2018, 32nd AAAI Conference on Artificial Intelligence, AAAI 2018
Towards fine-grained open zero-shot learning: Inferring unseen visual features from attributes
2017, Proceedings - 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017
Describing unseen classes by exemplars: Zero-shot learning using grouped simile ensemble
2017, Proceedings - 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017

View all citing articles on Scopus

Fan Zhu is currently a post-doctoral associate in the NYU Multimedia and Visual Computing Lab, Abu Dhabi, UAE. He has authored/co-authored over 10 papers in well-known journals/conferences, such as IJCV, IEEE TNNLS, CVPR, CIKM and BMVC, and two Chinese patents. His research interests include submodular optimization for computer vision, sparse coding, 3D feature learning, dictionary learning and transfer learning. He has been awarded the National Distinguished Overseas Self-funded Student of China prize in 2014. He serves as a reviewer of IEEE Transactions on Cybernetics.

Ling Shao is currently a Full Professor with the Department of Computer Science and Digital Technologies, Northumbria University, Newcastle upon Tyne, UK and an Advanced Visiting Fellow with the University of Sheffield, UK. He has authored or co-authored over 160 academic papers in well-known journals/conferences. His current research interests include computer vision, image/video processing, pattern recognition, and machine learning. Prof. Shao is an Associate Editor of the IEEE Transactions On Image Processing, The IEEE Transactions On Cybernetics, and several other journals. He is also a fellow of the British Computer Society and the Institution of Engineering and Technology.

View full text

Recognising occluded multi-view actions using local nearest neighbour embedding

Highlights

Abstract

Introduction

Section snippets

Background

Local nearest neighbour embedding

Multi-view fusions

Synthetic database

Discusions

Conclusion and future work

Pattern Recognit. Lett.

Comput. Vis. Image Underst.

Signal Process.

Comput. Vis. Image Underst.

Pattern Recognit. Lett.

Pattern Recognit. Lett.

Pattern Recognition

Multi-camera recognition of people operating home medical devices

Proceedings of the IEEE International Conference on BioMedical Engineering and Informatics (BMEI)

Multiview activity recognition in smart homes with spatio-temporal features

Proceedings of the ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC)

Saliency detection by multiple-instance learning

IEEE Trans. Cybern.

Visual saliency by selective contrast

IEEE Trans. Circuits Syst. Video Technol.

Are current monocular computer vision system for human action recognition suitable for visual surveillance applications?

Proceedings of the International Symposium on Visual Computing (ISVC)

Object detection in remote sensing images using a discriminatively trained mixture model

J. Photogramm. Remote Sens.

Cross-view action recognition from temporal self-similarities

Proceedings of the European Conference on Computer Vision (ECCV)

Human activity recognition with metric learning

Proceedings of the European Conference on Computer Vision (ECCV)

Learning to recognize activities from the wrong view point

Proceedings of the European Conference on Computer Vision (ECCV)

Making action recognition robust to occlusions and viewpoint changes

Proceedings of the European Conference on Computer Vision (ECCV)

Weakly-supervised cross-domain dictionary learning for visual recognition

Int. J. Comput. Vis.

Multi-view intact space learning

IEEE Trans. Pattern Anal. Mach. Intell.

Decomposition based transfer distance metric learning for image classification

IEEE Trans. Image Process.

Action recognition from arbitrary views using 3D exemplars

Proceedings of the IEEE International Conference on Computer Vision (ICCV)

Saliency detection by combining spatial and spectral information

Opt. Lett.