Recognising occluded multi-view actions using local nearest neighbour embedding
Introduction
Human action recognition has received increasing attentions during the past decades. It has a wide range of applications such as medical surveillance [1], smart home [2] and human-machine interaction [3]. However, how to recognise multiple, complex human actions or activities remains a challenging problem [4]. So far, the majority of action recognition systems are only restricted to a finite number of well-defined action categories, and the performance is evaluated on actions cropped by detected bounding boxes [5], [8]. For realistic applications, current methods are still very sensitive to trivial environmental variations, e.g., gender, body size, viewpoint and illumination variations, and occlusions [6], [7]. Among these problems, view-variation and occlusion are two main inevitable hurdles of action recognition. As a pessimistic conclusion claimed in [9], the monocular computer vision systems are not competent enough for surveillance applications. Fortunately, the progressive visual technologies have made it possible to solve the action recognition problem using multi-view or range sensors [10], [11]. Hence, extensive studies are conducted on view-invariance and transferable representations [6], [13], [14], [16], [17], [18], [19], [20]. Other works also consider multi-descriptor fusion approaches [22], [23]. Nonetheless, only few techniques such as [15] tackle the occlusion problem. Therefore, dealing with the occlusion problem remains an imminent research area to bridge the gap between existing action recognition algorithms and realistic applications [10].
Intuitively, the occlusion problem can be solved by a multi-view system, as shown in Fig. 1. If actions captured from a viewpoint are occluded, the information loss can be compensated by data from other views which are not occluded, thus, the occlusion problem is transformed to a view-disparity problem. However, such a strategy leads to two main difficulties. The first one is how to suppress the intra-class distance caused by viewpoint variations. For this concern, it is widely acknowledged that local descriptors are less susceptible to intra-class variations [24], [25], [26], [27], [28], which are generally fused with holistic representations [29], [30], [31]. The second difficulty is that, in real-world applications, occlusions appear unpredictably in both training and testing data, and, as a result, break the consistency of the holistic models in the two datasets.
In order to overcome these problems, this paper is devoted to investigating multi-view methods that can incorporate local descriptors and are robust to occlusions in both training and testing action datasets. Specifically, we adopt the dense trajectories (DT) [24], which are further transformed to a robust higher-level representation and then used for multi-view fusion. We conclude our main contributions in the following 3 aspects: (1) we propose a robust learning-free algorithm: local nearest neighbour embedding (LNNE); (2) we introduce 3 multi-view fusion scenarios to test the LNNE method; (3)we conduct extensive experiments on two multi-view action data sets with occlusions, where the LNNE method achieves significant performance improvements on all scenarios.
The following sections are arranged as follows: We introduce related works in Section 2. In Section 3, we explicitly describe the LNNE method. We then illustrate the structures of the 3 fusion pipelines in the Section 4. Detailed experimental results are presented with analysis and discussions in Sections 5 and 6. Finally, we conclude our work in Section 7.
Section snippets
Background
We review previous works from two main aspects. In the first aspect, we review the basis of feature embedding techniques, and we aim at providing an intuitive and generalised view of embedding. Also, we discuss their relations to our LNNE method. In the second aspect, we compare the proposed fusion scenarios with existing multi-view action recognition scenarios.
Local nearest neighbour embedding
In a typical action recognition task using local representations, an action sequence x(i) can be consisted by an uncertain number mi of extracted local descriptor . The training set is represented by pooling the descriptors from all training action instances into one matrix. Suppose there are totally I instances, where . is a single column vector indexing the corresponding action label of each training local descriptor.
Multi-view fusions
Based on the LNNE algorithm, we propose 3 multi-view fusion scenarios for action recognition according to Eq. (6). We discuss the advantages of each scenario and particularly explain how our methods can deal with occlusions. The problem of occlusion is demonstrated in Fig. 1. Note that we redefine the training set as a group of individual subsets: with label vectors: . In correspondence, the query action is also represented as the same ordered views:
Synthetic database
The first release of IXMAS is a multi-view action dataset. It consists of 11 daily action categories, each of which is performed by 11 actors and 3 times per action per person. As a result, there are totally 11 by 11 by 3 performed action examples. Particularly, each action is captured by 5 cameras simultaneously from different viewpoints. Fig. 3 provides an example of the “check-watch” action from 5 views.
In order to investigate the occlusion problem in action recognition, we evaluate our
Discusions
From the above results and analysis, our 3 methods demonstrate improved robustness for occluded multi-view action recognition compared to the published state-of-the-art methods. We particularly discuss the advantages and shortcomings for each of our methods in the following:
Conclusion and future work
In this paper, we have proposed an embedding algorithm, LNNE, and 3 fusion scenarios to deal with the occlusion problem in multi-view action recognition. We introduced an odds-ratio term in LNNE, and such a term can assign less weights to non-discriminative local features, which exist in both training and querying data. LNNE can also rectify misaligned views to a certain degree in order to fit itself to the fusion scenarios.
All 3 fusion methods outperform the state-of-the-art methods on both
Yang Long is currently a Ph.D. student with the Department of Electronic and Electrical Engineering, The University of Sheffield, Sheffield, UK and a visiting student with the Department of Computer Science and Digital Technologies, Northumbria University, Newcastle upon Tyne, UK. He has co-authored one Chinese patent. His current research interests include computer vision, pattern recognition, machine learning, and ontology Engineering. He is also a part-time lecturer of C++ programming in
References (44)
- et al.
Modelling and segmenting subunits for sign language recognition based on hand motion analysis
Pattern Recognit. Lett.
(2009) - et al.
Free viewpoint action recognition using motion history volumes
Comput. Vis. Image Underst.
(2006) - et al.
Efficient highlight removal of metal surfaces
Signal Process.
(2014) - et al.
Multi-spectral dataset and its application in saliency detection
Comput. Vis. Image Underst.
(2013) - et al.
Human activity recognition from 3D data: a review
Pattern Recognit. Lett.
(2014) - et al.
Multi-view action recognition using local similarity random forests and sensor fusion
Pattern Recognit. Lett.
(2013) - et al.
Locally nearest neighbor classifiers for pattern classification
Pattern Recognition
(2004) - et al.
Multi-camera recognition of people operating home medical devices
Proceedings of the IEEE International Conference on BioMedical Engineering and Informatics (BMEI)
(2010) - et al.
Multiview activity recognition in smart homes with spatio-temporal features
Proceedings of the ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC)
(2010) - et al.
Saliency detection by multiple-instance learning
IEEE Trans. Cybern.
(2013)
Visual saliency by selective contrast
IEEE Trans. Circuits Syst. Video Technol.
Are current monocular computer vision system for human action recognition suitable for visual surveillance applications?
Proceedings of the International Symposium on Visual Computing (ISVC)
Object detection in remote sensing images using a discriminatively trained mixture model
J. Photogramm. Remote Sens.
Cross-view action recognition from temporal self-similarities
Proceedings of the European Conference on Computer Vision (ECCV)
Human activity recognition with metric learning
Proceedings of the European Conference on Computer Vision (ECCV)
Learning to recognize activities from the wrong view point
Proceedings of the European Conference on Computer Vision (ECCV)
Making action recognition robust to occlusions and viewpoint changes
Proceedings of the European Conference on Computer Vision (ECCV)
Weakly-supervised cross-domain dictionary learning for visual recognition
Int. J. Comput. Vis.
Multi-view intact space learning
IEEE Trans. Pattern Anal. Mach. Intell.
Decomposition based transfer distance metric learning for image classification
IEEE Trans. Image Process.
Action recognition from arbitrary views using 3D exemplars
Proceedings of the IEEE International Conference on Computer Vision (ICCV)
Saliency detection by combining spatial and spectral information
Opt. Lett.
Cited by (7)
Generic compact representation through visual-semantic ambiguity removal
2019, Pattern Recognition LettersCitation Excerpt :All these methods follow the restricted one-way paradigm that suffers from the ambiguity between low-level instances and high-level semantic concepts and labels. Recently, a new direction of ZSL is using the transductive model [5,25–31]. Unlabelled target domain data is collected for learning a transfer function.
An effective and efficient approximate two-dimensional dynamic programming algorithm for supporting advanced computer vision applications
2017, Journal of Visual Languages and ComputingCitation Excerpt :Another classical example is that of sequence alignment. Generally-speaking, computer vision applications are emerging trends for such a context, and, recently, the research community has devoted a lot of attention to this topic (e.g., [2–11]). DP has been applied to various tasks in pattern recognition and computer vision [12,13].
Incremental 2-D nearest-point search with evenly populated strips
2019, Informatica (Slovenia)Towards affordable semantic searching: Zero-shot retrieval via dominant attributes
2018, 32nd AAAI Conference on Artificial Intelligence, AAAI 2018Towards fine-grained open zero-shot learning: Inferring unseen visual features from attributes
2017, Proceedings - 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017Describing unseen classes by exemplars: Zero-shot learning using grouped simile ensemble
2017, Proceedings - 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017
Yang Long is currently a Ph.D. student with the Department of Electronic and Electrical Engineering, The University of Sheffield, Sheffield, UK and a visiting student with the Department of Computer Science and Digital Technologies, Northumbria University, Newcastle upon Tyne, UK. He has co-authored one Chinese patent. His current research interests include computer vision, pattern recognition, machine learning, and ontology Engineering. He is also a part-time lecturer of C++ programming in the Northumbria University.
Fan Zhu is currently a post-doctoral associate in the NYU Multimedia and Visual Computing Lab, Abu Dhabi, UAE. He has authored/co-authored over 10 papers in well-known journals/conferences, such as IJCV, IEEE TNNLS, CVPR, CIKM and BMVC, and two Chinese patents. His research interests include submodular optimization for computer vision, sparse coding, 3D feature learning, dictionary learning and transfer learning. He has been awarded the National Distinguished Overseas Self-funded Student of China prize in 2014. He serves as a reviewer of IEEE Transactions on Cybernetics.
Ling Shao is currently a Full Professor with the Department of Computer Science and Digital Technologies, Northumbria University, Newcastle upon Tyne, UK and an Advanced Visiting Fellow with the University of Sheffield, UK. He has authored or co-authored over 160 academic papers in well-known journals/conferences. His current research interests include computer vision, image/video processing, pattern recognition, and machine learning. Prof. Shao is an Associate Editor of the IEEE Transactions On Image Processing, The IEEE Transactions On Cybernetics, and several other journals. He is also a fellow of the British Computer Society and the Institution of Engineering and Technology.