Efficient motion estimation methods for fast recognition of activities of daily living

doi:10.1016/j.image.2017.01.005

Signal Processing: Image Communication

Volume 53, April 2017, Pages 1-12

https://doi.org/10.1016/j.image.2017.01.005 Get rights and content

Abstract

This work proposes a framework for the efficient recognition of activities of daily living (ADLs), captured by static color cameras, applicable in real world scenarios. Our method reduces the computational cost of ADL recognition in both compressed and uncompressed domains by introducing system level improvements in State-of-the-Art activity recognition methods. Faster motion estimation methods are employed to replace costly dense optical flow (OF) based motion estimation, through the use of fast block matching methods, as well as motion vectors, drawn directly from the compressed video domain (MPEG vectors). This results in increased computational efficiency, with minimal loss in terms of recognition accuracy. To prove the effectiveness of our approach, we provide an extensive, in-depth investigation of the trade-offs between computational cost, compression efficiency and recognition accuracy, tested on bench-mark and real-world ADL video datasets.

Introduction

The recognition of Activities of Daily Living (ADLs) has drawn significant research attention in the computer vision community, as their monitoring can provide valuable information for applications such as assisted living, remote healthcare, lifestyle and behavioral pro ling. To this end, e orts are being made to design algorithms of ADL recognition which are both accurate and computationally efficient. While a plethora of activity recognition methods exist, most overlook the importance of computational and compression efficiency, focusing only on recognition accuracy. The main bottleneck of current State-of-the-Art (SoA) works [1], [2], [3], [4], [5], [6], [7] is the use of computationally expensive Optical Flow (OF) [8] for motion estimation and feature extraction.

This work addresses the issue of computational efficiency by expanding upon our previous work in [9]: computationally costly dense OF is replaced by computationally lighter Block Matching (BM) and MPEG encoded motion vectors for activity recognition. The motion field is post-processed and its results are incorporated in a dense, trajectory-based activity recognition framework [1]. In-depth experiments on benchmark ADL video datasets compare some of the most reliable and popular OF and BM methods, as well as the most common encoded motion vectors, demonstrating that the latter increase computational efficiency at a minimal loss in recognition accuracy, compared to related work.

Recent works [10], [11], [12], [13] also used motion vectors drawn directly from the MPEG compressed video domain, resulting in a significant computational speedup (~66%) at a small reduction of recognition accuracy (~5%). We ex-tend these works by providing a thorough comparison of very popular video compression standards, applied specifically for the recognition of ADLs, in contrast to more generic works [14], [15], [16]. We also investigate various configuration parameters of MPEG video encoding, such as GOP size and the motion estimation algorithm used to examine their effect on recognition accuracy and compression efficiency (video quality and bit rate) and identify those that make a difference in the measured performance. We finally use the precomputed MPEG motion vectors to seed and accelerate the BM search. This analysis reveals trade-offs between bit rate (file size), PSNR (video quality), ADL recognition accuracy, resulting in useful guidelines to practitioners. In short, our contributions are:

•
A framework for efficient recognition and coding of human activities in video by exploring the trade-off s at all stages: (1) video compression efficiency, (2) computational efficiency, (3) recognition accuracy.
•
We propose and evaluate the use of existing compressed motion vectors in conjunction with BM, for improved, faster recognition accuracy and computational savings.
•
A thorough Rate-Distortion-based comparison (bitrate-video quality) between very popular video compression standards, applied specifically to activities of daily living.

The rest of this paper is organized as follows: In Section 2 we review related SoA approaches for activity recognition and motion estimation. In Section 3 we present our activity recognition framework method in detail, including the motion estimation methods used. Several aspects of video encoding examined are detailed in Section 4, namely the effectiveness of different video codecs, and various encoding parameters. Experimental results are presented in detail in Section 5, comparing OF with BM, BM with BM seeded by MPEG vectors, the computational efficiency of all methods, and a joint performance metric. Finally, Section 6 concludes this paper and addresses our plans for future work.

Section snippets

Activity recognition methods

Numerous approaches have been developed in recent years for activity recognition. Our work is closely related to methods based on trajectories of interest points [1], [3], [4], [5], [6], [7], [17], [18], [19], [20], [21], which can be roughly divided into those where interest points are sampled sparsely or densely.

The approaches of the first category extract sparse interest points via standard interest point detectors and track them over time. Messing et al. [17] tracked corner points using the

Action representation

In this section, we describe our activity representation and recognition framework, as well as the motion estimation methods used in our experiments.The overall schema of our framework is depicted in Fig. 1

MPEG video encoding

In this section, we examine the effect of MPEG video encoding on activity recognition accuracy. Initially we compare video codecs so as to choose the most appropriate one for our experiments. Then we explore the trade-offs between activity recognition accuracy, computational efficiency and compression efficiency (video quality vs bit rate). We also explore the effect of different encoding parameters to show how they affect recognition accuracy and computational/compression efficiency. Finally,

Experimental results

We have performed comprehensive experiments on uncompressed videos of human activities, to compare the effect of different motion estimation and en-coding techniques on recognition accuracy, computational efficiency for several benchmark datasets. We also provide comparisons with the SoA to determine the optimal configuration for recognizing ADLs. We revisit this discussion in Section 5.6, where we assess our results in both the compressed and uncompressed video domains using a hybrid metric,

Conclusions and future work

In this work, we proposed a complete framework for efficient recognition, processing and coding of activities of daily living (ADLs), as captured by standard 2D cameras. Our approach follows the SoA [1], [3], [20], where trajectories of tracked visual features are extracted on dense grids and recognition takes place via Fisher vectors and Support Vector Machines. In contrast to these approaches, which are based on dense OF, we use video downsampling, fast block matching motion estimation and

Acknowledgement

This work was funded by the European Commission under the 7th Frame-work Program (FP7 2007-2013), Grant agreement 288199 Dem@Care.

References (62)

C. Hong et al.
An efficient approach to content-based object retrieval in videos
Neurocomputing
(2011)
J. Yu et al.
Machine learning and signal processing for human pose recovery and behavior analysis
Signal Process.
(2015)
H. Wang, A. Klaser, C. Schmid, C. Liu, Action recognition by dense trajectories, in: Proceedings of the International...
I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in: Proceedings of the...
H. Wang et al.
Dense trajectories and motion boundary descriptors for action recognition
Int. J. Comput. Vis.
(2013)
H. Wang, C. Schmid, Action recognition with improved trajectories, in: Proceedings of the International Conference on...
M. Jain, H. Jegou, P. Bouthemy, Better exploiting motion for better action recognition, in: Proceedings of the...
Y.G. Jiang, Q. Dai, X. Xue, W. Liu, C. Ngo, Trajectory-based modeling of human actions with motion reference points,...
P. Ochs et al.
Segmentation of moving objects by long term video analysis
IEEE Trans. Pattern Anal. Mach. Intell.
(2014)
G. Farneback
Two-frame motion estimation based on polynomial expansion
Image Anal.
(2003)

S. Poularakis, K. Avgerinakis, A. Briassouli, I. Kompatsiaris, Computationally efficient recognition of activities of...

R.V. Babu et al.

Recognition of human actions using motion history information extracted from the compressed video

Image Vis. Comput.

(2004)

C. Yeo et al.

High-speed action recognition and localization in compressed domain videos

IEEE Trans. Circuits Syst. Video Technol.

(2008)

S. Biswas, R. Babu, H.264 compressed video classification using Histogram of Oriented Motion Vectors (HOMV), in:...

V. Kantorov, I. Laptev, Efficient feature extraction, encoding, and classification for action recognition, in:...

J. Ohm et al.

Comparison of the coding efficiency of video coding standards – including high efficiency video coding (HEVC)

IEEE Trans. Circuits Syst. Video Technol.

(2012)

D. Grois, D. Marpe, A. Mulayo, B. Itzhaky, O. Hadar, Performance comparison of H.265/MPEG-HEVC, VP9, and H.264/MPEG-AVC...

M. Pourazad et al.

Hevc: the new gold standard for video compression: how does HEVC compare with H.264/AVC?

IEEE Consum. Electron. Mag.

(2012)

R. Messing, C. Pal, H. Kautz, Activity recognition using the velocity histories of tracked keypoints, in: Proceedings...

J. Sun, X. Wu, S. Yan, L. Cheong, T. Chua, J. Li, Hierarchical spatio-temporal context modeling for action recognition,...

P. Matikainen, M. Hebert, R. Sukthankar, Trajectons: action recognition through the motion analysis of tracked...

K. Avgerinakis, A. Briassouli, I. Kompatsiaris, Recognition of activities of daily living for smart home environments,...

D. Oneata, J. Verbeek, C. Schmid, Action and event recognition with Fisher vectors on a compact feature set, in:...

B.D. Lucas, T. Kanade, An iterative image registration technique with an application to stereo vision, in: Proceedings...

D.G. Lowe

Distinctive image features from scale-invariant keypoints

Int. J. Comput. Vis.

(2004)

N. Dalal, B. Triggs, C. Schmid, Human detection using oriented histograms of flow and appearance, in: Proceedings of...

H. Bay, T. Tuytelaars, L. Van Gool, Surf: speeded up robust features, in: Proceedings of the European Conference on...

H. Jegou et al.

Aggregating local image descriptors into compact codes

IEEE Trans. Pattern Anal. Mach. Intell.

(2012)

F. Perronnin, J. Sanchez, T. Mensink, Improving the sher kernel for large-scale image classification, in: Proceedings...

T. Brox et al.

Large displacement optical flow: descriptor matching in variational motion estimation

IEEE Trans. Pattern Anal. Mach. Intell.

(2011)

Y. Yan et al.

Multitask linear discriminant analysis for view invariant action recognition

IEEE Trans. Image Process.

(2014)

Cited by (21)

Vehicle tracking with Kalman filter using online situation assessment
2020, Robotics and Autonomous Systems
Citation Excerpt :
Fig. 4 represents a sample of the obtained paths in a video sequence. Motion Flow History (MFH) and Motion History Image (MHI) are introduced in [19] and [20] and is constructed using information vector in compressed MPEG video stream. The moving object actions are characterized by MFH and MHI representing spatio-temporal motion vector information.
Vehicle tracking is an attractive problem in the field of public transportation with several research projects conducted using Kalman filter (KF) to tackle this. While a driver may act on his own decision, there exist parameters affecting his behavior so called situation assessment such as neighboring drivers, possible obstacles, or alternative routes changing over time. In this paper, utilizing online situation assessment (SA) inside Kalman filter is studied. Motion History Graph is used as online modeling of the history of the vehicle motions and is used to augment the estimation. Experimental results on video sequences from different datasets show an average 25 percent performance improvement when using online SA inside KF.
Background foreground boundary aware efficient motion search for surveillance videos
2020, Signal Processing: Image Communication
Citation Excerpt :
The modified hexagon grid search (MHGS) [12] uses different mechanisms of complexity reduction in addition to hexagon patterns. The motion search complexity reduction has been extended for fast activity recognition applications [17,18]. These algorithms use different search patterns for reducing the total number of search points.
The huge amount of data in surveillance video coding demands high compression rates with lower computational requirements for efficient storage and archival. The motion estimation is a very time-consuming process in the traditional video coding framework, and hence reducing computational complexity is a pressing task, especially for surveillance videos. The presence of significant background proportion in surveillance videos makes its special case for coding. The existing surveillance video coding methods propose separate search mechanisms for background and foreground regions. However, they still suffer from misclassification and inefficient search strategies since it does not consider the inherent motion characteristics of the foreground regions. In this paper, a background-foreground-boundary aware block matching algorithm is proposed to exploit special characteristics of the surveillance videos. A novel three-step framework is proposed for boundary aware block matching process. For this, firstly, the blocks are categorized into three classes, namely, background, foreground, and boundary blocks. Secondly, the motion search is performed by employing different search strategies for each class. The zero-motion vector-based search is employed for background blocks. Whereas, to exploit fast and directional motion characteristics of the boundary and foreground blocks, the eight rotating uni-wing diamond search patterns are proposed. Thirdly, the speed-up is achieved through the novel region-based sub-sampled structure. The experimental results demonstrate that two to four times speed-up over existing methods can be achieved through this scheme while maintaining better matching accuracy.
A new framework of action recognition with discriminative parts, spatio-temporal and causal interaction descriptors
2018, Journal of Visual Communication and Image Representation
Citation Excerpt :
For the past few years, the computer vision has become a new subject and is at a stage of rapid development. As a critical technology of analyzing and understanding massive heterogeneous data of video, action recognition possesses significant academic value, potential business value and huge application prospect, and all of them make it become a research focus and difficulty in the computer vision field rapidly, well then an increasing number of scholars and research institutions have carried out numerous research works successively in the related aspects [1–3]. And consequently, action recognition has been successfully applied in the human-computer interaction field, e.g., intelligent surveillance, intelligent traffic, video retrieval, robot navigation and game entertainment.
To improve action recognition performance, a novel discriminative spectral clustering method is firstly proposed, by which the candidate parts with the internal trajectories being close in spatial position, consistent in appearance and similar in motion velocity are mined. Furthermore, the discriminative constraint is introduced to select discriminative parts. Meanwhile, by fully considering the local and global distributions of data, a new similarity matrix is constructed, which enhances clustering effect. Secondly, the spatio-temporal interaction descriptor and causal interaction descriptor are constructed respectively, which fully mine the spatio-temporal and implicit causal interactive relationships between parts. Finally, a new framework is proposed. By associating the discriminative parts, spatio-temporal and causal interaction descriptors together as the inputs of Latent Support Vector Machine (LSVM), the correlations between action categories and action parts as well as interaction descriptors are mined. Consequently, accuracy is enhanced. The extensive and adequate experiments demonstrate the effectiveness of the proposed method.
A metaplastic neural network technique for human activity recognition for Alzheimer's patients
2023, 17th International Conference on INnovations in Intelligent SysTems and Applications, INISTA 2023 - Proceedings
Deep-learning-based human activity recognition for Alzheimer’s patients’ daily life activities assistance
2023, Neural Computing and Applications
Speeding Up Action Recognition Using Dynamic Accumulation of Residuals in Compressed Domain
2022, arXiv

View all citing articles on Scopus

View full text

Efficient motion estimation methods for fast recognition of activities of daily living

Abstract

Introduction

Section snippets

Activity recognition methods

Action representation

MPEG video encoding

Experimental results

Conclusions and future work

Acknowledgement

Neurocomputing

Signal Process.

Dense trajectories and motion boundary descriptors for action recognition

Int. J. Comput. Vis.

Segmentation of moving objects by long term video analysis

IEEE Trans. Pattern Anal. Mach. Intell.

Two-frame motion estimation based on polynomial expansion

Image Anal.

Recognition of human actions using motion history information extracted from the compressed video

Image Vis. Comput.

High-speed action recognition and localization in compressed domain videos

IEEE Trans. Circuits Syst. Video Technol.

Comparison of the coding efficiency of video coding standards – including high efficiency video coding (HEVC)

IEEE Trans. Circuits Syst. Video Technol.

Hevc: the new gold standard for video compression: how does HEVC compare with H.264/AVC?

IEEE Consum. Electron. Mag.

Distinctive image features from scale-invariant keypoints

Int. J. Comput. Vis.

Aggregating local image descriptors into compact codes

IEEE Trans. Pattern Anal. Mach. Intell.

Large displacement optical flow: descriptor matching in variational motion estimation

IEEE Trans. Pattern Anal. Mach. Intell.

Multitask linear discriminant analysis for view invariant action recognition

IEEE Trans. Image Process.