A hardware friendly algorithm for action recognition using spatio-temporal motion-field patches
Introduction
Gesture perception or action recognition is receiving growing attention due to its applications in smart surveillance [1], sign language interpretation [2], advanced user interface [3], [4] and intelligent robot control. As compared to static image recognition, action recognition usually requires handling overwhelmingly large amount of data because a whole set of video sequences must be analyzed. Moreover, if action recognition is subject to cluttered background, results are often degraded significantly. Therefore, it sometimes requires taking additional measures, i.e. tracking windows or background estimation on the frame-basis. In some cases, it is desirable to build the recognition system directly in the VLSI hardware such as ASICs (application specific integrated circuits) or FPGAs in order to achieve real-time performance. Therefore several constraints need to be further imposed on the algorithms. One important requirement is that the background elimination should be incorporated to the system so that video sequences can be taken as direct input. Another constraint is that computation in the system should be simple enough to be implemented on VLSI circuits either by analog or digital technology.
The process of action recognition usually contains two stages: feature extraction stage and the template matching stage. In the first stage, feature vectors are generated to represent actions in videos. In the second stage, feature vectors are classified using some sorts of classifiers such as Hidden Markov Models (HMMs) or Support Vector Machines (SVMs) [?]. Most of the researches in action recognition, nevertheless, devote their efforts to the first stage, namely, how to generate good features to represent actions. Those algorithms generally fall into three categories. In the first category [5], [6], particular parts of human bodies are identified at the beginning, and feature vectors are generated by tracking those parts in spatial and time coordinates. Recognition rates therefore highly depend on the accuracy of distinguishing these specific parts. In the second set of algorithms [7], optical flow estimation is applied in low resolution video samples. Tracking objects from videos is, nevertheless, prerequisite for such systems. In the third group, feature vectors are extracted by using patches (or so-called bag-of-words) [8], [9], [10], inspired by latest development in visual cortex and image recognition [11]. Patches in image recognition can be seen as small portions of images that capture local features (see Fig. 1). Algorithms within this category generally include two essential processes: generating prototypes and finding matches between the prototypes and inputs by calculating the similarities. In [8], a system built in a hierarchical way based on the study of vision models has been proposed, and patches in their context are defined as the 2-D spatial form within each frame while matching is being done on the frame-basis. Feature vectors of actions are then calculated by finding best matches across video sequences. High recognition rate for action recognition has been reported. However, the system contains six layers due to additional process in time sequences. Pre-processing to eliminate background is also required for the applications. In contrast, [9] extends the definition of patches into a spatio-temporal form, in which a small period of time sequences are also included. In the beginning, interesting points must be detected using Harris corner detector. Once interest points are detected, matching is conducted between training samples and testing samples around interest points. In order to represent spatio-temporal patches, PCA is applied in advance to reduce the size of the patches.
Our previous works focus on building hardware compatible systems which may be expected to achieve real-time response performance. In [12], a concept called Directional Edge Displace (DED) maps was introduced, in which motions in videos are highlighted by taking the difference between the time-integrated edge maps and initial edge maps. Then motion fields are calculated using a block matching algorithm, followed by integrating those values along different directions in each frame to represent the characteristics of motions in compact forms. Finally, feature vectors generated from each frame are input to HMMs as a time sequence to carry out recognition. An alternative to generating such a time sequence of vectors is called Projected Directional-Motion Histogram (PDMH) which is formed by integrating motion field maps in both spatial and time domains. Several VLSI chips have already been developed to accelerate the processing using digital [13] as well as analog CMOS technologies [14].
In this paper, we propose a hierarchical model using spatio-temporal patches for action recognition with an effort to make the algorithms compatible to the VLSI architecture developed previously [15]. In order to limit the complexity of the system, we proposed a two level structure. At the lower level, we introduced the concept of Essential Directional Edge Displacement (EDED) maps to eliminate most of the background noises. At the higher level, when feature vectors of actions are to be generated, matching is performed by calculating the similarities between input video sequences and prototype patches, which are extracted from learning video samples. Since patches can capture the local features while losing absolute position information in space and time domains, recognition based on patches is robust to several kinds of variations in positions and time sequences. Our approach is different from traditional spatio-temporal patch methods [9], [10] in several aspects. Firstly, interest points detected by Harris corner detector or such kinds are of no necessity for our processing. Instead, we only select patches that contain enough non-zero motion field values. Since background is effectively eliminated (see Fig. 3(c)) at lower levels and only moving parts from video sequences are captured, patches extracted by the simple criterion are well tuned to local features. Secondly, in order to reduce the size of each patch, integration along space and time are adopted, rather than computationally demanding methods like PCA [9]. Thirdly, being inspired by the latest researches from visual cortex, we calculate best matches of prototypes with the whole video sequences, while in previous methods matching among interesting points must be defined at certain locations in videos. Experiment was conducted over a database for gestures with cluttered background [12]. Furthermore, we also tested the algorithm on widely used action databases such as Weizmann human database [9] and KTH human database. Results show that our system can achieve robust recognition performance despite its simplicity in calculations.
Section snippets
Robust features for actions
The proposed architecture employs a hierarchical structure (Fig. 2). At the lower level, global features such as speed and directions are extracted. While at the higher level, invariance to spatio-temporal positions is achieved by applying matching between prototype patches and input video sequences.
Experiment
We carried out experiment on three databases. In order to show the robustness against cluttered background, we tested our algorithm on a gesture perception database. In addition, we also conducted experiment over the Weizmann human database and KTH human database.
Results and discussions
We have tested our algorithm in several aspects. Basically, we apply confusion matrix to examine the results of recognition, in which each row represents a true label and each column represents a predicted label.
On the way to VLSI implementation
We tested our system on a computer with a CPU of XEON 1.6 GHz. It took less than 2 min to generate a feature vector for a video sequence from Weizmann human database. Furthermore, since our algorithm has been designed in a way that always kept its friendliness to VLSI hardware implementation in mind, several time-consuming steps can be improved directly once implemented on the hardware system. For example, the feature extraction at the lower level is in particular computationally expensive
Conclusions
We have proposed a hardware friendly algorithm and tested it for a gesture database with cluttered background and popular action database with various actions. The hierarchical structure shows a promising ability to recognize actions without other pre-processing to eliminate background. Furthermore, we intentionally avoid complex computation by using only summation and boolean operators, so that the implementation on digital chips or FPGA would be feasible. Future works will focus on reducing
Ruihan Bao received the B.E. degree from Xi'an Jiao Tong University in China, in 2007, the M.E. degree in electronic engineering from University of Tokyo, Japan in 2009. Currently, he is a Ph.D. student at the Department of Electrical Engineering and Information System, University of Tokyo. His research interest includes image processing and video analysis algorithms using VLSIs for real-time performance.
References (32)
- et al.
Smart video surveillance: exploring the concept of multiscale spatiotemporal tracking
IEEE Signal Process. Mag.
(2005) - C. Vogler, D. Metaxas, Handshapes and movements: multiple-channel American sign language recognition, in: Gesture-Based...
- H. Touyama, M. Aotsuka, M. Hirose, A pilot study on virtual camera control via steady-state VEP in immersing virtual...
- H. Meng, N. Pears, C. Bailey, A human action recognition system for embedded computer vision application, in: 2007 IEEE...
- D. Ramanan, Learning to parse images of articulated bodies, in: NIPS 2007, NIPS,...
- V. Ferrari, M. Marin-Jimenez, A. Zisserman, Progressive search space reduction for human pose estimation, in: IEEE...
- A.A. Efros, A.C. Berg, E.C. Berg, G. Mori, J. Malik, Recognizing action at a distance, in: ICCV, 2003, pp....
- H. Jhuang, T. Serre, L. Wolf, T. Poggio, A biologically inspired system for action recognition, in: ICCV, 2007, pp....
- P. Dollar, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition via sparse spatio-temporal features, in: VS-PETS,...
- et al.
Unsupervised learning of human action categories using spatial–temporal words
Int. J. Comput. Vision
(2008)
Robust object recognition with cortex-like mechanisms
IEEE Trans. Pattern Anal. Mach. Intell.
A pixel-parallel self-similitude processing for multiple-resolution edge-filtering analog image sensors
IEEE Trans. Circuits Syst. I: Regular Pap.
Cited by (9)
Probabilistic spatio-temporal inference for motion event understanding
2013, NeurocomputingCitation Excerpt :In multimedia data, image and video are representative media that express visual information, which include a variety of low-level information such as color, shape, texture, and pattern [1]. Object has temporal flow and spatial characteristics, which can be expressed in spatio-temporal relation [2,3]. Many studies have been conducted on such theme for a long period of time.
An analysis of the application of computer-assisted instruction in the teaching of physical education yoga
2024, Applied Mathematics and Nonlinear SciencesA real-time embedded system for human action recognition using template matching
2017, Proceedings - 2017 IEEE International Conference on Electrical, Instrumentation and Communication Engineering, ICEICE 2017Detection of episodes of major depression in older adults through physiological markers and movement patterns: Case study
2015, Proceedings - 2015 IEEE International Conference on Healthcare Informatics, ICHI 2015An Ultra-Low Power, "Always-On" Camera Front-End for Posture Detection in Body Worn Cameras Using Restricted Boltzman Machines
2015, IEEE Transactions on Multi-Scale Computing SystemsA real-time motion-feature-extraction VLSI employing digital-pixel-sensor-based parallel architecture
2014, IEEE Transactions on Circuits and Systems for Video Technology
Ruihan Bao received the B.E. degree from Xi'an Jiao Tong University in China, in 2007, the M.E. degree in electronic engineering from University of Tokyo, Japan in 2009. Currently, he is a Ph.D. student at the Department of Electrical Engineering and Information System, University of Tokyo. His research interest includes image processing and video analysis algorithms using VLSIs for real-time performance.
Tadashi Shibata was born in Japan in 1948, and received the B.S. degree in electrical engineering and the M.S. degrees in material science both from Osaka University, and the Ph.D. degree from The University of Tokyo.
From 1974 to 1986, he was with Toshiba Corporation, where he worked on VLSI process integration. From 1986 to 1997, he was Associate Professor at Tohoku University, where he studied low-temperature processing and ultra-clean technologies for VLSI fabrication. Since the invention of a new functional device called Neuron MOS Transistor (neuMOS) in 1989, his research interest shifted from devices and materials to circuits and systems. Since 1997, he has been Professor at Department of Electrical Engineering and Information Systems, The University of Tokyo. His current research interest is to develop human-like intelligent computing systems based on the state-of-the-art silicon technology and the biologically inspired as well as psychologically inspired models of the brain.