Elsevier

Pattern Recognition

Volume 81, September 2018, Pages 545-561
Pattern Recognition

Learning structures of interval-based Bayesian networks in probabilistic generative model for human complex activity recognition

https://doi.org/10.1016/j.patcog.2018.04.022Get rights and content

Highlights

  • A family of Bayesian network-based probabilistic generative models is presented to address diversity and uncertainty in complex activity recognition.

  • The network structure in our improved model is learned from empirical data to characterize the inherent structural variability in complex activities.

  • A new complex hand activity dataset is made publicly available dedicated to the purpose of complex activity recognition.

  • Experimental results suggest our improved model outperforms existing state-of-the-arts by a large margin.

Abstract

Complex activity recognition is challenging due to the inherent uncertainty and diversity of performing a complex activity. Normally, each instance of a complex activity has its own configuration of atomic actions and their temporal dependencies. In our previous work, we proposed an atomic action-based Bayesian model that constructs Allen’s interval relation networks to characterize complex activities in a probabilistic generative way: By introducing latent variables from the Chinese restaurant process, our approach is able to capture all possible styles of a particular complex activity as a unique set of distributions over atomic actions and relations. However, a major limitation of our previous models is their fixed network structures, which may lead to an overtrained or undertrained model owing to unnecessary or missing links in a network. In this work, we present an improved model that network structures can be automatically learned from empirical data, allowing itself to characterize complex activities with structural varieties. In addition, a new dataset of complex hand activities has been constructed and made publicly available, which is much larger in size than any existing datasets. Empirical evaluations on benchmark datasets as well as our in-house dataset demonstrate the competitiveness of our approach.

Introduction

A complex activity consists of a set of temporally-composed events of atomic actions, which are the lowest-level events that can be directly detected from sensors. In other words, a complex activity is usually composed of multiple atomic actions occurring consecutively and concurrently over a duration of time. Modeling and recognizing complex activities remains an open research question as it faces several challenges: First, understanding complex activities calls for not only the inference of atomic actions, but also the interpretation of their rich temporal dependencies. Second, individuals often possess diverse styles of performing the same complex activity. As a result, a complex activity recognition model should be capable of capturing and propagating the underlying uncertainties over atomic actions and their temporal relationships. Third, a complex activity recognition model should also tolerate errors introduced from atomic action level, due to sensor noise or low-level prediction errors.

Currently, a lot of research focuses on semantic-based complex activity modeling, as using semantic representation has been accredited for its promising performance and desirable ability for human-understandable reasoning [8]. Chang et al. [9] focused on detecting complex events in videos by considering a zero-shot setting where no training data is supplied and evaluating the semantic correlation of each event of interest. Unfortunately, such semantic-based models are capable of representing rich temporal relations, but they often do not have expressive power to capture uncertainties. Many semantic-based models such as context-free grammar (CFG) [38] and Markov logic network (MLN) [22], [29]) are used to represent complex activities, which can handle rich temporal relations. Yet formulae and their weights in these models (e.g. CFG grammars and MLN structures) need to be manually encoded, which could be rather difficult to scale up and is almost impossible for many practical scenarios where temporal relations among activities are intricate. Although a number of semantic-based approaches have been proposed for learning temporal relations, such as stochastic context-free grammars [42] and Inductive Logic Programming (ILP) [18], they can only learn formulas that are either true or false, but cannot learn their weights, which hinders them from handling uncertainty.

On the other hand, graphical models become increasingly popular for modeling complex activities because of their capability of managing uncertainties [44]. Unfortunately, most of them can handle three temporal relations only, i.e. equals, follows and precedes. Both Hidden Markov model (HMM) and conditional random field (CRF) are commonly used for recognizing sequential activities, but are limited in managing overlapping activities [12], [13], [14], [24]. Many variants with complex structures have been proposed to capture more temporal relations among activities, such as interleaved hidden Markov models (IHMM) [31], skip-chain CRF [23] and so on. However, these models are time point-based, and hence with the growth of the number of concurrent activities they are highly computationally intensive [34]. Dynamic Bayesian network (DBN) can learn more temporal dependencies than HMM and CRF by adding activities’ duration states, but imposes more computational burden [32]. Moreover, the structures of these graphical models are usually manually specified instead of learned from the data. The interval temporal Bayesian network (ITBN) [44] differs significantly from the previous methods, as being a graphical model that first integrates interval-based Bayesian network with the 13 Allen’s relations. Nonetheless, ITBN has several significant drawbacks: First, its directed acyclic Bayesian structure leads ITBN having to ignore some temporal relations to ensure a temporally consistent network. As such, it may result in loss of internal relations. Second, it would be rather computationally expensive to evaluate all possible consistent network structures, especially when the network size is large. Third, neither can ITBN manage the multiple occurrences of the same atomic action, nor can it handle arbitrary network size as it remains unchanged as the count of atomic-action types. Fig. 1 illustrates the graph structures of the three commonly-used graphical models.

It is worth noting that we will focus on complex activity recognition in this paper, and interested readers may consult the excellent reviews [1], [5], [15], [16], [17], [25] for further details regarding atomic-level action recognition. Atomic actions are referred to as primitive events that can be inferred from sensors data and images and cannot be further decomposed under application semantics [39]. The interval of a primitive event can also be obtained as the period of time over which the corresponding status remains unchanged. Many excellent approaches have been proposed in the literature to atomic actions (or called events in some papers) which can be inferred from various sources. Chatzis and Kosmopoulos [11] presented a variational Bayesian treatment of multistream fused hidden Markov models, and applied it in the context of active learning based visual workflow recognition for human behavior understanding in video sequences. Simonyan and Zisserman [33], [40] built deep convolutional neural networks to estimate the upper body pose of humans in gesture videos, and also presented a two-stream architecture of deep convolutional networks, which incorporates spatial and temporal networks for action recognition in video. Chang et al. [10], [19] presented a semantic pooling approach for event detection, recognition, and recounting in videos by defining semantic saliency that assesses the relevance of each shot with the event of interest.

Our model focuses on the representations of a complex activity with diverse combinations of atomic actions and their temporal relations under uncertainty, assuming that atomic actions are already recognized and labeled in advance. In the field of human activity recognition, it is increasingly important to understand how those representations work and what they are capturing [21]. Unlike the complex activity recognition methods that operate directly on raw values such as sensor data and video clips, the atomic action-based methodology provides an intermediate space between low-level raw data and high-level complex activities, thereby freeing it from dependence on a particular source modality. The atomic action-based recognition system may have several benefits: the ability to operate on a range of data sources, allowing the knowledge of activities to be shared across a wide range of possible sensor modalities and camera types [3]; the ability to handle across a range of activities efficiently, attenuating the complexity of recognizing complex activities directly from raw source data; and the ability to be reused without needing a special configuration or retraining, regardless of what and how many sources are involved. In our experiments, we adopt the existing approaches to detect atomic actions from different sources of raw data, such as motion sensors [5] and videos [7], [44].

To address the problems in the existing models, we presented the generative probabilistic model with Allen’s interval-based relations (GPA in short) to explicitly model complex activities, which is achieved by constructing probabilistic interval-based networks with temporal dependencies. In other words, our model considers a probabilistic generative process of constructing interval-based networks to characterize the complex activities of interests. Briefly speaking, a set of latent variables, called tables, which are generated from the Chinese restaurant process (CRP) [35] are introduced to construct the interval-based network structures of a complex activity. Each latent variable characterizes a unique style of this complex activity by containing its distinct set of atomic actions and their temporal dependencies based on Allen’s interval relations. There are two advantages to using CRP: It allows our model to describe a complex activity of arbitrary interval sizes and also to take into account multiple occurrences of the same atomic actions. We further introduce interval relation constraints that can guarantee the whole network generation process is globally temporally consistent without loss of internal relations. Based on these ideas in our previous work [28], [30], we presented three variants: GPA-C, where only the links between two neighboring nodes are constructed in chain-based networks; GPA-F, where all pairwise links with a fixed direction from past to future are constructed in fully-connected networks; and GPA-T, where only the links between two nodes assigned at the same table are considered in networks.

A major limitation of these models is their fixed network structures. It may lead to an overtrained or undertrained model owing to unnecessary or missing links. To further improve our model, instead of manually specifying a network to a fixed structure, the network structure in this work is learned from training data. By learning network structures, our improved model, named GPA-S, is more effective than existing graphical models in characterizing the inherent structural variability in complex activities. A further comparison study is summarized in Table 1, which also shows our main contributions.

It is worth mentioning that in spite of the increasing need from diverse applications in the area of complex activity recognition, there are only a few publicly-available complex activity recognition datasets [4], [26], [37]. In particular, the number of instances are on the order of hundreds at most. This motivates us to propose a dedicated large-scale dataset on depth camera-based complex hand activity recognition, which contains instances that are about an order of magnitude larger than the existing datasets. We have made the dataset and related tools publicly available on a dedicated project website1 in support of the open-source research activities in this emerging research community.

Section snippets

Definitions and problem formulation

Although we have given the definitions and problem formulation in our previous papers [28], [30], we intend to explain them again to readers, especially for who follow our approach for the first time, with more examples and comprehensive illustrations. Assume we have at hand a dataset D of N instances from a set of L types of complex activities involving a set of M types of atomic actions A={A1,A2,,AM}. An atomic-action interval (interval for short) is written by Ii=ai@[ti,ti+), where ti and

Our GPA models

For any complex activity type l (1 ≤ l ≤ L), denote DlD the corresponding subset of Nl instances, where each element dDl is an instance of the lth type of complex activity In GPA family, the generative process of constructing an interval-based network Gd=(Vd,Ed) for describing the observed instance d consists of two parts, node generation and link generation, which are described below.

Structure learning

In what follows we focus on how to learn the network structure and the parameter vectors θ and φ from the training data Dl for a particular complex activity l.

Experiments

Experiments are carried out on three benchmark datasets as well as our in-house dataset on recognizing complex hand activities. In addition to the improved model GPA-S, three variants with fixed network structures are also considered: GPA-C for chain-based structures; GPA-F and GPA-T for fully-connected structures. Several well-established models are employed as the comparison methods, which include IHMM [31], dynamic Bayesian network (DBN) [23] and ITBN [44], where IHMM and DBN are implemented

Conclusion

We present an interval-based Bayesian generative network approach to account for the latent structures of complex activities by constructing probabilistic interval-based networks with temporal dependencies in complex activity recognition. In particular, the Bayesian framework and the novel application of Chinese restaurant process (CRP) of our improved GPA-S model enable us to explicitly capture inherit structural variability in each of the complex activities. In addition, we make a new complex

Acknowledgments

We would like to sincerely thank Yongzhong Yang, Lakshmi N. Govindarajan and Prof. Li Cheng from Bioinformatics Institute, Singapore A*STAR on the help of the ASL dataset preparation and atomic hand action detection.

This work was supported by grants from the Fundamental Research Funds for the Key Research Program of Chongqing Science & Technology Commission (grant nos. cstc2017rgzn-zdyf0064, cstc2017jcyjBX0025), the Chongqing Provincial Human Resource and Social Security Department (grant no.

Li Liu is currently an associate professor at Chongqing University. He received his Ph.D. in Computer Science from the Université Paris-sud XI in 2008. He had served as an associate professor at Lanzhou University in China and also a Senior Research Fellow of School of Computing at the National University of Singapore. His research interests are in pattern recognition, data analysis, and their applications on human behaviors. He aims to contribute in interdisciplinary research of computer

References (44)

  • X. Chang et al.

    Feature interaction augmented sparse learning for fast kinect motion detection

    IEEE Trans. Image Process.

    (2017)
  • X. Chang et al.

    Bi-level semantic representation analysis for multimedia event detection

    IEEE Trans. Cybern.

    (2017)
  • X. Chang et al.

    Semantic concept discovery for large-scale zero-shot event detection

    International Conference on Artificial Intelligence

    (2015)
  • X. Chang et al.

    Semantic pooling for complex event analysis in untrimmed videos

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • S.P. Chatzis et al.

    Visual workflow recognition using a variational Bayesian treatment of multistream fused hidden Markov models

    IEEE Trans. Cir. Syst. Video Technol.

    (2012)
  • S.P. Chatzis et al.

    A variational Bayesian methodology for hidden Markov models utilizing student’s-t mixtures

    Pattern Recognit.

    (2011)
  • S.P. Chatzis et al.

    A conditional random field-based model for joint sequence segmentation and classification

    Pattern Recognit.

    (2013)
  • D.J. Cook et al.

    Activity discovery and activity recognition: a new partnership

    IEEE Trans. Cybern.

    (2013)
  • M. Devanne et al.

    3-d human action recognition by shape analysis of motion trajectories on Riemannian manifold

    IEEE Trans Cybern.

    (2014)
  • K.S.R. Dubba et al.

    Learning relational event models from video

    J. Artif. Intell. Res.

    (2015)
  • H. Fan et al.

    Complex event detection by identifying reliable shots from untrimmed videos

    IEEE International Conference on Computer Vision

    (2017)
  • X. Fan et al.

    An improved lower bound for Bayesian network structure learning

    AAAI

    (2015)
  • Cited by (68)

    • Learning temporal action models from multiple plans: A constraint satisfaction approach

      2022, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Learning, as a feature discovery task from raw or past empirical data, is appealing in academic and industrial environments to identify common structures/behaviors that lead to improvements in diagnostic expertise and reduce the need for manual operations (Appiah et al., 2019; Jialin Pan and Yang, 2010; Kou et al., 2019; Lauretti et al., 2018; Liu et al., 2018; Xu et al., 2012).

    • An abnormal event detection method based on the Riemannian manifold and LSTM network

      2021, Neurocomputing
      Citation Excerpt :

      First, certain events are specified as abnormal events, and the abnormal events are detected by detecting whether the predetermined abnormal events exist in the video. After then, in order to detect every type of abnormal events, more and more methods have been introduced, including the method based on reconstruction model [12–15], clustering and outlier detection [11], probability model [5,6,16,17] and deep learning methods [1,18–28]. The methods based on reconstruction model mainly include sparse dictionary and autoencoder, which treats abnormal events as “unrepresentable” events [12–15].

    View all citing articles on Scopus

    Li Liu is currently an associate professor at Chongqing University. He received his Ph.D. in Computer Science from the Université Paris-sud XI in 2008. He had served as an associate professor at Lanzhou University in China and also a Senior Research Fellow of School of Computing at the National University of Singapore. His research interests are in pattern recognition, data analysis, and their applications on human behaviors. He aims to contribute in interdisciplinary research of computer science and human related disciplines. Li has published widely in conferences and journals with more than 80 peer-reviewed publications. Li has been the Principal Investigator of several funded projects from government and industry.

    Shu Wang is currently a lecture at Southwest University. She received her Ph.D. from Lanzhou University. She had severed as an associate professor at Lanzhou University of Technology. She has great research interests in sensor development techniques and electronic materials for sensors.

    Bin Hu is a professor and dean at Lanzhou University. He is also the IET Fellow; Co-chairs of IEEE SMC TC on Cognitive Computing; Chair Professor of the National Recruitment Program of Global Experts. His research interests include Computational Psychophysiology, Pervasive Computing, and Mental Health Care. He has served as associate editor in peer reviewed journals such as IEEE Trans. Affective Computing, IET Communications, etc.

    Qingyu Xiong is a professor and dean at Chongqing University. He received the B.S. and M.S. degrees from the School of Automation, Chongqing University in 1986 and 1991, respectively, and the Ph.D. degree from Kyushu University of Japan in 2002. His research interests include neural networks and their applications. He has published more than 100 journal and conference papers in these areas. Moreover, he has more than 20 research and applied grants.

    Junhao Wen is a professor and vice dean at Chongqing University. He received the Ph.D. degree from the Chongqing University in 2008. His research interests include service computing, cloud computing, and software dependable engineering. He has published more than 80refereed journal and conference papers in these areas. He has more than 30 research and industrial projects and developed many commercial systems and software tools.

    David S. Rosenblum received the Ph.D. degree from Stanford University. He is a professor and dean of the School of Computing at the National University of Singapore. He was an associate professor at the University of California, Irvine; and professor at University College London. His research interests include probabilistic modeling and analysis, and the design and validation of mobile, ubiquitous computing systems. He is a Fellow of the ACM and the IEEE, and the Editor-in-Chief of the ACM Transactions on Software Engineering and Methodology.

    View full text