1 Introduction

Societies around the globe have become accustomed to the ubiquitous presence of camera sensors in the private, public and corporate space for purposes of security, monitoring, safety and even provide a natural user interface for human machine interaction. Cameras may be employed to facilitate data collection, to serve as a data source for controlling actuators, or to monitor the status of a process which includes tracking. Thus there is an increasing need for video analytic systems to operate across different domains to recognize interesting events for the purpose of behavior analysis and activity recognition.

In order to recognize interesting events across different domains in this study we propose a cross domain framework supported by relevant theory, which will lead to an Open Surveillance concept - a systemic organization of components that will streamline future system development. The framework we propose utilizes Markov logic networks – a combination of probabilistic Markov network models with first-order logic. The proposed framework in this research will pave the way for the establishment of a software library similar to the widely-used OpenCV for computer vision (see Fig. 1). Open Surveillance may be conceptualized as the middleware that connects the computer vision (i.e., domain specific) components with the application that responds to the interpretation of the streaming video.

Fig. 1.
figure 1

Structure of processing: computer vision functions and definition of logic description is unique for each domain

Existing approaches for behavior analysis and activity recognition are domain specific. No effort has been made to contribute to a framework that functions across domains. A framework is proposed in [1] to recognize behavior in one-to-one basketball on top of arbitrary trajectories from tracking of ball, hands and feet. This framework uses video analysis and mixed probabilistic and logical inference to annotate events that occurred given a semantic description of what generally happens in a scenario. However in order to extend the framework to other domains new low-level events (LLEs) predicates must be re-defined according to the characteristics of different domains. Thus, a more general and systematic framework is needed and it is essential for building components that function across surveillance domains.

Types of interesting events may vary from simple to complex activities and from single agent to multi-agent activities. As in [1] multi-agent activities are challenging due to interactions that lead to large state spaces and complicate the already uncertain low level processing. A spatio-temporal structure is mostly utilized and leveraged to distinguish amongst complex activities. Trajectories are one representation used to capture motion and its spatio-temporal structure. Issues addressed in this study include: (a) How may the patterns of motion be represented with spatio-temporal structure and how can they be decomposed into atomic elements? (b) What reasoning mechanisms must be established to infer high level events (HLEs) from LLEs? (c) Given the noisy nature of extracting LLEs from videos, how may the uncertainty be managed during inference? (d) How may the framework be generalized to function across multiple domains?

A complete general set of context-free LLEs is developed, which are represented as spatial, temporal and spatio-and-temporal predicates, such that any activity can be subjected to using the set of LLEs structured inside first-order logic and representing time using an approach based on Allen interval logic. Trajectories are used to represent motion patterns of interesting objects. Due to the uncertainty of motion pattern representation Markov logic networks are utilized to handle the noisy input trajectories. The set of LLEs is integrated with a Markov logic network. We do not use manual annotations of video to indicate presence of objects but assume that detection of interesting objects is not a problem and can be handled by existing methods. What’s more the capturing sensor is assumed to be stationary.

We tested our approach in two different domains: human gesture and human interactional activities. For human gesture two datasets are used: Microsoft human actions and Weizmann Dataset. For human interactional activities we used a synthetic one.

Every action is distinguished from others with its own specific combinations of spatiotemporal patterns of body parts or agent centroid. Events are normally defined by their interaction with properties of the world and observations of the world. Observations of the world are incorporated into a knowledge base using a set of soft rules. In order to simplify we only focus on the definition of new LLEs and how to design it to be capable of functioning in the cross domain framework. Moreover, we treat both properties and observations as LLEs since they are generated depending on preprocessing results.

This work is a major step in assisting in many sectors the development of new video stream monitoring systems. Such systems must rely less on constant attention of human operators. In the following section related work is discussed and the event reasoning framework is explained in more detail in the third section. Experiments and conclusion are discussed in the fourth and last sections, respectively.

1.1 Related Work

A survey on human action recognition categorized learning approaches into direct classification and temporal state-space models and action detection. Our approach falls into the category of temporal state-space models.

More specifically it belongs to generative temporal state-space models. Generative temporal state-space models for action recognition learn a joint distribution over both observations and action labels. In the literature, there is research examining the generative temporal state-space models. The three principal approaches incorporate HMMs, grammars, and Markov logic networks.

Hidden states of hidden Markov models (HMMs) correspond to different phases (key gestures or sub-actions) of a complete action. They model state transition probabilities and observation probabilities. Oliver et al. [2] used a coupled HMM to model the interaction between two tracks. When a single HMM is trained for each action, the problem of action recognition is formulated as finding the action HMM that leads to highest probability for the observed sequence. HMMs are widely used in human action recognition [3]. However, HMMs do not handle missing observations well and require large training sets to learn structures that humans can describe easily.

Grammars are also categorized as generative models that specify explicitly how action sequences can be observed in order. Context-free grammars (CFG) [5] can define high-level interactions hierarchically using spatial and temporal logic predicates. Context free grammar parsing doesn’t handle the uncertainty of low-level processing thus it is sensitive to low-level failures. While it has been extended to incorporate probabilistic inference [6], experiences indicate they do not scale well to large datasets and can not deal with multi-agent cases.

Markov logic networks (MLNs) [7] are a probabilistic logic, that combine the probabilistic Markov network model with first-order logic. Morariu and Davis [1] presented a Markov logic network based framework for automatic recognition of complex multi-agent events. They automatically detect and track players, their hands and feet, and the ball, generating a set of trajectories which are used in conjunction with spatio-temporal relations to generate event observation. Domain knowledge plays an important role in defining rules, observations, properties and actions of interest. Perse et al. [8] transform trajectories into a sequence of semantically meaningful symbols and compare them with domain experts driven templates in order to analyze team activities in basketball games. However, there is no research towards unifying low-level observation, properties and actions to provide an intermediate layer, on top of which high-level interesting events for different applications under the domain of image/video analysis based surveillance can be developed. In order to acquire a proof of concept (POC) for the proposed cross domain logic framework in our experiment, rules for expressing knowledge are provided manually, though they can be learned through training.

2 Logic Event Reasoning Framework

The proposed logic event reasoning framework uses intuitive knowledge about human action and interaction rules over spatial, temporal and spatio-temporal semantic-free LLEs over trajectories to infer HLEs in a descriptive first-order logic (in Horn clause form). For video analytics related applications popular top-down approaches to extract activities of interests from videos provide spatio-temporal information through trajectories. Activities can be captured and represented as trajectories and it is then the input of the proposed logic reasoning framework as shown in Fig. 1. The proposed logic reasoning framework uses intuitive knowledge about human action and interaction rules over the input trajectories. Trajectories need to be preprocessed to acquire the universe of discourse. The universe of discourse includes a set of time intervals (T), a set of objects (O), a set of atomic motion trajectory segments (S) and a set of locations of interest (L). Afterwards the semantic-free low level events are grounded from the information acquired from the preprocessed trajectories.

Thus grounded databases are ready with grounded predicates for further logic reasoning to recognize high-level events. Some predicates are grounded directly from the sensory data (to be described below). Spatial, spatio-temporal and temporal predicates will be discussed in detail in the section of the Low-Level Event Grounding. Rules for HLEs must be predefined for each application domain, however learning them from labelled exemplars could also be an alternative. The grounded predicates populate the inference graph of a Markov logic network [9]. Trajectories are decomposed into segments and every segment indicates a possible event interval, from which a larger set of all possible action intervals can be derived. Then probabilistic inference is used to determine which event interval candidate best indicates an interesting high level event.

2.1 Preprocessing

In order to extract elements of universal discourse from raw trajectories a preprocessing step is needed. Raw trajectories are decomposed into atomic motion segments. The atomic motion segments are the smallest meaningful sub-trajectories, which indicate no motion changes with respect to speed and movement direction. Every atomic motion segment can either indicate a movement or staying at one location; every atomic motion segment is associated with an event interval. What’s more action intervals of high-level events need to be derived from event intervals extracted from observations. How action intervals are generated will be discussed in more detail in the fourth section.

2.2 Low-Level Event Grounding

The domain of discourse over which the low-level motion predicates are defined as follows:

\( T = t_{1} ,t_{2} , \ldots \), a set of time intervals defined in terms of logical units

\( O = o_{1} ,o_{2} , \ldots \), a set of labelled objects; the labels are application domain dependent

\( S = \sigma_{1} ,\sigma_{2} , \ldots \), a set of atomic motion trajectory segments

\( L = \delta_{1} ,\delta_{2} , \ldots \), a set of locations

The spatial, temporal and spatio-and-temporal predicates over the domain of discourse are defined to be semantic-free LLEs. Spatial and spatial-temporal predicates are described below. Doing so will also illustrate the basic simplicity of the computer vision processing that must be revisited each time a new application domain is introduced. Less than a dozen predicates are needed and may be categorized as spatial and spatial temporal.

Spatial-predicates:

\( near\left( {\delta_{i} ,\delta_{j} } \right) \), location \( \delta_{i} \) and \( \delta_{j} \) are near to each other

\( far\left( {\delta_{i} ,\delta_{j} } \right) \), location \( \delta_{i} \) and \( \delta_{j} \) are far from each other

\( parellel\left( {\sigma_{i} ,\sigma_{j} } \right) \), motion trajectory segment \( \sigma_{i} \) and \( \sigma_{j} \) are parallel

\( cross\left( {\sigma_{i} ,\sigma_{j} } \right) \), motion trajectory segment \( \sigma_{i} \) and \( \sigma_{j} \) cross with each other without extension

\( nonparellel\left( {\sigma_{i} ,\sigma_{j} } \right) \), motion trajectory segment \( \sigma_{i} \) and \( \sigma_{j} \) cross with each other with extension

\( on\left( {\delta_{i} ,\sigma_{j} } \right) \), location \( \delta_{i} \) is on the motion trajectory segment \( \sigma_{j} \)

Spatial-temporal predicates:

\( move\left( {o_{i} ,\sigma_{i} ,t_{i} } \right) \), object \( o_{i} \) moves in motion trajectory segment \( \sigma_{i} \) within time interval \( t_{i} \)

\( stopAt\left( {o_{i} ,\delta_{i} ,t_{i} } \right) \), object \( o_{i} \) stops at location \( \delta_{i} \) within time interval \( t_{i} \)

Functions over the domain of discourse are as follows:

\( D_{i} \left( {\sigma_{i} } \right) \), the direction of motion segment \( \sigma_{i} \), the direction of motion segment \( \sigma_{i} \) is the ith quantized direction

\( startLoc\left( {\sigma_{i} } \right) \), the start location of trajectory segment \( \sigma_{i} \)

\( endLoc\left( {\sigma_{i} } \right) \), the end location of trajectory segment \( \sigma_{i} \)

\( len\left( {\sigma_{i} } \right) \), the length of trajectory segment \( \sigma_{i} \)

\( v\left( {\sigma_{i} } \right) \), the average speed of trajectory segment \( \sigma_{i} \)

\( a\left( {\sigma_{i} } \right) \), the acceleration of trajectory segment \( \sigma_{i} \)

Temporal predicates:

11 temporal relationships in total between these predicates, which are defined over time intervals, are expressed using the following base binary relations and their inverses: before, meets, overlaps, starts, during, finishes, and equals, which are known as Allen’s temporal intervals [10].

As mentioned previously the trajectories of the objects of interest are the first-hand information which describe the motion of the objects. From the raw trajectories we need to derive knowledge in terms of the domain of discourse - time intervals (T), labelled objects (O), atomic motion trajectory segments (S) and locations of interest (L). As shown in Fig. 2 show the knowledge is derived from trajectories is discussed over two domains-human action and human interaction. The left part of Fig. 2 illustrates a person executing a sidekick. A human interation “meet” is shown on the right of the figure.

Fig. 2.
figure 2

Computer vision methods are used to extract objects, motion (segments with direction), and temporal data from video and create corresponding predicates

For sidekick at the left the moving objects is the left foot o1. Trajectories of the left foot are obtained from tracking o1. The trajectories are abstracted into segments \( \sigma_{1} \) and \( \sigma_{2} \) also shown on the two graphs at the left. Proper segmentation of a complex motion is a research problem. These two extracted motion trajectory segments are associated with time intervals, \( t_{1} \) and \( t_{2} \) respectively. In the language of force dynamics, \( t_{1} \) meets \( t_{2} \). In aggregate, these result in the grounded predicates shown in the database of Fig. 2. For human action sidekick the spatio-temporal predicates are \( move\left( {o_{1} ,\sigma_{1} ,t_{1} } \right) \) and \( move\left( {o_{1} ,\sigma_{2} ,t_{2} } \right) \) are used to describe the movement. Quantitized movement direction of two segments \( \sigma_{1} \) and \( \sigma_{2} \) are measured as functions over segments \( D_{2} \left( {\sigma_{1} } \right) \) and \( D_{4} \left( {\sigma_{1} } \right) \). Basically previously described predicates give detailed information about the movement of the body part footleft. \( move\left( {footLeft,\sigma_{1} ,t_{1} } \right) \) and \( D_{2} \left( {\sigma_{1} } \right) \) - the footleft moves in the second quantitized direction within the interval \( t_{1} \) as a segment \( \sigma_{1} \). \( move\left( {footLeft,\sigma_{2} ,t_{2} } \right) \) and \( D_{4} \left( {\sigma_{1} } \right) \) - the footleft moves in the fourth direction within the interval \( t_{2} \) as a segment \( \sigma_{2} \). Relations among time intervals \( t_{1} \), \( t_{2} \) and \( t_{3} \) are described by the temporal predicates \( meets\left( {t_{1} ,t_{2} } \right) \), \( starts\left( {t_{1} ,t_{3} } \right) \) and \( finishes\left( {t_{1} ,t_{3} } \right) \). The nonparallel spatial relation of these two segments is grounded as the spatial predicate \( nonparellel\left( {\sigma_{1} ,\sigma_{2} } \right) \).

Similarly, the human interaction “meet” at the right of Fig. 2 \( o_{1} \) and \( o_{2} \) are two moving agents along a straight path. Trajectories are extracted into motion segments and locations of interest. Agents being tracked through time and trajectories are extracted into either segments or locations of interest. Trajectories of agent1 are extracted into \( \sigma_{1} \) and location of interest \( \delta_{1} \) and for agent2 \( \sigma_{2} \) and location of interest \( \delta_{2} \) are extracted. The movements of agent1 and agent2 are described as spatio-temporal predicates \( move\left( {o_{1} ,\sigma_{1} ,t_{1} } \right) \), \( stopAt\left( {o_{1} ,\delta_{1} ,t_{3} } \right) \), \( move\left( {o_{2} ,\sigma_{2} ,t_{2} } \right) \), \( stopAt\left( {o_{2} ,\delta_{2} ,t_{4} } \right) \). Agent1 moves as a motion segment \( \sigma_{1} \) within time interval \( t_{1} \) and stops at a location \( \delta_{1} \) within time interval \( t_{2} \). Agent2 moves as a motion segment \( \sigma_{2} \) within time interval \( t_{3} \) and stops at a location \( \delta_{2} \) within time interval \( t_{4} \). The spatial relation between two locations of interest is grounded as predicate \( near\left( {\delta_{1} ,\delta_{2} } \right) \). Two motions segments \( \sigma_{1} \), \( \sigma_{2} \) are parallel and the relation is described as \( parellel\left( {\sigma_{1} ,\sigma_{2} } \right) \); Temporal relations between four time intervals are \( before\left( {t_{1} ,t_{3} } \right) \), \( before\left( {t_{1} ,t_{4} } \right) \), \( before\left( {t_{2} ,t_{4} } \right) \), \( before\left( {t_{2} ,t_{3} } \right) \), \( before\left( {t_{2} ,t_{4} } \right) \), \( equal\left( {t_{1} ,t_{2} } \right) \) and \( \left( {t_{3} ,t_{4} } \right) \).

All action rules consist of three types of predicates (temporal, spatial and spatiotemporal predicates), which is described in details in a following subsection. (In reality, the predicates are not grounded at this point. However, it assists in illustration.)

The keys to the later stages of processing are twofold: (1) The segments that correctly decompose a motion (such as a human gesture); (2) The choice of predicates that enable the proper interpretation of LLEs. We provide models and examples. For instance, our preliminary testing focused on human gestures. Heuristically, trajectories were segmented at sharp inflection points or at points where speed change abruptly such as “stops” and abstracted as straight lines.

Excepting temporal predicates, a single motion predicate was necessary, appropriately instantiated from the domain of discourse described earlier.

2.3 High-Level Event Representation

With grounded database rules describing high-level activities are needed for inference engine MLNs to calculate probability of each state of the world. In other words the probability of high-level events given the observation stored in the grounded databases needs to be inferred by MLNs. Basically the rules for a high-level event can be described as a simple Horn clause representation reasoning over semantic-free low-level events.

Due to space limitation a subset of rule examples for human action and interactions are selected and illustrated as follows:

2.4 MLNs

Let x be the state of the world, i.e., the truth value of all LLEs and HLEs. In general, we wish to know the probability of each state \( x \in X \) of the system, which can be expressed in a Markov network as

$$ P\left( {X = x} \right) = \frac{1}{Z}\prod\limits_{j} {\phi_{j} \left( x \right)} = \frac{1}{Z}\exp \left( {\sum\limits_{i} {w_{i} } \;f_{i} \left( x \right)} \right) $$

where \( Z \) is the partition function \( \phi_{j} \left( x \right) \) and \( f_{i} \left( x \right) \) are real-valued potential and feature functions in the state. The basic inference can be stated in terms of finding the most probable state x given some evidence \( y\; \subseteq \;x \), and this is formally defined as

$$ \mathop {\arg \hbox{max} }\limits_{x} P\left( {x\left| y \right.} \right) = \mathop {\arg \hbox{max} }\limits_{x} \sum\limits_{i} {w_{i} } \;f_{i} \left( x \right) $$

To solve the arg max we choose to use Markov logic networks. MLNs are a language that combine first-order logic and Markov networks. On the logic side, formulas have soft constraints such that a world that violates a formula is less probable than one that satisfies the formula. On the statistical side, complex models can be represented compactly. They can also express any probability distribution over the state of possible worlds X. Compared to previous developments in knowledge-based model construction and statistical relational learning, MLNs are less restricted since they are supported by a comprehensive set of learning and inference algorithms. Due to their generality, MLNs can integrate logical and statistical approaches from different fields within one framework.

Refer again to Fig. 2. The grounding database is populated with predicates. However, this is for illustration only. As presently formulated, we populate it with tuples from the relation \( R\left( {T,O,S,L} \right) \). We refer to the grounding database together with the logic rules, \( F_{i} \), as the knowledge base (KB). Prior to inference, the Markov network must be instantiated using the KB and weights. At first a set of constants representing objects in the domain of interest is extracted from the grounding database. Secondly with the MLN’s formulas, a set of vertices (predicates) for the logic network is generated by replacing the variables in each formula with the set of constants extracted. If any two vertices (predicates) contribute to the grounding of the same formula, an edge is established between these two vertices. Therefore each full or partially grounded formula Fi is graphically represented as a clique in the Markov network. As shown in Fig. 2 computer vision methods are used to extract objects, motion (segments with direction), and temporal data from video and create corresponding predicates.

To state more declaratively, an MLN graph \( L \) consists of \( \left( {F_{i} ,w_{i} } \right) \) pairs where \( F_{i} \) is a formula in first-order logic and \( w_{i} \) is a real number interpreted as the weight of \( F_{i} \). While not a precise relation, a weight \( w_{i} \) may be interpreted as indicating that \( F_{i} \) is \( e^{{w_{i} }} \) times likely to have a true grounding for an arbitrary world than not have one. The number of true groundings of \( F_{i} \) in state \( x \), \( n_{i} \left( {x,y} \right) \), is used as the feature function \( f_{i} (x) \) in the formulae above. Any MLN defines a Markov network \( M_{L,C} \left[ \cdot \right] \), and thus finding the most probable state in MLN can be expressed in a Markov network manner as

$$ \mathop {\arg \hbox{max} }\limits_{x} P\left( {x\left| y \right.} \right) = \mathop {\arg \hbox{max} }\limits_{x} \sum\limits_{i} {w_{i} } n_{i} \left( {x,y} \right) $$

3 Experiments and Results

We have conducted experiments in two domains (human action recognition and human interaction recognition) to prove the principle of the proposed architecture. Naturally, we could not develop an entire system tailored to a domain such as security in a public place. We took shortcuts: (1) For human action recognition in order to simulate the computer vision aspect, we used Kinect R; this system identified objects leaving us to extract the tracks, the intervals, and segment the motion; For Weizmann dataset we semi-automatically extracted trajectories of a human body part. For each video, we first manually marked the body parts (head, hands and feet) to track, and then run the TLD tracker to generate trajectories for them. (2) For the human interaction recognition a synthetic dataset is generated as in [2]; (3) To accomplish the inference we used the more general Alchemy [11] package.

3.1 Motion Segmentation and Action Interval Generation

The MSR Microsoft Research action dataset provides 3-dimensional trajectories capturing motion in 3D real-world settings. We simplified to utilize \( (x,y) \) coordinates of the trajectories and four quantized directions in 2D are used. Trajectories are down-sampled so that motion direction changes can be calculated from only two neighboring points over the motion trajectory. Afterwards trajectories are decomposed into segments such that for every segment the motion direction doesn’t change abruptly. We fit every segment into a circle. When the radius is large enough the segment can be fitted into a line; however when the radius is small enough it is actually a curve or even a full circle and the curve/circle is segmented further into several sub-segments. The angle subtended at the center of a circle by start and end points on the circle determines the number of sub-segments (the maximal subtended angle is 45°). We fit a line over each segment and calculate the motion direction associated with it.

Every point over the trajectory has the fourth dimension, t, to indicate the time stamp when it is captured. Thus it is possible to generate action interval candidates. Both start and end time stamps are considered as moments - start moments and end moments. Action interval is defined as [\( moment_{start} \) and \( moment_{end} \)]. \( moment_{start} \) is one of start moments and \( moment_{start} \) is one of end moments. For each action video clip a set of action interval candidates can be generated by choosing \( moment_{start} \) and \( moment_{end} \) among moments from segments. What’s more a subset of action interval candidates is further generated by constraining the interval length to be within a specific range \( \left[ {duration_{\hbox{min} } ,duration_{\hbox{max} } } \right]\;\quad 1/fps \) as in Fig. 3. Similarly for human interaction the trajectories of synthetic agents mimicking human behavior in a virtual environment are decomposed into segments due to abrupt changes of motion direction and speed. When agents stop at one location the segments are considered as locations of interest instead of movement segments.

Fig. 3.
figure 3

Action interval generation

As shown in Fig. 3 there are five motion segments \( \sigma_{1} \; - \;\sigma_{5} \) and the starting moments and ending moments constitutes the moments \( m_{1} \; - \;m_{8} \). Among them start moments are \( m_{1} \), \( m_{2} \), \( m_{3} \), \( m_{5} \) and \( m_{6} \) the rest are end moments. With the interval length being constrained, final action interval candidates are \( \tau_{1} \; - \;\tau_{9} \).

3.2 Results and Analysis

For the Microsoft human action dataset the motions targeted were sidekick, draw an X with hand motions, and clap. 76 videos were used. We chose a subset of Weizmann dataset of two people performing 7 natural actions: bend, jack, jump, walk, wave one hand, wave two hands and skip. For human interaction recognition the interaction targeted were Meet, ChangeDirection, ChangeDirectionMeet, MeetWalkTogether and MeetWalkSeparate. Ten synthetic events where four agents are involved were used. Rules are provided manually for each motion even though it can be learned from training examples. Please refer to Appendix A for rules defined for actions - LeftSideKick, DrawX, HandClapping, Meet, ChangeDirectionMeet, WalkTogether and WalkSeparate. The result for the Microsoft action dataset is shown in Table 1. Some videos contained only a single action while others contained more than one instance of the same action.

Table 1. Experiments based on a skeletal version of the narrow-angle view architecture for Human Action Recognition

The result for Weizmann dataset and for human interactional activities is shown in Tables 2 and 3 respectively.

Table 2. Experiments for Weizmann dataset
Table 3. Experiments based on a skeletal version of the narrow-angle view architecture for Human Interaction Recognition

4 Conclusion

A reasoning framework which combines first-order-logic with Markov logic networks is presented in order to recognize both simple and complex activities. Semantic-free predicates are defined thus low-level events (LLEs) and high-level events (HLEs) of interest across different domains can be described by encoding those LLEs and HLEs and temporal logic (Allen’s interval logic) in a first-order-logic presentation.

The main contribution is the logic reasoning framework together with a new set of context free LLEs which could be utilized across different domains. Currently human action datasets from MSR and a synthetic human interaction dataset are used for experiments and results demonstrate the effectiveness of our approach. In the future we will validate our proposed framework to more domains such as intelligent traffic surveillance and design a real-time mechanism for human activity recognition across different domains.