Skip to main content
Log in

Recognizing Interactive Group Activities Using Temporal Interaction Matrices and Their Riemannian Statistics

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

While video-based activity analysis and recognition has received much attention, a large body of existing work deals with activities of a single subject. Modeling and recognition of coordinated multi-subject activities, or group activities, present in a variety of applications such as surveillance, sports, and biological monitoring records, etc., is the main objective of this paper. Unlike earlier attempts which model the complex spatial temporal constraints among multiple subjects with a parametric Bayesian network, we propose a compact and discriminative descriptor referred to as the Temporal Interaction Matrix for representing a coordinated group motion pattern. Moreover, we characterize the space of the Temporal Interaction Matrices using the Discriminative Temporal Interaction Manifold (DTIM), and use it as a framework within which we develop a data-driven strategy to characterize the group motion pattern without employing specific domain knowledge. In particular, we establish probability densities on the DTIM for compactly describing the statistical properties of the coordinations and interactions among multiple subjects in a group activity. For each class of group activity, we learn a multi-modal density function on the DTIM. A Maximum a Posteriori (MAP) classifier on the manifold is then designed for recognizing new activities. In addition, we have extended this model to one with which we can explicitly distinguish the participants from non-participants. We demonstrate how the framework can be applied to motions represented by point trajectories as well as articulated human actions represented by images. Experiments on both cases show the effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Aggarwal, J. K., & Ryoo, M. S. (2011). Human activity analysis: a review. ACM Computing Surveys, 43(3).

  • Amari, S., & Nagaoka, H. (2000). Methods of information geometry. London: Oxford University Press.

    MATH  Google Scholar 

  • Amer, M., & Todorovic, S. (2011). A chains model for localizing group activities in videos. In IEEE international conference on computer vision, Barcelona, Spain.

    Google Scholar 

  • Choi, W., Shahid, K., & Savarese, S. (2009). What are they doing?: Collective activity classification using spatio-temporal relationship among people. In 9th international workshop on visual surveillance, Kyoto, Japan.

    Google Scholar 

  • Choi, W., Shahid, K., & Savarese, S. (2011). Learning context for collective activity recognition. In IEEE conference on computer vision and pattern recognition, Colorado Springs, CO.

    Google Scholar 

  • Cutler, R., & Davis, L. (2000). Robust real-time periodic motion detection, analysis, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 781–796.

    Article  Google Scholar 

  • Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In Joint IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, Beijing, China.

    Google Scholar 

  • Dryden, I. L., & Mardia, K. V. (1998). Statistical shape analysis. New York: Wiley.

    MATH  Google Scholar 

  • Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  • Gong, S., & Xiang, T. (2003). Recognition of group activities using dynamic probabilistic networks. In IEEE international conference on computer vision, Nice, France.

    Google Scholar 

  • Grant, M., & Boyd, S. (2011). CVX: Matlab software for disciplined convex programming, version 1.21. http://cvxr.com/cvx.

  • Hakeem, A., & Shah, M. (2007). Learning, detection and representation of multi-agent events in videos. Artificial Intelligence, 171, 586–605.

    Article  Google Scholar 

  • Hongeng, S., & Nevatia, R. (2001). Multi-agent event recognition. In IEEE international conference on computer vision, Vancouver, BC.

    Google Scholar 

  • Hoogs, A., Bush, S., Brooksby, G., Perera, A., Dausch, M., & Krahnstoever, N. (2008). Detecting semantic group activities using relational clustering. In IEEE workshop on motion and video computing, Copper Mountain, CO.

    Google Scholar 

  • Huang, C., Shih, H., & Chao, C. (2006). Semantic analysis of soccer video using dynamic bayesian network. IEEE Transactions on Multimedia, 8(4), 749–760.

    Article  Google Scholar 

  • Intille, S., & Bobick, A. (2001). Recognizing planned, multiperson action. Computer Vision and Image Understanding, 81, 414–445.

    Article  MATH  Google Scholar 

  • Ivanov, Y., & Bobick, A. (2000). Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 852–872.

    Article  Google Scholar 

  • Joo, S., & Chellappa, R. (2007). A multiple-hypothesis approach for multiobject visual tracking. IEEE Transactions on Image Processing, 16(11), 2849–2854.

    Article  MathSciNet  Google Scholar 

  • Junejo, I. N., Dexter, E., Laptev, I., & Perez, P. (2011). View independent action recognition from temporal self-similarities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 172–185.

    Article  Google Scholar 

  • Kass, R., & Vos, P. (1997). Geometric foundations of asymptotic inference. New York: Wiley.

    Book  Google Scholar 

  • Khan, S. M., & Shah, M. (2005). Detecting group activities using rigidity of formation. In ACM multimedia, Singapore.

    Google Scholar 

  • Kim, K., Lee, D., & Essa, I. (2012). Detecting regions of interest in dynamic scenes with camera motions. In IEEE conference on computer vision and pattern recognition, Providence, RI.

    Google Scholar 

  • Kim, M., & Pavlovic, V. (2006). Discriminative learning of mixture of bayesian network classifiers for sequence classification. In IEEE conference on computer vision and pattern recognition, New York, NY.

    Google Scholar 

  • Klaser, A., Marszalek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d gradients. In British machine vision conference, Leeds, UK.

    Google Scholar 

  • Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2, 83–97.

    Article  MathSciNet  Google Scholar 

  • Lan, T., Wang, Y., Yang, W., & Mori, G. (2010). Beyond actions: discriminative models for contextual group activities. In Neural information processing systems, Vancouver, BC.

    Google Scholar 

  • Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64, 107–123.

    Article  Google Scholar 

  • Lazarescu, M., & Venkatesh, S. (2003). Using camera motion to identify different types of American football plays. In IEEE international conference on multimedia and expo, Baltimore, MD (pp. 181–184).

    Google Scholar 

  • Li, R., Chellappa, R., & Zhou, S. (2009). Learning multi-modal densities on discriminative temporal interaction manifold for group activity recognition. In IEEE conference on computer vision and pattern recognition, Miami, FL.

    Google Scholar 

  • libSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ (2012).

  • Liu, T., Ma, W., & Zhang, H. (2005). Effective feature extraction for play detection in American football video. In Multimedia modeling, Melbourne, Australia.

    Google Scholar 

  • Liu, X., & Chua, C. (2006). Multi-agent activity recognition using observation decomposed hidden Markov models. Image and Vision Computing, 24(2), 166–175.

    Article  MATH  Google Scholar 

  • Ma, X., Bashir, F., Khokhar, A., & Schonfeld, D. (2009). Event analysis based on multiple interactive motion trajectories. IEEE Transactions on Circuits and Systems for Video Technology, 19, 397–406.

    Article  Google Scholar 

  • Moeslund, T. B., Hilton, A., & Kruger, V. (2006). A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104, 90–126.

    Article  Google Scholar 

  • Morariu, V., & Davis, L. (2011). Multi-agent event recognition in structured scenarios. In IEEE conference on computer vision and pattern recognition, Colorado Springs, CO.

    Google Scholar 

  • Ni, B., Yan, S., & Kassim, A. (2009). Recognizing human group activities by localized causalities. In IEEE conference on computer vision and pattern recognition, Miami, FL.

    Google Scholar 

  • Pennec, X. (2006). Intrinsic statistics on riemannian manifolds: basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1), 127–154.

    Article  MathSciNet  Google Scholar 

  • Perse, M., Kristan, M., Kovacic, S., Vuckovic, G., & Pers, J. (2009). A trajectory-based analysis of coordinated team activity in a basketball game. Computer Vision and Image Understanding, 113(5), 612–621.

    Article  Google Scholar 

  • Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28(6), 976–990.

    Article  Google Scholar 

  • Rosset, S., & Segal, E. (2002). Boosting density estimation. In Neural information processing systems, Vancouver, BC.

    Google Scholar 

  • Ryoo, M. S., & Aggarwal, J. K. (2009). Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In IEEE international conference on computer vision, Japan, Kyoto.

    Google Scholar 

  • Ryoo, M. S., & Aggarwal, J. K. (2011). Stochastic representation and recognition of high-level group activities. International Journal of Computer Vision, 93, 183–200.

    Article  MathSciNet  MATH  Google Scholar 

  • Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional sift descriptor and its application to action recognition. In ACM multimedia, Augsburg, Germany.

    Google Scholar 

  • Srivastava, A., Jermyn, I., & Joshi, S. (2007). Riemannian analysis of probability density functions with applications in vision. In IEEE conference on computer vision and pattern recognition, Minneapolis, MN.

    Google Scholar 

  • Swears, E., & Hoogs, A. (2009). Learning and recognizing American football plays. In Snowbird learning workshop, Snowbird, UT.

    Google Scholar 

  • Vaswani, N., Roy-Chowdhury, A., & Chellappa, R. (2005). Shape activity: a continuous-state HMM for moving/deforming shapes with application to abnormal activity detection. IEEE Transactions on Image Processing, 14, 1603–1616.

    Article  Google Scholar 

  • Veeraraghavan, A., Chellappa, R., & Srinivasan, M. (2008). Shape and behavior encoded tracking of bee dances. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(3), 463–476.

    Article  Google Scholar 

  • Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: a survey. ACM Computing Surveys, 38(4), 1–45.

    Article  Google Scholar 

  • Zhang, D., Gatica-Perez, D., Bengio, S., & McCowan, I. (2006). Modeling individual and group actions in meetings with layered HMMs. IEEE Transactions on Multimedia, 8, 509–520.

    Article  Google Scholar 

  • Zhou, Y., Yan, S., & Huang, T. S. (2008). Pair-activity classification by bi-trajectories analysis. In IEEE conference on computer vision and pattern recognition, Anchorage, AK.

    Google Scholar 

Download references

Acknowledgements

Li and Chellappa were supported by the DARPA VIRAT program and a MURI program from the Office of Naval Research under the grant N00014-10-1-0934.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruonan Li.

Appendix

Appendix

1.1 7.1 Derivation of (37)

By elementary calculus it is obvious that

(44)

Since ϵf is a local deviation from P J in Taylor’s expansion, here we may assume ϵfP J. Hence we ignore the terms of ϵf and have the approximation

(45)

Note that the best ϵ is determined after f is learned.

1.2 7.2 Derivation of the Expansion of (40) in Sect. 4

Consider exemplar set {z 1,z 2,…,z M }, where the tuple \(z_{i}\triangleq(X_{i}, Y_{i}, \bar{W}_{i}, \bar{X}_{i}, c_{i})\). With these exemplars and assuming uniform prior probabilities for the classes, we have

(46)

where the last equality comes from the fact that given an exemplar z, i.e., Y, the matching W between it and the testing interaction \(\bar{Y}\) does not depend on the class label c anymore. Further more, note that according to the definition of z the component \(\bar{W}\) in z has already encoded the matching between \(\bar{Y}\) and Y. Therefore, we assume that the optimal matching W only depends on \(\bar{W}\) given z, and consequently W and \(\bar{Y}\) are conditionally independent given z. As a result, we have

$$ P(\bar{Y},W|z)=P(W|z)P(\bar{Y}|z)=P(W|\bar{W})P(\bar{Y}|z). $$
(47)

At the same time, note that given matching \(\bar{W}\) in z, P subjects can be selected from \(\bar{Y}\) which are then reduced to a P×P Temporal Interaction Matrix \(\bar{X}\). As a result, we have

$$ P(\bar{Y}|z)=P(\bar{X}|z)=P(\bar{X}|X). $$
(48)

Now, the first integrand in the rightmost hand side of (46) can be written as

$$ P(\bar{Y},W|z)=P(W|\bar{W})P(\bar{X}|X), $$
(49)

and the second integrand can also be straightforwardly simplified as

$$ P(z|c)=P(X|c). $$
(50)

Eventually, (46) can be expressed as

(51)

and the objective (40) simply becomes

(52)

1.3 7.3 Solution to (42) or (43)

A unified representation for the optimization problems (42) and (43) can be written as

(53)

where w is the PQ×1 vectorization of the matrix W by stacking the columns of W. The constant vectors or matrices c,H,E,e,F,f encode the other numbers in the objective or constraints in the original optimization problem (42) or (43).

Note that though the objective is quadratic in w, it is not necessarily convex or concave, and the elements of w only allow 0 or 1. Instead of directly tackling it, we solve the following optimization.

(54)

where \(\hat{\mathbf{c}}=[\sigma_{1},\sigma_{2},\ldots,\sigma_{PQ}]^{T}\), \(\hat{H}=\operatorname{diag}\{-\sigma_{1},-\sigma_{2},\ldots,\allowbreak-\sigma_{PQ}\}\), and σ i is a sufficiently large number satisfying \(\sigma_{i}>\sum^{PQ}_{j=1,j\neq i}|H_{ij}|+H_{ii}\).

Note that \(\hat{H}\) defined in this way imposes a negative strictly dominant diagonal to H and the quadratic term \(\hat{H}+H\) is strictly negative definite. Therefore, (54) is a concave programming problem in the convex unit hypercube [0,1]P×Q and will achieve its minimum at one of the feasible vertices satisfying the linear equality and inequality constraints. The feasible vertices, meanwhile, are exactly the feasible solutions of (53), and at these vertices, the values of the objective of (54) are equal to those of (53) due to the cancellation brought by \(\hat{\mathbf{c}}\). It is therefore implied that by solving the much more efficient problem in (54) we obtain the exact solution for the original problem in (53). To solve (54), we simply employ the optimization software CVX (Grant and Boyd 2011).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, R., Chellappa, R. & Zhou, S.K. Recognizing Interactive Group Activities Using Temporal Interaction Matrices and Their Riemannian Statistics. Int J Comput Vis 101, 305–328 (2013). https://doi.org/10.1007/s11263-012-0573-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-012-0573-0

Keywords

Navigation