Abstract
Human activity recognition (HAR) deals with recognition of activities or interactions that include humans within a video. Entities occurring in a video frame can be abstracted in variety of ways, ranging from the detailed silhouette of the entity to the very basic axis-aligned minimum bounding rectangles (MBR). On one end of the spectrum, using a detailed silhouette is not only demanding in terms of storage and computational resources but is also easily susceptible to noise. On the other end of the spectrum, MBRs require less storage, computation and abstracts out noise and video specific details. However, for abstraction of human bodies in a video an MBR does not offer an adequate solution because in addition to abstracting away noise, it also abstracts out important details such as the posture of the human body. For a more precise description, which offers a reasonable tradeoff between efficiency and noise elimination, a human body can be abstracted using a set of MBRs corresponding to different body parts. However, for a representation of activities as relations between interacting objects, a simplistic approximation assuming each MBR to be an independent entity leads to computation of redundant relations. In this paper, we explore a representation schema for interaction between entities that are considered as sets of rectangles, also referred to as extended objects. We further show that, given the representation schema, a simple recursive algorithm can be used to opportunistically extract topological, directional and distance information in O(n l o g n) time. We evaluate our representation schema for HAR on the Mind’s Eye dataset (http://www.visint.org), the UT-Interaction (Ryoo and Aggarwal 2010) dataset and the SBU Kinect Interaction dataset (Yun et al. 2012).
Similar content being viewed by others
Notes
The sphere defines a region to be a sphere.
A depiction of the states and their correspondence with the RA relations can be found in www.comp.leeds.ac.uk/qsr/cores.
x,y,z,w are the indices of the cores, such that a i ∈ σ x y and b j ∈ σ z w .
We use I-frames obtained using the tool ffmpeg as keyframes, www.ffmpeg.org.
References
Aggarwal J, Ryoo M (2011) Human activity analysis: a review. ACM Comput Surv 43(3):16:1–16:43
Bittner T, Donnelly M (2007) A formal theory of qualitative size and distance relations between regions. In: Proceedings of the 21st annual workshop on qualitative reasoning (QR07)
Clementini E, Felice PD, Califano G (1995) Composite regions in topological queries. Inf Syst 20 (7):579–594
Cohn AG, Hazarika SM (2001) Qualitative spatial representation and reasoning: an overview. Fundam Inform 46(1–2):1–29
Cohn AG, Magee DR, Galata A, Hogg D, Hazarika SM (2003) Towards an architecture for cognitive vision using qualitative spatio-temporal representations and abduction Spatial cognition, pp 232–248
Cohn AG, Renz J, Sridhar M (2012) Thinking inside the box: a comprehensive spatial representation for video analysis. In: Proceedings of the 13th international conference on principles of knowledge representation and reasoning (KR2012). AAAI Press, pp 588–592
Dubba KSR, Bhatt M, Dylla F, Hogg DC, Cohn AG (2012) Interleaved inductive-abductive reasoning for learning complex event models. In: ILP. Lecture notes in computer science. Springer, vol 7207, pp 113–129
Egenhofer MJ, Clementini E, Felice PD (1994) Topological relations between regions with holes. Int J Geogr Inf Syst 8:129–142
Falomir Z, Jime’nez-Ruiz E, Museros L, Escrig MT (2009) An ontology for qualitative description of images. In: Spatial and temporal reasoning for ambient intelligence systems. Springer
Kalita S, Karmakar A, Hazarika SM (2016) Comprehensive representation and efficient extraction of spatial information for human activity recognition from video data. In: Proceedings of international conference on computer vision and image processing (CVIP2016). Springer
Kusumam K (2012) Relational learning using body parts for human activity recognition in videos. Master’s thesis, University of Leeds
Park S, Park J, Al-masni M, Al-antari M, Uddin M, Kim TS (2016) A depth camera-based human activity recognition via deep learning recurrent neural network for health and social care services. Procedia Computer Science 100:78–84
Randell DA, Cui Z, Cohn A (1992) A spatial logic based on regions and connection. In: Nebel B, Rich C, Swartout W (eds) Proceedings of the 3rd international conference on principles of knowledge representation and reasoning. KR’92. Morgan Kaufmann, pp 165–176
Ryoo MS, Aggarwal JK (2010) UT-interaction dataset, ICPR contest on semantic description of human activities (SDHA). http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html
Schneider M, Behr T (2006) Topological relationships between complex spatial objects. ACM Trans Database Syst 31(1):39–81
Skiadopoulos S, Koubarakis M (2005) On the consistency of cardinal directions constraints. Artif Intell 163:91–135
Sokeh HS, Gould S, Renz J (2013) Efficient extraction and representation of spatial information from video data. In: Proceedings of the 23rd international joint conference on artificial intelligence (IJCAI’13), pp 1076–1082. AAAI press/IJCAI
Sridhar M, Cohn AG, Hogg DC (2011) Benchmarking qualitative spatial calculi for video activity analysis. In: Proceedings of the IJCAI workshop benchmarks and applications of spatial reasoning, pp 15–20
Yun K, Honorio J, Chattopadhyay D, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW). IEEE
Zelnik-Manor L, IraniM(2001) Event-based analysis of video. In: 2001 IEEE conference on computer vision and pattern recognition (CVPR 2001). IEEE Computer Society, pp 123–130
Zhang Y, Liu X, Chang MC, Ge W, Chen T (2012) Spatio-temporal phrases for activity recognition. Springer, Berlin Heidelberg, pp 707–721
Zhao Y, Holtzen S, Gao T, Zhu SC (2015) Represent and infer human theory of mind for human-robot interaction. In: 2015 AAAI fall symposium series
Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: Proceedings of the 38th AAAI conference on artificial intelligence, 2016, pp 3697–3704
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kalita, S., Karmakar, A. & Hazarika, S.M. Efficient extraction of spatial relations for extended objects vis-à-vis human activity recognition in video. Appl Intell 48, 204–219 (2018). https://doi.org/10.1007/s10489-017-0970-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-017-0970-8