Abstract
Person-person mutual action recognition (also referred to as interaction recognition) is an important research branch of human activity analysis. It begins with solutions based on carefully designed local-points and hand-crafted features, and then progresses to deep learning architectures, such as CNNs and LSTMS. These solutions often consist of complicated architectures and mechanisms to embed the relationships between the two persons on the architecture itself, to ensure the interaction patterns can be properly learned. Our contribution with this work is by proposing a more simple yet very powerful architecture, named Interaction Relational Network, which utilizes minimal prior knowledge about the structure of the data. We drive the network to learn to identify how to relate the body parts of the persons interacting, in order to better discriminate among the possible interactions. By breaking down the body parts through the frames as sets of independent joints, and with a few augmentations to our architecture to explicitly extract meaningful extra information from each pair of joints, our solution is able to achieve state-of-the-art performance on the traditional interaction recognition dataset SBU, and also on the mutual actions from the large-scale dataset NTU RGB+D.
A. C. Kot—This research was carried out at the Rapid-Rich Object Search (ROSE) Lab at the Nanyang Technological University, Singapore. The ROSE Lab is supported by the National Research Foundation, Singapore, and the Infocomm Media Development Authority, Singapore.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Aliakbarian, M.S., Saleh, F.S., Salzmann, M., Fernando, B., Petersson, L., Andersson, L.: Encouraging LSTMs to anticipate actions very early. In: IEEE International Conference on Computer Vision (ICCV), pp. 280–289 (2017)
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields, pp. 1–14. arXiv preprint arXiv:1812.08008 (2018)
Chowdhury, M.I.H., Nguyen, K., Sridharan, S., Fookes, C.: Hierarchical relational attention for video question answering. In: IEEE International Conference on Image Processing (ICIP), pp. 599–603 (2018)
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2625–2634 (2015)
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1110–1118 (2015)
Ibrahim, M.S., Mori, G.: Hierarchical relational networks for group activity recognition and retrieval. In: Springer European Conference on Computer Vision (ECCV), pp. 721–736 (2018)
Ji, Y., Cheng, H., Zheng, Y., Li, H.: Learning contrastive feature distribution model for interaction recognition. J. Vis. Commun. Image Represent. 33, 340–349 (2015)
Ji, Y., Ye, G., Cheng, H.: Interactive body part contrast mining for human interaction recognition. In: IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp. 1–6 (2014)
Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Leveraging structural context models and ranking score fusion for human interaction prediction. IEEE Trans. Multimedia (TMM) 20(7), 1712–1723 (2018)
Li, W., Wen, L., Chuah, M.C., Lyu, S.: Category-blind human action recognition: a practical recognition system. In: IEEE International Conference on Computer Vision (ICCV), pp. 4444–4452, December 2015
Liu, J., Shahroudy, A., Xu, D., Kot, A.C., Wang, G.: Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 3007–3021 (2018)
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50
Liu, J., Wang, G., Duan, L.Y., Abdiyeva, K., Kot, A.C.: Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans. Image Process. (TIP) 27(4), 1586–1599 (2018)
Liu, J., Wang, G., Duan, L.Y., Hu, P., Kot, A.C.: Global context-aware attention LSTM networks for 3D action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1647–1656 (2017)
Raptis, M., Sigal, L.: Poselet key-framing: a model for human activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2650–2657 (2013)
Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming video. In: IEEE International Conference on Computer Vision (ICCV), pp. 1036–1043 (2011)
Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match : video structure comparison for recognition of complex human activities. In: IEEE International Conference on Computer Vision (ICCV), pp. 1593–1600 (2009)
Santoro, A., et al.: A simple neural network module for relational reasoning. In: Advances in Neural Information Processing Systems (NIPS), pp. 4967–4976 (2017)
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1010–1019 (2016)
Shi, Y., Fernando, B., Hartley, R.: Action anticipation with RBF kernelized feature mapping RNN. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 305–322. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_19
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems (NIPS), pp. 568–576 (2014)
Vahdat, A., Gao, B., Ranjbar, M., Mori, G.: A discriminative key pose sequence model for recognizing human interactions. In: IEEE International Conference on Computer Vision, pp. 1729–1736 (2011)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision (ICCV), pp. 3551–3558 (2013)
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Wang, X., Ji, Q.: Hierarchical context modeling for video event recognition. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 39(9), 1770–1782 (2017)
Wu, H., Shao, J., Xu, X., Ji, Y., Shen, F., Shen, H.T.: Recognition and detection of two-person interactive actions using automatically selected skeleton features. IEEE Trans. Hum.-Mach. Syst. 48(3), 304–310 (2018)
Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 28–35 (2012)
Zhang, Y., Liu, X., Chang, M.-C., Ge, W., Chen, T.: Spatio-temporal phrases for activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 707–721. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33712-3_51
Zhang, Z.: Microsoft kinect sensor and its effect. IEEE Multimedia 19(2), 4–10 (2012)
Zhu, W., et al.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: AAAI, vol. 2, pp. 3697–3703 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Perez, M., Liu, J., Kot, A.C. (2020). Interaction Recognition Through Body Parts Relation Reasoning. In: Palaiahnakote, S., Sanniti di Baja, G., Wang, L., Yan, W. (eds) Pattern Recognition. ACPR 2019. Lecture Notes in Computer Science(), vol 12046. Springer, Cham. https://doi.org/10.1007/978-3-030-41404-7_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-41404-7_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41403-0
Online ISBN: 978-3-030-41404-7
eBook Packages: Computer ScienceComputer Science (R0)