skip to main content
research-article

Knowledge-driven Egocentric Multimodal Activity Recognition

Published: 17 December 2020 Publication History

Abstract

Recognizing activities from egocentric multimodal data collected by wearable cameras and sensors, is gaining interest, as multimodal methods always benefit from the complementarity of different modalities. However, since high-dimensional videos contain rich high-level semantic information while low-dimensional sensor signals describe simple motion patterns of the wearer, the large modality gap between the videos and the sensor signals raises a challenge for fusing the raw data. Moreover, the lack of large-scale egocentric multimodal datasets due to the cost of data collection and annotation processes makes another challenge for employing complex deep learning models. To jointly deal with the above two challenges, we propose a knowledge-driven multimodal activity recognition framework that exploits external knowledge to fuse multimodal data and reduce the dependence on large-scale training samples. Specifically, we design a dual-GCLSTM (Graph Convolutional LSTM) and a multi-layer GCN (Graph Convolutional Network) to collectively model the relations among activities and intermediate objects. The dual-GCLSTM is designed to fuse temporal multimodal features with top-down relation-aware guidance. In addition, we apply a co-attention mechanism to adaptively attend to the features of different modalities at different timesteps. The multi-layer GCN aims to learn relation-aware classifiers of activity categories. Experimental results on three publicly available egocentric multimodal datasets show the effectiveness of the proposed model.

References

[1]
Fabien Baradel, Natalia Neverova, Christian Wolf, Julien Mille, and Greg Mori. 2018. Object level visual reasoning in videos. In Proceedings of the 15th European Conference on Computer Vision (ECCV’18). Springer, 105--121.
[2]
Edgar A. Bernal, Xitong Yang, Qun Li, Jayant Kumar, Sriganesh Madhvanath, Palghat Ramesh, and Raja Bala. 2017. Deep temporal multimodal fusion for medical procedure monitoring using wearable sensors. IEEE Trans. Multimedia 20, 1 (Jan. 2017), 107--118.
[3]
Alejandro Betancourt, Pietro Morerio, Carlo S. Regazzoni, and Matthias Rauterberg. 2015. The evolution of first person vision methods: A survey. IEEE Trans. Circ. Syst. Vid. Technol. 25, 5 (May 2015), 744--760.
[4]
Andreas Bulling, Ulf Blanke, and Bernt Schiele. 2014. A tutorial on human activity recognition using body-worn inertial sensors. ACM Comput. Surv. 46, 3 (Jan. 2014), 33.
[5]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 961--970.
[6]
Minjie Cai, Kris M. Kitani, and Yoichi Sato. 2016. Understanding hand-object manipulation with grasp types and object attributes. In Proceedings of the Robotics: Science and Systems Conference, Vol. 3.
[7]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 6299--6308.
[8]
Chen Chen, Roozbeh Jafari, and Nasser Kehtarnavaz. 2017. A survey of depth and inertial sensor fusion for human action recognition. Multimedia Tools. Applic. 76, 3 (Feb. 2017), 4405--4425.
[9]
Yuting Chen, Joseph Wang, Yannan Bai, Gregory Castañón, and Venkatesh Saligrama. 2018. Probabilistic semantic retrieval for surveillance videos with activity graphs. IEEE Trans. Multimedia 21, 3 (Mar. 2018), 704--716.
[10]
Maria Cornacchia, Koray Ozcan, Yu Zheng, and Senem Velipasalar. 2017. A survey on activity detection and classification using wearable sensors. IEEE Sensors J. 17, 2 (Jan. 2017), 386--403.
[11]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 248--255.
[12]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 2625--2634.
[13]
Yuan Fang, Kingsley Kuan, Jie Lin, Cheston Tan, and Vijay Chandrasekhar. 2017. Object detection meets knowledge graphs. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 1661--1667.
[14]
Alireza Fathi, Yin Li, and James M. Rehg. 2012. Learning to recognize daily actions using gaze. In Proceedings of the 12th European Conference on Computer Vision (ECCV’12). Springer, 314--327.
[15]
Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2018. Watch, think and attend: End-to-end video classification via dynamic knowledge evolution modeling. In Proceedings of the 26th ACM International Conference on Multimedia (MM’18). ACM, 690--699.
[16]
Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019. I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19), Vol. 33.
[17]
Weili Guan, Xuemeng Song, Tian Gan, Junyu Lin, Xiaojun Chang, and Liqiang Nie. 2019. Cooperation learning from multiple social networks: Consistent and complementary perspectives. IEEE Trans. Cybern. (2019).
[18]
Sojeong Ha and Seungjin Choi. 2016. Convolutional neural networks for human activity recognition using multiple accelerometer and gyroscope sensors. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’16). 381--388.
[19]
Sojeong Ha, Jeong-Min Yun, and Seungjin Choi. 2015. Multi-modal convolutional neural networks for activity recognition. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC’15). 3017--3022.
[20]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1026--1034.
[21]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (Nov. 1997), 1735--1780.
[22]
Peng-Ju Hsieh, Yen-Liang Lin, Yu-Hsiu Chen, and Winston Hsu. 2016. Egocentric activity recognition by leveraging multiple mid-level representations. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’16). 1--6.
[23]
Fairouz Hussein and Massimo Piccardi. 2017. V-JAUNE: A framework for joint action recognition and video summarization. ACM Trans. Multimedia Comput. Commun. Appl. 13, 2 (May 2017), 20.
[24]
Ahmad Babaeian Jelodar, David Paulius, and Yu Sun. 2019. Long activity video understanding using functional object-oriented network. IEEE Trans. Multimedia 21, 7 (July 2019), 1813--1824.
[25]
Weike Jin, Zhou Zhao, Yimeng Li, Jie Li, Jun Xiao, and Yueting Zhuang. 2019. Video question answering via knowledge-based progressive spatial-temporal attention network. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (Aug. 2019), 1--22.
[26]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1725--1732.
[27]
Diederik P. Kingma and Jimmy Ba. 2013. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations (ICLR’15). Retrieved from http://arxiv.org/abs/1412.6980.
[28]
Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR’17). Retrieved from https://openreview.net/forum?id=SJU4ayYgl.
[29]
Shiro Kumano, Kazuhiro Otsuka, Ryo Ishii, and Junji Yamato. 2016. Collective first-person vision for automatic gaze analysis in multiparty conversations. IEEE Trans. Multimedia 19, 1 (Jan. 2016), 107--122.
[30]
Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2018. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18). Retrieved from https://openreview.net/forum?id=SJiHXGWAZ.
[31]
Meng Liu, Liqiang Nie, Meng Wang, and Baoquan Chen. 2017. Towards micro-video understanding by joint sequential-sparse modeling. In Proceedings of the 25th ACM International Conference on Multimedia (MM’17). 970--978.
[32]
Shaopeng Liu, Robert Gao, and Patty Freedson. 2012. Computational methods for estimating energy expenditure in human physical activities. Med. Sci. Sports Exer. 44, 11 (2012), 2138--46.
[33]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Proceedings of the 30th International Conference on Advances in Neural Information Processing Systems (NeurIPS’16). Curran Associates, Inc., 289--297. Retrieved from https://papers.nips.cc/paper/6202-hierarchical-question-image-co-attention-for-visual-question-answering.
[34]
Minghuang Ma, Haoqi Fan, and Kris M. Kitani. 2016. Going deeper into first-person activity recognition. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1894--1903.
[35]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9 (Nov. 2008), 2579--2605. Retrieved from http://www.jmlr.org/papers/v9/vandermaaten08a.html.
[36]
Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. 2017. The more you know: Using knowledge graphs for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 20--28.
[37]
Pascal Mettes and Cees G. M. Snoek. 2017. Spatial-aware object embeddings for zero-shot localization and classification of actions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 4443--4452.
[38]
Abduallah Mohamed, Kun Qian, Mohamed Elhoseiny, and Christian Claudel. 2020. Social-STGCNN: A social spatio-temporal graph convolutional neural network for human trajectory prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20). 14424--14432.
[39]
Pietro Morerio, Lucio Marcenaro, and Carlo S. Regazzoni. 2013. Hand detection in first person vision. In Proceedings of the 16th International Conference on Information Fusion (FUSION’13). IEEE, 1502--1507.
[40]
Abdulmajid Murad and Jae-Young Pyun. 2017. Deep recurrent neural networks for human activity recognition. Sensors 17, 11 (Nov. 2017), 2556.
[41]
Katsuyuki Nakamura, Serena Yeung, Alexandre Alahi, and Li Fei-Fei. 2017. Jointly learning energy expenditures and activities using egocentric multimodal signals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 6817--6826.
[42]
Thi-Hoa-Cuc Nguyen, Jean-Christophe Nebel, Francisco Florez-Revuelta, et al. 2016. Recognition of activities of daily living with egocentric vision: A review. Sensors 16, 1 (Jan. 2016), 72.
[43]
Liqiang Nie, Xuemeng Song, and Tat-Seng Chua. 2016. Learning from multiple social networks. Synt. Lect. Inf. Conc., Retr., Serv. 8, 2 (2016), 1--118.
[44]
Liqiang Nie, Xiang Wang, Jianglong Zhang, Xiangnan He, Hanwang Zhang, Richang Hong, and Qi Tian. 2017. Enhancing micro-video understanding by harnessing external sounds. In Proceedings of the 25th ACM International Conference on Multimedia (MM’17). 1192--1200.
[45]
Alan V. Oppenheim. 1999. Discrete-time Signal Processing. Pearson Education India.
[46]
Hamed Pirsiavash and Deva Ramanan. 2012. Detecting activities of daily living in first-person camera views. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 2847--2854.
[47]
Rafael Possas, Sheila Pinto Caceres, and Fabio Ramos. 2018. Egocentric activity recognition on a budget. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 5967--5976.
[48]
Shengsheng Qian, Tianzhu Zhang, and Changsheng Xu. 2018. Online multimodal multiexpert learning for social event tracking. IEEE Trans. Multimedia 20, 10 (Oct. 2018), 2733--2748.
[49]
Fereshteh Sadeghi, Santosh K. Kumar Divvala, and Ali Farhadi. 2015. VisKE: Visual knowledge extraction and question answering by visual verification of relation phrases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1456--1464.
[50]
Ruslan Salakhutdinov, Antonio Torralba, and Josh Tenenbaum. 2011. Learning to share visual appearance for multiclass object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). 1481--1488.
[51]
Youngjoo Seo, Michaël Defferrard, Pierre Vandergheynst, and Xavier Bresson. 2018. Structured sequence modeling with graph convolutional recurrent networks. In Proceedings of the 25th International Conference on Neural Information Processing (ICONIP’18). Springer, 362--373.
[52]
Zhijuan Shen, Jun Cheng, Xiping Hu, and Qian Dong. 2019. Emotion recognition based on multi-view body gestures. In Proceedings of the IEEE International Conference on Image Processing (ICIP’19). IEEE, 3317--3321.
[53]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 28th International Conference on Advances in Neural Information Processing Systems (NeurIPS’14). 568--576. Retrieved from https://papers.nips.cc/paper/5353-two-stream-convolutional-networks-for-action-recognition-in-videos.
[54]
Mohammad Soltanian and Shahrokh Ghaemmaghami. 2019. Hierarchical concept score postprocessing and concept-wise normalization in CNN-based video event recognition. IEEE Trans. Multimedia 21, 1 (Jan. 2019), 157--172.
[55]
Hao Song, Xinxiao Wu, Wennan Yu, and Yunde Jia. 2018. Extracting key segments of videos for event detection by learning from web sources. IEEE Trans. Multimedia 20, 5 (May 2018), 1088--1100.
[56]
Sibo Song, Vijay Chandrasekhar, Bappaditya Mandal, Liyuan Li, Joo-Hwee Lim, Giduthuri Sateesh Babu, Phyo Phyo San, and Ngai-Man Cheung. 2016. Multimodal multi-stream deep learning for egocentric activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’16). 378--385.
[57]
Robert Speer and Catherine Havasi. 2013. ConceptNet 5: A large semantic network for relational knowledge. In The People’s Web Meets NLP. Springer, Berlin, 161--176.
[58]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2818--2826.
[59]
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).
[60]
Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Commun. ACM 59, 2 (Jan. 2016), 64--73.
[61]
Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2013. Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103, 1 (2013), 60--79.
[62]
Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. 2019. Deep learning for sensor-based activity recognition: A survey. Pattern Recog. Lett. 119 (Mar. 2019), 3--11.
[63]
Lei Wang, Xu Zhao, Yunfei Si, Liangliang Cao, and Yuncai Liu. 2017. Context-associative hierarchical memory model for human activity recognition and prediction. IEEE Trans. Multimedia 19, 3 (Mar. 2017), 646--659.
[64]
Xiaolong Wang, Yufei Ye, and Abhinav Gupta. 2018. Zero-shot recognition via semantic embeddings and knowledge graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 6857--6866.
[65]
Yinwei Wei, Xiang Wang, Weili Guan, Liqiang Nie, Zhouchen Lin, and Baoquan Chen. 2019. Neural multimodal cooperative learning toward micro-video understanding. IEEE Trans. Image Proc. 29 (2019), 1--14.
[66]
Shuochao Yao, Shaohan Hu, Yiran Zhao, Aston Zhang, and Tarek Abdelzaher. 2017. DeepSense: A unified deep learning framework for time-series mobile sensing data processing. In Proceedings of the 26th International Conference on World Wide Web (WWW’17). International World Wide Web Conferences Steering Committee, 351--360.
[67]
Jun Ye, Hao Hu, Guo-Jun Qi, and Kien A. Hua. 2017. A temporal order modeling approach to human action recognition from multimodal sensor data. ACM Trans. Multimedia Comput. Commun. Appl. 13, 2 (Mar. 2017), 14.
[68]
Xingliang Yuan, Xinyu Wang, Cong Wang, Jian Weng, and Kui Ren. 2016. Enabling secure and fast indexing for privacy-assured healthcare monitoring via compressive sensing. IEEE Trans. Multimedia 18, 10 (Oct. 2016), 2002--2014.
[69]
Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel. 2018. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA’18). 5628--5635.
[70]
Yi Zhu, Yang Long, Yu Guan, Shawn Newsam, and Ling Shao. 2018. Towards universal representation for unseen action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 9436--9445.

Cited By

View all
  • (2024)From CNNs to Transformers in Multimodal Human Action Recognition: A SurveyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366481520:8(1-24)Online publication date: 13-May-2024
  • (2024)Cross-Modal Federated Human Activity RecognitionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336741246:8(5345-5361)Online publication date: Aug-2024
  • (2024)A survey of multimodal federated learning: background, applications, and perspectivesMultimedia Systems10.1007/s00530-024-01422-930:4Online publication date: 29-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 4
November 2020
372 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3444749
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 December 2020
Accepted: 01 July 2020
Revised: 01 June 2020
Received: 01 January 2020
Published in TOMM Volume 16, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Egocentric videos
  2. graph neural networks
  3. wearable sensors

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Key Research Program of Frontier Sciences of CAS
  • Research Program of National Laboratory of Pattern Recognition
  • National Natural Science Foundation of China
  • National Key Research and Development Program of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)74
  • Downloads (Last 6 weeks)6
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)From CNNs to Transformers in Multimodal Human Action Recognition: A SurveyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366481520:8(1-24)Online publication date: 13-May-2024
  • (2024)Cross-Modal Federated Human Activity RecognitionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336741246:8(5345-5361)Online publication date: Aug-2024
  • (2024)A survey of multimodal federated learning: background, applications, and perspectivesMultimedia Systems10.1007/s00530-024-01422-930:4Online publication date: 29-Jul-2024
  • (2023)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 4-Dec-2023
  • (2023)3D Object Watermarking from Data Hiding in the Homomorphic Encrypted DomainACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358857319:5s(1-20)Online publication date: 7-Jun-2023
  • (2023)Less Is More: Learning from Synthetic Data with Fine-Grained Attributes for Person Re-IdentificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358844119:5s(1-20)Online publication date: 7-Jun-2023
  • (2023)Neural Network Assisted Depth Map Packing for Compression Using Standard Hardware Video CodecsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358844019:5s(1-20)Online publication date: 7-Jun-2023
  • (2023)A Geometrical Approach to Evaluate the Adversarial Robustness of Deep Neural NetworksACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358793619:5s(1-17)Online publication date: 7-Jun-2023
  • (2023)Local Bidirection Recurrent Network for Efficient Video Deblurring with the Fused Temporal Merge ModuleACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358746819:5s(1-18)Online publication date: 7-Jun-2023
  • (2023)Video Captioning by Learning from Global Sentence and Looking AheadACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725219:5s(1-20)Online publication date: 7-Jun-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media