research-article

Multimodal Egocentric Activity Recognition Using Multi-stream CNN

Authors:

Balasubramanian RamanAuthors Info & Claims

ICVGIP '18: Proceedings of the 11th Indian Conference on Computer Vision, Graphics and Image Processing

Article No.: 10, Pages 1 - 8

https://doi.org/10.1145/3293353.3293363

Published: 03 May 2020 Publication History

Abstract

Egocentric activity recognition (EAR) is an emerging area in the field of computer vision research. Motivated by the current success of Convolutional Neural Network (CNN), we propose a multi-stream CNN for multimodal egocentric activity recognition using visual (RGB videos) and sensor stream (accelerometer, gyroscope, etc.). In order to effectively capture the spatio-temporal information contained in RGB videos, two types of modalities are extracted from visual data: Approximate Dynamic Image (ADI) and Stacked Difference Image (SDI). These image-based representations are generated both at clip level as well as entire video level, and are then utilized to finetune a pretrained 2D-CNN called MobileNet, which is specifically designed for mobile vision applications. Similarly for sensor data, each training sample is divided into three segments, and a deep 1D-CNN network is trained (corresponding to each type of sensor stream) from scratch. During testing, the softmax scores of all the streams (visual + sensor) are combined by late fusion. The experiments performed on multimodal egocentric activity dataset demonstrates that our proposed approach can achieve state-of-the-art results, outperforming the current best handcrafted and deep learning based techniques.

References

[1]

Kerem Altun and Billur Barshan. 2010. Human activity recognition using inertial/magnetic sensor units. In International Workshop on Human Behavior Understanding. 38--51.

[2]

Ferhat Attal, Samer Mohammed, Mariam Dedabrishvili, Faicel Chamroukhi, Latifa Oukhellou, and Yacine Amirat. 2015. Physical human activity recognition using wearable sensors. Sensors 15, 12 (2015), 31314--31338.

[3]

Lorenzo Baraldi, Francesco Paci, Giuseppe Serra, Luca Benini, and Rita Cucchiara. 2014. Gesture recognition in ego-centric videos using dense trajectories and hand segmentation. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 688--693.

Digital Library

[4]

Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, and Stephen Gould. 2016. Dynamic image networks for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 3034--3042.

[5]

Congqi Cao, Yifan Zhang, Yi Wu, Hanqing Lu, and Jian Cheng. 2017. Egocentric gesture recognition using recurrent 3d convolutional neural networks with spatiotemporal transformer modules. In IEEE Conference on Computer Vision and Pattern Recognition. 3763--3771.

[6]

François Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. arXiv preprint (2017), 1610--02357.

[7]

François Chollet et al. 2015. Keras. https://github.com/fchollet/keras.

[8]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In IEEE International Conference on Computer Vision and Pattern Recognition. 248--255.

[9]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In IEEE International Conference on Computer Vision and Pattern Recognition. 2625--2634.

[10]

Miikka Ermes, Juha Pärkkä, Jani Mäntyjärvi, and Ilkka Korhonen. 2008. Detection of daily activities and sports with wearable sensors in controlled and uncontrolled conditions. IEEE Transactions on Information Technology in Biomedicine 12, 1 (2008), 20--26.

Digital Library

[11]

Basura Fernando, Efstratios Gavves, José Oramas, Amir Ghodrati, and Tinne Tuytelaars. 2017. Rank pooling for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 773--787.

Digital Library

[12]

Nils Y Hammerla, Shane Halloran, and Thomas Ploetz. 2016. Deep, convolutional, and recurrent models for human activity recognition using wearables. arXiv preprint arXiv.1604.08880 (2016).

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE International Conference on Computer Vision and Pattern Recognition. 770--778.

[14]

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv.1704.04861 (2017).

[15]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv.1502.03167 (2015).

[16]

Pushpajit Khaire, Praveen Kumar, and Javed Imran. 2018. Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recognition Letters (2018).

[17]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.

[18]

Yin Li, Alireza Fathi, and James M Rehg. 2013. Learning to predict gaze in egocentric video. In IEEE International Conference on Computer Vision and Pattern Recognition. 3216--3223.

Digital Library

[19]

Yin Li, Zhefan Ye, and James M Rehg. 2015. Delving into egocentric actions. In IEEE Conference on Computer Vision and Pattern Recognition. 287--295.

[20]

Paul Lukowicz, Jamie A Ward, Holger Junker, Mathias Stäger, Gerhard Tröster, Amin Atrash, and Thad Starner. 2004. Recognizing workshop activity using body worn microphones and accelerometers. In International Conference on Pervasive Computing. 18--32.

[21]

Minghuang Ma, Haoqi Fan, and Kris M Kitani. 2016. Going deeper into first-person activity recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 1894--1903.

[22]

Inês P Machado, A Luísa Gomes, Hugo Gamboa, Vítor Paixão, and Rui M Costa. 2015. Human activity data discovery from triaxial accelerometer sensor: Non-supervised learning sensitivity to feature extraction parametrization. Information Processing&Management 51, 2 (2015), 204--214.

[23]

Francisco Javier Ordóñez and Daniel Roggen. 2016. Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16, 1 (2016), 115.

[24]

Rafael Possas, Sheila Pinto Caceres, and Fabio Ramos. 2018. Egocentric Activity Recognition on a Budget. In IEEE Conference on Computer Vision and Pattern Recognition. 5967--5976.

[25]

Charissa Ann Ronao and Sung-Bae Cho. 2016. Human activity recognition with smartphone sensors using deep learning neural networks. Expert Systems with Applications 59 (2016), 235--244.

Digital Library

[26]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.

[27]

Suriya Singh, Chetan Arora, and CV Jawahar. 2016. First person action recognition using deep learned descriptors. In IEEE Conference on Computer Vision and Pattern Recognition. 2620--2628.

[28]

Suriya Singh, Chetan Arora, and C.V. Jawahar. 2017. Trajectory aligned features for first person action recognition. Pattern Recognition 62 (2017), 45--55.

[29]

Sibo Song, Vijay Chandrasekhar, Bappaditya Mandal, Liyuan Li, Joo-Hwee Lim, Giduthuri Sateesh Babu, Phyo Phyo San, and Ngai-Man Cheung. 2016. Multimodal multi-stream deep learning for egocentric activity recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 24--31.

[30]

Sibo Song, Ngai-Man Cheung, Vijay Chandrasekhar, Bappaditya Mandal, and Jie Liri. 2016. Egocentric activity recognition with multimodal fisher vector. In IEEE International Conference on Acoustics, Speech and Signal Processing. 2717--2721.

Digital Library

[31]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929--1958.

Digital Library

[32]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition. 2818--2826.

[33]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In IEEE International Conference on Computer Vision and Pattern Recognition. 4489--4497.

Digital Library

[34]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2017. Temporal segment networks for action recognition in videos. arXiv preprint arXiv: 1705.02953 (2017).

[35]

Xuanhan Wang, Lianli Gao, Jingkuan Song, Xiantong Zhen, Nicu Sebe, and Heng Tao Shen. 2018. Deep appearance and motion learning for egocentric activity recognition. Neurocomputing 275 (2018), 438--447.

Digital Library

[36]

Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiaoli Li, and Shonali Krishnaswamy. 2015. Deep convolutional neural networks on multichannel time series for human activity recognition. In International Joint Conference on Artificial Intelligence, Vol. 15. 3995--4001.

Cited By

Bock MKuehne HVan Laerhoven KMoeller M(2024)WEAR: An Outdoor Sports Dataset for Wearable and Egocentric Activity RecognitionProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997768:4(1-21)Online publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1145/3699776
Hao YKanezaki ASato IKawakami RShinoda K(2024)Egocentric Human Activities Recognition With Multimodal Interaction SensingIEEE Sensors Journal10.1109/JSEN.2023.334919124:5(7085-7096)Online publication date: 1-Mar-2024
https://doi.org/10.1109/JSEN.2023.3349191
Gupta HKumar V(2024)Egocentric Vision Action Recognition: Performance Analysis on the Coer_Egovision Dataset2024 International Conference on Automation and Computation (AUTOCOM)10.1109/AUTOCOM60220.2024.10486146(341-346)Online publication date: 14-Mar-2024
https://doi.org/10.1109/AUTOCOM60220.2024.10486146
Show More Cited By

Index Terms

Multimodal Egocentric Activity Recognition Using Multi-stream CNN
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

Human Activity Recognition for Online Examination Environment Using CNN
Artificial Intelligence and Soft Computing
Abstract
Human Activity Recognition (HAR) is an intelligent system that recognizes activities based on a sequence of observations about human behavior. Human activity recognition is essential in human-to-human interactions to identify interesting patterns. ...
Egocentric activity recognition using two-stage decision fusion
Abstract
The widespread adoption of wearable devices equipped with advanced sensor technologies has fueled the rapid growth of egocentric video capture, known as First Person Vision (FPV). Unlike traditional third-person videos, FPV exhibits distinct ...
Confidence-based Deep Multimodal Fusion for Activity Recognition
UbiComp '18: Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers

Human activity recognition using multimodal sensors is widely studied in recent days. In this paper, we propose an end-to-end deep learning model for activity recognition, which fuses features of multiple modalities based on their confidence scores that ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICVGIP '18: Proceedings of the 11th Indian Conference on Computer Vision, Graphics and Image Processing

December 2018

659 pages

ISBN:9781450366151

DOI:10.1145/3293353

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICVGIP 2018

ICVGIP 2018: 11th Indian Conference on Computer Vision, Graphics and Image Processing

December 18 - 22, 2018

Hyderabad, India

Acceptance Rates

Overall Acceptance Rate 95 of 286 submissions, 33%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
148
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bock MKuehne HVan Laerhoven KMoeller M(2024)WEAR: An Outdoor Sports Dataset for Wearable and Egocentric Activity RecognitionProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997768:4(1-21)Online publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1145/3699776
Hao YKanezaki ASato IKawakami RShinoda K(2024)Egocentric Human Activities Recognition With Multimodal Interaction SensingIEEE Sensors Journal10.1109/JSEN.2023.334919124:5(7085-7096)Online publication date: 1-Mar-2024
https://doi.org/10.1109/JSEN.2023.3349191
Gupta HKumar V(2024)Egocentric Vision Action Recognition: Performance Analysis on the Coer_Egovision Dataset2024 International Conference on Automation and Computation (AUTOCOM)10.1109/AUTOCOM60220.2024.10486146(341-346)Online publication date: 14-Mar-2024
https://doi.org/10.1109/AUTOCOM60220.2024.10486146
Hao YUto KKanezaki ASato IKawakami RShinoda K(2023)EvIs-Kitchen: Egocentric Human Activities Recognition with Video and Inertial Sensor DataMultiMedia Modeling10.1007/978-3-031-27077-2_29(373-384)Online publication date: 9-Jan-2023
https://dl.acm.org/doi/10.1007/978-3-031-27077-2_29
Moussa MLim CWong K(2020)Extending Egocentric Vision into Vehicles: Malaysian Dash-Cam DatasetIntelligent Robotics and Applications10.1007/978-3-030-66645-3_23(270-281)Online publication date: 5-Nov-2020
https://dl.acm.org/doi/10.1007/978-3-030-66645-3_23

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten