skip to main content
10.1145/3293353.3293363acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicvgipConference Proceedingsconference-collections
research-article

Multimodal Egocentric Activity Recognition Using Multi-stream CNN

Published: 03 May 2020 Publication History

Abstract

Egocentric activity recognition (EAR) is an emerging area in the field of computer vision research. Motivated by the current success of Convolutional Neural Network (CNN), we propose a multi-stream CNN for multimodal egocentric activity recognition using visual (RGB videos) and sensor stream (accelerometer, gyroscope, etc.). In order to effectively capture the spatio-temporal information contained in RGB videos, two types of modalities are extracted from visual data: Approximate Dynamic Image (ADI) and Stacked Difference Image (SDI). These image-based representations are generated both at clip level as well as entire video level, and are then utilized to finetune a pretrained 2D-CNN called MobileNet, which is specifically designed for mobile vision applications. Similarly for sensor data, each training sample is divided into three segments, and a deep 1D-CNN network is trained (corresponding to each type of sensor stream) from scratch. During testing, the softmax scores of all the streams (visual + sensor) are combined by late fusion. The experiments performed on multimodal egocentric activity dataset demonstrates that our proposed approach can achieve state-of-the-art results, outperforming the current best handcrafted and deep learning based techniques.

References

[1]
Kerem Altun and Billur Barshan. 2010. Human activity recognition using inertial/magnetic sensor units. In International Workshop on Human Behavior Understanding. 38--51.
[2]
Ferhat Attal, Samer Mohammed, Mariam Dedabrishvili, Faicel Chamroukhi, Latifa Oukhellou, and Yacine Amirat. 2015. Physical human activity recognition using wearable sensors. Sensors 15, 12 (2015), 31314--31338.
[3]
Lorenzo Baraldi, Francesco Paci, Giuseppe Serra, Luca Benini, and Rita Cucchiara. 2014. Gesture recognition in ego-centric videos using dense trajectories and hand segmentation. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 688--693.
[4]
Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, and Stephen Gould. 2016. Dynamic image networks for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 3034--3042.
[5]
Congqi Cao, Yifan Zhang, Yi Wu, Hanqing Lu, and Jian Cheng. 2017. Egocentric gesture recognition using recurrent 3d convolutional neural networks with spatiotemporal transformer modules. In IEEE Conference on Computer Vision and Pattern Recognition. 3763--3771.
[6]
François Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. arXiv preprint (2017), 1610--02357.
[7]
François Chollet et al. 2015. Keras. https://github.com/fchollet/keras.
[8]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In IEEE International Conference on Computer Vision and Pattern Recognition. 248--255.
[9]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In IEEE International Conference on Computer Vision and Pattern Recognition. 2625--2634.
[10]
Miikka Ermes, Juha Pärkkä, Jani Mäntyjärvi, and Ilkka Korhonen. 2008. Detection of daily activities and sports with wearable sensors in controlled and uncontrolled conditions. IEEE Transactions on Information Technology in Biomedicine 12, 1 (2008), 20--26.
[11]
Basura Fernando, Efstratios Gavves, José Oramas, Amir Ghodrati, and Tinne Tuytelaars. 2017. Rank pooling for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 773--787.
[12]
Nils Y Hammerla, Shane Halloran, and Thomas Ploetz. 2016. Deep, convolutional, and recurrent models for human activity recognition using wearables. arXiv preprint arXiv.1604.08880 (2016).
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE International Conference on Computer Vision and Pattern Recognition. 770--778.
[14]
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv.1704.04861 (2017).
[15]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv.1502.03167 (2015).
[16]
Pushpajit Khaire, Praveen Kumar, and Javed Imran. 2018. Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recognition Letters (2018).
[17]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.
[18]
Yin Li, Alireza Fathi, and James M Rehg. 2013. Learning to predict gaze in egocentric video. In IEEE International Conference on Computer Vision and Pattern Recognition. 3216--3223.
[19]
Yin Li, Zhefan Ye, and James M Rehg. 2015. Delving into egocentric actions. In IEEE Conference on Computer Vision and Pattern Recognition. 287--295.
[20]
Paul Lukowicz, Jamie A Ward, Holger Junker, Mathias Stäger, Gerhard Tröster, Amin Atrash, and Thad Starner. 2004. Recognizing workshop activity using body worn microphones and accelerometers. In International Conference on Pervasive Computing. 18--32.
[21]
Minghuang Ma, Haoqi Fan, and Kris M Kitani. 2016. Going deeper into first-person activity recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 1894--1903.
[22]
Inês P Machado, A Luísa Gomes, Hugo Gamboa, Vítor Paixão, and Rui M Costa. 2015. Human activity data discovery from triaxial accelerometer sensor: Non-supervised learning sensitivity to feature extraction parametrization. Information Processing&Management 51, 2 (2015), 204--214.
[23]
Francisco Javier Ordóñez and Daniel Roggen. 2016. Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16, 1 (2016), 115.
[24]
Rafael Possas, Sheila Pinto Caceres, and Fabio Ramos. 2018. Egocentric Activity Recognition on a Budget. In IEEE Conference on Computer Vision and Pattern Recognition. 5967--5976.
[25]
Charissa Ann Ronao and Sung-Bae Cho. 2016. Human activity recognition with smartphone sensors using deep learning neural networks. Expert Systems with Applications 59 (2016), 235--244.
[26]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.
[27]
Suriya Singh, Chetan Arora, and CV Jawahar. 2016. First person action recognition using deep learned descriptors. In IEEE Conference on Computer Vision and Pattern Recognition. 2620--2628.
[28]
Suriya Singh, Chetan Arora, and C.V. Jawahar. 2017. Trajectory aligned features for first person action recognition. Pattern Recognition 62 (2017), 45--55.
[29]
Sibo Song, Vijay Chandrasekhar, Bappaditya Mandal, Liyuan Li, Joo-Hwee Lim, Giduthuri Sateesh Babu, Phyo Phyo San, and Ngai-Man Cheung. 2016. Multimodal multi-stream deep learning for egocentric activity recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 24--31.
[30]
Sibo Song, Ngai-Man Cheung, Vijay Chandrasekhar, Bappaditya Mandal, and Jie Liri. 2016. Egocentric activity recognition with multimodal fisher vector. In IEEE International Conference on Acoustics, Speech and Signal Processing. 2717--2721.
[31]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929--1958.
[32]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition. 2818--2826.
[33]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In IEEE International Conference on Computer Vision and Pattern Recognition. 4489--4497.
[34]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2017. Temporal segment networks for action recognition in videos. arXiv preprint arXiv: 1705.02953 (2017).
[35]
Xuanhan Wang, Lianli Gao, Jingkuan Song, Xiantong Zhen, Nicu Sebe, and Heng Tao Shen. 2018. Deep appearance and motion learning for egocentric activity recognition. Neurocomputing 275 (2018), 438--447.
[36]
Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiaoli Li, and Shonali Krishnaswamy. 2015. Deep convolutional neural networks on multichannel time series for human activity recognition. In International Joint Conference on Artificial Intelligence, Vol. 15. 3995--4001.

Cited By

View all
  • (2024)WEAR: An Outdoor Sports Dataset for Wearable and Egocentric Activity RecognitionProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997768:4(1-21)Online publication date: 21-Nov-2024
  • (2024)Egocentric Human Activities Recognition With Multimodal Interaction SensingIEEE Sensors Journal10.1109/JSEN.2023.334919124:5(7085-7096)Online publication date: 1-Mar-2024
  • (2024)Egocentric Vision Action Recognition: Performance Analysis on the Coer_Egovision Dataset2024 International Conference on Automation and Computation (AUTOCOM)10.1109/AUTOCOM60220.2024.10486146(341-346)Online publication date: 14-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICVGIP '18: Proceedings of the 11th Indian Conference on Computer Vision, Graphics and Image Processing
December 2018
659 pages
ISBN:9781450366151
DOI:10.1145/3293353
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Convolutional Neural Network
  2. Dynamic Image
  3. Egocentric Activity Recognition
  4. Multimodal Fusion
  5. Stacked Difference Image

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICVGIP 2018

Acceptance Rates

Overall Acceptance Rate 95 of 286 submissions, 33%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)WEAR: An Outdoor Sports Dataset for Wearable and Egocentric Activity RecognitionProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997768:4(1-21)Online publication date: 21-Nov-2024
  • (2024)Egocentric Human Activities Recognition With Multimodal Interaction SensingIEEE Sensors Journal10.1109/JSEN.2023.334919124:5(7085-7096)Online publication date: 1-Mar-2024
  • (2024)Egocentric Vision Action Recognition: Performance Analysis on the Coer_Egovision Dataset2024 International Conference on Automation and Computation (AUTOCOM)10.1109/AUTOCOM60220.2024.10486146(341-346)Online publication date: 14-Mar-2024
  • (2023)EvIs-Kitchen: Egocentric Human Activities Recognition with Video and Inertial Sensor DataMultiMedia Modeling10.1007/978-3-031-27077-2_29(373-384)Online publication date: 9-Jan-2023
  • (2020)Extending Egocentric Vision into Vehicles: Malaysian Dash-Cam DatasetIntelligent Robotics and Applications10.1007/978-3-030-66645-3_23(270-281)Online publication date: 5-Nov-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media