research-article

Cross-modal Assisted Training for Abnormal Event Recognition in Elevators

Authors:

Ming LiAuthors Info & Claims

ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction

Pages 530 - 538

https://doi.org/10.1145/3462244.3479908

Published: 18 October 2021 Publication History

Abstract

Given that very few action recognition datasets collected in elevators contain multimodal data, we collect and propose our multimodal dataset investigating passenger safety and inappropriate elevator usage. Moreover, we present a novel framework (RGBP) to utilize multimodal data to enhance unimodal test performance for the task of abnormal event recognition in elevators. Experimental results show that the best network architecture with the RGBP framework effectively improves the unimodal inference performance on the Elevator RGBD dataset by 4.71% (accuracy) and 4.95% (F1 score) with respect to the pure RGB model. In addition, our RGBP framework outperforms two other methods for ”multimodal training and unimodal inference”: MTUT [1] and the two-stage method based on depth estimation.

Supplementary Material

MP4 File (1282_video.mp4)

Given that very few action recognition datasets collected in elevators contain multimodal data, we collect and propose our multimodal dataset investigating passenger safety and inappropriate elevator usage. Moreover, we present a novel framework (RGBP) to utilize multimodal data to enhance unimodal test performance for the task of abnormal event recognition in elevators. Experimental results show that the best network architecture with the RGBP framework effectively improves the unimodal inference performance on the Elevator RGBD dataset by 4.71% (accuracy) and 4.95% (F1 score) with respect to the pure RGB model. In addition, our RGBP framework outperforms two other methods for "multimodal training and unimodal inference": MTUT and the two-stage method based on depth estimation.

Download
33.78 MB

References

[1]

Mahdi Abavisani, Hamid Reza Vaezi Joze, and Vishal M Patel. 2019. Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition With Multimodal Training. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, 1165–1174.

[2]

Roberto Arroyo, J Javier Yebes, Luis M Bergasa, Iván G Daza, and Javier Almazán. 2015. Expert video-surveillance system for real-time detection of suspicious behaviors in shopping malls. Expert systems with Applications 42, 21 (2015), 7991–8005.

[3]

Cem Yusuf Aydogdu, Souvik Hazra, Avik Santra, and Robert Weigel. 2020. Multi-modal cross learning for improved people counting using short-range FMCW radar. In 2020 IEEE International Radar Conference (RADAR). IEEE, 250–255.

[4]

Aaron F. Bobick and James W. Davis. 2001. The recognition of human movement using temporal templates. IEEE Transactions on pattern analysis and machine intelligence 23, 3(2001), 257–267.

Digital Library

[5]

Oren Boiman and Michal Irani. 2007. Detecting irregularities in images and in video. International journal of computer vision 74, 1 (2007), 17–31.

Digital Library

[6]

Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. 2019. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8001–8008.

Digital Library

[7]

Nicola Conci and Leonardo Lizzi. 2009. Camera placement using particle swarm optimization in visual surveillance applications. In 2009 16th IEEE international conference on image processing (ICIP). IEEE, 3485–3488.

[8]

Blanca Delgado, Khalid Tahboub, and Edward J Delp. 2014. Automatic detection of abnormal human events on train platforms. In NAECON 2014-IEEE National Aerospace and Electronics Conference. IEEE, 169–173.

[9]

Oscar Deniz, Ismael Serrano, Gloria Bueno, and Tae-Kyun Kim. 2014. Fast violence detection in video. In 2014 international conference on computer vision theory and applications (VISAPP), Vol. 2. IEEE, 478–485.

[10]

Yuan Gao, Hong Liu, Xiaohu Sun, Can Wang, and Yi Liu. 2016. Violence detection using oriented violent flows. Image and vision computing 48 (2016), 37–41.

[11]

Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2414–2423.

[12]

VK Gnanavel and A Srinivasan. 2015. Abnormal event detection in crowded video scenes. In Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (Ficta) 2014. Springer, 441–448.

[13]

Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. 2019. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3828–3838.

[14]

Anders Grunnet-Jepsen, John N Sweetser, Paul Winer, Akihiro Takagi, and John Woodfill. 2018. Projectors for Intel® RealSense™ Depth Cameras D4xx. Intel Support, Interl Corporation: Santa Clara, CA, USA (2018).

[15]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.

[16]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531(2015).

[17]

Di Hu, Chengze Wang, Feiping Nie, and Xuelong Li. 2019. Dense multimodal fusion for hierarchically joint representation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3941–3945.

[18]

Ahmad Jalal, Yeon-Ho Kim, Yong-Joong Kim, Shaharyar Kamal, and Daijin Kim. 2017. Robust human activity recognition from depth video using spatiotemporal multi-fused features. Pattern recognition 61(2017), 295–308.

[19]

Chunhua Jia, Wenhai Yi, Yu Wu, Hui Huang, Lei Zhang, and Leilei Wu. 2020. Abnormal activity capture from passenger flow of elevator based on unsupervised learning and fine-grained multi-label recognition. arXiv preprint arXiv:2006.15873(2020), arXiv–2006.

[20]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725–1732.

Digital Library

[21]

Roberto Leyva, Victor Sanchez, and Chang-Tsun Li. 2014. Video anomaly detection based on wake motion descriptors and perspective grids. In 2014 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 209–214.

[22]

Li Liu and Ling Shao. 2013. Learning discriminative representations from RGB-D video data. In Twenty-third international joint conference on artificial intelligence.

[23]

Amira Ben Mabrouk and Ezzeddine Zagrouba. 2017. Spatio-temporal feature using optical flow based distribution for violence detection. Pattern Recognition Letters 92 (2017), 62–67.

Digital Library

[24]

Amira Ben Mabrouk and Ezzeddine Zagrouba. 2018. Abnormal behavior recognition for intelligent video surveillance systems: A review. Expert Systems with Applications 91 (2018), 480–491.

Digital Library

[25]

Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. 2010. Anomaly detection in crowded scenes. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 1975–1981.

[26]

Hajananth Nallaivarothayan, Clinton Fookes, Simon Denman, and Sridha Sridharan. 2014. An MRF based abnormal event detection approach using motion and appearance features. In 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 343–348.

[27]

Bingbing Ni, Yong Pei, Pierre Moulin, and Shuicheng Yan. 2013. Multilevel depth and image fusion for human activity detection. IEEE transactions on cybernetics 43, 5 (2013), 1383–1394.

[28]

Enrique Bermejo Nievas, Oscar Deniz Suarez, Gloria Bueno García, and Rahul Sukthankar. 2011. Violence detection in video using computer vision techniques. In International conference on Computer analysis of images and patterns. Springer, 332–339.

[29]

Simon Niklaus, Long Mai, Jimei Yang, and Feng Liu. 2019. 3D Ken Burns effect from a single image. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1–15.

Digital Library

[30]

Dhanesh Ramachandram and Graham W Taylor. 2017. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34, 6 (2017), 96–108.

[31]

Aravinda S Rao, Jayavardhana Gubbi, Sutharshan Rajasegarar, Slaven Marusic, and Marimuthu Palaniswami. 2014. Detection of anomalous crowd behaviour using hyperspherical clustering. In 2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA). IEEE, 1–8.

[32]

Nida Rasheed, Shoab A Khan, and Adnan Khalid. 2014. Tracking and abnormal behavior detection in video surveillance using optical flow and neural networks. In 2014 28th International Conference on Advanced Information Networking and Applications Workshops. IEEE, 61–66.

Digital Library

[33]

Guang Shu, Gaojing Fu, Peng Li, and Haiyu Geng. 2014. Violent behavior detection based on svm in the elevator. International Journal of Security and Its Applications 8, 5(2014), 31–40.

[34]

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In European conference on computer vision. Springer, 746–760.

Digital Library

[35]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199(2014).

[36]

Cees GM Snoek, Marcel Worring, and Arnold WM Smeulders. 2005. Early versus late fusion in semantic video analysis. In Proceedings of the 13th annual ACM international conference on Multimedia. 399–402.

Digital Library

[37]

Jing Wang and Zhijie Xu. 2015. Crowd anomaly detection for automated video surveillance. (2015).

[38]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20–36.

[39]

Tian Wang and Hichem Snoussi. 2014. Detection of abnormal visual events via global optical flow orientation histogram. IEEE Transactions on Information Forensics and Security 9, 6(2014), 988–998.

Digital Library

[40]

Ping Xiao, Maylor KH Leung, and Kok Cheong Wong. 1996. Eleview: An Active Elevator Monitoring Vision System.,. In MVA. 253–256.

[41]

Dan Xu, Rui Song, Xinyu Wu, Nannan Li, Wei Feng, and Huihuan Qian. 2014. Video anomaly detection based on a hierarchical activity discovery within spatio-temporal contexts. Neurocomputing 143(2014), 144–152.

Digital Library

[42]

B Yogameena and K Sindhu Priya. 2015. Synoptic video based human crowd behavior analysis for forensic video surveillance. In 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR). IEEE, 1–6.

[43]

Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701(2012).

[44]

Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. 2021. Deep audio-visual learning: A survey. International Journal of Automation and Computing (2021), 1–26.

Digital Library

[45]

Songhao Zhu, Juanjuan Hu, and Zhe Shi. 2016. Local abnormal behavior detection based on optical flow and spatio-temporal gradient. Multimedia Tools and Applications 75, 15 (2016), 9445–9459.

Digital Library

[46]

Yujie Zhu and Zengfu Wang. 2016. Real-time abnormal behavior detection in elevator. In Chinese Conference on Intelligent Visual Surveillance. Springer, 154–161.

Recommendations

Scalable Deep Multimodal Learning for Cross-Modal Retrieval
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Cross-modal retrieval takes one type of data as the query to retrieve relevant data of another type. Most of existing cross-modal retrieval approaches were proposed to learn a common subspace in a joint manner, where the data from all modalities have to ...
Cross-modal Ambiguity Learning for Multimodal Fake News Detection
WWW '22: Proceedings of the ACM Web Conference 2022

Cross-modal learning is essential to enable accurate fake news detection due to the fast-growing multimodal contents in online social communities. A fundamental challenge of multimodal fake news detection lies in the inherent ambiguity across different ...
Multimodal learning for facial expression recognition

In this paper, multimodal learning for facial expression recognition (FER) is proposed. The multimodal learning method makes the first attempt to learn the joint representation by considering the texture and landmark modality of facial images, which are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction

October 2021

876 pages

ISBN:9781450384810

DOI:10.1145/3462244

Editors:
Zakia Hammal
Carnegie Mellon University
,
Carlos Busso
University of Texas at Dallas
,
Catherine Pelachaud
CNRS - ISIR, Sorbonne University
,
Sharon Oviatt
Monash University
,
Albert Ali Salah
Utrecht University and Boğaziçi University
,
Guoying Zhao
University of Oulu

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Kunshan Government Research (KGR) Funding in AY 2020/2021.

Conference

ICMI '21

Sponsor:

SIGCHI

ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 18 - 22, 2021

QC, Montréal, Canada

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
100
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten