skip to main content
10.1145/3462244.3479908acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Cross-modal Assisted Training for Abnormal Event Recognition in Elevators

Published: 18 October 2021 Publication History

Abstract

Given that very few action recognition datasets collected in elevators contain multimodal data, we collect and propose our multimodal dataset investigating passenger safety and inappropriate elevator usage. Moreover, we present a novel framework (RGBP) to utilize multimodal data to enhance unimodal test performance for the task of abnormal event recognition in elevators. Experimental results show that the best network architecture with the RGBP framework effectively improves the unimodal inference performance on the Elevator RGBD dataset by 4.71% (accuracy) and 4.95% (F1 score) with respect to the pure RGB model. In addition, our RGBP framework outperforms two other methods for ”multimodal training and unimodal inference”: MTUT [1] and the two-stage method based on depth estimation.

Supplementary Material

MP4 File (1282_video.mp4)
Given that very few action recognition datasets collected in elevators contain multimodal data, we collect and propose our multimodal dataset investigating passenger safety and inappropriate elevator usage. Moreover, we present a novel framework (RGBP) to utilize multimodal data to enhance unimodal test performance for the task of abnormal event recognition in elevators. Experimental results show that the best network architecture with the RGBP framework effectively improves the unimodal inference performance on the Elevator RGBD dataset by 4.71% (accuracy) and 4.95% (F1 score) with respect to the pure RGB model. In addition, our RGBP framework outperforms two other methods for "multimodal training and unimodal inference": MTUT and the two-stage method based on depth estimation.

References

[1]
Mahdi Abavisani, Hamid Reza Vaezi Joze, and Vishal M Patel. 2019. Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition With Multimodal Training. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, 1165–1174.
[2]
Roberto Arroyo, J Javier Yebes, Luis M Bergasa, Iván G Daza, and Javier Almazán. 2015. Expert video-surveillance system for real-time detection of suspicious behaviors in shopping malls. Expert systems with Applications 42, 21 (2015), 7991–8005.
[3]
Cem Yusuf Aydogdu, Souvik Hazra, Avik Santra, and Robert Weigel. 2020. Multi-modal cross learning for improved people counting using short-range FMCW radar. In 2020 IEEE International Radar Conference (RADAR). IEEE, 250–255.
[4]
Aaron F. Bobick and James W. Davis. 2001. The recognition of human movement using temporal templates. IEEE Transactions on pattern analysis and machine intelligence 23, 3(2001), 257–267.
[5]
Oren Boiman and Michal Irani. 2007. Detecting irregularities in images and in video. International journal of computer vision 74, 1 (2007), 17–31.
[6]
Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. 2019. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8001–8008.
[7]
Nicola Conci and Leonardo Lizzi. 2009. Camera placement using particle swarm optimization in visual surveillance applications. In 2009 16th IEEE international conference on image processing (ICIP). IEEE, 3485–3488.
[8]
Blanca Delgado, Khalid Tahboub, and Edward J Delp. 2014. Automatic detection of abnormal human events on train platforms. In NAECON 2014-IEEE National Aerospace and Electronics Conference. IEEE, 169–173.
[9]
Oscar Deniz, Ismael Serrano, Gloria Bueno, and Tae-Kyun Kim. 2014. Fast violence detection in video. In 2014 international conference on computer vision theory and applications (VISAPP), Vol. 2. IEEE, 478–485.
[10]
Yuan Gao, Hong Liu, Xiaohu Sun, Can Wang, and Yi Liu. 2016. Violence detection using oriented violent flows. Image and vision computing 48 (2016), 37–41.
[11]
Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2414–2423.
[12]
VK Gnanavel and A Srinivasan. 2015. Abnormal event detection in crowded video scenes. In Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (Ficta) 2014. Springer, 441–448.
[13]
Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. 2019. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3828–3838.
[14]
Anders Grunnet-Jepsen, John N Sweetser, Paul Winer, Akihiro Takagi, and John Woodfill. 2018. Projectors for Intel® RealSense™ Depth Cameras D4xx. Intel Support, Interl Corporation: Santa Clara, CA, USA (2018).
[15]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.
[16]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531(2015).
[17]
Di Hu, Chengze Wang, Feiping Nie, and Xuelong Li. 2019. Dense multimodal fusion for hierarchically joint representation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3941–3945.
[18]
Ahmad Jalal, Yeon-Ho Kim, Yong-Joong Kim, Shaharyar Kamal, and Daijin Kim. 2017. Robust human activity recognition from depth video using spatiotemporal multi-fused features. Pattern recognition 61(2017), 295–308.
[19]
Chunhua Jia, Wenhai Yi, Yu Wu, Hui Huang, Lei Zhang, and Leilei Wu. 2020. Abnormal activity capture from passenger flow of elevator based on unsupervised learning and fine-grained multi-label recognition. arXiv preprint arXiv:2006.15873(2020), arXiv–2006.
[20]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725–1732.
[21]
Roberto Leyva, Victor Sanchez, and Chang-Tsun Li. 2014. Video anomaly detection based on wake motion descriptors and perspective grids. In 2014 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 209–214.
[22]
Li Liu and Ling Shao. 2013. Learning discriminative representations from RGB-D video data. In Twenty-third international joint conference on artificial intelligence.
[23]
Amira Ben Mabrouk and Ezzeddine Zagrouba. 2017. Spatio-temporal feature using optical flow based distribution for violence detection. Pattern Recognition Letters 92 (2017), 62–67.
[24]
Amira Ben Mabrouk and Ezzeddine Zagrouba. 2018. Abnormal behavior recognition for intelligent video surveillance systems: A review. Expert Systems with Applications 91 (2018), 480–491.
[25]
Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. 2010. Anomaly detection in crowded scenes. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 1975–1981.
[26]
Hajananth Nallaivarothayan, Clinton Fookes, Simon Denman, and Sridha Sridharan. 2014. An MRF based abnormal event detection approach using motion and appearance features. In 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 343–348.
[27]
Bingbing Ni, Yong Pei, Pierre Moulin, and Shuicheng Yan. 2013. Multilevel depth and image fusion for human activity detection. IEEE transactions on cybernetics 43, 5 (2013), 1383–1394.
[28]
Enrique Bermejo Nievas, Oscar Deniz Suarez, Gloria Bueno García, and Rahul Sukthankar. 2011. Violence detection in video using computer vision techniques. In International conference on Computer analysis of images and patterns. Springer, 332–339.
[29]
Simon Niklaus, Long Mai, Jimei Yang, and Feng Liu. 2019. 3D Ken Burns effect from a single image. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1–15.
[30]
Dhanesh Ramachandram and Graham W Taylor. 2017. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34, 6 (2017), 96–108.
[31]
Aravinda S Rao, Jayavardhana Gubbi, Sutharshan Rajasegarar, Slaven Marusic, and Marimuthu Palaniswami. 2014. Detection of anomalous crowd behaviour using hyperspherical clustering. In 2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA). IEEE, 1–8.
[32]
Nida Rasheed, Shoab A Khan, and Adnan Khalid. 2014. Tracking and abnormal behavior detection in video surveillance using optical flow and neural networks. In 2014 28th International Conference on Advanced Information Networking and Applications Workshops. IEEE, 61–66.
[33]
Guang Shu, Gaojing Fu, Peng Li, and Haiyu Geng. 2014. Violent behavior detection based on svm in the elevator. International Journal of Security and Its Applications 8, 5(2014), 31–40.
[34]
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In European conference on computer vision. Springer, 746–760.
[35]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199(2014).
[36]
Cees GM Snoek, Marcel Worring, and Arnold WM Smeulders. 2005. Early versus late fusion in semantic video analysis. In Proceedings of the 13th annual ACM international conference on Multimedia. 399–402.
[37]
Jing Wang and Zhijie Xu. 2015. Crowd anomaly detection for automated video surveillance. (2015).
[38]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20–36.
[39]
Tian Wang and Hichem Snoussi. 2014. Detection of abnormal visual events via global optical flow orientation histogram. IEEE Transactions on Information Forensics and Security 9, 6(2014), 988–998.
[40]
Ping Xiao, Maylor KH Leung, and Kok Cheong Wong. 1996. Eleview: An Active Elevator Monitoring Vision System.,. In MVA. 253–256.
[41]
Dan Xu, Rui Song, Xinyu Wu, Nannan Li, Wei Feng, and Huihuan Qian. 2014. Video anomaly detection based on a hierarchical activity discovery within spatio-temporal contexts. Neurocomputing 143(2014), 144–152.
[42]
B Yogameena and K Sindhu Priya. 2015. Synoptic video based human crowd behavior analysis for forensic video surveillance. In 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR). IEEE, 1–6.
[43]
Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701(2012).
[44]
Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. 2021. Deep audio-visual learning: A survey. International Journal of Automation and Computing (2021), 1–26.
[45]
Songhao Zhu, Juanjuan Hu, and Zhe Shi. 2016. Local abnormal behavior detection based on optical flow and spatio-temporal gradient. Multimedia Tools and Applications 75, 15 (2016), 9445–9459.
[46]
Yujie Zhu and Zengfu Wang. 2016. Real-time abnormal behavior detection in elevator. In Chinese Conference on Intelligent Visual Surveillance. Springer, 154–161.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction
October 2021
876 pages
ISBN:9781450384810
DOI:10.1145/3462244
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Multimodal learning
  2. abnormal event recognition in elevators

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Kunshan Government Research (KGR) Funding in AY 2020/2021.

Conference

ICMI '21
Sponsor:
ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
October 18 - 22, 2021
QC, Montréal, Canada

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 100
    Total Downloads
  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media