ABSTRACT
As elevator accidents do great damage to people’s lives and property, taking immediate responses to emergent calls for help is necessary. In most emergency cases, passengers must use the “SOS” button to contact the remote safety guard. However, this method is unreliable when passengers lose the ability of body movement. To address this problem, we define a novel task of identifying real and fake calls for help in elevator scenes. Given that the limited call for help dataset collected in elevators contains multimodal data of real and fake categories, we collected and constructed an audiovisual dataset dedicated to the proposed task. Moreover, we present a novel instance-modality-wise dynamic framework to efficiently use the information from each modality and make inferences. Experimental results show that our multimodal network improves the performance on the call for help multimodal dataset by 2.66% (accuracy) and 1.25% (F1 Score) with respect to the pure audio model. Besides, our method outperforms other methods on our dataset.
- Pradeep K Atrey, M Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S Kankanhalli. 2010. Multimodal fusion for multimedia analysis: a survey. Multimedia systems 16, 6 (2010), 345–379.Google Scholar
- Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 41, 2(2018), 423–443.Google ScholarDigital Library
- Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.Google ScholarCross Ref
- Guoguo Chen, Carolina Parada, and Georg Heigold. 2014. Small-footprint keyword spotting using deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4087–4091.Google ScholarCross Ref
- Enea Cippitelli, Francesco Fioranelli, Ennio Gambi, and Susanna Spinsante. 2017. Radar and RGB-depth sensors for fall detection: A review. IEEE Sensors Journal 17, 12 (2017), 3585–3604.Google ScholarCross Ref
- Sidney K D’mello and Jacqueline Kory. 2015. A review and meta-analysis of multimodal affect detection systems. ACM Computing Surveys (CSUR) 47, 3 (2015), 1–36.Google ScholarDigital Library
- Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6202–6211.Google ScholarCross Ref
- Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 776–780.Google ScholarDigital Library
- Dimitrios Giannoulis, Emmanouil Benetos, Dan Stowell, Mathias Rossignol, Mathieu Lagrange, and Mark D Plumbley. 2013. Detection and classification of acoustic scenes and events: An IEEE AASP challenge. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. IEEE, 1–4.Google ScholarCross Ref
- Alex Graves. 2016. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983(2016).Google Scholar
- Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. 2021. Dynamic neural networks: A survey. arXiv preprint arXiv:2102.04906(2021).Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.Google ScholarCross Ref
- Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, 2017. CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 131–135.Google Scholar
- Alan Higgins and R Wohlford. 1985. Keyword recognition using template concatenation. In ICASSP’85. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 10. IEEE, 1233–1236.Google ScholarCross Ref
- Matthew B Hoy. 2018. Alexa, Siri, Cortana, and more: an introduction to voice assistants. Medical reference services quarterly 37, 1 (2018), 81–88.Google Scholar
- Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Q Weinberger. 2017. Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844(2017).Google Scholar
- Zhen-zhong Lan, Lei Bao, Shoou-I Yu, Wei Liu, and Alexander G Hauptmann. 2014. Multimedia classification and event detection using double fusion. Multimedia tools and applications 71, 1 (2014), 333–347.Google ScholarDigital Library
- Ian McLoughlin, Haomin Zhang, Zhipeng Xie, Yan Song, and Wei Xiao. 2015. Robust sound event classification using deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 3(2015), 540–552.Google ScholarDigital Library
- Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Benjamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj, and Tuomas Virtanen. 2017. DCASE 2017 challenge setup: Tasks, datasets and baseline system. In DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events.Google Scholar
- Behnaz Nojavanasghari, Deepak Gopinath, Jayanth Koushik, Tadas Baltrušaitis, and Louis-Philippe Morency. 2016. Deep multimodal fusion for persuasiveness prediction. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. 284–288.Google ScholarDigital Library
- Preksha Pareek and Ankit Thakkar. 2021. A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artificial Intelligence Review 54, 3 (2021), 2259–2322.Google ScholarDigital Library
- Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.Google Scholar
- Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision. 5533–5541.Google ScholarCross Ref
- Dhanesh Ramachandram and Graham W Taylor. 2017. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34, 6 (2017), 96–108.Google ScholarCross Ref
- Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routing between capsules. arXiv preprint arXiv:1710.09829(2017).Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199(2014).Google Scholar
- Cees GM Snoek, Marcel Worring, and Arnold WM Smeulders. 2005. Early versus late fusion in semantic video analysis. In Proceedings of the 13th annual ACM international conference on Multimedia. 399–402.Google ScholarDigital Library
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.Google ScholarDigital Library
- Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450–6459.Google ScholarCross Ref
- Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision. 3551–3558.Google ScholarDigital Library
- William H Warren and Robert R Verbrugge. 1984. Auditory perception of breaking and bouncing events: a case study in ecological acoustics.Journal of Experimental Psychology: Human perception and performance 10, 5(1984), 704.Google Scholar
- Martin Wolf and Climent Nadeu. 2014. Channel selection measures for multi-microphone speech recognition. Speech Communication 57(2014), 170–180.Google ScholarDigital Library
- Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. 2020. Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision. In European Conference on Computer Vision. Springer, 322–339.Google ScholarDigital Library
- Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. 2019. Condconv: Conditionally parameterized convolutions for efficient inference. arXiv preprint arXiv:1904.04971(2019).Google Scholar
- Lei Zhang and Xuezhi Xiang. 2020. Video event classification based on two-stage neural network. Multimedia Tools and Applications 79, 29 (2020), 21471–21486.Google ScholarDigital Library
Index Terms
- A Multimodal Dynamic Neural Network for Call for Help Recognition in Elevators
Recommendations
Cross-modal Assisted Training for Abnormal Event Recognition in Elevators
ICMI '21: Proceedings of the 2021 International Conference on Multimodal InteractionGiven that very few action recognition datasets collected in elevators contain multimodal data, we collect and propose our multimodal dataset investigating passenger safety and inappropriate elevator usage. Moreover, we present a novel framework (RGBP) ...
Fact-sentiment incongruity combination network for multimodal sarcasm detection
AbstractMultimodal sarcasm detection aims to identify whether the literal expression is contrary to the authentic attitude within multimodal data. Sarcasm incongruity method has been successfully applied to multimodal sarcasm detection, due to its ...
Highlights- A novel fine-grained solution is proposed to model the sarcasm incongruity.
- We design dynamic connecting layers via routing weight to capture fact incongruity.
- We reconstruct cross-modal sentiment graph to capture sentiment ...
M2H2: A Multimodal Multiparty Hindi Dataset For Humor Recognition in Conversations
ICMI '21: Proceedings of the 2021 International Conference on Multimodal InteractionHumor recognition in conversations is a challenging task that has recently gained popularity due to its importance in dialogue understanding, including in multimodal settings (i.e., text, acoustics, and visual). The few existing datasets for humor are ...
Comments