skip to main content
10.1145/3461615.3491112acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

A Multimodal Dynamic Neural Network for Call for Help Recognition in Elevators

Authors Info & Claims
Published:17 December 2021Publication History

ABSTRACT

As elevator accidents do great damage to people’s lives and property, taking immediate responses to emergent calls for help is necessary. In most emergency cases, passengers must use the “SOS” button to contact the remote safety guard. However, this method is unreliable when passengers lose the ability of body movement. To address this problem, we define a novel task of identifying real and fake calls for help in elevator scenes. Given that the limited call for help dataset collected in elevators contains multimodal data of real and fake categories, we collected and constructed an audiovisual dataset dedicated to the proposed task. Moreover, we present a novel instance-modality-wise dynamic framework to efficiently use the information from each modality and make inferences. Experimental results show that our multimodal network improves the performance on the call for help multimodal dataset by 2.66% (accuracy) and 1.25% (F1 Score) with respect to the pure audio model. Besides, our method outperforms other methods on our dataset.

References

  1. Pradeep K Atrey, M Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S Kankanhalli. 2010. Multimodal fusion for multimedia analysis: a survey. Multimedia systems 16, 6 (2010), 345–379.Google ScholarGoogle Scholar
  2. Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 41, 2(2018), 423–443.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.Google ScholarGoogle ScholarCross RefCross Ref
  4. Guoguo Chen, Carolina Parada, and Georg Heigold. 2014. Small-footprint keyword spotting using deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4087–4091.Google ScholarGoogle ScholarCross RefCross Ref
  5. Enea Cippitelli, Francesco Fioranelli, Ennio Gambi, and Susanna Spinsante. 2017. Radar and RGB-depth sensors for fall detection: A review. IEEE Sensors Journal 17, 12 (2017), 3585–3604.Google ScholarGoogle ScholarCross RefCross Ref
  6. Sidney K D’mello and Jacqueline Kory. 2015. A review and meta-analysis of multimodal affect detection systems. ACM Computing Surveys (CSUR) 47, 3 (2015), 1–36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6202–6211.Google ScholarGoogle ScholarCross RefCross Ref
  8. Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 776–780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Dimitrios Giannoulis, Emmanouil Benetos, Dan Stowell, Mathias Rossignol, Mathieu Lagrange, and Mark D Plumbley. 2013. Detection and classification of acoustic scenes and events: An IEEE AASP challenge. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. IEEE, 1–4.Google ScholarGoogle ScholarCross RefCross Ref
  10. Alex Graves. 2016. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983(2016).Google ScholarGoogle Scholar
  11. Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. 2021. Dynamic neural networks: A survey. arXiv preprint arXiv:2102.04906(2021).Google ScholarGoogle Scholar
  12. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.Google ScholarGoogle ScholarCross RefCross Ref
  13. Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, 2017. CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 131–135.Google ScholarGoogle Scholar
  14. Alan Higgins and R Wohlford. 1985. Keyword recognition using template concatenation. In ICASSP’85. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 10. IEEE, 1233–1236.Google ScholarGoogle ScholarCross RefCross Ref
  15. Matthew B Hoy. 2018. Alexa, Siri, Cortana, and more: an introduction to voice assistants. Medical reference services quarterly 37, 1 (2018), 81–88.Google ScholarGoogle Scholar
  16. Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Q Weinberger. 2017. Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844(2017).Google ScholarGoogle Scholar
  17. Zhen-zhong Lan, Lei Bao, Shoou-I Yu, Wei Liu, and Alexander G Hauptmann. 2014. Multimedia classification and event detection using double fusion. Multimedia tools and applications 71, 1 (2014), 333–347.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ian McLoughlin, Haomin Zhang, Zhipeng Xie, Yan Song, and Wei Xiao. 2015. Robust sound event classification using deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 3(2015), 540–552.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Benjamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj, and Tuomas Virtanen. 2017. DCASE 2017 challenge setup: Tasks, datasets and baseline system. In DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events.Google ScholarGoogle Scholar
  20. Behnaz Nojavanasghari, Deepak Gopinath, Jayanth Koushik, Tadas Baltrušaitis, and Louis-Philippe Morency. 2016. Deep multimodal fusion for persuasiveness prediction. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. 284–288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Preksha Pareek and Ankit Thakkar. 2021. A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artificial Intelligence Review 54, 3 (2021), 2259–2322.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.Google ScholarGoogle Scholar
  23. Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision. 5533–5541.Google ScholarGoogle ScholarCross RefCross Ref
  24. Dhanesh Ramachandram and Graham W Taylor. 2017. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34, 6 (2017), 96–108.Google ScholarGoogle ScholarCross RefCross Ref
  25. Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routing between capsules. arXiv preprint arXiv:1710.09829(2017).Google ScholarGoogle Scholar
  26. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199(2014).Google ScholarGoogle Scholar
  27. Cees GM Snoek, Marcel Worring, and Arnold WM Smeulders. 2005. Early versus late fusion in semantic video analysis. In Proceedings of the 13th annual ACM international conference on Multimedia. 399–402.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450–6459.Google ScholarGoogle ScholarCross RefCross Ref
  30. Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision. 3551–3558.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. William H Warren and Robert R Verbrugge. 1984. Auditory perception of breaking and bouncing events: a case study in ecological acoustics.Journal of Experimental Psychology: Human perception and performance 10, 5(1984), 704.Google ScholarGoogle Scholar
  32. Martin Wolf and Climent Nadeu. 2014. Channel selection measures for multi-microphone speech recognition. Speech Communication 57(2014), 170–180.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. 2020. Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision. In European Conference on Computer Vision. Springer, 322–339.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. 2019. Condconv: Conditionally parameterized convolutions for efficient inference. arXiv preprint arXiv:1904.04971(2019).Google ScholarGoogle Scholar
  35. Lei Zhang and Xuezhi Xiang. 2020. Video event classification based on two-stage neural network. Multimedia Tools and Applications 79, 29 (2020), 21471–21486.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Multimodal Dynamic Neural Network for Call for Help Recognition in Elevators
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction
          October 2021
          418 pages
          ISBN:9781450384711
          DOI:10.1145/3461615

          Copyright © 2021 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 17 December 2021

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate453of1,080submissions,42%
        • Article Metrics

          • Downloads (Last 12 months)18
          • Downloads (Last 6 weeks)1

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format