skip to main content
10.1145/3404835.3462964acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Multimodal Activation: Awakening Dialog Robots without Wake Words

Published:11 July 2021Publication History

ABSTRACT

When talking to the dialog robots, users have to activate the robot first from the standby mode with special wake words, such as "Hey Siri", which is apparently not user-friendly. The latest generation of dialog robots have been equipped with advanced sensors, like the camera, enabling multimodal activation. In this work, we work towards awaking the robot without wake words. To accomplish this task, we present a Multimodal Activation Scheme (MAS), consisting of two key components: audio-visual consistency detection and semantic talking intention inference. The first one is devised to measure the consistency between the audio and visual modalities in order to figure out weather the heard speech comes from the detected user in front of the camera. Towards this end, two heterogeneous CNN-based networks are introduced to convolutionalize the fine-grained facial landmark features and the MFCC audio features, respectively. The second one is to infer the semantic talking intention of the recorded speech, where the transcript of the speech is recognized and matrix factorization is utilized to uncover the latent human-robot talking topics. We ultimately devise different fusion strategies to unify these two components. To evaluate MAS, we construct a dataset containing 12,741 short videos recorded by 194 invited volunteers. Extensive experiments demonstrate the effectiveness of our scheme.

Skip Supplemental Material Section

Supplemental Material

SIGIR21-fp0979.mp4

mp4

18.6 MB

References

  1. Karan Ahuja, Andy Kong, Mayank Goel, and Chris Harrison. 2020. Direction-of-Voice (DoV) Estimation for Intuitive Speech Interaction with Smart Device Ecosystems. In Proceedings of the Annual ACM Symposium on User Interface Software and Technology. ACM, 1121--1131.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Relja Arandjelovic and Andrew Zisserman. 2018. Objects that Sound. In Europeon Conference on Computer Vision. Springer, 451--466.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6299--6308.Google ScholarGoogle ScholarCross RefCross Ref
  4. Mengli Cheng, Chengyu Wang, Xu Hu, Jun Huang, and Xiaobo Wang. 2020. Weakly Supervised Construction of ASR Systems with Massive Video Data. arXiv preprint arXiv:2008.01300 .Google ScholarGoogle Scholar
  5. Zhiyong Cheng, Xuanchong Li, Jialie Shen, and Alexander G Hauptmann. 2016. Which Information Sources are More Effective and Reliable in Video Search. In Proceedings of the International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 1069--1072.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Joon Son Chung and Andrew Zisserman. 2016. Out of Time: Automated Lip Sync In The Wild. In Asian Conference on Computer Vision. Springer, 251--263.Google ScholarGoogle Scholar
  7. Soo-Whan Chung, Joon Son Chung, and Hong-Goo Kang. 2019. Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation. In IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 3965--3969.Google ScholarGoogle Scholar
  8. Thierry Desot, Francois Portet, and Michel Vacher. 2019. Towards End-to-End Spoken Intent Recognition in Smart Home. In International Conference on Speech Technology and Human-Computer Dialogue. IEEE, 1--8.Google ScholarGoogle Scholar
  9. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, 4171--4186.Google ScholarGoogle Scholar
  10. Y. Fan, J. W. Kang, L. T. Li, K. C. Li, H. L. Chen, S. T. Cheng, P. Y. Zhang, Z. Y. Zhou, Y. Q. Cai, and D. Wang. 2020. CN-Celeb: A Challenging Chinese Speaker Recognition Dataset. In IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 7604--7608.Google ScholarGoogle Scholar
  11. Asif A. Ghazanfar and Daniel Y. Takahashi. 2014. Facial Expressions and the Evolution of the Speech Rhythm. Journal of Cognitive Neuroscience, Vol. 26, 6, 1196--1207.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yue Gu, Xinyu Li, Shuhong Chen, Jianyu Zhang, and Ivan Marsic. 2017. Speech Intention Classification with Multimodal Deep Learning. In Canadian Conference on Artificial Intelligence. Springer, 260--271.Google ScholarGoogle Scholar
  13. Xianjing Han, Xuemeng Song, Jianhua Yin, Yinglong Wang, and Liqiang Nie. 2019. Prototype-Guided Attribute-Wise Interpretable Scheme for Clothing Matching. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 785--794.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D Convolutional Neural Networks for Human Action Recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 35, 1, 221--231.Google ScholarGoogle Scholar
  15. Jintao Jiang, Abeer Alwan, Lynne E Bernstein, Patricia Keating, and Ed Auer. 2000. On the Correlation between Facial Movements, Tongue Movements and Speech Acoustics. In International Conference on Spoken Language Processing. ISCA, 42--45.Google ScholarGoogle Scholar
  16. Veton Kepuska and Gamal Bohouta. 2018. Next-Generation of Virtual Personal Assistants (Microsoft Cortana, Apple Siri, Amazon Alexa and Google Home). In IEEE Annual Computing and Communication Workshop and Conference. IEEE, 99--103.Google ScholarGoogle ScholarCross RefCross Ref
  17. VZ Këpuska and TB Klein. 2009. A Novel Wake-Up-Word Speech Recognition System, Wake-Up-Word Recognition Task, Technology and Evaluation. Nonlinear Analysis: Theory, Methods & Applications, Vol. 71, 12, e2772--e2789.Google ScholarGoogle ScholarCross RefCross Ref
  18. Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1746--1751.Google ScholarGoogle ScholarCross RefCross Ref
  19. Barrett E Koster, Robert D Rodman, and Donald Bitzer. 1994. Automated Lip-Sync: Direct Translation of Speech-Sound to Mouth-Shape. In Proceedings of the Asilomar Conference on Signals, Systems and Computers. IEEE, 583--584.Google ScholarGoogle ScholarCross RefCross Ref
  20. John Lewis. 1991. Automated Lip-Sync: Background and Techniques. The Journal of Visualization and Computer Animation, Vol. 2, 4, 118--122.Google ScholarGoogle ScholarCross RefCross Ref
  21. Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 2247--2256.Google ScholarGoogle ScholarCross RefCross Ref
  22. Takashi Maekaku, Yusuke Kida, and Akihiko Sugiyama. 2019. Simultaneous Detection and Localization of a Wake-Up Word Using Multi-Task Learning of the Duration and Endpoint. In Annual Conference of the International Speech Communication Association. ISCA, 4240--4244.Google ScholarGoogle ScholarCross RefCross Ref
  23. Etienne Marcheret, Gerasimos Potamianos, Josef Vopicka, and Vaibhava Goel. 2015. Detecting Audio-Visual Synchrony Using Deep Neural Networks. In the Annual Conference of the International Speech Communication Association. ISCA, 548--552.Google ScholarGoogle Scholar
  24. David F McAllister, Robert D Rodman, Donald L Bitzer, and Andrew S Freeman. 1997. Lip Synchronization of Speech. In Workshop on Audio-Visual Speech Processing. ISCA, 133--136.Google ScholarGoogle Scholar
  25. Donald McMillan, Barry Brown, Ikkaku Kawaguchi, Razan Jaber, Jordi Solsona Belenguer, and Hideaki Kuzuoka. 2019. Designing with Gaze: Tama--a Gaze Activated Smart-Speaker. In Proceedings of the ACM on Human-Computer Interaction. ACM, 1--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Liliane Momeni, Triantafyllos Afouras, Themos Stafylakis, Samuel Albanie, and Andrew Zisserman. 2020. Seeing Wake Words: Audio-visual Keyword Spotting. In British Machine Vision Conference. BMVA, 1--13.Google ScholarGoogle Scholar
  27. Shigeo Morishima, Shin Ogata, Kazumasa Murai, and Satoshi Nakamura. 2002. Audio-Visual Speech Translation With Automatic Lip Syncqronization and Face Tracking Based on 3-D Head Model. In IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2117--2120.Google ScholarGoogle ScholarCross RefCross Ref
  28. Arpan Mukherjee, Shubhi Tiwari, Tanya Chowdhury, and Tanmoy Chakraborty. 2019. Automatic Curation of Content Tables for Educational Videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1329--1332.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Alex Nichol, Joshua Achiam, and John Schulman. 2018. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999 .Google ScholarGoogle Scholar
  30. Yishuang Ning, Jia Jia, Zhiyong Wu, Runnan Li, Yongsheng An, Yanfeng Wang, and Helen Meng. 2017. Multi-Task Deep Learning for User Intention Understanding in Speech Interaction Systems. In AAAI Conference on Artificial Intelligence. AAAI, 161--167.Google ScholarGoogle Scholar
  31. Patryk Pomykalski, Mikołaj P Wo'zniak, Paweł W Wo'zniak, Krzysztof Grudzie'n, Shengdong Zhao, and Andrzej Romanowski. 2020. Considering Wake Gestures for Smart Assistant Use. In the CHI Conference on Human Factors in Computing Systems. ACM, 1--8.Google ScholarGoogle Scholar
  32. J Robin Rohlicek, William Russell, Salim Roukos, and Herbert Gish. 1989. Continuous Hidden Markov Modeling for Speaker-Independent Word Spotting. In International Conference on Acoustics, Speech, and Signal Processing. IEEE, 627--630.Google ScholarGoogle Scholar
  33. Enrique Argones Rúa, Hervé Bredin, Carmen Garc'ia Mateo, Gérard Chollet, and Daniel González Jiménez. 2009. Audio-Visual Speech Asynchrony Detection Using Co-Inertia Analysis and Coupled Hidden Markov Models. Pattern Analysis and Applications, Vol. 12, 3, 271--284.Google ScholarGoogle ScholarCross RefCross Ref
  34. Mehmet Emre Sargin, Yücel Yemez, Engin Erzin, and A Murat Tekalp. 2007. Audiovisual Synchronization and Fusion Using Canonical Correlation analysis. IEEE Transactions on Multimedia, Vol. 9, 7, 1396--1403.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Paraic Sheridan, Martin Wechsler, and Peter Sch"auble. 1997. Cross-Language Speech Retrieval: Establishing a Baseline Performance. In Proceedings of the annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 99--108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync From Audio. ACM Trans. Graph., Vol. 36, 4, 95:1--95:13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Yusheng Tian and Philip John Gorinski. 2020. Improving End-to-End Speech-to-Intent Classification with Reptile. In the Annual Conference of the International Speech Communication Association. ISCA, 891--895.Google ScholarGoogle Scholar
  38. Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, and Sanjeev Khudanpur. 2020. Wake Word Detection with Alignment-Free Lattice-Free MMI. In the Annual Conference of the International Speech Communication Association, Virtual Event. ISCA, 4258--4262.Google ScholarGoogle Scholar
  39. Ye-Yi Wang, Li Deng, and Alex Acero. 2005. Spoken Language Understanding. IEEE Signal Processing Magazine, Vol. 22, 5, 16--31.Google ScholarGoogle ScholarCross RefCross Ref
  40. Jennifer Williams, Steven Kleinegesse, Ramona Comanescu, and Oana Radu. 2018. Recognizing emotions in video using multimodal dnn feature fusion. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language. ACL, 11--19.Google ScholarGoogle ScholarCross RefCross Ref
  41. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems. MIT, 5754--5764.Google ScholarGoogle Scholar
  42. Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1103--1114.Google ScholarGoogle ScholarCross RefCross Ref
  43. Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory Fusion Network for Multi-view Sequential Learning. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 5634--5641.Google ScholarGoogle ScholarCross RefCross Ref
  44. Chenbin Zhang, Congjian Luo, Junyu Lu, Ao Liu, Bing Bai, Kun Bai, and Zenglin Xu. 2020. Read, Attend, and Exclude: Multi-Choice Reading Comprehension by Mimicking Human Reasoning Process. In Proceedings of the International ACM SIGIR conference on research and development in Information Retrieval. ACM, 1945--1948.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Multimodal Activation: Awakening Dialog Robots without Wake Words

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2021
      2998 pages
      ISBN:9781450380379
      DOI:10.1145/3404835

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 July 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader