research-article

Multimodal Activation: Awakening Dialog Robots without Wake Words

Authors:
Liqiang Nie

Shandong University, Shandong, China

Shandong University, Shandong, China
View Profile

,
Mengzhao Jia

Shandong University, Shandong, China

Shandong University, Shandong, China
View Profile

,
Xuemeng Song

Shandong University, Shandong, China

Shandong University, Shandong, China
View Profile

,
Ganglu Wu

Alibaba Group, Zhejiang, China

Alibaba Group, Zhejiang, China
View Profile

,
Harry Cheng

Shandong University, Shandong, China

Shandong University, Shandong, China
View Profile

,
Jian Gu

Alibaba Group, Zhejiang, China

Alibaba Group, Zhejiang, China
View Profile

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalJuly 2021Pages 491–500https://doi.org/10.1145/3404835.3462964

Published:11 July 2021Publication History

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 491–500

ABSTRACT

When talking to the dialog robots, users have to activate the robot first from the standby mode with special wake words, such as "Hey Siri", which is apparently not user-friendly. The latest generation of dialog robots have been equipped with advanced sensors, like the camera, enabling multimodal activation. In this work, we work towards awaking the robot without wake words. To accomplish this task, we present a Multimodal Activation Scheme (MAS), consisting of two key components: audio-visual consistency detection and semantic talking intention inference. The first one is devised to measure the consistency between the audio and visual modalities in order to figure out weather the heard speech comes from the detected user in front of the camera. Towards this end, two heterogeneous CNN-based networks are introduced to convolutionalize the fine-grained facial landmark features and the MFCC audio features, respectively. The second one is to infer the semantic talking intention of the recorded speech, where the transcript of the speech is recognized and matrix factorization is utilized to uncover the latent human-robot talking topics. We ultimately devise different fusion strategies to unify these two components. To evaluate MAS, we construct a dataset containing 12,741 short videos recorded by 194 invited volunteers. Extensive experiments demonstrate the effectiveness of our scheme.

Supplemental Material

SIGIR21-fp0979.mp4

mp4

18.6 MB

Download

References

Karan Ahuja, Andy Kong, Mayank Goel, and Chris Harrison. 2020. Direction-of-Voice (DoV) Estimation for Intuitive Speech Interaction with Smart Device Ecosystems. In Proceedings of the Annual ACM Symposium on User Interface Software and Technology. ACM, 1121--1131.Google ScholarDigital Library
Relja Arandjelovic and Andrew Zisserman. 2018. Objects that Sound. In Europeon Conference on Computer Vision. Springer, 451--466.Google ScholarDigital Library
Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6299--6308.Google ScholarCross Ref
Mengli Cheng, Chengyu Wang, Xu Hu, Jun Huang, and Xiaobo Wang. 2020. Weakly Supervised Construction of ASR Systems with Massive Video Data. arXiv preprint arXiv:2008.01300 .Google Scholar
Zhiyong Cheng, Xuanchong Li, Jialie Shen, and Alexander G Hauptmann. 2016. Which Information Sources are More Effective and Reliable in Video Search. In Proceedings of the International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 1069--1072.Google ScholarDigital Library
Joon Son Chung and Andrew Zisserman. 2016. Out of Time: Automated Lip Sync In The Wild. In Asian Conference on Computer Vision. Springer, 251--263.Google Scholar
Soo-Whan Chung, Joon Son Chung, and Hong-Goo Kang. 2019. Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation. In IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 3965--3969.Google Scholar
Thierry Desot, Francois Portet, and Michel Vacher. 2019. Towards End-to-End Spoken Intent Recognition in Smart Home. In International Conference on Speech Technology and Human-Computer Dialogue. IEEE, 1--8.Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, 4171--4186.Google Scholar
Y. Fan, J. W. Kang, L. T. Li, K. C. Li, H. L. Chen, S. T. Cheng, P. Y. Zhang, Z. Y. Zhou, Y. Q. Cai, and D. Wang. 2020. CN-Celeb: A Challenging Chinese Speaker Recognition Dataset. In IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 7604--7608.Google Scholar
Asif A. Ghazanfar and Daniel Y. Takahashi. 2014. Facial Expressions and the Evolution of the Speech Rhythm. Journal of Cognitive Neuroscience, Vol. 26, 6, 1196--1207.Google ScholarDigital Library
Yue Gu, Xinyu Li, Shuhong Chen, Jianyu Zhang, and Ivan Marsic. 2017. Speech Intention Classification with Multimodal Deep Learning. In Canadian Conference on Artificial Intelligence. Springer, 260--271.Google Scholar
Xianjing Han, Xuemeng Song, Jianhua Yin, Yinglong Wang, and Liqiang Nie. 2019. Prototype-Guided Attribute-Wise Interpretable Scheme for Clothing Matching. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 785--794.Google ScholarDigital Library
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D Convolutional Neural Networks for Human Action Recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 35, 1, 221--231.Google Scholar
Jintao Jiang, Abeer Alwan, Lynne E Bernstein, Patricia Keating, and Ed Auer. 2000. On the Correlation between Facial Movements, Tongue Movements and Speech Acoustics. In International Conference on Spoken Language Processing. ISCA, 42--45.Google Scholar
Veton Kepuska and Gamal Bohouta. 2018. Next-Generation of Virtual Personal Assistants (Microsoft Cortana, Apple Siri, Amazon Alexa and Google Home). In IEEE Annual Computing and Communication Workshop and Conference. IEEE, 99--103.Google ScholarCross Ref
VZ Këpuska and TB Klein. 2009. A Novel Wake-Up-Word Speech Recognition System, Wake-Up-Word Recognition Task, Technology and Evaluation. Nonlinear Analysis: Theory, Methods & Applications, Vol. 71, 12, e2772--e2789.Google ScholarCross Ref
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1746--1751.Google ScholarCross Ref
Barrett E Koster, Robert D Rodman, and Donald Bitzer. 1994. Automated Lip-Sync: Direct Translation of Speech-Sound to Mouth-Shape. In Proceedings of the Asilomar Conference on Signals, Systems and Computers. IEEE, 583--584.Google ScholarCross Ref
John Lewis. 1991. Automated Lip-Sync: Background and Techniques. The Journal of Visualization and Computer Animation, Vol. 2, 4, 118--122.Google ScholarCross Ref
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 2247--2256.Google ScholarCross Ref
Takashi Maekaku, Yusuke Kida, and Akihiko Sugiyama. 2019. Simultaneous Detection and Localization of a Wake-Up Word Using Multi-Task Learning of the Duration and Endpoint. In Annual Conference of the International Speech Communication Association. ISCA, 4240--4244.Google ScholarCross Ref
Etienne Marcheret, Gerasimos Potamianos, Josef Vopicka, and Vaibhava Goel. 2015. Detecting Audio-Visual Synchrony Using Deep Neural Networks. In the Annual Conference of the International Speech Communication Association. ISCA, 548--552.Google Scholar
David F McAllister, Robert D Rodman, Donald L Bitzer, and Andrew S Freeman. 1997. Lip Synchronization of Speech. In Workshop on Audio-Visual Speech Processing. ISCA, 133--136.Google Scholar
Donald McMillan, Barry Brown, Ikkaku Kawaguchi, Razan Jaber, Jordi Solsona Belenguer, and Hideaki Kuzuoka. 2019. Designing with Gaze: Tama--a Gaze Activated Smart-Speaker. In Proceedings of the ACM on Human-Computer Interaction. ACM, 1--26.Google ScholarDigital Library
Liliane Momeni, Triantafyllos Afouras, Themos Stafylakis, Samuel Albanie, and Andrew Zisserman. 2020. Seeing Wake Words: Audio-visual Keyword Spotting. In British Machine Vision Conference. BMVA, 1--13.Google Scholar
Shigeo Morishima, Shin Ogata, Kazumasa Murai, and Satoshi Nakamura. 2002. Audio-Visual Speech Translation With Automatic Lip Syncqronization and Face Tracking Based on 3-D Head Model. In IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2117--2120.Google ScholarCross Ref
Arpan Mukherjee, Shubhi Tiwari, Tanya Chowdhury, and Tanmoy Chakraborty. 2019. Automatic Curation of Content Tables for Educational Videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1329--1332.Google ScholarDigital Library
Alex Nichol, Joshua Achiam, and John Schulman. 2018. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999 .Google Scholar
Yishuang Ning, Jia Jia, Zhiyong Wu, Runnan Li, Yongsheng An, Yanfeng Wang, and Helen Meng. 2017. Multi-Task Deep Learning for User Intention Understanding in Speech Interaction Systems. In AAAI Conference on Artificial Intelligence. AAAI, 161--167.Google Scholar
Patryk Pomykalski, Mikołaj P Wo'zniak, Paweł W Wo'zniak, Krzysztof Grudzie'n, Shengdong Zhao, and Andrzej Romanowski. 2020. Considering Wake Gestures for Smart Assistant Use. In the CHI Conference on Human Factors in Computing Systems. ACM, 1--8.Google Scholar
J Robin Rohlicek, William Russell, Salim Roukos, and Herbert Gish. 1989. Continuous Hidden Markov Modeling for Speaker-Independent Word Spotting. In International Conference on Acoustics, Speech, and Signal Processing. IEEE, 627--630.Google Scholar
Enrique Argones Rúa, Hervé Bredin, Carmen Garc'ia Mateo, Gérard Chollet, and Daniel González Jiménez. 2009. Audio-Visual Speech Asynchrony Detection Using Co-Inertia Analysis and Coupled Hidden Markov Models. Pattern Analysis and Applications, Vol. 12, 3, 271--284.Google ScholarCross Ref
Mehmet Emre Sargin, Yücel Yemez, Engin Erzin, and A Murat Tekalp. 2007. Audiovisual Synchronization and Fusion Using Canonical Correlation analysis. IEEE Transactions on Multimedia, Vol. 9, 7, 1396--1403.Google ScholarDigital Library
Paraic Sheridan, Martin Wechsler, and Peter Sch"auble. 1997. Cross-Language Speech Retrieval: Establishing a Baseline Performance. In Proceedings of the annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 99--108.Google ScholarDigital Library
Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync From Audio. ACM Trans. Graph., Vol. 36, 4, 95:1--95:13.Google ScholarDigital Library
Yusheng Tian and Philip John Gorinski. 2020. Improving End-to-End Speech-to-Intent Classification with Reptile. In the Annual Conference of the International Speech Communication Association. ISCA, 891--895.Google Scholar
Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, and Sanjeev Khudanpur. 2020. Wake Word Detection with Alignment-Free Lattice-Free MMI. In the Annual Conference of the International Speech Communication Association, Virtual Event. ISCA, 4258--4262.Google Scholar
Ye-Yi Wang, Li Deng, and Alex Acero. 2005. Spoken Language Understanding. IEEE Signal Processing Magazine, Vol. 22, 5, 16--31.Google ScholarCross Ref
Jennifer Williams, Steven Kleinegesse, Ramona Comanescu, and Oana Radu. 2018. Recognizing emotions in video using multimodal dnn feature fusion. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language. ACL, 11--19.Google ScholarCross Ref
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems. MIT, 5754--5764.Google Scholar
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1103--1114.Google ScholarCross Ref
Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory Fusion Network for Multi-view Sequential Learning. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 5634--5641.Google ScholarCross Ref
Chenbin Zhang, Congjian Luo, Junyu Lu, Ao Liu, Bing Bai, Kun Bai, and Zenglin Xu. 2020. Read, Attend, and Exclude: Multi-Choice Reading Comprehension by Mimicking Human Reasoning Process. In Proceedings of the International ACM SIGIR conference on research and development in Information Retrieval. ACM, 1945--1948.Google ScholarDigital Library

Index Terms

Multimodal Activation: Awakening Dialog Robots without Wake Words
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms

Recommendations

Human-robot collaborative tutoring using multiparty multimodal spoken dialogue
HRI '14: Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction

In this paper, we describe a project that explores a novel experimental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction setup is designed, and a human-human dialogue corpus is ...
Read More
From vocal to multimodal dialogue management
ICMI '06: Proceedings of the 8th international conference on Multimodal interfaces

Multimodal, speech-enabled systems pose different research problems when compared to unimodal, voice-only dialogue systems. One of the important issues is the question of how a multimodal interface should look like in order to make the multimodal ...
Read More
A dialogue system for multimodal human-robot interaction
ICMI '13: Proceedings of the 15th ACM on International conference on multimodal interaction

This paper presents a POMDP-based dialogue system for multimodal human-robot interaction (HRI). Our aim is to exploit a dialogical paradigm to allow a natural and robust interaction between the human and the robot. The proposed dialogue system should ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2021
2998 pages
ISBN:9781450380379
DOI:10.1145/3404835
General Chairs:
Fernando Diaz
(Google)
,
Chirag Shah
University of Washington
,
Torsten Suel
New York University
,
Program Chairs:
Pablo Castells
Universidad Autónoma de Madrid, Amazon
,
Rosie Jones
Spotify
,
Tetsuya Sakai
Waseda University
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 July 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
audio-visual consistency detection
dialog robots
multimodal activation
semantic talking intention inference
wake words
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 427
  Total Downloads
- Downloads (Last 12 months)63
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multimodal Activation: Awakening Dialog Robots without Wake Words

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Human-robot collaborative tutoring using multiparty multimodal spoken dialogue

From vocal to multimodal dialogue management

A dialogue system for multimodal human-robot interaction