ABSTRACT
When talking to the dialog robots, users have to activate the robot first from the standby mode with special wake words, such as "Hey Siri", which is apparently not user-friendly. The latest generation of dialog robots have been equipped with advanced sensors, like the camera, enabling multimodal activation. In this work, we work towards awaking the robot without wake words. To accomplish this task, we present a Multimodal Activation Scheme (MAS), consisting of two key components: audio-visual consistency detection and semantic talking intention inference. The first one is devised to measure the consistency between the audio and visual modalities in order to figure out weather the heard speech comes from the detected user in front of the camera. Towards this end, two heterogeneous CNN-based networks are introduced to convolutionalize the fine-grained facial landmark features and the MFCC audio features, respectively. The second one is to infer the semantic talking intention of the recorded speech, where the transcript of the speech is recognized and matrix factorization is utilized to uncover the latent human-robot talking topics. We ultimately devise different fusion strategies to unify these two components. To evaluate MAS, we construct a dataset containing 12,741 short videos recorded by 194 invited volunteers. Extensive experiments demonstrate the effectiveness of our scheme.
Supplemental Material
- Karan Ahuja, Andy Kong, Mayank Goel, and Chris Harrison. 2020. Direction-of-Voice (DoV) Estimation for Intuitive Speech Interaction with Smart Device Ecosystems. In Proceedings of the Annual ACM Symposium on User Interface Software and Technology. ACM, 1121--1131.Google ScholarDigital Library
- Relja Arandjelovic and Andrew Zisserman. 2018. Objects that Sound. In Europeon Conference on Computer Vision. Springer, 451--466.Google ScholarDigital Library
- Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6299--6308.Google ScholarCross Ref
- Mengli Cheng, Chengyu Wang, Xu Hu, Jun Huang, and Xiaobo Wang. 2020. Weakly Supervised Construction of ASR Systems with Massive Video Data. arXiv preprint arXiv:2008.01300 .Google Scholar
- Zhiyong Cheng, Xuanchong Li, Jialie Shen, and Alexander G Hauptmann. 2016. Which Information Sources are More Effective and Reliable in Video Search. In Proceedings of the International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 1069--1072.Google ScholarDigital Library
- Joon Son Chung and Andrew Zisserman. 2016. Out of Time: Automated Lip Sync In The Wild. In Asian Conference on Computer Vision. Springer, 251--263.Google Scholar
- Soo-Whan Chung, Joon Son Chung, and Hong-Goo Kang. 2019. Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation. In IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 3965--3969.Google Scholar
- Thierry Desot, Francois Portet, and Michel Vacher. 2019. Towards End-to-End Spoken Intent Recognition in Smart Home. In International Conference on Speech Technology and Human-Computer Dialogue. IEEE, 1--8.Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, 4171--4186.Google Scholar
- Y. Fan, J. W. Kang, L. T. Li, K. C. Li, H. L. Chen, S. T. Cheng, P. Y. Zhang, Z. Y. Zhou, Y. Q. Cai, and D. Wang. 2020. CN-Celeb: A Challenging Chinese Speaker Recognition Dataset. In IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 7604--7608.Google Scholar
- Asif A. Ghazanfar and Daniel Y. Takahashi. 2014. Facial Expressions and the Evolution of the Speech Rhythm. Journal of Cognitive Neuroscience, Vol. 26, 6, 1196--1207.Google ScholarDigital Library
- Yue Gu, Xinyu Li, Shuhong Chen, Jianyu Zhang, and Ivan Marsic. 2017. Speech Intention Classification with Multimodal Deep Learning. In Canadian Conference on Artificial Intelligence. Springer, 260--271.Google Scholar
- Xianjing Han, Xuemeng Song, Jianhua Yin, Yinglong Wang, and Liqiang Nie. 2019. Prototype-Guided Attribute-Wise Interpretable Scheme for Clothing Matching. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 785--794.Google ScholarDigital Library
- Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D Convolutional Neural Networks for Human Action Recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 35, 1, 221--231.Google Scholar
- Jintao Jiang, Abeer Alwan, Lynne E Bernstein, Patricia Keating, and Ed Auer. 2000. On the Correlation between Facial Movements, Tongue Movements and Speech Acoustics. In International Conference on Spoken Language Processing. ISCA, 42--45.Google Scholar
- Veton Kepuska and Gamal Bohouta. 2018. Next-Generation of Virtual Personal Assistants (Microsoft Cortana, Apple Siri, Amazon Alexa and Google Home). In IEEE Annual Computing and Communication Workshop and Conference. IEEE, 99--103.Google ScholarCross Ref
- VZ Këpuska and TB Klein. 2009. A Novel Wake-Up-Word Speech Recognition System, Wake-Up-Word Recognition Task, Technology and Evaluation. Nonlinear Analysis: Theory, Methods & Applications, Vol. 71, 12, e2772--e2789.Google ScholarCross Ref
- Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1746--1751.Google ScholarCross Ref
- Barrett E Koster, Robert D Rodman, and Donald Bitzer. 1994. Automated Lip-Sync: Direct Translation of Speech-Sound to Mouth-Shape. In Proceedings of the Asilomar Conference on Signals, Systems and Computers. IEEE, 583--584.Google ScholarCross Ref
- John Lewis. 1991. Automated Lip-Sync: Background and Techniques. The Journal of Visualization and Computer Animation, Vol. 2, 4, 118--122.Google ScholarCross Ref
- Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 2247--2256.Google ScholarCross Ref
- Takashi Maekaku, Yusuke Kida, and Akihiko Sugiyama. 2019. Simultaneous Detection and Localization of a Wake-Up Word Using Multi-Task Learning of the Duration and Endpoint. In Annual Conference of the International Speech Communication Association. ISCA, 4240--4244.Google ScholarCross Ref
- Etienne Marcheret, Gerasimos Potamianos, Josef Vopicka, and Vaibhava Goel. 2015. Detecting Audio-Visual Synchrony Using Deep Neural Networks. In the Annual Conference of the International Speech Communication Association. ISCA, 548--552.Google Scholar
- David F McAllister, Robert D Rodman, Donald L Bitzer, and Andrew S Freeman. 1997. Lip Synchronization of Speech. In Workshop on Audio-Visual Speech Processing. ISCA, 133--136.Google Scholar
- Donald McMillan, Barry Brown, Ikkaku Kawaguchi, Razan Jaber, Jordi Solsona Belenguer, and Hideaki Kuzuoka. 2019. Designing with Gaze: Tama--a Gaze Activated Smart-Speaker. In Proceedings of the ACM on Human-Computer Interaction. ACM, 1--26.Google ScholarDigital Library
- Liliane Momeni, Triantafyllos Afouras, Themos Stafylakis, Samuel Albanie, and Andrew Zisserman. 2020. Seeing Wake Words: Audio-visual Keyword Spotting. In British Machine Vision Conference. BMVA, 1--13.Google Scholar
- Shigeo Morishima, Shin Ogata, Kazumasa Murai, and Satoshi Nakamura. 2002. Audio-Visual Speech Translation With Automatic Lip Syncqronization and Face Tracking Based on 3-D Head Model. In IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2117--2120.Google ScholarCross Ref
- Arpan Mukherjee, Shubhi Tiwari, Tanya Chowdhury, and Tanmoy Chakraborty. 2019. Automatic Curation of Content Tables for Educational Videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1329--1332.Google ScholarDigital Library
- Alex Nichol, Joshua Achiam, and John Schulman. 2018. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999 .Google Scholar
- Yishuang Ning, Jia Jia, Zhiyong Wu, Runnan Li, Yongsheng An, Yanfeng Wang, and Helen Meng. 2017. Multi-Task Deep Learning for User Intention Understanding in Speech Interaction Systems. In AAAI Conference on Artificial Intelligence. AAAI, 161--167.Google Scholar
- Patryk Pomykalski, Mikołaj P Wo'zniak, Paweł W Wo'zniak, Krzysztof Grudzie'n, Shengdong Zhao, and Andrzej Romanowski. 2020. Considering Wake Gestures for Smart Assistant Use. In the CHI Conference on Human Factors in Computing Systems. ACM, 1--8.Google Scholar
- J Robin Rohlicek, William Russell, Salim Roukos, and Herbert Gish. 1989. Continuous Hidden Markov Modeling for Speaker-Independent Word Spotting. In International Conference on Acoustics, Speech, and Signal Processing. IEEE, 627--630.Google Scholar
- Enrique Argones Rúa, Hervé Bredin, Carmen Garc'ia Mateo, Gérard Chollet, and Daniel González Jiménez. 2009. Audio-Visual Speech Asynchrony Detection Using Co-Inertia Analysis and Coupled Hidden Markov Models. Pattern Analysis and Applications, Vol. 12, 3, 271--284.Google ScholarCross Ref
- Mehmet Emre Sargin, Yücel Yemez, Engin Erzin, and A Murat Tekalp. 2007. Audiovisual Synchronization and Fusion Using Canonical Correlation analysis. IEEE Transactions on Multimedia, Vol. 9, 7, 1396--1403.Google ScholarDigital Library
- Paraic Sheridan, Martin Wechsler, and Peter Sch"auble. 1997. Cross-Language Speech Retrieval: Establishing a Baseline Performance. In Proceedings of the annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 99--108.Google ScholarDigital Library
- Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync From Audio. ACM Trans. Graph., Vol. 36, 4, 95:1--95:13.Google ScholarDigital Library
- Yusheng Tian and Philip John Gorinski. 2020. Improving End-to-End Speech-to-Intent Classification with Reptile. In the Annual Conference of the International Speech Communication Association. ISCA, 891--895.Google Scholar
- Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, and Sanjeev Khudanpur. 2020. Wake Word Detection with Alignment-Free Lattice-Free MMI. In the Annual Conference of the International Speech Communication Association, Virtual Event. ISCA, 4258--4262.Google Scholar
- Ye-Yi Wang, Li Deng, and Alex Acero. 2005. Spoken Language Understanding. IEEE Signal Processing Magazine, Vol. 22, 5, 16--31.Google ScholarCross Ref
- Jennifer Williams, Steven Kleinegesse, Ramona Comanescu, and Oana Radu. 2018. Recognizing emotions in video using multimodal dnn feature fusion. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language. ACL, 11--19.Google ScholarCross Ref
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems. MIT, 5754--5764.Google Scholar
- Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1103--1114.Google ScholarCross Ref
- Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory Fusion Network for Multi-view Sequential Learning. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 5634--5641.Google ScholarCross Ref
- Chenbin Zhang, Congjian Luo, Junyu Lu, Ao Liu, Bing Bai, Kun Bai, and Zenglin Xu. 2020. Read, Attend, and Exclude: Multi-Choice Reading Comprehension by Mimicking Human Reasoning Process. In Proceedings of the International ACM SIGIR conference on research and development in Information Retrieval. ACM, 1945--1948.Google ScholarDigital Library
Index Terms
- Multimodal Activation: Awakening Dialog Robots without Wake Words
Recommendations
Human-robot collaborative tutoring using multiparty multimodal spoken dialogue
HRI '14: Proceedings of the 2014 ACM/IEEE international conference on Human-robot interactionIn this paper, we describe a project that explores a novel experimental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction setup is designed, and a human-human dialogue corpus is ...
From vocal to multimodal dialogue management
ICMI '06: Proceedings of the 8th international conference on Multimodal interfacesMultimodal, speech-enabled systems pose different research problems when compared to unimodal, voice-only dialogue systems. One of the important issues is the question of how a multimodal interface should look like in order to make the multimodal ...
A dialogue system for multimodal human-robot interaction
ICMI '13: Proceedings of the 15th ACM on International conference on multimodal interactionThis paper presents a POMDP-based dialogue system for multimodal human-robot interaction (HRI). Our aim is to exploit a dialogical paradigm to allow a natural and robust interaction between the human and the robot. The proposed dialogue system should ...
Comments