Abstract
Humans communicate with each other in a multimodal way. Even with several technologies mediating remote communication, face-to-face contact is still our main and most natural way to exchange information. Despite continuous advances in interaction modalities, such as speech interaction, much can be done to improve its naturalness and efficiency, particularly by considering the visual cues transmitted by facial expressions through audiovisual speech synthesis (AVS). To this effect, several approaches have been proposed, in the literature, mostly based in data-driven methods. These, while presenting very good results, rely on models that work as black boxes without a direct relation with the actual process of producing speech and, hence, do not contribute much to our understanding of the underpinnings of the synergies between the audio and visual outputs. In this context, the authors proposed a first proof of concept for an articulatory-based approach to AVS, supported on the articulatory phonology framework, and argued that this research needs to be challenged and informed by fast methods to translate it to interactive applications. In this article, we describe further evolutions of the pronunciation module of the AVS core system along with the proposal of a set of interaction modalities to enable its integration in applications to enable a faster translation into real scenarios. The proposed modalities are designed in line with the W3C recommendations for multimodal interaction architectures making it easy to integrate with any applications that consider it.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Almeida, N., Teixeira, A., Silva, S., Ketsmur, M.: The AM4I architecture and framework for multimodal interaction and its application to smart environments. Sensors 19, 11 (2019). https://doi.org/10.3390/s19112587. Switzerland
Almeida, N., Silva, S., Teixeira, A.: Design and development of speech interaction: a methodology. In: Kurosu, M. (ed.) HCI 2014, Part II. LNCS, vol. 8511, pp. 370–381. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07230-2_36
Dahl, D.A.: The W3C multimodal architecture and interfaces standard. J. Multimodal User Interfaces 7(3), 171–182 (2013). https://doi.org/10.1007/s12193-013-0120-5
Filntisis, P.P., Katsamanis, A., Tsiakoulis, P., Maragos, P.: Video-realistic expressive audio-visual speech synthesis for the Greek language. Speech Commun. 95, 137–152 (2017)
Mattheyses, W., Verhelst, W.: Audiovisual speech synthesis: an overview of the state-of-the-art. Speech Commun. 66, 182–217 (2015). https://doi.org/10.1016/j.specom.2014.11.001
Nam, H., Goldstein, L., Browman, C., Rubin, P., Proctor, M., Saltzman, E.: Tada (task dynamics application) manual (2006)
Oliveira, C.A.M.D.: Do grafema ao gesto: contributos linguísticos para um sistema de síntese de base articulatória. Ph.D. thesis, Universidade de Aveiro (2009). https://ria.ua.pt/handle/10773/4847
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2019, pp. 3617–3621. IEEE (2019)
Rao, K., Peng, F., Sak, H., Beaufays, F.: Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4225–4229 (2015). https://doi.org/10.1109/ICASSP.2015.7178767
Rodríguez, B.H., Moissinac, J.C.: Discovery and registration: finding and integrating components into dynamic systems. In: Dahl, D. (ed.) Multimodal Interaction with W3C Standards. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-42816-1_15
Rubin, P., Baer, T., Mermelstein, P.: An articulatory synthesizer for perceptual research. J. Acoust. Soc. Am. 70(2), 321–328 (1981)
Saltzman, E.L., Munhall, K.G.: A dynamical approach to gestural patterning in speech production. Ecol. Psychol. 1(4), 333–382 (1989)
Serra, J., Ribeiro, M., Freitas, J., Orvalho, V., Dias, M.S.: A proposal for a visual speech animation system for European Portuguese. In: Torre Toledano, D., et al. (eds.) IberSPEECH 2012. CCIS, vol. 328, pp. 267–276. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35292-8_28
Silva, S., Teixeira, A., Orvalho, V.: Articulatory-based audiovisual speech synthesis: proof of concept for European Portuguese. In: Proceedings of the Iberspeech, Lisbon, Portugal, pp. 119–126 (2016)
Silva, S., Teixeira, A.J.S.: An anthropomorphic perspective for audiovisual speech synthesis. In: BIOSIGNALS, pp. 163–172 (2017)
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. (ToG) 36(4), 1–13 (2017)
Teixeira, A., Silva, L., Martinez, R., Vaz, F.: Sapwindows - towards a versatile modular articulatory synthesizer. In: Proceedings of 2002 IEEE Workshop on Speech Synthesis, pp. 31–34 (2002). https://doi.org/10.1109/WSS.2002.1224366
Teixeira, A., Almeida, N., Pereira, C., Oliveira e Silva, M., Vieira, D., Silva, S.: Applications of the multimodal interaction architecture in ambient assisted living. In: Dahl, D. (ed.) Multimodal Interaction with W3C Standards. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-42816-1_12
Teixeira, A., Almeida, N., Ketsmur, M., Silva, S.: Chapter 6 - effective natural interaction with our sensorized smart homes. In: Neustein, A. (ed.) Advances in Ubiquitous Computing. Advances in Ubiquitous Sensing Applications for Healthcare, pp. 185–222. Academic Press (2020). https://doi.org/10.1016/B978-0-12-816801-1.00006-2
Thangthai, A., Milner, B., Taylor, S.: Synthesising visual speech using dynamic visemes and deep learning architectures. Comput. Speech Lang. 55, 101–119 (2019)
Thézé, R., Gadiri, M.A., Albert, L., Provost, A., Giraud, A.L., Mégevand, P.: Animated virtual characters to explore audio-visual speech in controlled and naturalistic environments. Sci. Rep. 10(1), 1–12 (2020)
Acknowledgements
This work is partially funded by IEETA Research Unit funding (UIDB/00127/2020), by Portugal 2020 under the Competitiveness and Internationalization Operational Program, and the European Regional Development Fund through project MEMNON (POCI-01-0145-FEDER-028976).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Almeida, N., Cunha, D., Silva, S., Teixeira, A. (2021). Designing and Deploying an Interaction Modality for Articulatory-Based Audiovisual Speech Synthesis. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-87802-3_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)