Abstract
Speech translation is a technology that helps people communicate across different languages. The most commonly used speech translation model is composed of automatic speech recognition, machine translation and text-to-speech synthesis components, which share information only at the text level. However, spoken communication is different from written communication in that it uses rich acoustic cues such as prosody in order to transmit more information through non-verbal channels. This paper is concerned with speech-to-speech translation that is sensitive to this paralinguistic information. Our long-term goal is to make a system that allows users to speak a foreign language with the same expressiveness as if they were speaking in their own language. Our method works by reconstructing input acoustic features in the target language. From the many different possible paralinguistic features to handle, in this paper we choose duration and power as a first step, proposing a method that can translate these features from input speech to the output speech in continuous space. This is done in a simple and language-independent fashion by training an end-to-end model that maps source-language duration and power information into the target language. Two approaches are investigated: linear regression and neural network models. We evaluate the proposed methods and show that paralinguistic information in the input speech of the source language can be reflected in the output speech of the target language.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abe M, Nakamura S, Shikano K, Kuwabara H (1988) Voice conversion through vector quantization. In: ICASSP-88, international conference on acoustics, speech, and signal processing, New York City, vol 1, pp 655–658
Aguero PD, Adell J, Bonafonte A (2006) Prosody generation for speech-to-speech translation. In: 2006 IEEE international conference on acoustics speech and signal processing proceedings, Toulouse, France
Anumanchipalli GK, Oliveira LC, Black AW (2012) Intent transfer in speech-to-speech machine translation. In: 2012 IEEE spoken language technology workshop (SLT), Miami, FL, pp 153–158
Do QT, Sakti S, Neubig G, Toda T, Nakamura S (2015a) Improving translation of emphasis with pause prediction in speech-to-speech translation systems. In: IWSLT 2015: proceedings of the 12th international workshop on spoken language translation, Da Nang, Vietnam, pp 204–208
Do QT, Takamichi S, Sakti S, Neubig G, Toda T, Nakamura S (2015b) Preserving word-level emphasis in speech-to-speech translation using linear regression HSMMs. In: INTERSPEECH 2015, 16th annual conference of the international speech communication association, Dresden, pp 3665–3669
Do QT, Sakti S, Nakamura S (2017) Toward expressive speech translation: a unified sequence-to-sequence LSTMs approach for translating words and emphasis. In: Interspeech 2017, 18th annual conference of the international speech communication association, Stockholm, Sweden, pp 2640–2644
Dreyer M, Dong Y (2015) Apro: all-pairs ranking optimization for mt tuning. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, Denver, CO, pp 1018–1023
Duong L, Anastasopoulos A, Chiang D, Bird S, Cohn T (2016) An attentional model for speech translation without transcription. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, San Diego, CA, pp 949–959
Jiang J, Ahmed Z, Carson-Berndsen J, Cahill P, Way A (2011) Phonetic representation-based speech translation. In: Proceedings of machine translation summit XIII, Xiamen, China, pp 81–88
Kano T, Sakti S, Takamichi S, Neubig G, Toda T, Nakamura S (2012) A method for translation of paralinguistic information. In: 2012 International workshop on spoken language translation, Hong Kong, pp 158–163
Kano T, Takamichi S, Sakti S, Neubig G, Toda T, Nakamura S (2013) Generalizing continuous-space translation of paralinguistic information. In: INTERSPEECH 2013, 14th Annual conference of the international speech communication association, Lyon, France, pp 2614–2618
Koehn P, Hoang H (2007) Factored translation models. In: EMNLP-CoNLL-2007: proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague, Czech Republic, pp 868–876
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, Prague, Czech Republic, pp 177–180
Leonard R (1984) A database for speaker-independent digit recognition. In: ICASSP ’84. IEEE international conference on acoustics, speech, and signal processing, San Diego, CA, pp 328–331
Morishima S, Nakamura S (2002) Multi-modal translation system and its evaluation. In: Proceedings of the fourth IEEE international conference on multimodal interfaces, Pittsburgh, PA, pp 241–246
Neubig G, Duh K, Ogushi M, Kano T, Kiso T, Sakti S, Toda T, Nakamura S (2012) The NAIST machine translation system for IWSLT 2012. In: IWSLT-2012: 9th international workshop on spoken language translation, Hong Kong, pp 54–60
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Philadelphia, PA, pp 311–318
Pearce D, Hirsch HG (2000) The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ASR2000—Automatic speech recognition: challenges for the new millenium, Paris, France, pp 181–188
Sridhar VKR, Bangalore S, Narayanan S (2013) Enriching machine-mediated speech-to-speech translation using contextual information. Comput Speech Lang 27(2):492–508
Székely É, Steiner I, Ahmed Z, Carson-Berndsen J (2014) Facial expression-based affective speech translation. J Multimodal User Interfaces 8(1):87–96
Takezawa T, Morimoto T, Sagisaka Y, Campbell N, Iida H, Sugaya F, Yokoo A, Yamamoto S (1998) A Japanese-to-English speech translation system: ATR-MATRIX. In: 5th international conference on spoken language processing, ICSLP’98 proceedings, Sydney, Australia, pp 2779–2883
Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235
Wahlster W (2001) Robust translation of spontaneous speech: a multi-engine approach. In: Proceedings of seventeenth international joint conference on artificial intelligence, invited papers, Seattle, WA, pp 19–28
Weiss RJ, Chorowski J, Jaitly N, Wu Y, Chen Z (2017) Sequence-to-sequence models can directly transcribe foreign speech. arXiv:1703.08581
Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51(11):1039–1064
Zhang J, Nakamura S (2003) An efficient algorithm to search for a minimum sentence set for collecting speech database. In: 15th international congress of phonetic sciences (ICPhS-15), Barcelona, Spain, pp 3145–3148
Acknowledgements
The funding was provided by Japan Society for the Promotion of Science (Grand Nos. 24240032 and 26870371).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kano, T., Takamichi, S., Sakti, S. et al. An end-to-end model for cross-lingual transformation of paralinguistic information. Machine Translation 32, 353–368 (2018). https://doi.org/10.1007/s10590-018-9217-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-018-9217-7