An end-to-end model for cross-lingual transformation of paralinguistic information

Kano, Takatomo; Takamichi, Shinnosuke; Sakti, Sakriani; Neubig, Graham; Toda, Tomoki; Nakamura, Satoshi

doi:10.1007/s10590-018-9217-7

An end-to-end model for cross-lingual transformation of paralinguistic information

Published: 06 April 2018

Volume 32, pages 353–368, (2018)
Cite this article

Machine Translation

Takatomo Kano ORCID: orcid.org/0000-0001-9693-3785¹,
Shinnosuke Takamichi¹,
Sakriani Sakti¹,
Graham Neubig¹,
Tomoki Toda¹ &
…
Satoshi Nakamura¹

526 Accesses
3 Altmetric
Explore all metrics

Abstract

Speech translation is a technology that helps people communicate across different languages. The most commonly used speech translation model is composed of automatic speech recognition, machine translation and text-to-speech synthesis components, which share information only at the text level. However, spoken communication is different from written communication in that it uses rich acoustic cues such as prosody in order to transmit more information through non-verbal channels. This paper is concerned with speech-to-speech translation that is sensitive to this paralinguistic information. Our long-term goal is to make a system that allows users to speak a foreign language with the same expressiveness as if they were speaking in their own language. Our method works by reconstructing input acoustic features in the target language. From the many different possible paralinguistic features to handle, in this paper we choose duration and power as a first step, proposing a method that can translate these features from input speech to the output speech in continuous space. This is done in a simple and language-independent fashion by training an end-to-end model that maps source-language duration and power information into the target language. Two approaches are investigated: linear regression and neural network models. We evaluate the proposed methods and show that paralinguistic information in the input speech of the source language can be reflected in the output speech of the target language.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

French-Fulfulde Textless and Cascading Speech Translation: Towards a Dual Architecture

Significance of Audio Quality in Speech-to-Text Translation Systems

Multilingual Speech Recognition for Indian Languages

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Part of the content of this article is based on content that has been published in IWSLT and InterSpeech (Kano et al. 2012, 2013). In this paper we describe these methods using a unified formulation, adding a more complete survey, and a discussion of the results in significantly more depth.

References

Abe M, Nakamura S, Shikano K, Kuwabara H (1988) Voice conversion through vector quantization. In: ICASSP-88, international conference on acoustics, speech, and signal processing, New York City, vol 1, pp 655–658
Aguero PD, Adell J, Bonafonte A (2006) Prosody generation for speech-to-speech translation. In: 2006 IEEE international conference on acoustics speech and signal processing proceedings, Toulouse, France
Anumanchipalli GK, Oliveira LC, Black AW (2012) Intent transfer in speech-to-speech machine translation. In: 2012 IEEE spoken language technology workshop (SLT), Miami, FL, pp 153–158
Do QT, Sakti S, Neubig G, Toda T, Nakamura S (2015a) Improving translation of emphasis with pause prediction in speech-to-speech translation systems. In: IWSLT 2015: proceedings of the 12th international workshop on spoken language translation, Da Nang, Vietnam, pp 204–208
Do QT, Takamichi S, Sakti S, Neubig G, Toda T, Nakamura S (2015b) Preserving word-level emphasis in speech-to-speech translation using linear regression HSMMs. In: INTERSPEECH 2015, 16th annual conference of the international speech communication association, Dresden, pp 3665–3669
Do QT, Sakti S, Nakamura S (2017) Toward expressive speech translation: a unified sequence-to-sequence LSTMs approach for translating words and emphasis. In: Interspeech 2017, 18th annual conference of the international speech communication association, Stockholm, Sweden, pp 2640–2644
Dreyer M, Dong Y (2015) Apro: all-pairs ranking optimization for mt tuning. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, Denver, CO, pp 1018–1023
Duong L, Anastasopoulos A, Chiang D, Bird S, Cohn T (2016) An attentional model for speech translation without transcription. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, San Diego, CA, pp 949–959
Jiang J, Ahmed Z, Carson-Berndsen J, Cahill P, Way A (2011) Phonetic representation-based speech translation. In: Proceedings of machine translation summit XIII, Xiamen, China, pp 81–88
Kano T, Sakti S, Takamichi S, Neubig G, Toda T, Nakamura S (2012) A method for translation of paralinguistic information. In: 2012 International workshop on spoken language translation, Hong Kong, pp 158–163
Kano T, Takamichi S, Sakti S, Neubig G, Toda T, Nakamura S (2013) Generalizing continuous-space translation of paralinguistic information. In: INTERSPEECH 2013, 14th Annual conference of the international speech communication association, Lyon, France, pp 2614–2618
Koehn P, Hoang H (2007) Factored translation models. In: EMNLP-CoNLL-2007: proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague, Czech Republic, pp 868–876
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, Prague, Czech Republic, pp 177–180
Leonard R (1984) A database for speaker-independent digit recognition. In: ICASSP ’84. IEEE international conference on acoustics, speech, and signal processing, San Diego, CA, pp 328–331
Morishima S, Nakamura S (2002) Multi-modal translation system and its evaluation. In: Proceedings of the fourth IEEE international conference on multimodal interfaces, Pittsburgh, PA, pp 241–246
Neubig G, Duh K, Ogushi M, Kano T, Kiso T, Sakti S, Toda T, Nakamura S (2012) The NAIST machine translation system for IWSLT 2012. In: IWSLT-2012: 9th international workshop on spoken language translation, Hong Kong, pp 54–60
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Philadelphia, PA, pp 311–318
Pearce D, Hirsch HG (2000) The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ASR2000—Automatic speech recognition: challenges for the new millenium, Paris, France, pp 181–188
Sridhar VKR, Bangalore S, Narayanan S (2013) Enriching machine-mediated speech-to-speech translation using contextual information. Comput Speech Lang 27(2):492–508
Article Google Scholar
Székely É, Steiner I, Ahmed Z, Carson-Berndsen J (2014) Facial expression-based affective speech translation. J Multimodal User Interfaces 8(1):87–96
Article Google Scholar
Takezawa T, Morimoto T, Sagisaka Y, Campbell N, Iida H, Sugaya F, Yokoo A, Yamamoto S (1998) A Japanese-to-English speech translation system: ATR-MATRIX. In: 5th international conference on spoken language processing, ICSLP’98 proceedings, Sydney, Australia, pp 2779–2883
Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235
Article Google Scholar
Wahlster W (2001) Robust translation of spontaneous speech: a multi-engine approach. In: Proceedings of seventeenth international joint conference on artificial intelligence, invited papers, Seattle, WA, pp 19–28
Weiss RJ, Chorowski J, Jaitly N, Wu Y, Chen Z (2017) Sequence-to-sequence models can directly transcribe foreign speech. arXiv:1703.08581
Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51(11):1039–1064
Article Google Scholar
Zhang J, Nakamura S (2003) An efficient algorithm to search for a minimum sentence set for collecting speech database. In: 15th international congress of phonetic sciences (ICPhS-15), Barcelona, Spain, pp 3145–3148

Download references

Acknowledgements

The funding was provided by Japan Society for the Promotion of Science (Grand Nos. 24240032 and 26870371).

Author information

Authors and Affiliations

Graduate School of Information Science, Nara Institute of Science and Technology, Kansai Science City, Japan
Takatomo Kano, Shinnosuke Takamichi, Sakriani Sakti, Graham Neubig, Tomoki Toda & Satoshi Nakamura

Authors

Takatomo Kano
View author publications
You can also search for this author inPubMed Google Scholar
Shinnosuke Takamichi
View author publications
You can also search for this author inPubMed Google Scholar
Sakriani Sakti
View author publications
You can also search for this author inPubMed Google Scholar
Graham Neubig
View author publications
You can also search for this author inPubMed Google Scholar
Tomoki Toda
View author publications
You can also search for this author inPubMed Google Scholar
Satoshi Nakamura
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Takatomo Kano.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kano, T., Takamichi, S., Sakti, S. et al. An end-to-end model for cross-lingual transformation of paralinguistic information. Machine Translation 32, 353–368 (2018). https://doi.org/10.1007/s10590-018-9217-7

Download citation

Received: 18 July 2016
Accepted: 05 March 2018
Published: 06 April 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s10590-018-9217-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An end-to-end model for cross-lingual transformation of paralinguistic information

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

French-Fulfulde Textless and Cascading Speech Translation: Towards a Dual Architecture

Significance of Audio Quality in Speech-to-Text Translation Systems

Multilingual Speech Recognition for Indian Languages

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now