research-article

Distributed speech translation technologies for multiparty multilingual communication

Authors:

Sakriani Sakti,

Noriyuki Kimura,

Shigeki Matsuda,

Yutaka Ashikari,

Hideki Kashioka,

Eiichiro Sumita,

Satoshi NakamuraAuthors Info & Claims

ACM Transactions on Speech and Language Processing (TSLP), Volume 9, Issue 2

Article No.: 4, Pages 1 - 27

https://doi.org/10.1145/2287710.2287712

Published: 02 August 2012 Publication History

Abstract

Developing a multilingual speech translation system requires efforts in constructing automatic speech recognition (ASR), machine translation (MT), and text-to-speech synthesis (TTS) components for all possible source and target languages. If the numerous ASR, MT, and TTS systems for different language pairs developed independently in different parts of the world could be connected, multilingual speech translation systems for a multitude of language pairs could be achieved. Yet, there is currently no common, flexible framework that can provide an entire speech translation process by bringing together heterogeneous speech translation components. In this article we therefore propose a distributed architecture framework for multilingual speech translation in which all speech translation components are provided on distributed servers and cooperate over a network. This framework can facilitate the connection of different components and functions. To show the overall mechanism, we first present our state-of-the-art technologies for multilingual ASR, MT, and TTS components, and then describe how to combine those systems into the proposed network-based framework. The client applications are implemented on a handheld mobile terminal device, and all data exchanges among client users and spoken language technology servers are managed through a Web protocol. To support multiparty communication, an additional communication server is provided for simultaneously distributing the speech translation results from one user to multiple users. Field testing shows that the system is capable of realizing multiparty multilingual speech translation for real-time and location-independent communication.

References

[1]

Arulampalam, M. S., Maskell, S., Gordon, N., and Clapp, T. 2002. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Process. 50, 2, 174--188.

Digital Library

[2]

Asahara, M. and Matsumoto, Y. 2000. Extended models and tools for high performance part-of-speech tagger. In Proceedings of the International Conference on Computational Linguistics, Workshop (COLING). 21--27.

Digital Library

[3]

Brown, P., Della-Pietra, S., Della-Pietra, V., and Mercer, R. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Ling. 19, 2, 263--311.

Digital Library

[4]

CCITT. 1984. Absolute Category Rating (ACR) Method for Subjective Testing of Digital Processors. Red Book.

[5]

Finch, A. and Sumita, E. 2008. Dynamic model interpolation for statistical machine translation. In Proceedings of the Statistical Machine Translation Workshop (WMT). 208--215.

Digital Library

[6]

Foster, G. and Kuhn, R. 2007. Mixture model adaptation for SMT. In Proceedings of the Statistical Machine Translation Workshop (WMT). 128--135.

Digital Library

[7]

Fujimoto, M. and Nakamura, S. 2006. A non-stationary noise suppression method based on particle filtering and Polyak averaging. IEICE Trans. Inform. Syst. J89-ED, 3, 922--930.

Digital Library

[8]

Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1, 97--109.

[9]

Hu, X., Isotani, R., and Nakamura, S. 2009. Construction of Chinese segmented and POS-tagged conversational corpora and their evaluations on spontaneous speech recognitions. In Proceedings of the 7th Workshop on Asian Language Resource, Annual Meeting of Association for Computational Linguistics (ACL). 65--70.

Digital Library

[10]

Jitsuhiro, T., Matsui, T., and Nakamura, S. 2004. Automatic generation of non-uniform HMM topologies based on the MDL criterion. IEICE Trans. Inform. Syst. E87-D, 8, 2121--2129.

Digital Library

[11]

Kawai, H., Toda, T., Ni, J., Tsuzaki, M., and Tokuda, K. 2004. XIMERA: A new TTS from ATR based on corpus-based technologies. In Proceedings of the ISCA Speech Synthesis Workshop (SSW5). 179--184.

[12]

Kikui, G., Sumita, E., Takezawa, T., and Yamamoto, S. 2003. Creating corpora for speech-to-speech translation. In Proceedings of EUROSPEECH. 381--384.

[13]

Kikui, G., Takezawa, T., Mizushima, M., Yamamoto, S., Sasaki, Y., Kawai, H., and Nakamura, S. 2005. Monitor experiments of ATR speech-to-speech translation system. In Proceedings of the Autumn Meeting of the Acoustical Society of Japan (ASJ). 19--20.

[14]

Kikui, G., Yamamoto, S., Takezawa, T., and Sumita, E. 2006. Comparative study on corpora for speech translation. IEEE Trans. Audio Speech Lang. Process. 14, 5, 1674--1682.

Digital Library

[15]

Koehn, P., Och, F. J., and Marcu, D. 2003. Statistical phrase-based translation. In Proceedings of the Human Language Technology Conference. 127--133.

Digital Library

[16]

Lo, W. K. and Soong, F. K. 2005. Generalized posterior probability for minimum error verification of recognized sentences. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 85--88.

[17]

Och, F. J. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL). 160--167.

Digital Library

[18]

Och, F. J. and Ney, H. 2002. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL). 295--302.

Digital Library

[19]

Och, F. J. and Ney, H. 2003. A systematic comparison of various statistical alignment models. Comput. Ling. 29, 1, 19--51.

Digital Library

[20]

Ostendorf, M. and Singer, H. 1997. HMM topology design using maximum likelihood successive state splitting. Comput. Speech Lang. 11, 17--41.

[21]

Papineni, K., Roukos, S., Ward, T., and Zhu, W. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL). 311--318.

Digital Library

[22]

Paul, M., Okuma, H., Yamamoto, H., Sumita, E., Matsuda, S., Shimizu, T., and Nakamura, S. 2008. Multilingual mobile-phone translation services for world travelers. In Proceedings of the International Conference on Computational Linguistics, Workshop (COLING). Companion Volume.165--168.

Digital Library

[23]

Paul, M., Yamamoto, H., Sumita, E., and Nakamura, S. 2009. On the importance of the pivot language selection for statistical machine translation. In Proceedings of the Association for Computational Linguistics: Human Language Technologies (NAACL/HLT). 221--224.

Digital Library

[24]

Sakti, S., Kimura, N., Paul, M., Hori, C., Sumita, E., Nakamura, S., Park, J., Wutiwiwachai, C., Xu, B., Riza, H., Arora, K., Luong, C., and Li, H. 2009. The Asian network-based speech-to-speech translation system. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 507--512.

[25]

Segura, J. C., Torre, A. D. L., Benitez, M. C., and Peinado, A. M. 2001. Model-based compensation of the additive noise for continuous speech recognition. Experiments using the AURORA II database and tasks. In Proceedings of EUROSPEECH. 221--224.

[26]

Soong, F. K., Loo, W. K., and Nakamura, S. 2004. Optimal acoustic and language model weight for minimizing word verification errors. In Proceedings of the International Conference on Spoken Language Processing (ICSLP). 441--444.

[27]

Stolcke, A. 2002. SRILM - an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing (ICSLP). 901--904.

[28]

Takami, J. and Sagayama, S. 1992. A successive state splitting algorithm for efficient allophone modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 573--576.

Digital Library

[29]

Takezawa, T. and Kikui, G. 2003. Collecting machine-translation aided bilingual dialogues for corpus-based speech-to-speech translation. In Proceedings of EUROSPEECH. 2757--2760.

[30]

Takezawa, T. and Kikui, G. 2004. A comparative study on human communication behaviors and linguistic characteristics for speech-to-speech translation. In Proceedings of the Annual Conference on Language Resources and Evaluation (LREC). 1589--1592.

[31]

Toda, T., Kawai, H., and Tsuzaki, M. 2004. Optimizing sub-cost functions for segment selection based on perceptual evaluation in concatenative speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 657--660.

[32]

Tokuda, K., Kobayashi, T., and Imai, S. 1995. Speech parameter generation from HMM using dynamic features. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 660--663.

[33]

Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., and Kitamura, T. 2000. Speech parameter generation algorithms for HMM-based speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 1215--1218.

[34]

Yamamoto, H., Isogai, S., and Sagisaka, Y. 2003. Multi-class composite N-gram language model. Speech Comm. 41, 369--379.

[35]

Yamamoto, H. and Sumita, E. 2007. Bilingual cluster-based model for statistical machine translation. In Proceedings of the Conference on Empirical Methods on Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNNL). 514--523.

Cited By

Ping L(2021)Mobile Platform Speech Intelligent Recognition and Translation APP2021 13th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA)10.1109/ICMTMA52658.2021.00137(595-598)Online publication date: Jan-2021
https://doi.org/10.1109/ICMTMA52658.2021.00137
Tjandra ASakti SNakamura S(2017)Listening while speaking: Speech chain by deep learning2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)10.1109/ASRU.2017.8268950(301-308)Online publication date: Dec-2017
https://doi.org/10.1109/ASRU.2017.8268950

Index Terms

Distributed speech translation technologies for multiparty multilingual communication

Recommendations

The IBM speech-to-speech translation system for smartphone: Improvements for resource-constrained tasks

This paper describes our recent improvements to IBM TRANSTAC speech-to-speech translation systems that address various issues arising from dealing with resource-constrained tasks, which include both limited amounts of linguistic resources and training ...
Impacts of machine translation and speech synthesis on speech-to-speech translation

This paper analyzes the impacts of machine translation and speech synthesis on speech-to-speech translation systems. A typical speech-to-speech translation system consists of three components: speech recognition, machine translation and speech ...
The VoiceTRAN speech-to-speech translation communicator
AEE'06: Proceedings of the 5th WSEAS international conference on Applications of electrical engineering

This paper describes the design phases of the VoiceTRAN Communicator, which integrates speech recognition, machine translation, and text-to-speech synthesis using the Galaxy architecture. The aim of the work was to build a robust multimodal speech-to-...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Speech and Language Processing

ACM Transactions on Speech and Language Processing Volume 9, Issue 2

July 2012

58 pages

ISSN:1550-4875

EISSN:1550-4883

DOI:10.1145/2287710

Issue’s Table of Contents

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 August 2012

Accepted: 01 March 2012

Revised: 01 February 2012

Received: 01 August 2011

Published in TSLP Volume 9, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
395
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ping L(2021)Mobile Platform Speech Intelligent Recognition and Translation APP2021 13th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA)10.1109/ICMTMA52658.2021.00137(595-598)Online publication date: Jan-2021
https://doi.org/10.1109/ICMTMA52658.2021.00137
Tjandra ASakti SNakamura S(2017)Listening while speaking: Speech chain by deep learning2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)10.1109/ASRU.2017.8268950(301-308)Online publication date: Dec-2017
https://doi.org/10.1109/ASRU.2017.8268950

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents