Elsevier

Computer Speech & Language

Volume 27, Issue 2, February 2013, Pages 509-527
Computer Speech & Language

A-STAR: Toward translating Asian spoken languages

https://doi.org/10.1016/j.csl.2011.07.001Get rights and content

Abstract

This paper outlines the first Asian network-based speech-to-speech translation system developed by the Asian Speech Translation Advanced Research (A-STAR) consortium. Eight research groups comprising the A-STAR members participated in the experiments, covering nine languages, i.e., eight Asian languages (Hindi, Indonesian, Japanese, Korean, Malay, Thai, Vietnamese, and Chinese) and English. Each A-STAR member contributed one or more of the following spoken language technologies: automatic speech recognition, machine translation, and text-to-speech through Web servers. The system was designed to translate common spoken utterances of travel conversations from a given source language into multiple target languages in order to facilitate multiparty travel conversations between people speaking different Asian languages. It covers travel expressions including proper nouns that are names of famous places or attractions in Asian countries. In this paper, we describe the issues of developing spoken language technologies for Asian languages, and discuss the difficulties involved in connecting different heterogeneous spoken language translation systems through Web servers. This paper also presents speech-translation results including subjective evaluation, from the first A-STAR field testing which was carried out in July 2009.

Highlights

► The first Asian network-based speech-to-speech translation system developed by the A-STAR consortium. ► A-STAR field testing experiments was carried out in July 2009, covering eight Asian languages and the English language. ► All the speech-to-speech translation engines have already been successfully implemented into Web servers that can be accessed via client applications worldwide.

Introduction

With an area of 17 million square miles, Asia is the world's largest and most populous continent. It hosts approximately four billion people or 60% of the world's current human population. A wide variety of societies, religions and ethnicities shape the culture of Asia. There is also great linguistic diversity, and the majority of Asian countries have more than one language that is natively spoken. In fact, from a total of more than 6800 living languages in the world, one-third are found in Asia (Lewis, 2009).

As we enter the 21st century, information exchange among people from and into Asia is increasing. Meanwhile, the travel destinations of international travelers – whether for the purpose of tourism, emigration, or foreign study – are becoming increasingly diverse. In particular, the Asian region has witnessed a strengthening of its social and economic relationships. Enhancing mutual understanding and economic relations has thus become a key challenge. Socio-economic relations in the Asian region are more vital today than ever before. These changes are increasing the need for devising means that facilitate interaction among people in Asia speaking different languages. The language barriers between Asian countries are critical problems to overcome.

Spoken language translation is one of the innovative technologies that enable people to communicate with each other by speaking in their own languages. However, translating a spoken language is an extremely complex task. This technology involves research in automatic speech recognition (ASR), machine translation (MT), text-to-speech generation (TTS) and doing this for many languages imposes an even greater challenge. Ideally, to create effective systems, many different countries would conduct joint research. Currently, a number of research groups are making progress in a bid to break down the language barrier. As one of the first examples of international cooperation, the Consortium for Speech TrAnslation Research (C-STAR)2 was formed as a voluntary group of institutions committed to building speech translation systems. Institutions in European countries have also formed TC-STAR3 (Technology and Corpora for Speech-to-speech Translation) consortium, which has the objective of translating public speeches and discussions in international meeting. However, these activities are still rare in the Asian community. Therefore, the Asian Speech Translation Advanced Research (A-STAR)4 consortium was established in June 2006 as a speech translation consortium for creating the basic infrastructure for spoken language communication technologies and overcoming language barriers in Asia (Nakamura et al., 2007).

The A-STAR consortium was founded by the National Institute of Information and Communications Technology/Advanced Telecommunications Research (NICT/ATR), Japan, in collaboration with other research institutes in Asia. Currently, the A-STAR consortium consists of the following members: the Electronics and Telecommunication Research Institute (ETRI) in Korea, the National Electronics and Computer Technology Center (NECTEC) in Thailand, the Institute of Automation, Chinese Academy of Sciences (CASIA) in China, the Agency for Assessment and Application Technology (BPPT) in Indonesia, the Center for Development of Advanced Computing (CDAC) in India, the Institute of Information Technology (IOIT) in Vietnam, and the Institute for Infocomm Research (I2R) in Singapore. The consortium is working collaboratively to collect Asian language corpora, create common speech recognition and translation dictionaries, develop Web service speech translation modules for various Asian languages, and standardize interfaces and data formats that facilitate international connections between the different speech translation modules from different countries. The main objective is to create a network-based speech-to-speech translation system in the Asian region, as illustrated in Fig. 1.

In this paper, we outline the development of the Asian network-based speech-to-speech translation system. The system was designed to translate common spoken utterances of travel conversations from a given source language into multiple target languages in order to facilitate multiparty travel conversations between people speaking different Asian languages. Each A-STAR member contributed one or more of the following spoken language technologies: ASR, MT, and TTS through Web servers. Currently, the system successfully covers nine languages: Hindi (Hi), Indonesian (Id), Japanese (Ja), Korean (Ko), Malay (Ms), Thai (Th), Vietnamese (Vi), Chinese (Zh) and English (En). The system covers travel expressions including proper nouns such as the names of famous places or attractions in Asian countries.

The rest of this paper is organized as follows. Section 2 gives an overview of Asian language processing. Section 3 describes the development of the spoken language translation engines for ASR, MT, and TTS. Issues related to how to handle the additional proper nouns are discussed in Section 4. Next, the overall architecture of the A-STAR speech-to-speech translation system architecture is provided in Section 5. Here, we also describe the standardized data format and client application. Details about the first A-STAR demo experiments and speech-translation results are reported and discussed in Section 6. Some highlights of the challenges that we faced in multiparty multilingual speech-to-speech translation are summarized in Section 7. Finally, our conclusion is presented in Section 8.

Section snippets

Asian language processing

One of the first stages of developing a spoken language technology is the design, collection, and annotation of data corpora. However, the processing of natural Asian languages into suitable structured data format poses a great challenge.

Table 1 summarizes the characteristics of the eight Asian languages (LNG) used in this experiment. As a reference, English is also included. It shows that the Asian languages characteristics are very diverse, in many ways. In contrast to Western languages that

Spoken language technologies

The development of A-STAR spoken language technologies is described in the following section. Each A-STAR member contributes their ASR, MT, and TTS systems. All systems are allowed to be trained with any available corpora. There is no restriction on the type of resources applied.

Multiparty dialog scenario and proper nouns

We have also collected additional multiparty dialog scenarios including famous proper nouns, as follows:

  • Proper nouns

    The proper nouns are collected from ten Asian countries: India, Indonesia, Japan, Korea, Malaysia, Singapore, Thailand, Vietnam, China, and the United States. They mainly comprise city names (e.g., Kyoto in Japan, Beijing in China), tourist areas (e.g., Bulkuksa in Korea, Wat Pra Kaew in Thailand), attractions (e.g., Wayang kulit in Indonesia, Kathak in India), and similar nouns.

Integration of network-based speech-to-speech translation

The spoken language technologies, including ASR, MT, and TTS engines described in Section 3, are provided by A-STAR members through Web servers, and form a unified multilingual network-based speech-to-speech translation system. The overall structure is illustrated in Fig. 2, and the other component systems are described in the following sections.

Evaluation of translation server components

The first A-STAR demo experiments for connecting spoken language technologies in the Asian region were carried out in July 2009. Two evaluations based on basic travel expression sentences, BTEC, and dialog utterances were conducted as described in the following section.

Challenges on multiparty multilingual speech-to-speech translation

In addition to the questionnaire described above, users were allowed to provide some feedback on problems that occur during the A-STAR demo experiments. Here are some highlights of the challenges that we faced in multiparty multilingual speech-to-speech translation.

Conclusion

The first Asian network-based speech-to-speech translation experiments were performed in July 2009. Eight research groups comprising the A-STAR consortium members participated in the experiments, covering eight Asian languages and the English language. All the speech-to-speech translation engines have already been successfully implemented into Web servers that can be accessed via client applications worldwide. This implementation has achieved the desired objective of real-time, location-free

References (54)

  • KwonO. et al.

    Korean large vocabulary continuous speech recognition with morpheme-based recognition units

    Speech Communication

    (2003)
  • ZhangJ. et al.

    An introduction to the Chinese speech recognition front-end of the NICT/ATR multi-lingual speech translation system

    Tsinghua Science and Technology

    (2008)
  • AroraE.S.K. et al.

    Handling of out-of-vocabulary words in phrase-based statistical machine translation for Hindi-Japanese

  • AroraS. et al.

    Development of HMM-based Hindi speech synthesis system

  • AsaharaM. et al.

    Extended models and tools for high performance part-of-speech Tagger

  • AwA. et al.

    Piramid: Bahasa Indonesia and Bahasa Malaysia translation system enhanced through comparable corpora

  • BangcharoensapP. et al.

    A Statistical-machine-translation approach to word boundary identification: a projective analogy of bilingual translation

  • BudionoH.R. et al.

    Resource Report: Building Parallel Text Corpora for Multi-Domain Translation System

  • CCITT

    Absolute Category Rating (ACR) Method for Subjective Testing of Digital Processors

    (1984)
  • ChenB. et al.

    I2R multi-pass machine translation system for IWSLT 2008

  • FinchA. et al.

    A Bayesian model of bilingual segmentation for transliteration

  • FinchA. et al.

    The NICT/ATR speech translation system for IWSLT 2007

  • HaruechaiyasakC. et al.

    TLex: Thai Lexeme analyser based on the conditional random fields

  • HongM. et al.

    Customizing a Korean–English MT system for patent translation

  • HuX. et al.

    Construction of Chinese segmented and POS-tagged conversational corpora and their evaluations on spontaneous speech recognitions

  • HuangX. et al.

    Spoken Language Processing

    (2001)
  • JitsuhiroT. et al.

    Automatic generation of non-uniform HMM topologies based on the MDL criterion

    IEICE Trans. Inf. & Syst.

    (2004)
  • KasuriyaS. et al.

    Thai speech database for speech recognition

  • KawaiH. et al.

    XIMERA: a new TTS from ATR based on corpus-based technologies

  • KikuiG. et al.

    Comparative study on corpora for speech translation

    IEEE Transactions on Audio, Speech, and Language Processing

    (2006)
  • Kimura, N., 2008. Speech Translation Markup Language (STML) Version 1.00, Tech. Rep., NICT, Kyoto,...
  • KoehnP. et al.

    Statistical phrase-based translation

  • KoehnP. et al.

    Moses: open source toolkit for statistical machine translation

  • LeeK. et al.

    A method for English–Korean target word selection using multiple knowledge sources

    IEICE Trans. Fundamentals

    (2006)
  • LeeI. et al.

    An Overview of Korean–English speech-to-speech translation system

  • Lewis, M.P., 2009. Ethnologue: Languages of the World. SIL International, Dallas, TX, USA....
  • LiH. et al.

    Query-by-example spoken document retrieval – the star challenge 2008

  • Cited by (11)

    View all citing articles on Scopus

    This paper has been recommended for acceptance by ‘Guest Editors Speech–Speech Translation.’.

    1

    Currently belongs to Nara Institute of Science and Technology, Japan.

    View full text