Abstract
With increasing globalization, communication across language and cultural boundaries is becoming an essential requirement of doing business, delivering education, and providing public services. Due to the considerable cost of human translation services, only a small fraction of text documents and an even smaller percentage of spoken encounters, such as international meetings and conferences, are translated, with most resorting to the use of a common language (e.g. English) or not taking place at all. Technology may provide a potentially revolutionary way out if real-time, domain-independent, simultaneous speech translation can be realized. In this paper, we present a simultaneous speech translation system based on statistical recognition and translation technology. We discuss the technology, various system improvements and propose mechanisms for user-friendly delivery of the result. Over extensive component and end-to-end system evaluations and comparisons with human translation performance, we conclude that machines can already deliver comprehensible simultaneous translation output. Moreover, while machine performance is affected by recognition errors (and thus can be improved), human performance is limited by the cognitive challenge of performing the task in real time.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Abbreviations
- AMI:
-
Meeting transcription data from Augmented Multi-party Interaction (see Table 3)
- ASR:
-
Automatic speech recognition
- BN:
-
Data from broadcast news corpora (see Table 3)
- CHIL:
-
Computers in the human interaction loop
- cMLLR:
-
Constrained maximum likelihood linear regression
- DG:
-
Directorate general
- EC:
-
European Commission
- EM:
-
Expectation maximization
- EPPS:
-
European Parliament plenary sessions
- GALE:
-
Global Autonomous Language Exploitation
- GWRD:
-
Data from Gigaword corpus (see Table 3)
- HNSRD:
-
Data from UK Parliament debates (see Table 3)
- ICSI:
-
(Data recorded at) International Computer Science Institute
- JRTk:
-
Janus Recognition Toolkit
- MLLR:
-
Maximum likelihood linear regression
- MT:
-
Machine translation
- MTG:
-
Meeting transcription data (see Table 3)
- NIST:
-
National Institute of Standards and Technology—MT evaluation measure (Doddington 2002) Data recorded at NIST
- OOV:
-
Out of vocabulary
- PROC:
-
Data from conference proceedings (see Table 3)
- RTF:
-
Real-time factor
- RT-06S:
-
NIST 2006 Rich Transcription evaluation
- RWTH:
-
Rheinisch-Westfälische Technische Hochschule (Aachen)
- SMNR:
-
Lectures and seminars from the CHIL project
- SMT:
-
Statistical machine translation
- SRI:
-
Stanford Research Institute
- SST:
-
Speech-to-speech translation
- STR-DUST:
-
Speech TRanslation: Domain-Unlimited, Spontaneous and Trainable
- SWB:
-
Data from switchboard transcriptions (see Table 3)
- TC-STAR:
-
Technologies and Corpora for Speech-to-Speech-Translation
- TED:
-
Translanguage English database
- TH:
-
Technische Hochschule
- TTS:
-
Text to speech
- UKA-. . .:
-
See Table 3
- UW-M:
-
See Table 3
- VAD:
-
Voice activity detection
- VTLN:
-
Vocal tract length normalization
- WER:
-
Word error rate
References
Accipio Consulting (2006) Sprachtechnologien für Europa [Language technologies for Europe]. ITC IRST, Trento, Italy. Available at http://www.tc-star.org/pubblicazioni/D17_HLT_DE.pdf. Accessed 29 Oct 2008
Al-Khanji R, El-Shiyab S, Hussein R (2000) On the use of compensatory strategies in simultaneous interpretation. Meta J Traduc 45: 544–557
Atal B (1974) Effectiveness of linear prediction characteristics of speech wave for automatic speaker identification and verification. J Acoust Soc Am 55: 1304–1312
Bahl L, Brown P, de Souza P, Mercer R (1986) Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: ICASSP ’86, IEEE international conference on acoustics, speech, and signal processing, Tokyo, Japan, pp 49–52
Bain K, Basson S, Faisman A, Kanevsky D (2005) Accessibility, transcription, and access everywhere. IBM Syst J 44: 589–603
Barik HC (1969) A study of simultaneous interpretation. PhD thesis, University of North Carolina at Chapel Hill
Bellegarda JR (2004) Statistical language model adaptation: review and persepectives. Speech Commun 42: 93–108
Black AW, Taylor PA (1997) The Festival speech synthesis system: system documentation. Technical Report HCRC/TR-83, Human Communciation Research Centre, University of Edinburgh, Edinburgh, Scotland
Brown PF, Della Pietra SA, Della Pietra VJ, Mercer RL (1994) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19: 263–311
Bulyko I, Ostendorf M, Stolcke A (2003) Getting more mileage from Web text sources for conversational speech language modeling using class-dependent mixtures. In: HLT-NAACL 2003 Human language technology conference of the North American chapter of the Association for Computational Linguistics, Companion volume: short papers, student research workshop, demonstrations, tutorial abstracts, Edmonton, Alberta, Canada, pp 7–9
Burger S, MacLaren V, Waibel A (2004) ISL meeting speech part 1, catalog nbr LDC2004S05, Linguistic Data Consortium, Philadelphia, PA
Carletta J, Ashby S, Bourban S, Flynn M, Guillemot M, Hain T, Kadlec J, Karaiskos V, Kraaij W, Kronenthal M, Lathoud G, Lincoln M, Lisowska A, McCowan I, Post W, Reidsma D, Wellner P (2005) The AMI meeting corpus: A pre-announcement. In: 2nd joint workshop on multimodal interaction and related machine learning algorithms MLMI 05, Edinburgh, Scotland, pp 28–39
Cettolo M, Falavigna D (1998) Automatic detection of semantic boundaries based on acoustic and lexical knowledge. In: Fifth international conference on spoken language processing, ICSLP’98, Sydney, Australia, pp 1551–1554
Cettolo M, Federico M (2006) Text segmentation criteria for statistical machine translation. In: Salakoski T, Ginter F, Pyysalo S, Pahikkala T (eds) Advances in natural language processing, 5th international conference, FinTAL 2006, Turku, Finland, LNCS 4139. Springer Verlag, Berlin, pp 664–673
Cettolo M, Brugnara F, Federico M (2004) Advances in the automatic transcription of lectures. In: ICASSP 2004, IEEE international conference on acoustics, speech, and signal processing, Montreal, Canada, pp 769–772
Chen CJ (1999) Speech recognition with automatic punctuation. In: Sixth European conference on speech communication and technology (Eurospeech’99), Budapest, Hungary, pp 447–450
de Mori R, Federico M (1999) Language model adaptation. In: Ponting K (eds) Computational models of speech pattern processing. Springer Verlag, Berlin, pp 280–303
Doddington G (2002) Automatic evaluation of MT quality using n-gram co-occurrence statistics. In: Proceedings of human language technology conference 2002, San Diego, CA, 138–145
Eide E, Gish H (1996) A paramteric approach to vocal tract length normalization. In: 1996 IEEE international conference on acoustics, speech, and signal processing, Atlanta, Georgia, pp 346–348
Finke M, Geutner P, Hild H, Kemp T, Ries K, Westphal M (1997) The Karlsruhe-verbmobil speech recognition engine. In: 1997 IEEE international conference on acoustics, speech, and signal processing (ICASSP’97), Munich, Germany, pp 83–86
Fiscus J (1997) A post-processing system to yield reduced word error rates: recogniser output voting error reduction (ROVER). In: Proceedings of the 1997 IEEE workshop on automatic speech recognition and understanding, Santa Barbara, CA, pp 347–352
Fiscus J, Garofolo J, Przybocki M, Fisher W, Pallett D (1998) 1997 English broadcast news speech (HUB4), catalog nbr LDC98S71, Linguistic Data Consortium, Philadelphia, PA
Foster G, Kuhn R, Johnson H (2006) Phrasetable smoothing for statistical machine translation. In: EMNLP 2006 conference on empirical methods in natural language processing, Sydney, Australia, pp 53–61
Fritsch J, Rogina I (1996) The bucket box intersection (BBI) algorithm for fast approximative evaluation of diagonal mixture Gaussians. In: 1996 IEEE international conference on acoustics, speech, and signal processing, Atlanta, Georgia, pp 837–840
Fügen C, Kolss M (2007) The influence of utterance chunking on machine translation performance. In: Interspeech 2007, 8th annual conference of the International Speech Communication Association, Antwerp, Belgium, pp 2837–2840
Fügen C, Westphal M, Schneider M, Schultz T, Waibel A (2001) LingWear: a mobile tourist information system. In: Proceedings of the first international conference on human language technology research, San Diego, California, 5 pp
Fügen C, Ikbal S, Kraft F, Kumatani K, Laskowski K, McDonough JW, Ostendorf M, Stüker S, Wölfel M (2006a) The ISL RT-06S speech-to-text system. In: Renals et al (2006), pp 407–418
Fügen C, Kolss M, Paulik M, Waibel A (2006b) Open domain speech translation: from seminars and speeches to lectures. In: TC-STAR workshop on speech to speech translation, Barcelona, Spain, pp 81–86
Furui S (1986) Cepstral analysis technique for automatic speaker verification. IEEE T Acoust Speech Signal Proc 34:52–59
Furui S (2005) Recent progress in corpus-based spontaneous speech recognition. IEICE T Inform Syst E88-D:366–375
Furui S (2007) Recent advances in automatic speech summarization. In: Symposium on large-scale knowledge resources (LKR 2007), Tokyo, Japan, pp 49–54
Gales MJF (1998) Maximum likelihood linear transformations for HMM-based speech recognition. Comput Speech Lang 12: 75–98
Garofolo JS, Michel M, Stanford VM, Tabassi E, Fiscus J, Laprun CD, Pratz N, Lard J (2004) NIST meeting pilot corpus speech, catalog nbr LDC2004S09, Linguistic Data Consortium, Philadelphia, PA
Geutner P, Finke M, Scheytt P (1998) Adaptive vocabularies for transcribing multilingual broadcast news. In: Proceedings of the 1997 IEEE international conference on acoustics, speech, and signal processing (ICASSP ’97), Seattle, Washington, pp 925–928
Glass J, Hazen TJ, Cyphers S, Malioutov I, Huynh D, Barzilay R (2007) Recent progress in the MIT spoken lecture processing project. In: Interspeech 2007, 8th annual conference of the International Speech Communication Association, Antwerp, Belgium, pp 2553–2556
Godfrey JJ, Holliman E (1993) Switchboard-1 transcripts, catalog nbr LDC93T4, Linguistic Data Consortium, Philadelphia, PA
Gollan C, Bisani M, Kanthak S, Schlüter R, Ney H (2005) Cross domain automatic transcription on the TC-STAR EPPS corpus. In: ICASSP, 2005 IEEE conference on acoustics, speech, and signal processing, Philadelphia, PA, pp 825–828
Gollan C, Hahn S, Schlüter R, Ney H (2007) An improved method for unsupervised training of LVCSR systems. In: Interspeech 2007, 8th annual conference of the International Speech Communication Association, Antwerp, Belgium, pp 2101–2104
Graff D (1994) UN parallel text (complete), catalog nbr LDC94T4A, Linguistic Data Consortium, Philadelphia, PA
Graff D (2003) English gigaword, catalog nbr LDC2003T05, Linguistic Data Consortium, Philadelphia, PA
Graff D, Garofolo J, Fiscus J, Fisher W, Pallett D (1997) 1996 English broadcast news speech (HUB4), catalog nbr LDC97S44, Linguistic Data Consortium, Philadelphia, PA
Hamon O, Mostefa D, Choukri K (2007) End-to-end evaluation of a speech-to-speech translation system in TC-STAR. In: Machine translation summit XI, Copenhagen, Denmark, pp 223–230
Henderson JA (1982) Some psychological aspects of simultaneous interpreting. Incorp Ling 21(4): 149–150
Hendricks PV (1971) Simultaneous interpreting: a practical book. Longman, London
Huang J, Zweig G (2002) Maximum entropy model for punctuation annotation from speech. In: 7th international conference on spoken language processing (ICSLP 2002, Interspeech 2002), Denver, Colorado, pp 917–920
Huang J, Westphal M, Chen SF, Siohan O, Povey D, Libal V, Soneiro A, Schulz H, Ross T, Potamianos G (2006) The IBM rich transcription spring 2006 speech-to-text system for lecture meetings. In: Renals et al (2006), pp 432–443
Janin A, Edwards J, Ellis D, Gelbart D, Morgan N, Peskin B, Pfau T, Shriberg E, Stolcke A, Wooters C (2004) ICSI meeting speech, catalog nbr LDC2004S02, Linguistic Data Consortium, Philadelphia, PA
Jones R (1998) Conference interpreting explained. St. Jerome Publishing, Manchester
Kim J-H, Woodland PC (2001) The use of prosody in a combined system for punctuation generation and speech recognition. In: Eurospeech 2001 Scandinavia, 7th European conference on speech communication and technology, 2nd Interspeech event, Aalborg, Denmark, pp 2757–2760
Klakow D, Peters J (2002) Testing the correlation of word error rate and perplexity. Speech Commun 38: 19–28
Koehn P, Axelrod A, Mayne AB, Callison-Burch C, Osborne M, Talbot D (2005) Edinburgh system description for the 2005 IWSLT speech translation evaluation. In: Proceedings of international workshop on spoken language translation, Pittsburgh, PA
Kolss M, Zhao B, Vogel S, Hildebrand AS, Niehues J, Venugopal A, Zhang Y (2006) The ISL statistical machine translation system for the TC-STAR spring 2006 evaluation. In: TC-STAR workshop on speech to speech translation, Barcelona, Spain
Kopczynski A (1994) Bridging the gap: empirical research in simultaneous interpretation. John Benjamins, Amsterdam/Philadelphia
Lamel LF, Schiel F, Fourcin A, Mariani J, Tillmann HG (1994) The translanguage English database TED. In: Third international conference on spoken language processing (ICSLP 94), Yokohama, Japan, pp 1795–1798
Lamel L, Bilinski E, Adda G, Gauvain J-L, Schwenk H (2006) The LIMSI RT06s lecture transcription system. In: Renals et al. (2006), pp 457–468
Lamel L, Gauvain J-L, Adda G, Barras C, Bilinski E, Galibert O, Pujol A, Schwenk H, Zhu X (2007) The LIMSI 2006 TC-STAR EPPS transcription systems. In: ICASSP 2007, international conference on acoustics, speech, and signal processing, Honolulu, Hawaii, pp 997–1000
Lederer M (1978) Simultaneous interpretation: units of meaning and other features. In: Gerver D, Sinaiko HW (eds) Language interpretation and communication. Plenum Press, New York, pp 323–332
Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 9: 171–185
Liu Y (2004) Structural event detection for rich transcription of speech. PhD thesis, Purdue University, West Lafayette, IN
Lööf J, Bisani M, Gollan C, Heigold G, Hoffmeister B, Plahl C, Schlüter R, Ney H (2006) The 2006 RWTH parliamentary speeches transcription system. In: Interspeech 2006 – ICSLP, ninth international conference on spoken language processing, Pittsburgh, PA, pp 105–108
Mani I (2001) Automatic summarization. John Benjamins, Amsterdam
Matusov E, Leusch G, Bender O, Ney H (2005) Evaluating machine translation output with automatic sentence segmentation. In: Proceedings of international workshop on spoken language translation, Pittsburgh, PA
Matusov E, Mauser A, Ney H (2006) Automatic sentence segmentation and punctuation prediction for spoken language translation. In: International workshop on spoken language translation, Kyoto, Japan, pp 158–165
Matusov E, Leusch G, Banchs RE, Bertoldi N, Déchelotte D, Federico M, Kolss M, Lee Y-S, Mariño JB, Paulik M, Roukos S, Schwenk H, Ney H (2008) System combination for machine translation of spoken and written language. IEEE T Audio Speech Lang Proc 16: 1222–1237
Morimoto T, Takezawa T, Yato F, Sagayama S, Tashiro T, Nagata M, Kurematsu A (1993) ATR’s speech translation system: ASURA. In: European conference on speech communication and technology 1993, Eurospeech 1993, Berlin, Germany, pp 1291–1294
Moser-Mercer B, Kunzli A, Korac M (1998) Prolonged turns in interpreting: effects on quality, physiological and psychological stress (Pilot Study). Interpreting: Int J Res Prac Interpreting 3:47–64
Normandin Y (1991) Hidden Markov models, maximum mutual information estimation and the speech recognition problem. PhD thesis, McGill University, Montreal, Quebec, Canada
Och FJ (2003) Minimum error rate training in statistical machine translation. In: 41st annual meeting of the Association for Computational Linguistics, Sapporo, Japan, pp 160–167
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29:19–51
Olszewski D, Prasetyo F, Linhard K (2005) Steerable highly directional audio beam louspeaker. In: Interspeech’2005 – Eurospeech, Lisboa, Portugal, pp 137–140
Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: 40th annual meeting of the Association of Computational Linguistics, Philadelphia, Pennsylvania, pp 311–318
Paulik M, Waibel A (2008) Extracting clues from human interpreter speech for spoken language translation. In: ICASSP 2008 IEEE international conference on acoustics, speech, and signal processing, Las Vegas, Nevada, pp 5097–5100
Ramabhadran B, Huang J, Picheny M (2003) Towards automatic transcription of large spoken archives – English ASR for the Malach project. In: Proceedings of the 2003 IEEE conference on acoustics, speech, and signal processing (ICASSP 2003), Hong Kong, China, pp 216–219
Ramabhadran B, Siohan O, Mangu L, Zweig G, Westphal M, Schulz H, Soneiro A (2006) The IBM 2006 speech transcription system for European Parliamentary speeches. In: Interspeech 2006 – ICSLP, ninth international conference on spoken language processing, Pittsburgh, PA, pp 1225–1228
Rao S, Lane I, Schultz T (2007) Optimizing sentence segmentation for spoken language translation. In: Interspeech 2007, 8th annual conference of the International Speech Communication Association, Antwerp, Belgium, pp 2845–2848
Renals S, Bengio S, Fiskus J (eds) (2006) Machine learning for multimodal interaction: third international workshop, MLMI 2006, Bethesda. Revised selected papers, LNCS 4299, Springer Verlag, Berlin
Rogina I, Schaaf T (2002) Lecture and presentation tracking in an intelligent meeting room. In: 4th IEEE international conference on multimodal interfaces (ICMI 2002), Pittsburgh, PA, pp 47–52
Roukos S, Graff D, Melamed D (1995) Hansard French/English, catalog nbr LDC95T20, Linguistic Data Consortium, Philadelphia, PA
Seleskovitch D (1978) Interpreting for international conferences: problems of language and communication. Pen & Booth, Washington DC
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th conference of Association for Machine Translation in the Americas: visions for the future of machine translation, Cambridge, Massachusetts, pp 223–231
Soltau H, Metze F, Fügen C, Waibel A (2001) A one-pass decoder based on polymorphic linguistic context assignment. In: ASRU 2001, automatic speech recognition and understanding workshop, Madonna di Campiglio, Trento, Italy, pp 214–217
Soltau H, Yu H, Metze F, Fügen C, Jin Q, Jou S-C (2004) The 2003 ISL rich transcription system for conversational telephony speech. In: ICASSP 2004, IEEE international conference on acoustics, speech, and signal processing, Montreal, Canada, pp 773–776
Stolcke A (2002) SRILM – an extensible language modeling toolkit. In: 7th international conference on spoken language processing (ICSLP 2002, Interspeech 2002), Denver, Colorado, pp 901–904
Stüker S, Fügen C, Hsiao R, Ikbal S, Jin Q, Kraft F, Paulik M, Raab M, Tam Y-C, Wölfel M (2006) The ISL TC-STAR spring 2006 ASR evaluation systems. In: TC-STAR workshop on speech to speech translation, Barcelona, Spain, pp 139–144
Stüker S, Paulik M, Kolss M, Fügen C, Waibel A (2007) Speech translation enhanced ASR for European Parliament speeches – on the influence of ASR performance on speech translation. In: ICASSP 2007, international conference on acoustics, speech, and signal processing, Honolulu, Hawaii, pp 1293–1296
Trancoso I, Nunes R, Neves L (2006) Classroom lecture recognition. In: Vieira R, Quaresma P, Nunes MdGV, Mamede NJ, Oliveira C, Dias MC (eds) Computational processing of the Portuguese language, 7th international workshop, PROPOR 2006, Itatiaia, Brazil, LNCS 3960, Springer Verlag, Berlin, pp 190–199
Vidal M (1997) New study on fatigue confirms need for working in teams. Proteus Newsl NAJIT 6.1
Vogel S (2003) SMT decoder dissected: word reordering. In: International conference on natural language processing and knowledge engineering, Beijing, China, pp 561–566
Vogel S (2005) PESA: phrase pair extraction as sentence splitting. In: MT summit X, the tenth machine translation summit, Phuket, Thailand, pp 251–258
Vogel S, Ney H, Tillmann C (1996) HMM-based word alignment in statistical translation. In: COLING-96, the 16th international conference on computational linguistics, Copenhagen, Denmark, pp 836–841
Waibel A, Fügen C (2008) Spoken language translation. IEEE Signal Proc Mag 25(3): 70–79
Waibel A, Stiefelhagen R (eds) (2009) Computers in the human interaction loop. Springer Verlag, Berlin
Waibel A, Jain AN, McNair AE, Saito H, Hauptmann AG, Tebelskis J (1991) JANUS, a speech-to-speech translation using connectionist and symbolic processing strategies. In: ICASSP-91, proceedings of the international conference on acoustics, speech, and signal processing, Toronto, Canada, pp 793–796
Waibel A, Steusloff H, Stiefelhagen R, the CHIL Project Consortium (2004) CHIL – computers in the human interaction loop. In: WIAMIS 2004, 5th international workshop on image analysis for multimedia interactive services, Lisbon, Portugal, 4 pp
Yagi SM (2000) Studying style in simultaneous interpretation. Meta J Traduc 45: 520–547
Yuan J, Liberman M, Cieri C (2006) Towards an integrated understanding of speaking rate in conversation. In: Interspeech 2006 – ICSLP, ninth international conference on spoken language processing, Pittsburgh, Pennsylvania, paper Mon3A3O-1
Zechner K (2002) Summarization of spoken language – challenges, methods, and prospects. Speech Technol Expert eZine, 6
Zhan P, Westphal M (1997) Speaker normalization based on frequency warping. In: 1997 IEEE international conference on acoustics, speech, and signal processing (ICASSP’97), Munich, Germany, p 1039
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Fügen, C., Waibel, A. & Kolss, M. Simultaneous translation of lectures and speeches. Machine Translation 21, 209–252 (2007). https://doi.org/10.1007/s10590-008-9047-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-008-9047-0