Gaps to Bridge in Speech Technology

Németh, Géza

doi:10.1007/978-3-319-11581-8_2

Géza Németh²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8773))

Included in the following conference series:

International Conference on Speech and Computer

1342 Accesses

Abstract

Although recently there has been significant progress in the general usage and acceptance of speech technology in several developed countries there are still major gaps that prevent the majority of possible users from daily use of speech technology-based solutions. In this paper some of them are listed and some directions for bridging these gaps are proposed. Perhaps the most important gap is the "Black box" thinking of software developers. They suppose that inputting text into a text-to-speech (TTS) system will result in voice output that is relevant to the given context of the application. In case of automatic speech recognition (ASR) they wait for accurate text transcription (even punctuation). It is ignored that even humans are strongly influenced by a priori knowledge of the context, the communication partners, etc. For example by serially combining ASR + machine translation + TTS in a speech-to-speech translation system a male speaker at a slow speaking rate might be represented by a fast female voice at the other end. The science of semantic modelling is still in its infancy. In order to produce successful applications researchers of speech technology should find ways to build-in the a priori knowledge into the application environment, adapt their technologies and interfaces to the given scenario. This leads us to the gap between generic and domain specific solutions. For example intelligibility and speaking rate variability are the most important TTS evaluation factors for visually impaired users while human-like announcements at a standard rate and speaking style are required for railway station information systems. An increasing gap is being built between "large" languages/markets and "small" ones. Another gap is the one between closed and open application environments. For example there is hardly any mobile operating system that allows TTS output re-direction into a live telephone conversation. That is a basic need for rehabilitation applications of speech impaired people. Creating an open platform where "smaller" and "bigger" players of the field could equally plug-in their engines/solutions at proper quality assurance and with a fair share of income could help the situation. In the paper some examples are given about how our teams at BME TMIT try to bridge the gaps listed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

French-Fulfulde Textless and Cascading Speech Translation: Towards a Dual Architecture

Giving Voices to Multimodal Applications

An end-to-end model for cross-lingual transformation of paralinguistic information

Article 06 April 2018

References

Voice Synthesis Nearing Growth Explosion, Computerworld (August 31, 1981)
Google Scholar
Brown, M.: The “Lost” Steve Jobs Speech from 1983; Foreshadowing Wireless Networking, the iPad, and the App Store. In: Talk by Steve Jobs at International Design Conference in 1983, October 2 (2012) (retrieved July 2014)
Google Scholar
The Global Language Technology Market, LT-Innovate, p. 11 (October 2012)
Google Scholar
Handley, Z.: Is text-to-speech synthesis ready for use in computer-assisted language learning? Speech Communication 51(10), 906–919 (2009)
Article Google Scholar
Németh, G., Zainkó, C., Fekete, L., Olaszy, G., Endrédi, G., Olaszi, P., Kiss, G., Kiss, P.: The design, implementation and operation of a Hungarian e-mail reader. International Journal of Speech Technology 3/4, 216–228 (2000)
Google Scholar
Németh, G., Zainkó, C., Kiss, G., Olaszy, G., Fekete, L., Tóth, D.: Replacing a Human Agent by an Automatic Reverse Directory Service. In: Magyar, G., Knapp, G., Wojtkowski, W., Wojtkowski, G., Zupancic, J. (szerk.) Advances in Information Systems Development: New Methods and Practice for the Networked Society, pp. 321–328. Springer (2007)
Google Scholar
Németh, G., Fék, M., Csapó, T.G.: Increasing Prosodic Variability of Text-To-Speech Synthesizers. In: Interspeech 2007, Antwerpen, Belgium, pp. 474–477 (2007)
Google Scholar
Csapó, T.G., Németh, G.: Modeling irregular voice in statistical parametric speech synthesis with residual codebook based excitation. IEEE Journal on Selected Topics In Signal Processing 8(2), 209–220 (2014)
Article Google Scholar
Olaszy, G., Németh, G., Olaszi, P., Kiss, G., Gordos, G.: PROFIVOX - A Hungarian Professional TTS System for Telecommunications Applications. International Journal of Speech Technology 3(3/4), 201–216 (2000)
Article MATH Google Scholar
Csala, E., Németh, G., Zainkó, C.: Application of the NAO humanoid robot in the treatment of marrow-transplanted children. In: Péter, B. (ed.) 2012 IEEE 3rd International Conference on Cognitive Infocommunications (CogInfoCom), Kosice, Slovakia, pp. 655–658 (2012)
Google Scholar
Németh, G., Zainkó, Cs., Bartalis, M., Olaszy, G., Kiss, G.: Human Voice or Prompt Generation? Can They Co-Exist in an Application? In: Interspeech 2009: Speech and Intelligence, Brighton, UK, pp. 620–623 (2009)
Google Scholar
Klabbers, E.A.M.: High-Quality Speech Output Generation through Advanced Phrase Concatenation. In: COST Telecom Workshop, Rhodes, Greece, pp. 85–88 (September 1997)
Google Scholar
Nagy, A., Pesti, P., Németh, G., Bőhm, T.: Design issues of a corpus-based speech synthesizer. HÍRADÁSTECHNIKA LX:(6), 6–12 (2005)
Google Scholar
Németh, G., Kiss, G., Tóth, B.: Cross Platform Solution of Communication and Voice/Graphical User Interface for Mobile Devices in Vehicles. In: Abut, H., Hansen, J.H.L., Takeda, K. (eds.) Advances for In-Vehicle and Mobile Systems: Challenges for International Standards, pp. 237–250. Springer (2005)
Google Scholar
Tóth, B., Németh, G.: Hidden Markov Model Based Speech Synthesis System in Hungarian. Infocommunications Journal LXIII:(7), 30–34 (2008)
Google Scholar
Zainkó, C., Tóth, B.P., Bartalis, M., Németh, G., Fegyó, T.: Some Aspects of Synthetic Elderly Voices in Ambient Assisted Living Systems. In: Burileanu, C., Teodorescu, H.-N., Rusu, C. (eds.) Proceedings of the 7th International Conference Speech Technology and Human-Computer Dialogu, Cluj-Napoca, Romania, pp. 185–189. IEEE, New York (2013)
Google Scholar
Németh, G., Olaszy, G., Csapó, T.G.: Spemoticons: Text-To-Speech based emotional auditory cues”m. In: ICAD 2011, Budapest, Magyarország, pp. 1–7. Paper Keynote 3 (2011)
Google Scholar
Ethnologue, SIL International (retrieved July 2014)
Google Scholar
META-NET White Paper series on Europe’s Languages in the Digital Age (2013), http://www.meta-net.eu/whitepapers/key-results-and-cross-language-comparison (retrieved July 2014)
Németh, G., Zainkó, C.: Multilingual Statistical Text Analysis, Zipf’s Law and Hungarian Speech Generation. Acta Linguistica Hungarica 49:(3-4), 385–405 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Telecommunications and Media Informatics (TMIT), Budapest University of Technology and Economics (BME), Hungary
Géza Németh

Authors

Géza Németh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Speech and Multimodal Interfaces Laboratory, St. Petersburg Institute of Informatics and Automation of the Russian Academy of Sciences, 39, 14th line, 199178, St. Petersburg, Russia
Andrey Ronzhin
Institute of Applied and Mathematical Linguistics, Moscow State Linguistic University, 38, Ostozhenka, 119034, Moscow, Russia
Rodmonga Potapova
Faculty of Technical Sciences, University of Novi Sad, 6, Trg Dositeja Obradovića, 21000, Novi Sad, Serbia
Vlado Delic

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Németh, G. (2014). Gaps to Bridge in Speech Technology. In: Ronzhin, A., Potapova, R., Delic, V. (eds) Speech and Computer. SPECOM 2014. Lecture Notes in Computer Science(), vol 8773. Springer, Cham. https://doi.org/10.1007/978-3-319-11581-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-11581-8_2
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11580-1
Online ISBN: 978-3-319-11581-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics