skip to main content
10.1145/2857218.2857234acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmedesConference Proceedingsconference-collections
research-article

Robust deep-learning models for text-to-speech synthesis support on embedded devices

Published: 25 October 2015 Publication History

Abstract

Currently, smartphones and tablets are firmly implanted within our daily lives. These devices have an entire ecosystem devoted to them, with applications and tools designed for their specifications: they use touch-enabled interfaces, have a limited amount of memory and CPU time available for apps (16/32MB limit on Android and iOS devices). A well-established research domain is the development of natural human-computer-interfaces (HCI) via voice and gestures. However, these interfaces are bound by the hardware resources available to them, and by the fact that they use network/Internet access to send/receive data, relying on dedicated servers for the decision making process. This paper focuses on the development of small robust deep-learning models that are designed to provide high quality text-to-speech (TTS) functionality (one of the three main components of HCI) on smart devices, without requiring network access. We obtain very good results in TTS text sub-tasks using models significantly smaller than those used in state-of-the-art approaches.

References

[1]
Bartlett, S., Kondrak, G., & Cherry, C. 2008. Automatic syllabification with structured SVMs for letter-to-phoneme conversion. Proceedings of ACL-08: HLT, 568--576.
[2]
Bengio, Lamblin, Popovici, Larochelle. 2006. "Greedy Layer-Wise Training of Deep Networks", in NIPS
[3]
Boroş, Tiberiu and Dumitrescu, Ştefan Daniel. 2013. Improving the RACAI Neural Network MSD Tagger. In Engineering Applications of Neural Networks (Lazaros Iliadis and Harris Papadopoulos and Chrisina Jayne). Springer, vol. 383, pp. 42--51
[4]
Demberg, V., Schmid, H., & Mohler, G. 2007. Phonological constraints and morphological preprocessing for grapheme-to-phoneme conversion. In Annual Meeting-Association for Computational Linguistics (Vol. 45, No. 1, p. 96).
[5]
Erjavec, T. and Monachini, M. (Eds.). 1997. Specifications and Notation for Lexicon Encoding. Deliverable D1.1 F. Multext-East Project COP-106
[6]
Hinton, Osindero and Teh. 2006. "A fast Learning Algorithm for Deep Belief Nets", in Neural Computation
[7]
Jiampojamarn, S., Cherry, C., & Kondrak, G. 2008. Joint processing and discriminative training for letter-to-phoneme conversion. Proceedings of ACL-08: HLT, 905--913.
[8]
Stan, A., Yamagishi, J., King, S., & Aylett, M. 2011. The Romanian Speech Synthesis (RSS) corpus: building a high quality HMM-based speech synthesis system using a high sampling rate. Speech Communication, 53(3), 442--450.
[9]
Ungurean, C., Burileanu, D., Popescu, V. and Derviş, A. 2011. Hybrid Syllabification and Letter-To-Phone Conversion For TTS Synthesis. In U.P.B. Sci. Bull., Series C, Vol. 73, Iss. 3, 2011, ISSN 1454-234x

Cited By

View all
  • (2024)Deep Learning’s Impact on Speech Synthesis for Mobile DevicesArtificial Intelligence and Sustainable Computing10.1007/978-981-97-0327-2_9(117-131)Online publication date: 24-Apr-2024
  • (2016)Search based applications for speech processing2016 8th International Conference on Electronics, Computers and Artificial Intelligence (ECAI)10.1109/ECAI.2016.7861101(1-6)Online publication date: Jun-2016

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
MEDES '15: Proceedings of the 7th International Conference on Management of computational and collective intElligence in Digital EcoSystems
October 2015
271 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • The French Chapter of ACM Special Interest Group on Applied Computing
  • IFSP: Federal Institute of São Paulo

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. lexical stress prediction
  3. part-of-speech tagging
  4. phonetic transcription
  5. syllabification
  6. text processing
  7. text-to-speech synthesis

Qualifiers

  • Research-article

Conference

MEDES '15
Sponsor:
  • IFSP

Acceptance Rates

MEDES '15 Paper Acceptance Rate 13 of 64 submissions, 20%;
Overall Acceptance Rate 267 of 682 submissions, 39%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Deep Learning’s Impact on Speech Synthesis for Mobile DevicesArtificial Intelligence and Sustainable Computing10.1007/978-981-97-0327-2_9(117-131)Online publication date: 24-Apr-2024
  • (2016)Search based applications for speech processing2016 8th International Conference on Electronics, Computers and Artificial Intelligence (ECAI)10.1109/ECAI.2016.7861101(1-6)Online publication date: Jun-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media