Skip to main content
Log in

Speech recognition and synthesis technology development at NTT for telecommunications services

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

This paper describes recent developments at NTT in the areas of speech recognition, speech synthesis, and interactive voice systems as they relate to telecommunications applications. Speaker-independent largevocabulary speech recognition based on context-dependent phone models and LR parser, and high-quality text-to-speech (TTS) conversion using the waveform concatenation method, both realized as software, have enabled interactive voice systems for fast and easy prototyping of telephone-based applications. Practical applications are discussed with examples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Abe, M., Hakoda, K., and Tsukada, H. (1996). An information retrieval system from text database using text-to-speech.Proc. AVIOS'96, pp. 189–196.

  • Charpentier, F. and Moulines, E. (1989). Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones.Proc. Eurospeech'89, pp. 13–19.

  • Darrel, S. and Bernie, R. (1994). DECtalk software in a desktop environment.Proc. AVIOS'94, pp. 189–193.

  • Hakoda, K., Nakajima, S., Hirokawa, T., and Mizuno, H. (1990). A new Japanese text-to-speech synthesizer based on COC synthesized method.Proc. ICSLP'90, pp. 809–812.

  • Hakoda, K., Hirokawa, T., Tsukada, H., Yoshida, Y., and Mizuno, H. (1995). Japanese text-to-speech software based on waveform concatenation method.Proc. AVIOS'95, pp. 65–72.

  • Hirokawa, T., Itoh, K., and Sato, H. (1993). High quality speech synthesis system based on waveform concatenation of phoneme segment.IEICE Trans. Fundamentals, E76-A(11): 1964–1970.

    Google Scholar 

  • Ikehara, S., Murakami, K., Miyazaki, M., and Ohyama, Y. (1986). Construction of Japanese text-to-speech system.ECL Tech. J., 35(2): 145–155 (in Japanese).

    Google Scholar 

  • Imamura, A. and Suzuki, Y. (1990). Speaker-independent word spotting and a tranputer-based implementation.Proc. ICSLP'90, pp. 537–540.

  • Intoh, K. and Miki, S. (1988). Speaker independent isolated word recognition board and its application.American Voice I/O Systems Applications Conf., AVIOS'88.

  • Itakura, F. (1975). Line spectrum representation of linear prediction coefficients of speech signal.Trans. of the Committee on Speech Research, ASJ, S75-34.

  • Itakura, F. and Saito, S. (1969). Speech analysis-synthesis system based on the partial autocorrelation coefficient.Acoust. Soc. of Japan Meeting, pp. 199–200 (in Japanese).

  • Minami, Y, Shikano, K., Yamada, T., and Matsuoka, T. (1992). Very-large-vocabulary continuous speech recognition system for telephone directory assistance.Proc. IVTTA'92.

  • Momosaki, K., Hara, Y., Shiga, Y., Kaseno, O., Tamanaka, N., Nitta, T., and Kobayashi, K. (1994). A Japanese TTS software for personal computers.ASJ'94 Autumn Meeting.3-5-6, pp. 327–328 (in Japanese).

    Google Scholar 

  • Nakatsu, R. and Ishii, N. (1987). Voice response and recognition system for telephone information services.Proc. of SPEECH TECH'87, pp. 168–172.

  • Noda, Y and Sagayama, S. (1995). Fast and accurate beam search using forward heuristic functions in HMM-LR speech recognition.Proc. Eurospeech'95 (Madrid), WEamIA.5, pp. 913–916.

  • Sato, H., Sagisaka, Y, Kogure, K., and Sagayama, S. (1982). Investigation on Japanese text-to-speech conversion.Trans. of the Committee on Speech Research, S82-08 (in Japanese).

  • Takahashi, J. and Sagayama, S. (1994). Fast telephone channel adaptation based on vector field smoothing technique.Proc. IVTTA'94 Workshop, pp. 97–100.

  • Takahashi, J. and Sagayama, S. (1995). Vector-field-smoothed bayesian learning for incremental speaker adaptation.Proc. ICASSP95 (Detroit), pp. 696–699.

  • Takahashi, K., Iwata, K., Mitome, Y, and Nagano, K. (1994). Japanese text-to-speech conversion software for personal computers.Proc. ICSLPV4, pp. 1743–1746.

  • Takahashi, S. and Sagayama, S. (1995). Four-level tied structure for efficient representation of acoustic modeling.Proc. ICASSP'95 (Detroit), pp. 520–523.

  • Tomita, M. (1991).Generalized LR Parsing. Kluwer Academic Publishers.

  • Yamada, T. and Sagayama, S. (1994). An implementation of LR parser using context-dependent phone models.Proc. JASJ Conf., 3-8-8, pp. 123–124 (in Japanese).

    Google Scholar 

  • Yoshida, Y, Nakajima, S., Hakoda, K., and Hirokawa, T. (1996). A new method of generating speech synthesis units based on phonological knowledge and clustering technique.Proc. ICSLP'96, pp. 1712–1715.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hakoda, K., Kitai, M. & Sagayama, S. Speech recognition and synthesis technology development at NTT for telecommunications services. Int J Speech Technol 2, 145–153 (1997). https://doi.org/10.1007/BF02208826

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02208826

Keywords

Navigation