Skip to main content

Comparison of Speech Recognition Performance Between Kaldi and Google Cloud Speech API

  • Conference paper
  • First Online:
Recent Advances in Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP 2018)

Abstract

In recent years, many systems having a speech interface have grown. The speech interface includes spoken dialogue function and high performance of a spoken dialogue system has been required. The spoken dialogue system consists of a speech recognition module. In this study, we focus on the speech recognition module of the spoken dialogue system and aim for improving the spoken dialogue system by enhancing the performance of the speech recognition system. Among several speech recognition systems, Kaldi is a widely used speech recognition system in many kinds of researches. On the other hand, several speech recognition services that are Web API is also provided, such as IBM Watson Speech to Text, Microsoft Bing Speech API, and Google Cloud Speech API, which is known that it has high performance. This paper compares speech recognition performance between Kaldi and Google Cloud Speech API in WER and RTF and confirms the recognition performance of each recognition system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://cloud.google.com/speech-to-text/.

  2. 2.

    https://www.ibm.com/watson/jp-ja/developercloud/speech-to-text.html.

  3. 3.

    https://azure.microsoft.com/ja-jp/services/cognitive-services/speech/.

  4. 4.

    http://chasen.naist.jp/snapshot/ipadic/ipadic/doc/ipadic-ja.pdf.

References

  1. JEIDA Noise Database. http://research.nii.ac.jp/src/en/JEIDA-NOISE.html

  2. The “nnet3” setup. http://kaldi-asr.org/doc/dnn3.html

  3. Baumann, T., Kennington, C., Hough, J., Schlangen, D.: Recognising conversational speech: what an incremental asr should do for a dialogue system and how to get there. In: Dialogues with Social Robots: Enablements, Analyses, and Evaluation. pp. 421–432. Springer, Singapore (2017)

    Google Scholar 

  4. Itou, K., Yamamoto, M., Takeda, K., Takezawa, T., Matsuoka, T., Kobayashi, T., Shikano, K., Itahashi, S.: JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research. J. Acoust. Soc. Jpn. (E) 20(3), 199–206 (1999)

    Article  Google Scholar 

  5. Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying conditional random fields to Japanese morphological analysis. In: Proceedings of EMNLP, pp. 230–237 (2004)

    Google Scholar 

  6. Maekawa, K., Hanae, K., Sadaoki, F., Isahara, H.: Spontaneous speech corpus of Japanese. In: Proceedings of the Second International Conference of Language Resources and Evaluation (LREC 2000), pp. 947–952 (2000)

    Google Scholar 

  7. Morbini, F., Audhkhasi, K., Sagae, K., Artstein, R., Can, D., Georgiou, P.G., Narayanan, S., Leuski, A., Traum, D.R.: Which ASR should I choose for my dialogue system? In: Proceedings of SIGDIAL Conference, pp. 394–403 (2013)

    Google Scholar 

  8. Mori, H., Satake, T., Nakamura, M., Kasuya, H.: Constructing a spoken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics. Speech Commun. 53(1), 36–50 (2011)

    Article  Google Scholar 

  9. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 1–4 (2011)

    Google Scholar 

  10. Takeishi, E., Nose, T., Chiba, Y., Ito, A.: Construction and analysis of phonetically and prosodically balanced emotional speech database. In: Proceedings of Oriental COCOSDA, pp. 16–21 (2016)

    Google Scholar 

Download references

Acknowledgment

Part of this work was supported by JSPS KAKENHI Grant Numbers JP17H00823.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Akinori Ito .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kimura, T., Nose, T., Hirooka, S., Chiba, Y., Ito, A. (2019). Comparison of Speech Recognition Performance Between Kaldi and Google Cloud Speech API. In: Pan, JS., Ito, A., Tsai, PW., Jain, L. (eds) Recent Advances in Intelligent Information Hiding and Multimedia Signal Processing. IIH-MSP 2018. Smart Innovation, Systems and Technologies, vol 110. Springer, Cham. https://doi.org/10.1007/978-3-030-03748-2_13

Download citation

Publish with us

Policies and ethics