research-article

An Efficient CTC Decoding Strategy with Lattice Compression

Author:

Jun LiAuthors Info & Claims

ICVISP 2020: Proceedings of the 2020 4th International Conference on Vision, Image and Signal Processing

Article No.: 47, Pages 1 - 5

https://doi.org/10.1145/3448823.3448869

Published: 04 March 2021 Publication History

Abstract

Connectionist Temporal Classification (CTC) has been widely used for acoustic modeling task in state-of-the-art automatic speech recognition (ASR) systems. To complement the acoustic model, several streams of works have been proposed to decode CTC output and incorporate language model: WFST decoder, Seq2Seq decoder and RNN decoder. Among these candidates, RNN decoder is the most suitable for CTC since it does not require any explicit lexicon dictionary for decoding, and thus it is able to produce open vocabulary output. However, the state-of-the-art RNN decoder is often slower during decoding process, and this severely limits their usage for realtime speech recognition.

In this work, we propose an efficient decoding strategy for RNN decoder that significantly improve the decoding speed of the RNN while not losing performance. Previous works on RNN decoder typically decode audio frames one by one, they waste lots of computation resource traversing the sparse CTC outputs. In this work, we exploit this sparsity of CTC by analyzing the distribution of characters and propose an efficient lattice structure in the decoder. Our experiments show that the decoder could accelerate decoding process by approximately 4X at the cost of only losing 0.2% Word error rate (WER).

References

[1]

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In International Conference on Machine Learning. 173--182.

Digital Library

[2]

William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4960--4964.

Digital Library

[3]

Christopher Cieri, David Miller, and Kevin Walker. [n.d.]. The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text.

[4]

Li Deng and Dong Yu. 2014. Deep learning: methods and applications. Foundations and trends in signal processing 7, 3--4 (2014), 197--387.

[5]

John J Godfrey, Edward C Holliman, and Jane McDaniel. 1992. SWITCHBOARD: Telephone speech corpus for research and development. In Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International Conference on, Vol. 1. IEEE, 517--520.

[6]

Alex Graves. 2012. Sequence transduction with recurrent neural net-works. arXiv preprint arXiv:1211.3711 (2012).

[7]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. ACM, 369--376.

Digital Library

[8]

Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, et al. 2019. Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6381--6385.

[9]

Kyuyeon Hwang and Wonyong Sung. 2016. Character-Level Language Modeling with Hierarchical Recurrent Neural Networks. arXiv preprint arXiv:1609.03777 (2016).

[10]

Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, et al. 2019. A comparative study on transformer vs RNN in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 449--456.

[11]

Andrew Maas, Ziang Xie, Dan Jurafsky, and Andrew Ng. 2015. Lexiconfree conversational speech recognition with neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 345--354.

[12]

Yajie Miao, Mohammad Gowayyed, and Florian Metze. 2015. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 167--174.

[13]

Nikita Nangia, Adina Williams, Angeliki Lazaridou, and Samuel R Bowman. 2017. The repeval 2017 shared task: Multi-genre natural language inference with sentence representations. arXiv preprint arXiv:1707.08172 (2017).

[14]

Haşim Sak, Andrew Senior, Kanishka Rao, Ozan Irsoy, Alex Graves, Françoise Beaufays, and Johan Schalkwyk. 2015. Learning acoustic frame labeling for speech recognition with recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 4280--4284.

[15]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[16]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156--3164.

[17]

Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and Tomoki Hayashi. 2017. Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing 11, 8 (2017), 1240--1253.

[18]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).

[19]

Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. The Microsoft 2016 conversational speech recognition system. arXiv preprint arXiv:1609.03528 (2016).

[20]

Dong Yu and Li Deng. [n.d.]. AUTOMATIC SPEECH RECOGNITION. Springer.

[21]

Dong Yu and Jinyu Li. 2017. Recent progresses in deep learning based acoustic models. IEEE/CAA Journal of Automatica Sinica 4, 3 (2017), 396--409.

Index Terms

An Efficient CTC Decoding Strategy with Lattice Compression
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition

Recommendations

Cooperative lattice coding and decoding in half-duplex channels

We propose novel lattice coding/decoding schemes for half-duplex outage-limited cooperative channels. These schemes are inspired by the cooperation protocols of Azarian et al. and enjoy an excellent performance-complexity tradeoff. More specifically, ...
On the limitations of the naive lattice decoding

In this paper, the inherent drawbacks of the naive lattice decoding (NLD) for MIMO fading systems is investigated. We show that using the NLD for MIMO systems has considerable deficiencies in terms of the diversity-multiplexing tradeoff. Unlike the case ...
Slovenian spontaneous speech recognition and acoustic modeling of filled pauses and onomatopoeas

This paper is focused on acoustic modeling for spontaneous speech recognition. This topic is still a very challenging task for speech technology research community. The attributes of spontaneous speech can heavily degrade speech recognizer's accuracy ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICVISP 2020: Proceedings of the 2020 4th International Conference on Vision, Image and Signal Processing

December 2020

366 pages

ISBN:9781450389532

DOI:10.1145/3448823

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 March 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICVISP 2020

ICVISP 2020: 2020 4th International Conference on Vision, Image and Signal Processing

December 9 - 11, 2020

Bangkok, Thailand

Acceptance Rates

ICVISP 2020 Paper Acceptance Rate 60 of 147 submissions, 41%;

Overall Acceptance Rate 186 of 424 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
33
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten