skip to main content
10.1145/3448823.3448869acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicvispConference Proceedingsconference-collections
research-article

An Efficient CTC Decoding Strategy with Lattice Compression

Published: 04 March 2021 Publication History

Abstract

Connectionist Temporal Classification (CTC) has been widely used for acoustic modeling task in state-of-the-art automatic speech recognition (ASR) systems. To complement the acoustic model, several streams of works have been proposed to decode CTC output and incorporate language model: WFST decoder, Seq2Seq decoder and RNN decoder. Among these candidates, RNN decoder is the most suitable for CTC since it does not require any explicit lexicon dictionary for decoding, and thus it is able to produce open vocabulary output. However, the state-of-the-art RNN decoder is often slower during decoding process, and this severely limits their usage for realtime speech recognition.
In this work, we propose an efficient decoding strategy for RNN decoder that significantly improve the decoding speed of the RNN while not losing performance. Previous works on RNN decoder typically decode audio frames one by one, they waste lots of computation resource traversing the sparse CTC outputs. In this work, we exploit this sparsity of CTC by analyzing the distribution of characters and propose an efficient lattice structure in the decoder. Our experiments show that the decoder could accelerate decoding process by approximately 4X at the cost of only losing 0.2% Word error rate (WER).

References

[1]
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In International Conference on Machine Learning. 173--182.
[2]
William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4960--4964.
[3]
Christopher Cieri, David Miller, and Kevin Walker. [n.d.]. The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text.
[4]
Li Deng and Dong Yu. 2014. Deep learning: methods and applications. Foundations and trends in signal processing 7, 3--4 (2014), 197--387.
[5]
John J Godfrey, Edward C Holliman, and Jane McDaniel. 1992. SWITCHBOARD: Telephone speech corpus for research and development. In Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International Conference on, Vol. 1. IEEE, 517--520.
[6]
Alex Graves. 2012. Sequence transduction with recurrent neural net-works. arXiv preprint arXiv:1211.3711 (2012).
[7]
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. ACM, 369--376.
[8]
Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, et al. 2019. Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6381--6385.
[9]
Kyuyeon Hwang and Wonyong Sung. 2016. Character-Level Language Modeling with Hierarchical Recurrent Neural Networks. arXiv preprint arXiv:1609.03777 (2016).
[10]
Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, et al. 2019. A comparative study on transformer vs RNN in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 449--456.
[11]
Andrew Maas, Ziang Xie, Dan Jurafsky, and Andrew Ng. 2015. Lexiconfree conversational speech recognition with neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 345--354.
[12]
Yajie Miao, Mohammad Gowayyed, and Florian Metze. 2015. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 167--174.
[13]
Nikita Nangia, Adina Williams, Angeliki Lazaridou, and Samuel R Bowman. 2017. The repeval 2017 shared task: Multi-genre natural language inference with sentence representations. arXiv preprint arXiv:1707.08172 (2017).
[14]
Haşim Sak, Andrew Senior, Kanishka Rao, Ozan Irsoy, Alex Graves, Françoise Beaufays, and Johan Schalkwyk. 2015. Learning acoustic frame labeling for speech recognition with recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 4280--4284.
[15]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[16]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156--3164.
[17]
Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and Tomoki Hayashi. 2017. Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing 11, 8 (2017), 1240--1253.
[18]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
[19]
Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. The Microsoft 2016 conversational speech recognition system. arXiv preprint arXiv:1609.03528 (2016).
[20]
Dong Yu and Li Deng. [n.d.]. AUTOMATIC SPEECH RECOGNITION. Springer.
[21]
Dong Yu and Jinyu Li. 2017. Recent progresses in deep learning based acoustic models. IEEE/CAA Journal of Automatica Sinica 4, 3 (2017), 396--409.

Index Terms

  1. An Efficient CTC Decoding Strategy with Lattice Compression

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICVISP 2020: Proceedings of the 2020 4th International Conference on Vision, Image and Signal Processing
    December 2020
    366 pages
    ISBN:9781450389532
    DOI:10.1145/3448823
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 March 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Acoustic Modeling
    2. CTC
    3. Lattice
    4. RNN decoding

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICVISP 2020

    Acceptance Rates

    ICVISP 2020 Paper Acceptance Rate 60 of 147 submissions, 41%;
    Overall Acceptance Rate 186 of 424 submissions, 44%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 33
      Total Downloads
    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media