skip to main content
research-article

Reduced Memory Viterbi Decoding for Hardware-accelerated Speech Recognition

Published:28 May 2022Publication History
Skip Abstract Section

Abstract

Large Vocabulary Continuous Speech Recognition systems require Viterbi searching through a large state space to find the most probable sequence of phonemes that led to a given sound sample. This needs storing and updating of a large Active State List (ASL) in the on-chip memory (OCM) at regular intervals (called frames), which poses a major performance bottleneck for speech decoding. Most works use hash tables for OCM storage while beam-width pruning to restrict the ASL size. To achieve a decent accuracy and performance, a large OCM, numerous acoustic probability computations, and DRAM accesses are incurred.

We propose to use a binary search tree for ASL storage and a max heap data structure to track the worst cost state and efficiently replace it when a better state is found. With this approach, the ASL size can be reduced from over 32K to 512 with minimal impact on recognition accuracy for a 7,000-word vocabulary model. This, combined with a caching technique for acoustic scores, reduced the DRAM data accessed by 31\( \times \) and the acoustic probability computations by 26\( \times \).

The approach has also been implemented in hardware on a Xilinx Zynq FPGA at 200 MHz using the Vivado SDS compiler. We study the tradeoffs among the amount of OCM used, word error rate, and decoding speed to show the effectiveness of the approach. The resulting implementation is capable of running faster than real time with 91% lesser block-RAMs.

REFERENCES

  1. [1] Raj Desh, Villalba Jesus, Povey Daniel, and Khudanpur Sanjeev. 2021. Frustratingly easy noise-aware training of acoustic models. arXiv:2011.02090. Retrieved from https://arxiv.org/abs/2011.02090.Google ScholarGoogle Scholar
  2. [2] Yazdani Reza, Arnau Jose-Maria, and González Antonio. 2017. UNFOLD: A memory-efficient speech recognizer using on-the-fly WFST composition. In Proceedings of the International Symposium on Microarchitecture (ISCA). 6981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Price Michael, Glass James, and Chandrakasan Anantha P.. 2018. A low-power speech recognizer and voice activity detector using deep neural networks. IEEE J. Solid-State Circ. 53, 1 (2018), 6675. Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Yazdani Reza, Arnau Jose-Maria, and González Antonio. 2019. A low-power, high-performance speech recognition accelerator. IEEE Trans. Comput. 68, 12 (2019), 18171831. Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Pinto Dennis, Arnau Jose-María, and González Antonio. 2020. Design and evaluation of an ultra low-power human-quality speech recognition system. ACM Trans. Arch. Code Optimiz. 17, 4 (2020), 119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Price Michael, Glass James, and Chandrakasan Anantha P.. 2014. A 6 mW, 5,000-Word real-time speech recognizer using WFST models. IEEE J. Solid-State Circ. 50, 1 (2014), 102112. Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Han Song et al. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). IEEE, 243254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Price Michael, Chandrakasan Anantha, and Glass James R.. 2016. Memory-Efficient modeling and search techniques for hardware ASR decoders. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’16). 18931897. http://people.csail.mit.edu/jrg/2016/Price-Interspeech-16.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Ghai Wiqas, Kumar Suresh, and Athavale Vijay Anant. 2021. Using Gaussian mixtures on triphone acoustic modelling-based Punjabi continuous speech recognition. In Advances in Computational Intelligence and Communication Technology. Springer, 395406. Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Davis Steven and Mermelstein Paul. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Sign. Process. 28, 4 (1980), 357366. Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Rath Shakti P., Povey Daniel, Veselỳ Karel, and Cernockỳ Jan. 2013. Improved feature processing for deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’16). 109113. https://www.danielpovey.com/files/2013_interspeech_nnet_lda.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Mohri Mehryar, Pereira Fernando, and Riley Michael. 2002. Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16, 1 (2002), 6988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Mohri Mehryar, Pereira Fernando, and Riley Michael. 2008. Speech recognition with weighted finite-state transducers. In Springer Handbook of Speech Processing. Springer, 559584. Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Willett Daniel et al. 2001. Time and memory efficient viterbi decoding for LVCSR using a precompiled search network. In Proceedings of the 7th European Conference on Speech Communication and Technology. https://www.isca-speech.org/archive_v0/archive_papers/eurospeech_2001/e01_0847.pdf.Google ScholarGoogle Scholar
  15. [15] Viterbi Andrew. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13, 2 (1967), 260269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Povey Daniel et al. 2011. The kaldi speech recognition toolkit. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. https://www.danielpovey.com/files/2011_asru_kaldi.pdf.Google ScholarGoogle Scholar
  17. [17] Gupta Anishka and Yadav Divakar. 2021. A novel approach to perform context-based automatic spoken document retrieval of political speeches based on wavelet tree indexing. Multimedia Tools Appl. 80, 14 (2021), 2220922229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Cormen Thomas H., Leiserson Charles E., Rivest Ronald L., and Stein Clifford. 2001. Introduction to Algorithms (2nd ed.). The MIT Press. https://doc.lagout.org/science/0_Computer%20Science/2_Algorithms/Introduction%20to%20Algorithms%2C%202nd%20Edition.pdf.Google ScholarGoogle Scholar
  19. [19] Yazdani Reza, Riera Marc, Arnau Jose-Maria, and González Antonio. 2018. The dark side of DNN pruning. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 790801. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Wang Guangsen and Sim Khe Chai. 2012. An investigation of tied-mixture GMM based triphone state clustering. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). IEEE, 47174720. Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Inc. Xilinx2019. UG902 Vivado Design Suite User Guide—High-Level Synthesis. https://docs.xilinx.com/v/u/en-US/ug902-vivado-high-level-synthesis.Google ScholarGoogle Scholar
  22. [22] Georgopoulos Konstantinos, Chrysos Grigorios, Malakonakis Pavlos, Nikitakis Antonis, Tampouratzis Nikos, Dollas Apostolos, Pnevmatikatos Dionisios, and Papaefstathiou Yannis. 2016. An evaluation of Vivado HLS for efficient system design. In Proceedings of the International Symposium on Electronics in Marine (ELMAR’16). 195199. Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Inc. Xilinx2019. UG1027 SDSoC Environment User Guide. https://www.xilinx.com/support/documents/sw_manuals/xilinx2019_1/ug1027-sdsoc-user-guide.pdf.Google ScholarGoogle Scholar
  24. [24] Inc. Xilinx2019. ZCU102 Evaluation Board User Guide (UG1182). Retrieved on July 29, 2021 from https://www.xilinx.com/support/documentation/boards_and_kits/zcu102/ug1182-zcu102-eval-bd.pdf.Google ScholarGoogle Scholar
  25. [25] 2006. Librivox— Solomon Mines Audio Book. Retrieved on October 7, 2021 from https://librivox.org/king-solomons-mines-by-haggard/.Google ScholarGoogle Scholar
  26. [26] Inc. Micron Technology2016. DDR4 SDRAM SODIMM Features (MTA4ATF51264HZ–2G6E1). Retrieved on July 29, 2021 from https://media-www.micron.com/-/media/client/global/documents/products/data-sheet/modules/sodimm/ddr4/atf4c512x64hz.pdf?rev=e4f0743341814159bc75d9f2511f4dfd.Google ScholarGoogle Scholar
  27. [27] Inc. Micron Technology DDR4 Power Calculator. Retrieved on June 28, 2021 from https://media-www.micron.com/-/media/client/global/documents/products/power-calculator/ddr4_power_calc.xlsm?la=en&rev=5e97be39078d4a1b8619cb85c96bbe63.Google ScholarGoogle Scholar
  28. [28] He Guangji, Sugahara Takanobu, Izumi Shintaro, Kawaguchi Hiroshi, and Yoshimoto Masahiko. 2012. A 40-nm 168-mW 2.4\( \times \)-Real-Time VLSI processor for 60-k word continuous speech recognition. In Proceedings of the IEEE Custom Integrated Circuits Conference. IEEE, 14. Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Hori Takaaki, Watanabe Shinji, and Nakamura Atsushi. 2010. Search error risk minimization in viterbi beam search for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’10). IEEE, 49344937. Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] He Guangji, Sugahara Takanobu, Miyamoto Yuki, Fujinaga Tsuyoshi, Noguchi Hiroki, Izumi Shintaro, Kawaguchi Hiroshi, and Yoshimoto Masahiko. 2012. A 40 nm 144 mW VLSI processor for real-time 60-k word continuous speech recognition. IEEE Trans. Circ. Syst. I: Regul. Pap. 59, 8 (2012), 16561666. Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Choi Jungwook, You Kisun, and Sung Wonyong. 2010. An FPGA implementation of speech recognition with weighted finite state transducers. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 16021605. Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] You Kisun, Choi Jungwook, and Sung Wonyong. 2012. Flexible and expandable speech recognition hardware with weighted finite state transducers. J. Sign. Process. Syst. 66, 3 (2012), 235244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Price Michael. 2016. Energy-Scalable Speech Recognition Circuits. Ph.D. Dissertation. Massachusetts Institute of Technology. https://dspace.mit.edu/handle/1721.1/106090.Google ScholarGoogle Scholar
  34. [34] Choi Young-Kyu, You Kisun, Choi Jungwook, and Sung Wonyong. 2010. A real-time FPGA-Based 20000-Word speech recognizer with optimized DRAM access. In IEEE Transactions on Circuits and Systems. IEEE, 21192131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Huggins-Daines David, Kumar Mohit, Chan Arthur, Black Alan W., Ravishankar Mosur, and Rudnicky Alexander I. 2006. Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 1. IEEE, I–I. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Reduced Memory Viterbi Decoding for Hardware-accelerated Speech Recognition

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Embedded Computing Systems
            ACM Transactions on Embedded Computing Systems  Volume 21, Issue 3
            May 2022
            365 pages
            ISSN:1539-9087
            EISSN:1558-3465
            DOI:10.1145/3530307
            • Editor:
            • Tulika Mitra
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 28 May 2022
            • Online AM: 26 January 2022
            • Revised: 1 December 2021
            • Accepted: 1 December 2021
            • Received: 1 October 2021
            Published in tecs Volume 21, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text

          HTML Format

          View this article in HTML Format .

          View HTML Format