research-article

Reduced Memory Viterbi Decoding for Hardware-accelerated Speech Recognition

Authors:
Pani Prithvi Raj

Indian Institute of Technology Madras, Madras, India

Indian Institute of Technology Madras, Madras, India

0000-0001-8529-7920
View Profile

,
Pakala Akhil Reddy

Indian Institute of Technology Madras, Terraces on Brompton, Houston, TX, India

Indian Institute of Technology Madras, Terraces on Brompton, Houston, TX, India

0000-0002-0570-4993
View Profile

,
Nitin Chandrachoodan

Indian Institute of Technology Madras, IIT Madras, India

Indian Institute of Technology Madras, IIT Madras, India

0000-0002-9258-7317
View Profile

Authors Info & Claims

ACM Transactions on Embedded Computing Systems Volume 21 Issue 3Article No.: 31pp 1–18https://doi.org/10.1145/3510028

Published:28 May 2022Publication History

ACM Transactions on Embedded Computing Systems

Abstract

Large Vocabulary Continuous Speech Recognition systems require Viterbi searching through a large state space to find the most probable sequence of phonemes that led to a given sound sample. This needs storing and updating of a large Active State List (ASL) in the on-chip memory (OCM) at regular intervals (called frames), which poses a major performance bottleneck for speech decoding. Most works use hash tables for OCM storage while beam-width pruning to restrict the ASL size. To achieve a decent accuracy and performance, a large OCM, numerous acoustic probability computations, and DRAM accesses are incurred.

We propose to use a binary search tree for ASL storage and a max heap data structure to track the worst cost state and efficiently replace it when a better state is found. With this approach, the ASL size can be reduced from over 32K to 512 with minimal impact on recognition accuracy for a 7,000-word vocabulary model. This, combined with a caching technique for acoustic scores, reduced the DRAM data accessed by 31\( \times \) and the acoustic probability computations by 26\( \times \).

The approach has also been implemented in hardware on a Xilinx Zynq FPGA at 200 MHz using the Vivado SDS compiler. We study the tradeoffs among the amount of OCM used, word error rate, and decoding speed to show the effectiveness of the approach. The resulting implementation is capable of running faster than real time with 91% lesser block-RAMs.

REFERENCES

[1] Raj Desh, Villalba Jesus, Povey Daniel, and Khudanpur Sanjeev. 2021. Frustratingly easy noise-aware training of acoustic models. arXiv:2011.02090. Retrieved from https://arxiv.org/abs/2011.02090.Google Scholar
[2] Yazdani Reza, Arnau Jose-Maria, and González Antonio. 2017. UNFOLD: A memory-efficient speech recognizer using on-the-fly WFST composition. In Proceedings of the International Symposium on Microarchitecture (ISCA). 69–81. Google ScholarDigital Library
[3] Price Michael, Glass James, and Chandrakasan Anantha P.. 2018. A low-power speech recognizer and voice activity detector using deep neural networks. IEEE J. Solid-State Circ. 53, 1 (2018), 66–75. Google ScholarCross Ref
[4] Yazdani Reza, Arnau Jose-Maria, and González Antonio. 2019. A low-power, high-performance speech recognition accelerator. IEEE Trans. Comput. 68, 12 (2019), 1817–1831. Google ScholarCross Ref
[5] Pinto Dennis, Arnau Jose-María, and González Antonio. 2020. Design and evaluation of an ultra low-power human-quality speech recognition system. ACM Trans. Arch. Code Optimiz. 17, 4 (2020), 1–19. Google ScholarDigital Library
[6] Price Michael, Glass James, and Chandrakasan Anantha P.. 2014. A 6 mW, 5,000-Word real-time speech recognizer using WFST models. IEEE J. Solid-State Circ. 50, 1 (2014), 102–112. Google ScholarCross Ref
[7] Han Song et al. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). IEEE, 243–254. Google ScholarDigital Library
[8] Price Michael, Chandrakasan Anantha, and Glass James R.. 2016. Memory-Efficient modeling and search techniques for hardware ASR decoders. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’16). 1893–1897. http://people.csail.mit.edu/jrg/2016/Price-Interspeech-16.pdf.Google ScholarCross Ref
[9] Ghai Wiqas, Kumar Suresh, and Athavale Vijay Anant. 2021. Using Gaussian mixtures on triphone acoustic modelling-based Punjabi continuous speech recognition. In Advances in Computational Intelligence and Communication Technology. Springer, 395–406. Google ScholarCross Ref
[10] Davis Steven and Mermelstein Paul. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Sign. Process. 28, 4 (1980), 357–366. Google ScholarCross Ref
[11] Rath Shakti P., Povey Daniel, Veselỳ Karel, and Cernockỳ Jan. 2013. Improved feature processing for deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’16). 109–113. https://www.danielpovey.com/files/2013_interspeech_nnet_lda.pdf.Google ScholarCross Ref
[12] Mohri Mehryar, Pereira Fernando, and Riley Michael. 2002. Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16, 1 (2002), 69–88. Google ScholarDigital Library
[13] Mohri Mehryar, Pereira Fernando, and Riley Michael. 2008. Speech recognition with weighted finite-state transducers. In Springer Handbook of Speech Processing. Springer, 559–584. Google ScholarCross Ref
[14] Willett Daniel et al. 2001. Time and memory efficient viterbi decoding for LVCSR using a precompiled search network. In Proceedings of the 7th European Conference on Speech Communication and Technology. https://www.isca-speech.org/archive_v0/archive_papers/eurospeech_2001/e01_0847.pdf.Google Scholar
[15] Viterbi Andrew. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13, 2 (1967), 260–269. Google ScholarDigital Library
[16] Povey Daniel et al. 2011. The kaldi speech recognition toolkit. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. https://www.danielpovey.com/files/2011_asru_kaldi.pdf.Google Scholar
[17] Gupta Anishka and Yadav Divakar. 2021. A novel approach to perform context-based automatic spoken document retrieval of political speeches based on wavelet tree indexing. Multimedia Tools Appl. 80, 14 (2021), 22209–22229. Google ScholarDigital Library
[18] Cormen Thomas H., Leiserson Charles E., Rivest Ronald L., and Stein Clifford. 2001. Introduction to Algorithms (2nd ed.). The MIT Press. https://doc.lagout.org/science/0_Computer%20Science/2_Algorithms/Introduction%20to%20Algorithms%2C%202nd%20Edition.pdf.Google Scholar
[19] Yazdani Reza, Riera Marc, Arnau Jose-Maria, and González Antonio. 2018. The dark side of DNN pruning. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 790–801. Google ScholarDigital Library
[20] Wang Guangsen and Sim Khe Chai. 2012. An investigation of tied-mixture GMM based triphone state clustering. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). IEEE, 4717–4720. Google ScholarCross Ref
[21] Inc. Xilinx2019. UG902 Vivado Design Suite User Guide—High-Level Synthesis. https://docs.xilinx.com/v/u/en-US/ug902-vivado-high-level-synthesis.Google Scholar
[22] Georgopoulos Konstantinos, Chrysos Grigorios, Malakonakis Pavlos, Nikitakis Antonis, Tampouratzis Nikos, Dollas Apostolos, Pnevmatikatos Dionisios, and Papaefstathiou Yannis. 2016. An evaluation of Vivado HLS for efficient system design. In Proceedings of the International Symposium on Electronics in Marine (ELMAR’16). 195–199. Google ScholarCross Ref
[23] Inc. Xilinx2019. UG1027 SDSoC Environment User Guide. https://www.xilinx.com/support/documents/sw_manuals/xilinx2019_1/ug1027-sdsoc-user-guide.pdf.Google Scholar
[24] Inc. Xilinx2019. ZCU102 Evaluation Board User Guide (UG1182). Retrieved on July 29, 2021 from https://www.xilinx.com/support/documentation/boards_and_kits/zcu102/ug1182-zcu102-eval-bd.pdf.Google Scholar
[25] 2006. Librivox— Solomon Mines Audio Book. Retrieved on October 7, 2021 from https://librivox.org/king-solomons-mines-by-haggard/.Google Scholar
[26] Inc. Micron Technology2016. DDR4 SDRAM SODIMM Features (MTA4ATF51264HZ–2G6E1). Retrieved on July 29, 2021 from https://media-www.micron.com/-/media/client/global/documents/products/data-sheet/modules/sodimm/ddr4/atf4c512x64hz.pdf?rev=e4f0743341814159bc75d9f2511f4dfd.Google Scholar
[27] Inc. Micron Technology DDR4 Power Calculator. Retrieved on June 28, 2021 from https://media-www.micron.com/-/media/client/global/documents/products/power-calculator/ddr4_power_calc.xlsm?la=en&rev=5e97be39078d4a1b8619cb85c96bbe63.Google Scholar
[28] He Guangji, Sugahara Takanobu, Izumi Shintaro, Kawaguchi Hiroshi, and Yoshimoto Masahiko. 2012. A 40-nm 168-mW 2.4\( \times \)-Real-Time VLSI processor for 60-k word continuous speech recognition. In Proceedings of the IEEE Custom Integrated Circuits Conference. IEEE, 1–4. Google ScholarCross Ref
[29] Hori Takaaki, Watanabe Shinji, and Nakamura Atsushi. 2010. Search error risk minimization in viterbi beam search for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’10). IEEE, 4934–4937. Google ScholarCross Ref
[30] He Guangji, Sugahara Takanobu, Miyamoto Yuki, Fujinaga Tsuyoshi, Noguchi Hiroki, Izumi Shintaro, Kawaguchi Hiroshi, and Yoshimoto Masahiko. 2012. A 40 nm 144 mW VLSI processor for real-time 60-k word continuous speech recognition. IEEE Trans. Circ. Syst. I: Regul. Pap. 59, 8 (2012), 1656–1666. Google ScholarCross Ref
[31] Choi Jungwook, You Kisun, and Sung Wonyong. 2010. An FPGA implementation of speech recognition with weighted finite state transducers. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 1602–1605. Google ScholarCross Ref
[32] You Kisun, Choi Jungwook, and Sung Wonyong. 2012. Flexible and expandable speech recognition hardware with weighted finite state transducers. J. Sign. Process. Syst. 66, 3 (2012), 235–244. Google ScholarDigital Library
[33] Price Michael. 2016. Energy-Scalable Speech Recognition Circuits. Ph.D. Dissertation. Massachusetts Institute of Technology. https://dspace.mit.edu/handle/1721.1/106090.Google Scholar
[34] Choi Young-Kyu, You Kisun, Choi Jungwook, and Sung Wonyong. 2010. A real-time FPGA-Based 20000-Word speech recognizer with optimized DRAM access. In IEEE Transactions on Circuits and Systems. IEEE, 2119–2131. Google ScholarDigital Library
[35] Huggins-Daines David, Kumar Mohit, Chan Arthur, Black Alan W., Ravishankar Mosur, and Rudnicky Alexander I. 2006. Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 1. IEEE, I–I. Google ScholarCross Ref

Index Terms

Reduced Memory Viterbi Decoding for Hardware-accelerated Speech Recognition

Recommendations

Constrained Viterbi decoding for embedded user-customised password speaker recognition
SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing

Embedded speaker recognition in mobile devices could involve several ergonomic constraints and a limited amount of computing resources. GMM/UBM systems have proved their efficiency in more classical contexts where good accuracy depends on a relatively ...
Read More
A low power turbo/Viterbi decoder for 3GPP2 applications

This paper presents a channel decoder that completes both turbo and Viterbi decodings, which are pervasive in many wireless communication systems, especially those that require very low signal-to-noise ratios. The trellis decoding algorithm merges them ...
Read More
A hardware-efficient technique to implement a trellis code modulation decoder

This brief presents a new technique in implementing a verylar ge-scale integration trellis code modulation (TCM) decoder. The technique aims to reduce hardware complexityand increase decoding throughput. The technique is introduced in the design of a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Embedded Computing Systems Volume 21, Issue 3
May 2022
365 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3530307
Editor:
Tulika Mitra
National University of Singapore, Singapore
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 28 May 2022
- Online AM: 26 January 2022
- Revised: 1 December 2021
- Accepted: 1 December 2021
- Received: 1 October 2021
Published in tecs Volume 21, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Viterbi decoding
binary search trees
On-chip memory
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 385
  Total Downloads
- Downloads (Last 12 months)101
- Downloads (Last 6 weeks)22
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Reduced Memory Viterbi Decoding for Hardware-accelerated Speech Recognition

ACM Transactions on Embedded Computing Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Constrained Viterbi decoding for embedded user-customised password speaker recognition

A low power turbo/Viterbi decoder for 3GPP2 applications

A hardware-efficient technique to implement a trellis code modulation decoder