Abstract
Large Vocabulary Continuous Speech Recognition systems require Viterbi searching through a large state space to find the most probable sequence of phonemes that led to a given sound sample. This needs storing and updating of a large Active State List (ASL) in the on-chip memory (OCM) at regular intervals (called frames), which poses a major performance bottleneck for speech decoding. Most works use hash tables for OCM storage while beam-width pruning to restrict the ASL size. To achieve a decent accuracy and performance, a large OCM, numerous acoustic probability computations, and DRAM accesses are incurred.
We propose to use a binary search tree for ASL storage and a max heap data structure to track the worst cost state and efficiently replace it when a better state is found. With this approach, the ASL size can be reduced from over 32K to 512 with minimal impact on recognition accuracy for a 7,000-word vocabulary model. This, combined with a caching technique for acoustic scores, reduced the DRAM data accessed by 31\( \times \) and the acoustic probability computations by 26\( \times \).
The approach has also been implemented in hardware on a Xilinx Zynq FPGA at 200 MHz using the Vivado SDS compiler. We study the tradeoffs among the amount of OCM used, word error rate, and decoding speed to show the effectiveness of the approach. The resulting implementation is capable of running faster than real time with 91% lesser block-RAMs.
- [1] . 2021. Frustratingly easy noise-aware training of acoustic models. arXiv:2011.02090. Retrieved from https://arxiv.org/abs/2011.02090.Google Scholar
- [2] . 2017. UNFOLD: A memory-efficient speech recognizer using on-the-fly WFST composition. In Proceedings of the International Symposium on Microarchitecture (ISCA). 69–81. Google ScholarDigital Library
- [3] . 2018. A low-power speech recognizer and voice activity detector using deep neural networks. IEEE J. Solid-State Circ. 53, 1 (2018), 66–75. Google ScholarCross Ref
- [4] . 2019. A low-power, high-performance speech recognition accelerator. IEEE Trans. Comput. 68, 12 (2019), 1817–1831. Google ScholarCross Ref
- [5] . 2020. Design and evaluation of an ultra low-power human-quality speech recognition system. ACM Trans. Arch. Code Optimiz. 17, 4 (2020), 1–19. Google ScholarDigital Library
- [6] . 2014. A 6 mW, 5,000-Word real-time speech recognizer using WFST models. IEEE J. Solid-State Circ. 50, 1 (2014), 102–112. Google ScholarCross Ref
- [7] . 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). IEEE, 243–254. Google ScholarDigital Library
- [8] . 2016. Memory-Efficient modeling and search techniques for hardware ASR decoders. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’16). 1893–1897. http://people.csail.mit.edu/jrg/2016/Price-Interspeech-16.pdf.Google ScholarCross Ref
- [9] . 2021. Using Gaussian mixtures on triphone acoustic modelling-based Punjabi continuous speech recognition. In Advances in Computational Intelligence and Communication Technology. Springer, 395–406. Google ScholarCross Ref
- [10] . 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Sign. Process. 28, 4 (1980), 357–366. Google ScholarCross Ref
- [11] . 2013. Improved feature processing for deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (Interspeech’16). 109–113. https://www.danielpovey.com/files/2013_interspeech_nnet_lda.pdf.Google ScholarCross Ref
- [12] . 2002. Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16, 1 (2002), 69–88. Google ScholarDigital Library
- [13] . 2008. Speech recognition with weighted finite-state transducers. In Springer Handbook of Speech Processing. Springer, 559–584. Google ScholarCross Ref
- [14] . 2001. Time and memory efficient viterbi decoding for LVCSR using a precompiled search network. In Proceedings of the 7th European Conference on Speech Communication and Technology. https://www.isca-speech.org/archive_v0/archive_papers/eurospeech_2001/e01_0847.pdf.Google Scholar
- [15] . 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13, 2 (1967), 260–269. Google ScholarDigital Library
- [16] . 2011. The kaldi speech recognition toolkit. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. https://www.danielpovey.com/files/2011_asru_kaldi.pdf.Google Scholar
- [17] . 2021. A novel approach to perform context-based automatic spoken document retrieval of political speeches based on wavelet tree indexing. Multimedia Tools Appl. 80, 14 (2021), 22209–22229. Google ScholarDigital Library
- [18] . 2001. Introduction to Algorithms (2nd ed.). The MIT Press. https://doc.lagout.org/science/0_Computer%20Science/2_Algorithms/Introduction%20to%20Algorithms%2C%202nd%20Edition.pdf.Google Scholar
- [19] . 2018. The dark side of DNN pruning. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 790–801. Google ScholarDigital Library
- [20] . 2012. An investigation of tied-mixture GMM based triphone state clustering. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). IEEE, 4717–4720. Google ScholarCross Ref
- [21] 2019. UG902 Vivado Design Suite User Guide—High-Level Synthesis. https://docs.xilinx.com/v/u/en-US/ug902-vivado-high-level-synthesis.Google Scholar
- [22] . 2016. An evaluation of Vivado HLS for efficient system design. In Proceedings of the International Symposium on Electronics in Marine (ELMAR’16). 195–199. Google ScholarCross Ref
- [23] 2019. UG1027 SDSoC Environment User Guide. https://www.xilinx.com/support/documents/sw_manuals/xilinx2019_1/ug1027-sdsoc-user-guide.pdf.Google Scholar
- [24] 2019. ZCU102 Evaluation Board User Guide (UG1182). Retrieved on July 29, 2021 from https://www.xilinx.com/support/documentation/boards_and_kits/zcu102/ug1182-zcu102-eval-bd.pdf.Google Scholar
- [25] 2006. Librivox— Solomon Mines Audio Book. Retrieved on October 7, 2021 from https://librivox.org/king-solomons-mines-by-haggard/.Google Scholar
- [26] 2016. DDR4 SDRAM SODIMM Features (MTA4ATF51264HZ–2G6E1). Retrieved on July 29, 2021 from https://media-www.micron.com/-/media/client/global/documents/products/data-sheet/modules/sodimm/ddr4/atf4c512x64hz.pdf?rev=e4f0743341814159bc75d9f2511f4dfd.Google Scholar
- [27] DDR4 Power Calculator. Retrieved on June 28, 2021 from https://media-www.micron.com/-/media/client/global/documents/products/power-calculator/ddr4_power_calc.xlsm?la=en&rev=5e97be39078d4a1b8619cb85c96bbe63.Google Scholar
- [28] . 2012. A 40-nm 168-mW 2.4\( \times \)-Real-Time VLSI processor for 60-k word continuous speech recognition. In Proceedings of the IEEE Custom Integrated Circuits Conference. IEEE, 1–4. Google ScholarCross Ref
- [29] . 2010. Search error risk minimization in viterbi beam search for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’10). IEEE, 4934–4937. Google ScholarCross Ref
- [30] . 2012. A 40 nm 144 mW VLSI processor for real-time 60-k word continuous speech recognition. IEEE Trans. Circ. Syst. I: Regul. Pap. 59, 8 (2012), 1656–1666. Google ScholarCross Ref
- [31] . 2010. An FPGA implementation of speech recognition with weighted finite state transducers. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 1602–1605. Google ScholarCross Ref
- [32] . 2012. Flexible and expandable speech recognition hardware with weighted finite state transducers. J. Sign. Process. Syst. 66, 3 (2012), 235–244. Google ScholarDigital Library
- [33] . 2016. Energy-Scalable Speech Recognition Circuits. Ph.D. Dissertation. Massachusetts Institute of Technology. https://dspace.mit.edu/handle/1721.1/106090.Google Scholar
- [34] . 2010. A real-time FPGA-Based 20000-Word speech recognizer with optimized DRAM access. In IEEE Transactions on Circuits and Systems. IEEE, 2119–2131. Google ScholarDigital Library
- [35] . 2006. Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 1. IEEE, I–I. Google ScholarCross Ref
Index Terms
- Reduced Memory Viterbi Decoding for Hardware-accelerated Speech Recognition
Recommendations
Constrained Viterbi decoding for embedded user-customised password speaker recognition
SAC '10: Proceedings of the 2010 ACM Symposium on Applied ComputingEmbedded speaker recognition in mobile devices could involve several ergonomic constraints and a limited amount of computing resources. GMM/UBM systems have proved their efficiency in more classical contexts where good accuracy depends on a relatively ...
A low power turbo/Viterbi decoder for 3GPP2 applications
This paper presents a channel decoder that completes both turbo and Viterbi decodings, which are pervasive in many wireless communication systems, especially those that require very low signal-to-noise ratios. The trellis decoding algorithm merges them ...
A hardware-efficient technique to implement a trellis code modulation decoder
This brief presents a new technique in implementing a verylar ge-scale integration trellis code modulation (TCM) decoder. The technique aims to reduce hardware complexityand increase decoding throughput. The technique is introduced in the design of a ...
Comments