Tuning Large Language Model for Speech Recognition With Mixed-Scale Re-Tokenization | IEEE Journals & Magazine | IEEE Xplore

Tuning Large Language Model for Speech Recognition With Mixed-Scale Re-Tokenization


Abstract:

Large Language Models (LLMs) have proven successful across a spectrum of speech-related tasks, such as speech recognition, text-to-speech, and spoken language understandi...Show More

Abstract:

Large Language Models (LLMs) have proven successful across a spectrum of speech-related tasks, such as speech recognition, text-to-speech, and spoken language understanding. Recently, the use of discretized speech features has gained attention as an efficient and compatible alternative to continuous features for LLMs. This is mainly due to their reduced storage requirements and better alignment of these features with LLM's input space. However, the typical practice of freezing the speech encoder during training poses challenges in bridging the modality gap between speech and text. To address this, we propose to use a mixed-scale re-tokenization layer, integrating multiple granularities in discretized speech features directly within the LLM's input module. Our experimental results demonstrated that the proposed method can effectively enhance the performance of ASR in the setting of continuous learning of an LLM, highlighting the importance of a meticulously designed input module for the integration of discretized speech features with an LLM.
Published in: IEEE Signal Processing Letters ( Volume: 31)
Page(s): 1740 - 1744
Date of Publication: 27 June 2024

ISSN Information:


Contact IEEE to Subscribe

References

References is not available for this document.