Abstract:
Large Language Models (LLMs) have proven successful across a spectrum of speech-related tasks, such as speech recognition, text-to-speech, and spoken language understandi...Show MoreMetadata
Abstract:
Large Language Models (LLMs) have proven successful across a spectrum of speech-related tasks, such as speech recognition, text-to-speech, and spoken language understanding. Recently, the use of discretized speech features has gained attention as an efficient and compatible alternative to continuous features for LLMs. This is mainly due to their reduced storage requirements and better alignment of these features with LLM's input space. However, the typical practice of freezing the speech encoder during training poses challenges in bridging the modality gap between speech and text. To address this, we propose to use a mixed-scale re-tokenization layer, integrating multiple granularities in discretized speech features directly within the LLM's input module. Our experimental results demonstrated that the proposed method can effectively enhance the performance of ASR in the setting of continuous learning of an LLM, highlighting the importance of a meticulously designed input module for the integration of discretized speech features with an LLM.
Published in: IEEE Signal Processing Letters ( Volume: 31)