Loading [a11y]/accessibility-menu.js
VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation | IEEE Journals & Magazine | IEEE Xplore

VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation


Abstract:

Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we...Show More

Abstract:

Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional language model task via multi-task learning framework. To accomplish this, we first convert the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder. In such a way, all these tasks are converted to token-based sequence prediction problems, which can be naturally handled with one conditional language model. We further integrate task IDs (TID), language IDs (LID), and LSTM-based acoustic embedding into the proposed model to enhance the modeling capability of handling different languages and tasks. Experimental results demonstrate that the proposed VioLA model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines.
Page(s): 3709 - 3716
Date of Publication: 29 July 2024

ISSN Information:

Funding Agency:


References

References is not available for this document.