Journals & Magazines >IEEE/ACM Transactions on Audi... >Volume: 32

VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we...Show More

Metadata

Abstract:

Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional language model task via multi-task learning framework. To accomplish this, we first convert the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder. In such a way, all these tasks are converted to token-based sequence prediction problems, which can be naturally handled with one conditional language model. We further integrate task IDs (TID), language IDs (LID), and LSTM-based acoustic embedding into the proposed model to enhance the modeling capability of handling different languages and tasks. Experimental results demonstrate that the proposed VioLA model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines.

Published in: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 32)

Page(s): 3709 - 3716

Date of Publication: 29 July 2024

ISSN Information:

DOI: 10.1109/TASLP.2024.3434425

Funding Agency:

Contents

References is not available for this document.

VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?