Abstract:
End-to-end (E2E) spoken language understanding (SLU) systems facilitate mapping speech inputs directly to semantic outputs, eliminating the need for modular processing of...Show MoreMetadata
Abstract:
End-to-end (E2E) spoken language understanding (SLU) systems facilitate mapping speech inputs directly to semantic outputs, eliminating the need for modular processing of speech-to-text and text-to-semantics sub-tasks using separate models. However, they are now limited to processing speech inputs only, and are not flexible to deal with plain texts. In this paper, we propose an E2E spoken and natural language understanding (SNLU) system that can handle both speech and text within a unified architecture. The system follows the Mask-CTC non-autoregressive approach, and the input flexibility is acquired by partially sharing the decoder between SLU and NLU tasks. Experiments on the SLURP dataset show that the proposed architecture achieves similar performance to using separate E2E SLU and NLU modules, but with relatively 43.7 % less model parameters. We also explore the use of pre-trained speech and language models into the SNLU system, and show that they further improve the performance.
Date of Conference: 16-20 December 2023
Date Added to IEEE Xplore: 19 January 2024
ISBN Information: