Slot filling is an essential component of Spoken Language Understanding. In contrast to conventional pipeline approaches, which extract slots from the ASR output, end-to-end approaches directly get slots from speech within a classification or generation framework. However, classification relies on predefined categories, which is not scalable, and the generative model is decoding in an open-domain space, suffering from blurred boundaries of slots in speech. To address the shortcomings of these two formulations, we propose a new encoder-decoder framework for slot filling, named Speech2Slot, leveraging a limited generation method with boundary detection. We also released a large-scale Chinese spoken slot filling dataset named Voice Navigation Dataset in Chinese (VNDC). Experiments on VNDC show that our model is markedly superior to other approaches, outperforming the state-of-the-art slot filling approach with 6.65% accuracy improvement. We make our code (https://github.com/eehover/speech2slot) publicly available for researchers to replicate and build on our work.
Cite as: Wang, P., Su, Y., Zhou, X., Ye, X., Wei, L., Liu, M., You, Y., Jiang, F. (2022) Speech2Slot: A Limited Generation Framework with Boundary Detection for Slot Filling from Speech. Proc. Interspeech 2022, 2748-2752, doi: 10.21437/Interspeech.2022-11347
@inproceedings{wang22fa_interspeech, author={Pengwei Wang and Yinpei Su and Xiaohuan Zhou and Xin Ye and Liangchen Wei and Ming Liu and Yuan You and Feijun Jiang}, title={{Speech2Slot: A Limited Generation Framework with Boundary Detection for Slot Filling from Speech}}, year=2022, booktitle={Proc. Interspeech 2022}, pages={2748--2752}, doi={10.21437/Interspeech.2022-11347} }