skip to main content
10.1145/3665348.3665377acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgaiisConference Proceedingsconference-collections
research-article

On The Performance of EMA-Synchronized Speech and Stand-alone Speech in Speech Recognition and Acoustic-to-Articulatory Inversion

Published: 03 July 2024 Publication History

Abstract

Synchronized acoustic-articulatory data is the basis of various applications, such as exploring the fundamental mechanisms of speech production, acoustic to articulatory inversion (AAI), articulatory to acoustic mapping, etc. Most of the studies in these fields directly trained various models with EMA-synchronized speech, while the target inputs or outputs are stand-alone speech in real applications. However, the recording conditions of EMA-synchronized speech and stand-alone speech are different, which may make the EMA-synchronized speech different to the stand-alone speech and degrade the performance of downstream tasks. Hence, it is necessary to shed light on whether the EMA-synchronized speech and stand-alone speech signals are different, and if so, how this affects the performance of the models trained with synchronized acoustic-articulatory data. In this study, we explore differences between EMA-synchronized speech and stand-alone speech from the aspect of speech recognition, and its influence on the performance of AAI. The results indicate the performance of phone error rate increases from 7.8% for stand-alone speech to 37.4% for EMA-synchronized speech, and the RMSE increases from 0.71mm for EMA- synchronized speech to 3.07mm when the input is switched from EMA-synchronized speech to stand-alone speech for the AAI model trained with EMA-synchronized speech.

References

[1]
Meenakshi, N., Yarra, C. Yamini, B.K., 2014. Comparison of speech quality with and without sensors in electromagnetic articulograph AG501 recording. in Interspeech, 2014. 935-939.
[2]
Dromey, C., Hunter, E., and Nissena, S.L. 2018. Speech Adaptation to Kinematic Recording Sensors: Perceptual and Acoustic Findings. Journal of Speech, Language, and Hearing Research, 61, 593-603.
[3]
Wang, J., Liu, J., Li, X., 2023. Two-Stream Joint-Training for Speaker Independent Acoustic-to-Articulatory Inversion. in ICASSP2023. 1-5.
[4]
Wang, J., Liu, J., Zhao, L. 2022. Acoustic-to-articulatory inversion based on speech decomposition and auxiliary feature. in ICASSP2022. 4808-4812.
[5]
Shahrebabaki, A. S., Olfati, N., Imran, A. S., 2022. Acoustic-to-articulatory mapping with joint optimization of deep speech enhancement and articulatory inversion models. IEEE Trans. Acoust., Speech, and Language Processing, 30, 135-147.
[6]
Wu, P., Chen, L., Cho, C. J., 2023. Speaker-Independent Acoustic-to-Articulatory Speech Inversion, in ICASSP2023. 5060-5064.
[7]
Udupa, S., Roy, A., Singh, A., 2021. Estimating articulatory movements in speech production with transformer networks. in Interspeech2021. 1154-1158.
[8]
Illa, A. and Ghosh, P. K. 2020. The impact of speaking rate on acoustic-to-articulatory inversion. Computer Speech&Language, 59, 75-90.
[9]
Illa, A., Meenakshi, G. N., and Ghosh, P. K. 2017. A comparative study of acoustic-to-articulatory inversion for neutral and whispered speech. in ICASSP2017. 5075–5079.
[10]
Siriwardena, Y. M., Sivaraman, G., and Espy-Wilson, C. 2022. Acoustic-to-articulatory Speech Inversion with Multi-task Learning. in Interspeech2022. 5020-5024.
[11]
Sun, G., Huang, Z., Wang, L., 2021. Temporal convolution network based joint optimization of acoustic-to-articulatory inversion. Applied Sciences, 11, 9056.
[12]
Bozorg, N. and Johnson, M. T. 2020. Acoustic-to-Articulatory Inversion with Deep Autoregressive Articulatory-WaveNet. in Interspeech2020. 3725-3729.
[13]
Seneviratne, N., Sivaraman, G., and Espy-Wilson, C. 2019. Multi-corpus Acoustic-to-articulatory Speech Inversion. in Interspeech2019. 859-863.
[14]
Liu, Z., Ling, Z. and Dai, L. 2018. Articulatory-to-Acoustic Conversion Using BLSTM-RNNs with Augmented Input Representation. Speech Communication, 99, 161-172.
[15]
Illa, A., Nair, A. and Ghosh, P. K. 2022. The impact of cross language on acoustic-to-articulatory inversion and its influence on articulatory speech synthesis. in ICASSP2022. 8267-8271.
[16]
Toda, T., Black, A. W. and Tokuda, K. 2008. Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Communication, 50, 215–227.
[17]
Aryal, S. and Gutierrez-Osuna, R. 2016. Data driven articulatory synthesis with deep neural networks. Computer Speech and Language, 36, 260-273.
[18]
Ling, Z., Richmond, K., Yamagishi, J., 2009. Integrating Articulatory Features Into HMM-Based Parametric Speech Synthesis. IEEE Transactions on Audio, Speech and Language Processing, 17, 6, 1171-1185
[19]
King, S., Frankel, J., Livescu, K., 2007. Speech production knowledge in automatic speech recognition. J. Acoust. Soc. Am., 121, 2, 723-742.
[20]
Yu, J., Markov, K. and Matsui, T. 2019. Articulatory and Spectrum Information Fusion Based on Deep Recurrent Neural Networks. IEEE Transactions on Audio, Speech and Language Processing, 27, 4, 742-752.
[21]
Li, M., Kim, J., Lammert, A., 2016. Speaker verification based on the fusion of speech acoustics and inverted articulatory signals. Computer Speech and Language, 36, 196-211.
[22]
Guo, P., Boyer, F., Chang, X., 2021. Recent Developments on ESPNET Tookit Boosted by Conformer. in ICASSP2021. 5874-5878.
[23]
Gulati, A., Qin, J., Chiu, C., 2020. Conformer: Convolution-augmented Transformer for Speech Recognition. in Interspeech2020. 5036-5040.

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
GAIIS '24: Proceedings of the 2024 International Conference on Generative Artificial Intelligence and Information Security
May 2024
439 pages
ISBN:9798400709562
DOI:10.1145/3665348
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 July 2024

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

GAIIS 2024

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 23
    Total Downloads
  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)6
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media