Conversion of Airborne to Bone-Conducted Speech with Deep Neural Networks

Pucher, Michael; Woltron, Thomas

doi:10.21437/Interspeech.2021-473

Conversion of Airborne to Bone-Conducted Speech with Deep Neural Networks

Michael Pucher, Thomas Woltron

It is a common experience of most speakers that the playback of one’s own voice sounds strange. This can be mainly attributed to the missing bone-conducted speech signal that is not present in the playback signal. It was also shown that some phonemes have a high bone-conducted relative to air-conducted sound transmission, which means that the bone-conduction filter is phone-dependent. To achieve such a phone-dependent modeling we train different speaker dependent and speaker adaptive speech conversion systems using airborne and bone-conducted speech data from 8 speakers (5 male, 3 female), which allow for the conversion of airborne speech to bone-conducted speech. The systems are based on Long Short-Term Memory (LSTM) deep neural networks, where the speaker adaptive versions with speaker embedding can be used without bone-conduction signals from the target speaker. Additionally we also used models that apply a global filtering. The different models are then evaluated by an objective error metric and a subjective listening experiment, which show that the LSTM based models outperform the global filters.

doi: 10.21437/Interspeech.2021-473

Cite as: Pucher, M., Woltron, T. (2021) Conversion of Airborne to Bone-Conducted Speech with Deep Neural Networks. Proc. Interspeech 2021, 1-5, doi: 10.21437/Interspeech.2021-473

@inproceedings{pucher21_interspeech,
  author={Michael Pucher and Thomas Woltron},
  title={{Conversion of Airborne to Bone-Conducted Speech with Deep Neural Networks}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1--5},
  doi={10.21437/Interspeech.2021-473}
}