Accurate voiced/unvoiced information is crucial in estimating the pitch of a target speech signal in severe nonstationary noise environments. Nevertheless, state-of-the-art pitch estimators based on deep neural networks (DNN) lack a dedicated mechanism for robustly detecting voiced and unvoiced segments in the target speech in noisy conditions. In this work, we proposed an end-to-end deep learning-based pitch estimation framework which jointly detects voiced/unvoiced segments and predicts pitch values for the voiced regions of the ground-truth speech. We empirically showed that our proposed framework significantly more robust than state-of-the-art DNN based pitch detectors in nonstationary noise settings. Our results suggest that joint training of voiced/unvoiced detection and voiced pitch prediction can significantly improve pitch estimation performance.
Cite as: Tran, D.N., Batricevic, U., Koishida, K. (2020) Robust Pitch Regression with Voiced/Unvoiced Classification in Nonstationary Noise Environments. Proc. Interspeech 2020, 175-179, doi: 10.21437/Interspeech.2020-3019
@inproceedings{tran20_interspeech, author={Dung N. Tran and Uros Batricevic and Kazuhito Koishida}, title={{Robust Pitch Regression with Voiced/Unvoiced Classification in Nonstationary Noise Environments}}, year=2020, booktitle={Proc. Interspeech 2020}, pages={175--179}, doi={10.21437/Interspeech.2020-3019} }