ISCA Archive Interspeech 2016
ISCA Archive Interspeech 2016

Speaker Identity and Voice Quality: Modeling Human Responses and Automatic Speaker Recognition

Soo Jin Park, Caroline Sigouin, Jody Kreiman, Patricia Keating, Jinxi Guo, Gary Yeung, Fang-Yu Kuo, Abeer Alwan

Despite recent breakthroughs in automatic speaker recognition (ASpR), system performance still degrades when utterances are short and/or when within-speaker variability is large. This study used short test utterances (2–3sec) to investigate the effect of within-speaker variability on state-of-the-art ASpR system performance. A subset of a newly-developed UCLA database is used, which contains multiple speech tasks per speaker. The short utterances combined with a speaking-style mismatch between read sentences and spontaneous affective speech degraded system performance, for 25 female speakers, by 36%. Because humans are more robust to utterance length or within-speaker variability, understanding human perception might benefit ASpR systems. Perception experiments were conducted with recorded read sentences from 3 female speakers, and a model is proposed to predict the perceptual dissimilarity between tokens. Results showed that a set of voice quality features including F0, F1, F2, F3, H1*-H2*, H2*-H4*, H4*-H2k*, H2k*-H5k, and CPP provides information that complements MFCCs. By fusing the feature set with MFCCs, human response prediction RMS error was .12, which represents a 12% relative error reduction compared to using MFCCs alone. In ASpR experiments with short utterances from 50 speakers, the voice quality feature set decreased the error rate by 11% when fused with MFCCs.


doi: 10.21437/Interspeech.2016-523

Cite as: Park, S.J., Sigouin, C., Kreiman, J., Keating, P., Guo, J., Yeung, G., Kuo, F.-Y., Alwan, A. (2016) Speaker Identity and Voice Quality: Modeling Human Responses and Automatic Speaker Recognition. Proc. Interspeech 2016, 1044-1048, doi: 10.21437/Interspeech.2016-523

@inproceedings{park16_interspeech,
  author={Soo Jin Park and Caroline Sigouin and Jody Kreiman and Patricia Keating and Jinxi Guo and Gary Yeung and Fang-Yu Kuo and Abeer Alwan},
  title={{Speaker Identity and Voice Quality: Modeling Human Responses and Automatic Speaker Recognition}},
  year=2016,
  booktitle={Proc. Interspeech 2016},
  pages={1044--1048},
  doi={10.21437/Interspeech.2016-523}
}