Since the perceived audio quality of the synthesized speech may determine a system’s market success, quality evaluations are critical. Audio quality evaluations are usually done in either subjectively or objectively. Due to their costly and time-consuming nature, the subjective approaches have generally been replaced by the faster, more cost-efficient objective approaches. The primary downside of the objective approaches primarily is that they lack the human influence factors which are crucial for deriving the subjective perception of quality. However, it cannot be observed directly and manifested in individual brain activity. Thus, we combined predictions from single-subject electroencephalograph (EEG) information and audio features to improve the predictions of the overall quality of synthesized speech. Our result shows that by combining the results from both audio and EEG models, a very simple neural network can surpass the performance of the single-modal approach.
Cite as: Parmonangan, I.H., Tanaka, H., Sakti, S., Nakamura, S. (2020) Combining Audio and Brain Activity for Predicting Speech Quality. Proc. Interspeech 2020, 2762-2766, doi: 10.21437/Interspeech.2020-1559
@inproceedings{parmonangan20_interspeech, author={Ivan Halim Parmonangan and Hiroki Tanaka and Sakriani Sakti and Satoshi Nakamura}, title={{Combining Audio and Brain Activity for Predicting Speech Quality}}, year=2020, booktitle={Proc. Interspeech 2020}, pages={2762--2766}, doi={10.21437/Interspeech.2020-1559} }