Incorporating Speaker Normalizing Capabilities to an End-to-End Speech Recognition System

Krishna, Hari; Achanta, Sivanand; Kumar Vuppala, Anil

doi:10.21437/SLTU.2018-36

Incorporating Speaker Normalizing Capabilities to an End-to-End Speech Recognition System

Hari Krishna, Sivanand Achanta, Anil Kumar Vuppala

Speaker normalization is one of the crucial aspects of an Automatic speech recognition system (ASR). Speaker normalization is employed to reduce the performance drop in ASR due to speaker variabilities. Traditional speaker normalization methods are mostly linear transforms over the input data estimated per speaker, such transforms would be efficient with sufficient data. In practical scenarios, only a single utterance from the test speaker is accessible. The present study explores speaker normalization methods for end-to-end speech recognition systems that could efficiently be performed even when single utterance from the unseen speaker is available. In this work, it is hypothesized that by suitably providing information about the speaker’s identity while training an end-to-end neural network, the capability to normalize the speaker variability could be incorporated into an ASR system. The efficiency of these normalization methods depends on the representation used for unseen speakers. In this work, the identity of the training speaker is represented in two different ways viz. i) by using a one-hot speaker code, ii) a weighted combination of all the training speakers identities. The unseen speakers from the test set are represented using a weighted combination of training speakers representations. Both the approaches have reduced the word error rate (WER) by 0.6, 1.3% WSJ corpus.

doi: 10.21437/SLTU.2018-36

Cite as: Krishna, H., Achanta, S., Kumar Vuppala, A. (2018) Incorporating Speaker Normalizing Capabilities to an End-to-End Speech Recognition System. Proc. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018), 172-176, doi: 10.21437/SLTU.2018-36

@inproceedings{krishna18_sltu,
  author={Hari Krishna and Sivanand Achanta and Anil {Kumar Vuppala}},
  title={{Incorporating Speaker Normalizing Capabilities to an End-to-End Speech Recognition System}},
  year=2018,
  booktitle={Proc. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018)},
  pages={172--176},
  doi={10.21437/SLTU.2018-36}
}