End-to-End Phoneme Recognition using Models from Semantic Image Segmentation | IEEE Conference Publication | IEEE Xplore

End-to-End Phoneme Recognition using Models from Semantic Image Segmentation


Abstract:

We train fully convolutional neural networks with no recurrent layers for the end-to-end phoneme recognition task, using the Connectionist Temporal Classification (CTC) l...Show More

Abstract:

We train fully convolutional neural networks with no recurrent layers for the end-to-end phoneme recognition task, using the Connectionist Temporal Classification (CTC) loss function. The adopted network, U-Net, was introduced initially for semantic image segmentation tasks, and is often applied to segmenting features in medical imaging and remote sensing. The similarities between CTC-based automatic speech recognition and semantic segmentation problems are discussed. We extend the encoder-decoder architecture of U-Net and show it is capable of good performance in the acoustic modelling of a speech recognition system. We investigate the importance of the concatenation step in the design of U-net, and report results using the core test set of the TIMIT corpus.
Date of Conference: 19-24 July 2020
Date Added to IEEE Xplore: 28 September 2020
ISBN Information:

ISSN Information:

Conference Location: Glasgow, UK

Contact IEEE to Subscribe

References

References is not available for this document.