Abstract:
We propose a method of phonetic and prosodic feature estimation from speech that uses self-supervised-learning (SSL)-based acoustic modeling (AM). Due to the small amount...Show MoreMetadata
Abstract:
We propose a method of phonetic and prosodic feature estimation from speech that uses self-supervised-learning (SSL)-based acoustic modeling (AM). Due to the small amount of prosodic feature data, we use SSL for few-shot learning-based speech recognition. Prosodic features allow the symbolization of accent information in pitch-accent languages, which is important information for pronunciation. This method automatically generates labeled data of text-to-speech for pitch-accented language from speech only. In contrast, conventional methods can recognize only pitch accents in phonetic and prosodic features and often have low character error rates. Our method combines wav2vec 2.0, an SSL-based AM method with the Transformer architecture commonly used in natural language processing for correcting phonetic-confusion errors. The experiment indicates that our proposed method brings a 4.7%-character error rate with an SSL-based acoustic modeling with 5.69 hours fine-tuning data and phoneme-error-correction Transformer.
Published in: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)
Date of Conference: 14-19 April 2024
Date Added to IEEE Xplore: 15 August 2024
ISBN Information: