Abstract:
In this paper, we explore the use of social media data in augmenting the lack of large prepared text corpora for LVCSR language modeling. Extensive normalization is requi...Show MoreMetadata
Abstract:
In this paper, we explore the use of social media data in augmenting the lack of large prepared text corpora for LVCSR language modeling. Extensive normalization is required to handle informal and noisy nature of social media text. We propose a similarity-based text normalization approach where similarity in terms of spelling, pronunciation and context are considered. Similarity between a source (nonstandard) word and a target (normalized) word is measured by edit distance and Kullback-Leibler distance. The proposed normalization method can handle the case of homophonic, spelling error and insertion (repeated characters) which occur quite often in Twitter's texts. We then trained n-gram language models with the normalized texts and achieved up to 60% relative improvement in terms of perplexity and 9% in terms of WER on a mobile speech-to-speech translation task. The proposed approach is applicable to other types of social media texts by its unsupervised manner.
Date of Conference: 10-12 September 2014
Date Added to IEEE Xplore: 02 March 2015
Electronic ISBN:978-1-4799-7094-0