Normalization of informal text

doi:10.1016/j.csl.2013.07.001

Computer Speech & Language

Volume 28, Issue 1, January 2014, Pages 256-277

https://doi.org/10.1016/j.csl.2013.07.001 Get rights and content

Highlights

•
Normalization of abbreviations in noisy, informal text.
•
Collection, filtering and annotation of Twitter status messages.
•
Comparison of statistical and machine translation approaches.
•
Effects of language model order on accuracy.
•
Combination of methods to achieve best results.

Abstract

This paper describes a noisy-channel approach for the normalization of informal text, such as that found in emails, chat rooms, and SMS messages. In particular, we introduce two character-level methods for the abbreviation modeling aspect of the noisy channel model: a statistical classifier using language-based features to decide whether a character is likely to be removed from a word, and a character-level machine translation model. A two-phase approach is used; in the first stage the possible candidates are generated using the selected abbreviation model and in the second stage we choose the best candidate by decoding using a language model. Overall we find that this approach works well and is on par with current research in the field.

Introduction

Text messaging is a rapidly growing form of alternative communication for cell phones. This popularity has caused safety concerns leading many US states to pass laws prohibiting texting while driving. The technology is also difficult for users with visual impairments or physical handicaps to use. We believe a text-to-speech (TTS) system for cell phones can decrease these problems to promote safe travel and ease of use for all. Normalization is the usual first step for TTS.

SMS lingo is similar to the chatspeak that is prolific on forums, blogs and chatrooms. Screen readers will thus benefit from such technology, enabling visually impaired users to take part in internet culture. In addition, normalizing informal text is important for tasks such as information retrieval, summarization, and keyword, topic, sentiment and emotion detection, which are currently receiving a lot of attention for informal domains.

Normalization of informal text is complicated by the large number of abbreviations used. Some previous work on this problem used phrase-based machine translation (MT) for abbreviation normalization; however, a large annotated corpus is required for such a method since the learning is performed at the word level. By definition, this method cannot make a hypothesis for an abbreviation it did not see in training. This is a serious limitation in a domain where new words are created frequently and irregularly.

This work is an extension of our work in Pennell and Liu, 2010, Pennell and Liu, 2011, Pennell and Liu, 2011. In this paper, we establish two sets of baseline results for this problem on our data set. The first uses a language model for decoding without use of an abbreviation model, while the second utilizes a state-of-the art spell checking module, Jazzy Idzelis (2005). We then compare the use of our two abbreviation models for decoding informal text sentences. We also determine the effects on decoding accuracy when more or less context is available. Finally, we combine the two systems in various ways and demonstrate that a combined model performs better than both systems individually.

Section snippets

Related work

This section briefly describes relevant work in fields directly related to our research, though not always directly applied to informal text. We describe the tasks of modeling and expanding abbreviations in text as well as research on normalization of text in both formal and informal domains.

Data

Three small SMS corpora are currently publicly available for current research: How and Kan, 2005, Fairon and Paumier, 2006, Choudhury et al., 2007. In addition, there is the Edinburgh Twitter Corpus (Petrovic et al., 2010), which is quite large but does not have the corresponding standard English transcription. The small Twitter corpus used in Han and Baldwin (2011) has also been released; this corpus has annotation and context but only contains 549 messages. Due to the lack of a large parallel

Method

For a given text message sentence, A = a₁a₂ . . . a_n, the problem of determining the sentence of standard English words, $W = w_{1} w_{2} . . . w_{n}$ , can be formally described as below, similar to speech recognition and machine translation problems:

$\begin{matrix} \hat{W} & = & arg max P (W | A) \\ = & arg max P (W) P (A | W) \\ \approx & arg max \prod P (w_{i} | w_{i - n + 1} . . . w_{i - 1}) \times \prod P (a_{i} | w_{i}) \\ = & arg max (\sum log P (w_{i} | w_{i - n + 1} . . . w_{i - 1}) + \sum log P (a_{i} | w_{i})) \end{matrix}$ where the approximation is based on the assumption that each abbreviation depends only on the corresponding word (note that we are not considering one-to-many

Experimental setup

With the exception of tests to establish baselines, we used a cross validation setup for our experiments. The data from four annotators is used as training data, while the data from the fifth annotator is divided in half for development and testing. For each fold we perform two tests; initially we use the first half for development and test on the second half, then the development and test portions were swapped. The results shown here are averaged over all ten tests.

The language model (LM) we

Conclusions and future work

In this paper, we have provided an extensive comparison of two abbreviation models for normalizing abbreviations found in informal text. Both models yield improvements over two baselines – using a language model alone for decoding and a state-of-the-art spell-checking algorithm – even when using the score models with no context. With context and an LM, we significantly outperform both baselines. Our MT model vastly outperforms our CRF model, even on the deletion-type abbreviations for which the

Acknowledgements

Thanks to Justin Schneider and Duc Le for their work in implementing the message selection procedure for annotations. Thanks also to Paul Cook for providing his abbreviation type labels for the SMS test set so that we could perform comparison experiments.

This work is partly supported by DARPA under Contract No. HR0011-12-C-0016. Any opinions expressed in this material are those of the authors and do not necessarily reflect the views of DARPA.

References (32)

R. Sproat et al.
Normalization of non-standard words
Computer Speech and Language
(2001)
A. Aw et al.
A phrase-based statistical model for SMS text normalization
S. Bangalore et al.
Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system
S. Bartlett et al.
Automatic syllabification with structured SVMs for letter-to-phoneme conversion
R. Beaufort et al.
A hybrid rule/model-based finite-state framework for normalizing SMS messages
M. Choudhury et al.
Investigation and modeling of the structure of texting language
International Journal of Document Analysis and Recognition
(2007)
D. Contractor et al.
Unsupervised cleansing of noisy text
P. Cook et al.
An unsupervised model for text message normalization
C. Fairon et al.
A translated corpus of 30,000 French SMS
B. Han et al.
Lexical normalisation of short text messages: Makn sens a #twitter

B. Han et al.

Automatically constructing and normalisation dicitonary for microblogs

Q.C.A. Henríquez et al.

A ngram-based statistical machine translation approach for text normalization on chat-speak style communications

How, Y., Kan, M.Y., 2005. Optimizing predictive text entry for short message service on mobile phones, in: Human...

Idzelis, M., 2005. Jazzy: The java open source spell checker....

C. Kobus et al.

Normalizing SMS: are two metaphors better than one?

P. Koehn et al.

Moses: open source toolkit for statistical machine translation

Cited by (32)

Graph-based Turkish text normalization and its impact on noisy text processing
2022, Engineering Science and Technology, an International Journal
Citation Excerpt :
Feature weights were trained by sequential Monte Carlo in a maximum-likelihood framework in order to overcome large label space, and the local context was handled via a language model. Previous research also handled normalization as a machine translation problem from non-standard to standard words [64]. The work presented in [2] used a phrase-based statistical machine translation model to normalize English SMS texts at the token level, whereas the work on normalizing Slovene tweets [52] used a character-level statistical translation system.
User generated texts on the web are freely-available and lucrative sources of data for language technology researchers. Unfortunately, these texts are often dominated by informal writing styles and the language used in user generated content poses processing difficulties for natural language tools. Experienced performance drops and processing issues can be addressed either by adapting language tools to user generated content or by normalizing noisy texts before being processed. In this article, we propose a Turkish text normalizer that maps non-standard words to their appropriate standard forms using a graph-based methodology and a context-tailoring approach. Our normalizer benefits from both contextual and lexical similarities between normalization pairs as identified by a graph-based subnormalizer and a transformation-based subnormalizer. The performance of our normalizer is demonstrated on a tweet dataset in the most comprehensive intrinsic and extrinsic evaluations reported so far for Turkish. In this article, we present the first graph-based solution to Turkish text normalization with a novel context-tailoring approach, which advances the state-of-the-art results by outperforming other publicly available normalizers. For the first time in the literature, we measure the extent to which the accuracy of a Turkish language processing tool is affected by normalizing noisy texts before being processed. An analysis of these extrinsic evaluations that focus on more than one Turkish NLP task (i.e., part-of-speech tagger and dependency parser) reveals that Turkish language tools are not robust to noisy texts and a normalizer leads to remarkable performance improvements once used as a preprocessing tool in this morphologically-rich language.
An ontology knowledge inspection methodology for quality assessment and continuous improvement
2021, Data and Knowledge Engineering
Citation Excerpt :
Despite great advances in this research field, the use of these methods may result in the generation of inconsistencies and low-quality ontologies. This anomalous behaviour is directly connected with the intrinsic difficulties of different natural language processing challenges such as the disambiguation of word meanings (often called Word Sense Disambiguation, WSD) [10–13], handling informal text [12] or adequately dealing with new words from specific domains [13]. A direct consequence of this issue is that the costs of creating ontologies by using learning methods are not significantly reduced but are instead simply moved to a debugging/fixing stage.
Ontology-learning methods were introduced in the knowledge engineering area to automatically build ontologies from natural language texts related to a domain. Despite the initial appeal of these methods, automatically generated ontologies may have errors, inconsistencies, and a poor design quality, all of which must be manually fixed, in order to maintain the validity and usefulness of automated output. In this work, we propose a methodology to assess ontologies quality (quantitatively and graphically) and to fix ontology inconsistencies minimizing design defects. The proposed methodology is based on the Deming cycle and is grounded on quality standards that proved effective in the software engineering domain and present high potential to be extended to knowledge engineering quality management. This paper demonstrates that software engineering quality assessment approaches and techniques can be successfully extended and applied to the ontology-fixing and quality improvement problem. The proposed methodology was validated in a testing ontology, by ontology design quality comparison between a manually created and automatically generated ontology.
An empirical study on POS tagging for Vietnamese social media text
2018, Computer Speech and Language
Citation Excerpt :
Web 2.0 platforms such as blogs, forums, wikis, and social networks have facilitated the generation of a huge volume of user-generated text. These data have become an important source for both data mining and NLP communities, and at the same time require appropriate tools for text analysis (Pennell and Liu, 2014). Although available POS taggers can achieve high accuracy on conventional data, the performance usually degrades on noisy, unconventional text generated by social users.
Part-of-speech (POS) tagging is a fundamental task in natural language processing (NLP). A robust POS tagger plays an important role in most NLP problems and applications, including syntactic parsing, semantic parsing, machine translation, and question answering. Although a lot of efficient POS taggers has been developed for general, conventional text, little work has been done for social media text. In this paper, we present an empirical study on POS tagging for Vietnamese social media text, which shows several challenges compared with tagging for general text. Social media text does not always conform to formal grammars and correct spelling. It also uses abbreviations, foreign words, and emoticons frequently. A POS tagger developed for conventional text would perform poorly on such noisy data. We address this problem by proposing a tagging model based on Conditional Random Fields (CRFs) with various kinds of features for Vietnamese social media text. We also investigate the effect of features extracted from word clusters under the Brown and canonical correlation analysis (CCA) based clustering in semi-supervised settings. We introduce an annotated corpus for POS tagging, which consists of more than four thousand sentences from Facebook, the most popular social network in Vietnam. Using this corpus, we performed a series of experiments to evaluate the proposed model. Our model achieved 88.26% and 88.92% tagging accuracy in supervised and semi-supervised scenarios, respectively, which are nearly 12% improvement over vnTagger, a state-of-the-art and most widely used Vietnamese POS tagger developed for general, conventional text. In addition, the semi-supervised model outperformed, in terms of accuracy, the version of vnTagger trained on the same Facebook dataset, showing the usefulness of word cluster features.¹
Arabic Social Media Analysis and Translation
2017, Procedia Computer Science
Twitter, is considered as one of the famous social networking platform. It has become a very valuable information source for many Natural Language Processing (NLP) applications. Some strategies and linguistic pipelines were developed for analyzing English tweets but Arabic social media analysis is still an active research area. In this research paper, we focus on the task of pre-processing Arabic tweets, which can be regarded as a first step for any NLP application. We follow up with a statistical machine translation for Arabic tweets into English, where we explain the normalization process for both Arabic and English tweets. Moreover, to overcome the obstacle of unavailability of Arabic-English parallel corpora in the social media context, we used the UN corpus, a more general corpus in (Modern Standard Arabic and English). Then, we applied adapting strategies for the tweet’s contents like using an out-of-domain and/or in-domain language model. Our conducted experiments showed that applying a good lexical normalization on both languages and combining in-domain and out-of-domain data for the language model improves the Bleu score with 4pt., over the baseline.
Studying the effect and treatment of misspelled queries in Cross-Language Information Retrieval
2016, Information Processing and Management
Citation Excerpt :
Regarding future work, we intend to work mainly on improving the translation process of character n-grams in order to increase its quality for retrieval applications. Moreover, from a pragmatic point of view, and following the example of the research community, we intend to study the application of our character n-gram based approach to our current research lines in microblog text processing for text normalization (Pennell & Liu, 2014), sentiment analysis (Aisopos, Papadakis, Tserpes, & Varvarigou, 2012) and language identification tasks (Lui & Baldwin, 2014). At this respect, it should be noted that Twitter and other microblogging services are very noisy multilingual environments, for which specialized linguistic resources are still very scarce, particularly for non-English languages.
In contrast with their monolingual counterparts, little attention has been paid to the effects that misspelled queries have on the performance of Cross-Language Information Retrieval (CLIR) systems. The present work makes a first attempt to fill this gap by extending our previous work on monolingual retrieval in order to study the impact that the progressive addition of misspellings to input queries has, this time, on the output of CLIR systems. Two approaches for dealing with this problem are analyzed in this paper. Firstly, the use of automatic spelling correction techniques for which, in turn, we consider two algorithms: the first one for the correction of isolated words and the second one for a correction based on the linguistic context of the misspelled word. The second approach to be studied is the use of character n-grams both as index terms and translation units, seeking to take advantage of their inherent robustness and language-independence. All these approaches have been tested on a from-Spanish-to-English CLIR system, that is, Spanish queries on English documents. Real, user-generated spelling errors have been used under a methodology that allows us to study the effectiveness of the different approaches to be tested and their behavior when confronted with different error rates. The results obtained show the great sensitiveness of classic word-based approaches to misspelled queries, although spelling correction techniques can mitigate such negative effects. On the other hand, the use of character n-grams provides great robustness against misspellings.
On the feasibility of character n-grams pseudo-translation for Cross-Language Information Retrieval tasks
2016, Computer Speech and Language
Citation Excerpt :
Character n-grams have been successfully used for a long time in a wide variety of text processing problems and domains, including the following: approximate word matching (Zobel and Dart, 1995), language identification (Lui et al., 2014) spelling-error detection (Salton, 1989), author attribution and profiling (Stamatatos, 2009; Escalante et al., 2011; Sapkota et al., 2013), and bioinformatics (Tomović et al., 2006). More recently, character n-grams have been drawing increasing attention in the field of automatic processing of SMS and microblog (e.g. Twitter) texts – which tend to be noisy by nature – including tasks such as text normalization (Pennell and Liu, 2014), sentiment analysis (Aisopos et al., 2012) or language identification (Lui and Baldwin, 2014). In this way, n-gram based processing has become a standard state-of-the-art text processing approach, whose success comes from its positive features (Tomović et al., 2006):
The field of Cross-Language Information Retrieval relates techniques close to both the Machine Translation and Information Retrieval fields, although in a context involving characteristics of its own. The present study looks to widen our knowledge about the effectiveness and applicability to that field of non-classical translation mechanisms that work at character n-gram level. For the purpose of this study, an n-gram based system of this type has been developed. This system requires only a bilingual machine-readable dictionary of n-grams, automatically generated from parallel corpora, which serves to translate queries previously n-grammed in the source language. n-Gramming is then used as an approximate string matching technique to perform monolingual text retrieval on the set of n-grammed documents in the target language.
The tests for this work have been performed on CLEF collections for seven European languages, taking English as the target language. After an initial tuning phase in order to analyze the most effective way for its application, the results obtained, close to the upper baseline, not only confirm the consistency across languages of this kind of character n-gram based approaches, but also constitute a further proof of their validity and applicability, these not being tied to a given implementation.

View all citing articles on Scopus

View full text

Published by Elsevier Ltd.

Normalization of informal text

Highlights

Abstract

Introduction

Section snippets

Related work

Data

Method

Experimental setup

Conclusions and future work

Acknowledgements

Computer Speech and Language

A phrase-based statistical model for SMS text normalization

Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system

Automatic syllabification with structured SVMs for letter-to-phoneme conversion

A hybrid rule/model-based finite-state framework for normalizing SMS messages

Investigation and modeling of the structure of texting language

International Journal of Document Analysis and Recognition

Unsupervised cleansing of noisy text

An unsupervised model for text message normalization

A translated corpus of 30,000 French SMS

Lexical normalisation of short text messages: Makn sens a #twitter

Automatically constructing and normalisation dicitonary for microblogs

A ngram-based statistical machine translation approach for text normalization on chat-speak style communications

Normalizing SMS: are two metaphors better than one?

Moses: open source toolkit for statistical machine translation