Normalization of informal text

https://doi.org/10.1016/j.csl.2013.07.001Get rights and content

Highlights

  • Normalization of abbreviations in noisy, informal text.

  • Collection, filtering and annotation of Twitter status messages.

  • Comparison of statistical and machine translation approaches.

  • Effects of language model order on accuracy.

  • Combination of methods to achieve best results.

Abstract

This paper describes a noisy-channel approach for the normalization of informal text, such as that found in emails, chat rooms, and SMS messages. In particular, we introduce two character-level methods for the abbreviation modeling aspect of the noisy channel model: a statistical classifier using language-based features to decide whether a character is likely to be removed from a word, and a character-level machine translation model. A two-phase approach is used; in the first stage the possible candidates are generated using the selected abbreviation model and in the second stage we choose the best candidate by decoding using a language model. Overall we find that this approach works well and is on par with current research in the field.

Introduction

Text messaging is a rapidly growing form of alternative communication for cell phones. This popularity has caused safety concerns leading many US states to pass laws prohibiting texting while driving. The technology is also difficult for users with visual impairments or physical handicaps to use. We believe a text-to-speech (TTS) system for cell phones can decrease these problems to promote safe travel and ease of use for all. Normalization is the usual first step for TTS.

SMS lingo is similar to the chatspeak that is prolific on forums, blogs and chatrooms. Screen readers will thus benefit from such technology, enabling visually impaired users to take part in internet culture. In addition, normalizing informal text is important for tasks such as information retrieval, summarization, and keyword, topic, sentiment and emotion detection, which are currently receiving a lot of attention for informal domains.

Normalization of informal text is complicated by the large number of abbreviations used. Some previous work on this problem used phrase-based machine translation (MT) for abbreviation normalization; however, a large annotated corpus is required for such a method since the learning is performed at the word level. By definition, this method cannot make a hypothesis for an abbreviation it did not see in training. This is a serious limitation in a domain where new words are created frequently and irregularly.

This work is an extension of our work in Pennell and Liu, 2010, Pennell and Liu, 2011, Pennell and Liu, 2011. In this paper, we establish two sets of baseline results for this problem on our data set. The first uses a language model for decoding without use of an abbreviation model, while the second utilizes a state-of-the art spell checking module, Jazzy Idzelis (2005). We then compare the use of our two abbreviation models for decoding informal text sentences. We also determine the effects on decoding accuracy when more or less context is available. Finally, we combine the two systems in various ways and demonstrate that a combined model performs better than both systems individually.

Section snippets

Related work

This section briefly describes relevant work in fields directly related to our research, though not always directly applied to informal text. We describe the tasks of modeling and expanding abbreviations in text as well as research on normalization of text in both formal and informal domains.

Data

Three small SMS corpora are currently publicly available for current research: How and Kan, 2005, Fairon and Paumier, 2006, Choudhury et al., 2007. In addition, there is the Edinburgh Twitter Corpus (Petrovic et al., 2010), which is quite large but does not have the corresponding standard English transcription. The small Twitter corpus used in Han and Baldwin (2011) has also been released; this corpus has annotation and context but only contains 549 messages. Due to the lack of a large parallel

Method

For a given text message sentence, A = a1a2 . . . an, the problem of determining the sentence of standard English words, W=w1w2...wn, can be formally described as below, similar to speech recognition and machine translation problems:

Wˆ=argmaxP(W|A)=argmaxP(W)P(A|W)argmaxP(wi|win+1...wi1)×P(ai|wi)=argmax(logP(wi|win+1...wi1)+logP(ai|wi))where the approximation is based on the assumption that each abbreviation depends only on the corresponding word (note that we are not considering one-to-many

Experimental setup

With the exception of tests to establish baselines, we used a cross validation setup for our experiments. The data from four annotators is used as training data, while the data from the fifth annotator is divided in half for development and testing. For each fold we perform two tests; initially we use the first half for development and test on the second half, then the development and test portions were swapped. The results shown here are averaged over all ten tests.

The language model (LM) we

Conclusions and future work

In this paper, we have provided an extensive comparison of two abbreviation models for normalizing abbreviations found in informal text. Both models yield improvements over two baselines – using a language model alone for decoding and a state-of-the-art spell-checking algorithm – even when using the score models with no context. With context and an LM, we significantly outperform both baselines. Our MT model vastly outperforms our CRF model, even on the deletion-type abbreviations for which the

Acknowledgements

Thanks to Justin Schneider and Duc Le for their work in implementing the message selection procedure for annotations. Thanks also to Paul Cook for providing his abbreviation type labels for the SMS test set so that we could perform comparison experiments.

This work is partly supported by DARPA under Contract No. HR0011-12-C-0016. Any opinions expressed in this material are those of the authors and do not necessarily reflect the views of DARPA.

References (32)

  • R. Sproat et al.

    Normalization of non-standard words

    Computer Speech and Language

    (2001)
  • A. Aw et al.

    A phrase-based statistical model for SMS text normalization

  • S. Bangalore et al.

    Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system

  • S. Bartlett et al.

    Automatic syllabification with structured SVMs for letter-to-phoneme conversion

  • R. Beaufort et al.

    A hybrid rule/model-based finite-state framework for normalizing SMS messages

  • M. Choudhury et al.

    Investigation and modeling of the structure of texting language

    International Journal of Document Analysis and Recognition

    (2007)
  • D. Contractor et al.

    Unsupervised cleansing of noisy text

  • P. Cook et al.

    An unsupervised model for text message normalization

  • C. Fairon et al.

    A translated corpus of 30,000 French SMS

  • B. Han et al.

    Lexical normalisation of short text messages: Makn sens a #twitter

  • B. Han et al.

    Automatically constructing and normalisation dicitonary for microblogs

  • Q.C.A. Henríquez et al.

    A ngram-based statistical machine translation approach for text normalization on chat-speak style communications

  • How, Y., Kan, M.Y., 2005. Optimizing predictive text entry for short message service on mobile phones, in: Human...
  • Idzelis, M., 2005. Jazzy: The java open source spell checker....
  • C. Kobus et al.

    Normalizing SMS: are two metaphors better than one?

  • P. Koehn et al.

    Moses: open source toolkit for statistical machine translation

  • Cited by (32)

    • Graph-based Turkish text normalization and its impact on noisy text processing

      2022, Engineering Science and Technology, an International Journal
      Citation Excerpt :

      Feature weights were trained by sequential Monte Carlo in a maximum-likelihood framework in order to overcome large label space, and the local context was handled via a language model. Previous research also handled normalization as a machine translation problem from non-standard to standard words [64]. The work presented in [2] used a phrase-based statistical machine translation model to normalize English SMS texts at the token level, whereas the work on normalizing Slovene tweets [52] used a character-level statistical translation system.

    • An ontology knowledge inspection methodology for quality assessment and continuous improvement

      2021, Data and Knowledge Engineering
      Citation Excerpt :

      Despite great advances in this research field, the use of these methods may result in the generation of inconsistencies and low-quality ontologies. This anomalous behaviour is directly connected with the intrinsic difficulties of different natural language processing challenges such as the disambiguation of word meanings (often called Word Sense Disambiguation, WSD) [10–13], handling informal text [12] or adequately dealing with new words from specific domains [13]. A direct consequence of this issue is that the costs of creating ontologies by using learning methods are not significantly reduced but are instead simply moved to a debugging/fixing stage.

    • An empirical study on POS tagging for Vietnamese social media text

      2018, Computer Speech and Language
      Citation Excerpt :

      Web 2.0 platforms such as blogs, forums, wikis, and social networks have facilitated the generation of a huge volume of user-generated text. These data have become an important source for both data mining and NLP communities, and at the same time require appropriate tools for text analysis (Pennell and Liu, 2014). Although available POS taggers can achieve high accuracy on conventional data, the performance usually degrades on noisy, unconventional text generated by social users.

    • Arabic Social Media Analysis and Translation

      2017, Procedia Computer Science
    • Studying the effect and treatment of misspelled queries in Cross-Language Information Retrieval

      2016, Information Processing and Management
      Citation Excerpt :

      Regarding future work, we intend to work mainly on improving the translation process of character n-grams in order to increase its quality for retrieval applications. Moreover, from a pragmatic point of view, and following the example of the research community, we intend to study the application of our character n-gram based approach to our current research lines in microblog text processing for text normalization (Pennell & Liu, 2014), sentiment analysis (Aisopos, Papadakis, Tserpes, & Varvarigou, 2012) and language identification tasks (Lui & Baldwin, 2014). At this respect, it should be noted that Twitter and other microblogging services are very noisy multilingual environments, for which specialized linguistic resources are still very scarce, particularly for non-English languages.

    • On the feasibility of character n-grams pseudo-translation for Cross-Language Information Retrieval tasks

      2016, Computer Speech and Language
      Citation Excerpt :

      Character n-grams have been successfully used for a long time in a wide variety of text processing problems and domains, including the following: approximate word matching (Zobel and Dart, 1995), language identification (Lui et al., 2014) spelling-error detection (Salton, 1989), author attribution and profiling (Stamatatos, 2009; Escalante et al., 2011; Sapkota et al., 2013), and bioinformatics (Tomović et al., 2006). More recently, character n-grams have been drawing increasing attention in the field of automatic processing of SMS and microblog (e.g. Twitter) texts – which tend to be noisy by nature – including tasks such as text normalization (Pennell and Liu, 2014), sentiment analysis (Aisopos et al., 2012) or language identification (Lui and Baldwin, 2014). In this way, n-gram based processing has become a standard state-of-the-art text processing approach, whose success comes from its positive features (Tomović et al., 2006):

    View all citing articles on Scopus
    View full text