skip to main content
10.1145/1390749.1390761acmotherconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Unsupervised learning of multilingual short message service (SMS) dialect from noisy examples

Published: 24 July 2008 Publication History

Abstract

Noise in textual data such as those introduced by multi-linguality, misspellings, abbreviations, deletions, phonetic spellings, non standard transliteration, etc pose considerable problems for text-mining. Such corruptions are very common in instant messenger (IM) and short message service (SMS) data and adversely affect off the shelf text mining methods. Most techniques address this problem by supervised methods. But they require labels that are very expensive and time consuming to obtain. While we do not champion unsupervised methods over supervised when quality of results is the supreme and singular concern, we demonstrate that unsupervised methods can provide cost effective results without the need for expensive human intervention to generate parallely labelled corpora. A generative model based unsupervised technique is presented that maps non-standard words to their corresponding conventional frequent form. A Hidden Markov Model (HMM) over subsequencized representation of words is used subject to a parameterization such that the training phase involves clustering over vectors and not the customary dynamic programming over sequences. A principled transformation of maximum likelihood based "central clustering" cost function into a "pairwise similarity" based clustering is proposed. This transformation makes it possible to apply "subsequence kernel" based methods that model delete and insert edit operations well. The novelty of this approach lies in that the expensive (Baum-Welch) iterations required for HMM, can be avoided through a careful factorization of the HMM Loglikelihood and in establishing the connection between information theoretic cost function and the kernel approach of machine learning. Anecdotal evidence of efficacy is provided on public and proprietary data.

References

[1]
http://www.comp.nus.edu.sg/rpnlpir/downloads/corpora/smscorpus/.
[2]
Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jennifer C. Lai, and Robert L. Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467--479, 1992.
[3]
Monojit Choudhury, Rahul Saraf, Sudeshna Sarkar Vijit Jain, and Anupam Basu. Investigation and modeling of the structure of texting language. In Proceedings IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data, pages 63--70, Hyderabad, 2007.
[4]
Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience, New York, NY, USA, 1991.
[5]
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39(Series B):1--38, 1977.
[6]
E. Brill and R. C. Moore. An improved model for noisy spelling correction. In Proceedings of 38th Annual Meeting of the ACL, pages 286--293, 2000.
[7]
G Kondrak F Ahmad. Learning a spelling error model from search query logs. In HLT '05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 955--962, Morristown, NJ, USA, 2005. Association for Computational Linguistics.
[8]
Thomas Hofmann and Jan Puzicha. Unsupervised learning from dyadic data. Technical Report TR-98-042, International Computer Science Insitute, Berkeley, CA, 1998.
[9]
Yijue How and Min-Yen Kan. Optimizing predictive text entry for short message service on mobile phones. In Proc. of Human Computer Interfaces International, 2005.
[10]
Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. pages 105--161, 1999.
[11]
George Karypis. CLUTO - a clustering toolkit. Technical Report #02-017, University of Minnesota, Department of Computer Science, nov 2003.
[12]
K. Kukich. Technique for automatically correcting words in text. ACM Computing Surveys, 24:377--439, 1992.
[13]
Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text classification using string kernels. J. Mach. Learn. Res., 2:419--444, 2002.
[14]
L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--286, 1989.
[15]
Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888--905, 2000.
[16]
Padhraic Smyth, David Heckerman, and Michael I. Jordan. Probabilistic independence networks for hidden markov probability models. Neural Computation, 9(2):227--269, 1997.
[17]
Richard Sproat, Alan Black, Stanley Chen, Shankar Kumar, and Mari Ostendorf amd Christopher Richards. Normalization of non-standard words. Computer Speech and Language, 15:287--333, 1992.
[18]
K. Toutanova and R. C. Moore. Pronunciation modelling for improved spelling correction. In Proceedings of 40th Annual Meeting of the ACL, pages 144--151, 2002.

Cited By

View all
  • (2018)Relevance Feedback Mechanism for Resolving Transcription Ambiguity in SMS Based Literature Information SystemSmart Intelligent Computing and Applications10.1007/978-981-13-1927-3_56(527-542)Online publication date: 5-Nov-2018
  • (2014)Text messaging and retrieval techniques for a mobile health information systemJournal of Information Science10.1177/016555151454040040:6(736-748)Online publication date: 1-Dec-2014
  • (2013)Extraction of Spelling Variations from Language Structure for Noisy Text CorrectionProceedings of the 2013 12th International Conference on Document Analysis and Recognition10.1109/ICDAR.2013.72(324-328)Online publication date: 25-Aug-2013
  • Show More Cited By

Index Terms

  1. Unsupervised learning of multilingual short message service (SMS) dialect from noisy examples

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data
    July 2008
    130 pages
    ISBN:9781605581965
    DOI:10.1145/1390749
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 July 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Conference

    AND '08

    Acceptance Rates

    Overall Acceptance Rate 15 of 22 submissions, 68%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Relevance Feedback Mechanism for Resolving Transcription Ambiguity in SMS Based Literature Information SystemSmart Intelligent Computing and Applications10.1007/978-981-13-1927-3_56(527-542)Online publication date: 5-Nov-2018
    • (2014)Text messaging and retrieval techniques for a mobile health information systemJournal of Information Science10.1177/016555151454040040:6(736-748)Online publication date: 1-Dec-2014
    • (2013)Extraction of Spelling Variations from Language Structure for Noisy Text CorrectionProceedings of the 2013 12th International Conference on Document Analysis and Recognition10.1109/ICDAR.2013.72(324-328)Online publication date: 25-Aug-2013
    • (2010)Tokenizing micro-blogging messages using a text classification approachProceedings of the fourth workshop on Analytics for noisy unstructured text data10.1145/1871840.1871853(81-88)Online publication date: 26-Oct-2010
    • (2009)SMS based interface for FAQ retrievalProceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 210.5555/1690219.1690266(852-860)Online publication date: 2-Aug-2009

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media