research-article

Unsupervised learning of multilingual short message service (SMS) dialect from noisy examples

Authors:

Sreangsu Acharyya,

L. V. Subramaniam,

Shourya RoyAuthors Info & Claims

AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Pages 67 - 74

https://doi.org/10.1145/1390749.1390761

Published: 24 July 2008 Publication History

Abstract

Noise in textual data such as those introduced by multi-linguality, misspellings, abbreviations, deletions, phonetic spellings, non standard transliteration, etc pose considerable problems for text-mining. Such corruptions are very common in instant messenger (IM) and short message service (SMS) data and adversely affect off the shelf text mining methods. Most techniques address this problem by supervised methods. But they require labels that are very expensive and time consuming to obtain. While we do not champion unsupervised methods over supervised when quality of results is the supreme and singular concern, we demonstrate that unsupervised methods can provide cost effective results without the need for expensive human intervention to generate parallely labelled corpora. A generative model based unsupervised technique is presented that maps non-standard words to their corresponding conventional frequent form. A Hidden Markov Model (HMM) over subsequencized representation of words is used subject to a parameterization such that the training phase involves clustering over vectors and not the customary dynamic programming over sequences. A principled transformation of maximum likelihood based "central clustering" cost function into a "pairwise similarity" based clustering is proposed. This transformation makes it possible to apply "subsequence kernel" based methods that model delete and insert edit operations well. The novelty of this approach lies in that the expensive (Baum-Welch) iterations required for HMM, can be avoided through a careful factorization of the HMM Loglikelihood and in establishing the connection between information theoretic cost function and the kernel approach of machine learning. Anecdotal evidence of efficacy is provided on public and proprietary data.

References

[1]

http://www.comp.nus.edu.sg/rpnlpir/downloads/corpora/smscorpus/.

[2]

Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jennifer C. Lai, and Robert L. Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467--479, 1992.

Digital Library

[3]

Monojit Choudhury, Rahul Saraf, Sudeshna Sarkar Vijit Jain, and Anupam Basu. Investigation and modeling of the structure of texting language. In Proceedings IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data, pages 63--70, Hyderabad, 2007.

[4]

Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience, New York, NY, USA, 1991.

Digital Library

[5]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39(Series B):1--38, 1977.

[6]

E. Brill and R. C. Moore. An improved model for noisy spelling correction. In Proceedings of 38th Annual Meeting of the ACL, pages 286--293, 2000.

Digital Library

[7]

G Kondrak F Ahmad. Learning a spelling error model from search query logs. In HLT '05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 955--962, Morristown, NJ, USA, 2005. Association for Computational Linguistics.

Digital Library

[8]

Thomas Hofmann and Jan Puzicha. Unsupervised learning from dyadic data. Technical Report TR-98-042, International Computer Science Insitute, Berkeley, CA, 1998.

[9]

Yijue How and Min-Yen Kan. Optimizing predictive text entry for short message service on mobile phones. In Proc. of Human Computer Interfaces International, 2005.

[10]

Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. pages 105--161, 1999.

Digital Library

[11]

George Karypis. CLUTO - a clustering toolkit. Technical Report #02-017, University of Minnesota, Department of Computer Science, nov 2003.

[12]

K. Kukich. Technique for automatically correcting words in text. ACM Computing Surveys, 24:377--439, 1992.

Digital Library

[13]

Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text classification using string kernels. J. Mach. Learn. Res., 2:419--444, 2002.

Digital Library

[14]

L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--286, 1989.

[15]

Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888--905, 2000.

Digital Library

[16]

Padhraic Smyth, David Heckerman, and Michael I. Jordan. Probabilistic independence networks for hidden markov probability models. Neural Computation, 9(2):227--269, 1997.

Digital Library

[17]

Richard Sproat, Alan Black, Stanley Chen, Shankar Kumar, and Mari Ostendorf amd Christopher Richards. Normalization of non-standard words. Computer Speech and Language, 15:287--333, 1992.

Digital Library

[18]

K. Toutanova and R. C. Moore. Pronunciation modelling for improved spelling correction. In Proceedings of 40th Annual Meeting of the ACL, pages 144--151, 2002.

Digital Library

Cited By

Pathak VJoshi M(2018)Relevance Feedback Mechanism for Resolving Transcription Ambiguity in SMS Based Literature Information SystemSmart Intelligent Computing and Applications10.1007/978-981-13-1927-3_56(527-542)Online publication date: 5-Nov-2018
https://doi.org/10.1007/978-981-13-1927-3_56
Adesina AAgbele KAbidoye ANyongesa H(2014)Text messaging and retrieval techniques for a mobile health information systemJournal of Information Science10.1177/016555151454040040:6(736-748)Online publication date: 1-Dec-2014
https://dl.acm.org/doi/10.1177/0165551514540400
Gerdjikov SMihov SNenchev V(2013)Extraction of Spelling Variations from Language Structure for Noisy Text CorrectionProceedings of the 2013 12th International Conference on Document Analysis and Recognition10.1109/ICDAR.2013.72(324-328)Online publication date: 25-Aug-2013
https://dl.acm.org/doi/10.1109/ICDAR.2013.72
Show More Cited By

Index Terms

Unsupervised learning of multilingual short message service (SMS) dialect from noisy examples
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Language independent unsupervised learning of short message service dialect
Special Issue NOISY

Noise in textual data such as those introduced by multilinguality, misspellings, abbreviations, deletions, phonetic spellings, non-standard transliteration, etc. pose considerable problems for text-mining. Such corruptions are very common in instant ...
Unsupervised Ensemble Learning with Noisy Label Correction
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Unsupervised ensemble learning aims to estimate ground-truth labels via integrating noisy and unreliable labeling results from multiple annotators. Although many techniques have been proposed to deal with this challenging task, there still exists some "...
Unsupervised multilingual learning

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

July 2008

130 pages

ISBN:9781605581965

DOI:10.1145/1390749

Conference Chairs:
Daniel Lopresti
Lehigh University
,
Shourya Roy
IBM India Research Lab
,
Klaus Schulz
University of Munich
,
L. Venkata Subramaniam
India Research Lab

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

AND '08

AND '08: Second Workshop on Analytics for Noisy Unstructured Text Data

July 24, 2008

Singapore

Acceptance Rates

Overall Acceptance Rate 15 of 22 submissions, 68%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
400
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pathak VJoshi M(2018)Relevance Feedback Mechanism for Resolving Transcription Ambiguity in SMS Based Literature Information SystemSmart Intelligent Computing and Applications10.1007/978-981-13-1927-3_56(527-542)Online publication date: 5-Nov-2018
https://doi.org/10.1007/978-981-13-1927-3_56
Adesina AAgbele KAbidoye ANyongesa H(2014)Text messaging and retrieval techniques for a mobile health information systemJournal of Information Science10.1177/016555151454040040:6(736-748)Online publication date: 1-Dec-2014
https://dl.acm.org/doi/10.1177/0165551514540400
Gerdjikov SMihov SNenchev V(2013)Extraction of Spelling Variations from Language Structure for Noisy Text CorrectionProceedings of the 2013 12th International Conference on Document Analysis and Recognition10.1109/ICDAR.2013.72(324-328)Online publication date: 25-Aug-2013
https://dl.acm.org/doi/10.1109/ICDAR.2013.72
Laboreiro GSarmento LTeixeira JOliveira EBasili RLopresti DRinglstetter CRoy SSchulz KSubramaniam L(2010)Tokenizing micro-blogging messages using a text classification approachProceedings of the fourth workshop on Analytics for noisy unstructured text data10.1145/1871840.1871853(81-88)Online publication date: 26-Oct-2010
https://dl.acm.org/doi/10.1145/1871840.1871853
Kothari GNegi SFaruquie TChakaravarthy VSubramaniam LSu K(2009)SMS based interface for FAQ retrievalProceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 210.5555/1690219.1690266(852-860)Online publication date: 2-Aug-2009
https://dl.acm.org/doi/10.5555/1690219.1690266

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten