Language independent unsupervised learning of short message service dialect

Acharyya, Sreangsu; Negi, Sumit; Subramaniam, L. Venkata; Roy, Shourya

doi:10.1007/s10032-009-0093-9

Language independent unsupervised learning of short message service dialect

Original Paper
Published: 22 September 2009

Volume 12, pages 175–184, (2009)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Sreangsu Acharyya¹,
Sumit Negi²,
L. Venkata Subramaniam² &
…
Shourya Roy³

132 Accesses
3 Citations
Explore all metrics

Abstract

Noise in textual data such as those introduced by multilinguality, misspellings, abbreviations, deletions, phonetic spellings, non-standard transliteration, etc. pose considerable problems for text-mining. Such corruptions are very common in instant messenger and short message service data and they adversely affect off-the-shelf text mining methods. Most techniques address this problem by supervised methods by making use of hand labeled corrections. But they require human generated labels and corrections that are very expensive and time consuming to obtain because of multilinguality and complexity of the corruptions. While we do not champion unsupervised methods over supervised when quality of results is the singular concern, we demonstrate that unsupervised methods can provide cost effective results without the need for expensive human intervention that is necessary to generate a parallel labeled corpora. A generative model based unsupervised technique is presented that maps non-standard words to their corresponding conventional frequent form. A hidden Markov model (HMM) over a “subsequencized” representation of words is used, where a word is represented as a bag of weighted subsequences. The approximate maximum likelihood inference algorithm used is such that the training phase involves clustering over vectors and not the customary and expensive dynamic programming (Baum–Welch algorithm) over sequences that is necessary for HMMs. A principled transformation of maximum likelihood based “central clustering” cost function of Baum–Welch into a “pairwise similarity” based clustering is proposed. This transformation makes it possible to apply “subsequence kernel” based methods that model delete and insert corruptions well. The novelty of this approach lies in that the expensive (Baum–Welch) iterations required for HMM, can be avoided through an approximation of the loglikelihood function and by establishing a connection between the loglikelihood and a pairwise distance. Anecdotal evidence of efficacy is provided on public and proprietary data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ahmad, F., Kondrak, G.: Learning a spelling error model from search query logs. In: HLT ’05: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Morristown, NJ, USA. Association for Computational Linguistics. pp. 955–962 (2005)
Baum L.E., Petrie T., Soules G., Weiss N.: A maximization technique occurring in the statistical analysis of probabilisitic functions of Markov chains. Ann. Math. Stat. 41(1), 164–171 (1970)
Article MATH MathSciNet Google Scholar
Brill, E., Moore, R.C.: An improved model for noisy spelling correction. In: Proceedings of 38th Annual Meeting of the ACL, pp. 286–293 (2000)
Brown P.F., Della Pietra V.J., deSouza P.V., Lai J.C., Mercer R.L.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)
Google Scholar
Choudhury M., Sharaf R., Jain V., Mukherjee A., Sarkar S., Basu A.: Investigation and modeling of the structure of texting language. Int. J. Doc. Anal. Recognit. 10(3–4), 157–174 (1991)
Google Scholar
Cover T.M., Thomas J.A.: Elements of Information Theory. Wiley, New York (1991)
Book MATH Google Scholar
Dempster A.P., Laird N.M., Rubin D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. 39(Series B), 1–38 (1977)
MATH MathSciNet Google Scholar
Hofmann, T., Puzicha, J.: Unsupervised learning from dyadic data. Technical Report TR-98-042, International Computer Science Insitute, Berkeley (1998)
How, Y., Kan, M.-Y.: Optimizing predictive text entry for short message service on mobile phones. In: Proceedings of Human Computer Interfaces International (2005)
Karypis, G.: CLUTO—a clustering toolkit. Technical Report #02-017. Department of Computer Science, University of Minnesota (2003)
Kothari, G., Negi, S., Faruquie, T.A., Chakaravarthy, V.T., Subramaniam, L.V.: SMS based interface for FAQ retrieval. In: Joint Conference of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL-IJCNLP 2009), Singapore (2009)
Kukich K.: Technique for automatically correcting words in text. ACM Comput. Surv. 24, 377–439 (1992)
Article Google Scholar
Lodhi H., Saunders C., Shawe-Taylor J., Cristianini N., Watkins C.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)
Article MATH Google Scholar
Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Article Google Scholar
Rabiner L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Article Google Scholar
Shi J., Malik J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Article Google Scholar
Smyth P., Heckerman D., Jordan M.I.: Probabilistic independence networks for hidden Markov probability models. Neural Comput. 9(2), 227–269 (1992)
Article Google Scholar
Sproat R., Black A., Chen S., Kumar S., Ostendorf M., Richards C.: Normalization of non-standard words. Comput. Speech Lang. 15, 287–333 (1992)
Article Google Scholar
Toutanova, K., Moore, R.C.: Pronunciation modelling for improved spelling correction. In: Proceedings of 40th Annual Meeting of the ACL, pp. 144–151 (2002)

Download references

Author information

Authors and Affiliations

Electrical Engineering Department, University of Texas, Austin, TX, USA
Sreangsu Acharyya
IBM Research, New Delhi, India
Sumit Negi & L. Venkata Subramaniam
Xerox India Innovation Hub, Chennai, India
Shourya Roy

Authors

Sreangsu Acharyya
View author publications
You can also search for this author in PubMed Google Scholar
Sumit Negi
View author publications
You can also search for this author in PubMed Google Scholar
L. Venkata Subramaniam
View author publications
You can also search for this author in PubMed Google Scholar
Shourya Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to L. Venkata Subramaniam.

Additional information

The work was done when S. Acharyya visited IBM Research, India.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Acharyya, S., Negi, S., Subramaniam, L.V. et al. Language independent unsupervised learning of short message service dialect. IJDAR 12, 175–184 (2009). https://doi.org/10.1007/s10032-009-0093-9

Download citation

Received: 20 December 2008
Revised: 26 August 2009
Accepted: 26 August 2009
Published: 22 September 2009
Issue Date: September 2009
DOI: https://doi.org/10.1007/s10032-009-0093-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Language independent unsupervised learning of short message service dialect

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Short Text Tagging Using Nested Stochastic Block Model: A Yelp Case Study

A systematic evaluation of text mining methods for short texts: Mapping individuals’ internal states from online posts

Constructing Language Models from Online Forms to Aid Better Document Representation for More Effective Clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Language independent unsupervised learning of short message service dialect

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Short Text Tagging Using Nested Stochastic Block Model: A Yelp Case Study

A systematic evaluation of text mining methods for short texts: Mapping individuals’ internal states from online posts

Constructing Language Models from Online Forms to Aid Better Document Representation for More Effective Clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now