research-article

A comparison of different calculations for N-gram similarities in a spelling corrector for mobile instant messaging language

Authors:
Laurie Butgereit

CSIR Meraka Institute, Pretoria, RSA and Nelson Mandela Metropolitan University, Port Elizabeth, RSA

CSIR Meraka Institute, Pretoria, RSA and Nelson Mandela Metropolitan University, Port Elizabeth, RSA
View Profile

,
Reinhardt A. Botha

Nelson Mandela Metropolitan University, Port Elizabeth, RSA

Nelson Mandela Metropolitan University, Port Elizabeth, RSA
View Profile

SAICSIT '13: Proceedings of the South African Institute for Computer Scientists and Information Technologists ConferenceOctober 2013Pages 1–7https://doi.org/10.1145/2513456.2513458

Published:07 October 2013Publication History

SAICSIT '13: Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference

Pages 1–7

ABSTRACT

Mobile Instant Messaging (MIM) systems have produced a new convention in writing where vowels are often omitted, where new suffixes have appeared, where numerals and symbols often appear in the place of letters which have a similar shape or sound, and where words are often spelled phonetically. A word such as mister may be spelled numerous ways including mista and mistr (with new suffixes). When both participants to a MIM conversation understand these new spelling conventions, there is no problem. But in a situation such as automated topic spotting, it is advantageous to attempt to associate these new spellings (mista and mistr) back to the original word (mister). This paper describes work in creating a spelling corrector for MIM conversations for use after stop words have been removed from a conversation, after words have been stemmed, and after double letters have been collapsed to single letters. Four different similarity calculations Jaccard, Sørensen-Dice, Cosine, and Overlap are investigated and tested with historical data from the Dr Math mobile tutoring environment. This research found that the Overlap similarity calculation was the least accurate of the four measured. In situations where the length of the various words were the same, Sørensen-Dice and Cosine similarity calculations were identical. Jaccard and Sørensen-Dice worked equally well, however, they required different numerical cut-off values for misspelled words.

References

A. Botha and L. Butgereit, "Dr Math: A Mobiled Scaffolding Environment," International Journal of Mobile and Blended Learning, vol. 4, pp. 15--29, 2012. Google ScholarDigital Library
L. Butgereit, "A Model for Automated Topic Spotting in a Mobile Chat Based Mathematics Tutoring Environment," 2012.Google Scholar
W. J. Wilbur and K. Sirotkin. The automatic identification of stop words. J. Inf. Sci. 18(1), pp. 45. 1992. Google ScholarDigital Library
L. Butgereit and R. A. Botha, "Stop Words for "Dr Math"," Proceedings of IST-Africa, 2011, May 11--13, Gabarones, Botswana, 2011.Google Scholar
E. Hatcher and O. Gospodnetic. Lucene in Action 2004.Google Scholar
L. Butgereit and R. A. Botha, "A Lucene Stemmer for MXit Lingo," Proceedings of ZA WWW 2011, Sept 14--16, Johannesburg, 2011.Google Scholar
L. Butgereit and R. A. Botha, "Using N-grams to Identify Mathematics Topics in Mxit Lingo," Proceedings of SAICSIT, Oct 3--5, Cape Town 2011. Google ScholarDigital Library
L. Butgereit and R. A. Botha, "A model to identify mathematics topics in MXit lingo to provide tutors quick access to supporting documentation," Pythagoras, 2011.Google Scholar
C. Prün. Biographical notes on GK zipf. Glottometrics 3pp. 1--10. 2002.Google Scholar
W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. Ann Arbor MI 48113 pp. 4001. 1994.Google Scholar
Jaccard, Paul - Historischen Lexikon der Schweiz. Available: http://www.hls-dhs-dss.ch/textes/f/F31406.php.Google Scholar
P. Jaccard, "Étude comparative de la distribution florale dans une portion des Alpes et des Jura," Bulletin De La Société Vaudoise Des Sciences Naturelles, vol. 37, pp. 547--579, 1901.Google Scholar
P. Jaccard. The distribution of the flora in the alpine zone. New Phytol. 11(2), pp. 37--50. 1912.Google ScholarCross Ref
J. Bank and B. Cole. Calculating the jaccard similarity coefficient with map reduce for entity pairs in wikipedia. Wikipedia Similarity Team 2008.Google Scholar
N. Okazaki and J. Tsujii. Simple and efficient algorithm for approximate dictionary matching. Presented at Proceedings of the 23rd International Conference on Computational Linguistics. 2010,. Google ScholarDigital Library
C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing 1999. Google ScholarDigital Library
R. L. Causey. Logic, Sets and Recursion 2006. Google ScholarDigital Library
T. Sørensen, "A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons." Royal Danish Academy of Sciences and Letters, pp. 1--34, 1948.Google Scholar
F. C. Evans. Lee raymond dice obituary. J. Mammal. 59(3), pp. 635--644. 1978.Google Scholar

Index Terms

A comparison of different calculations for N-gram similarities in a spelling corrector for mobile instant messaging language
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Functional languages

Recommendations

Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological ...
Read More
Character contiguity in N-gram-based word matching: the case for Arabic text searching

This work assesses the performance of two N-gram matching techniques for Arabic root-driven string searching: contiguous N-grams and hybrid N-grams, combining contiguous and non-contiguous. The two techniques were tested using three experiments ...
Read More
New Language Resources for Arabic: Corpus Containing More Than Two Million Words and a Corpus Processing Tool
IALP '13: Proceedings of the 2013 International Conference on Asian Language Processing

Arabic is a resource-poor language relative to other languages with a similar number of speakers. This situation negatively affects corpus-based linguistic studies in Arabic and, to a lesser extent, Arabic language processing. This paper presents a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAICSIT '13: Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference
October 2013
398 pages
ISBN:9781450321129
DOI:10.1145/2513456
Conference Chairs:
John McNeill,
Karen Bradshaw,
Editors:
Philip Machanick,
Mosiuoa Tsietsi
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 October 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Dr Math
N-grams
spelling
Qualifiers
- research-article
Conference

Acceptance Rates
SAICSIT '13 Paper Acceptance Rate48of89submissions,54%Overall Acceptance Rate187of439submissions,43%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 132
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A comparison of different calculations for N-gram similarities in a spelling corrector for mobile instant messaging language

SAICSIT '13: Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

Character contiguity in N-gram-based word matching: the case for Arabic text searching

New Language Resources for Arabic: Corpus Containing More Than Two Million Words and a Corpus Processing Tool

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A comparison of different calculations for N-gram similarities in a spelling corrector for mobile instant messaging language

SAICSIT '13: Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

Character contiguity in N-gram-based word matching: the case for Arabic text searching

New Language Resources for Arabic: Corpus Containing More Than Two Million Words and a Corpus Processing Tool

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media