skip to main content
10.1145/2513456.2513458acmotherconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
research-article

A comparison of different calculations for N-gram similarities in a spelling corrector for mobile instant messaging language

Authors Info & Claims
Published:07 October 2013Publication History

ABSTRACT

Mobile Instant Messaging (MIM) systems have produced a new convention in writing where vowels are often omitted, where new suffixes have appeared, where numerals and symbols often appear in the place of letters which have a similar shape or sound, and where words are often spelled phonetically. A word such as mister may be spelled numerous ways including mista and mistr (with new suffixes). When both participants to a MIM conversation understand these new spelling conventions, there is no problem. But in a situation such as automated topic spotting, it is advantageous to attempt to associate these new spellings (mista and mistr) back to the original word (mister). This paper describes work in creating a spelling corrector for MIM conversations for use after stop words have been removed from a conversation, after words have been stemmed, and after double letters have been collapsed to single letters. Four different similarity calculations Jaccard, Sørensen-Dice, Cosine, and Overlap are investigated and tested with historical data from the Dr Math mobile tutoring environment. This research found that the Overlap similarity calculation was the least accurate of the four measured. In situations where the length of the various words were the same, Sørensen-Dice and Cosine similarity calculations were identical. Jaccard and Sørensen-Dice worked equally well, however, they required different numerical cut-off values for misspelled words.

References

  1. A. Botha and L. Butgereit, "Dr Math: A Mobiled Scaffolding Environment," International Journal of Mobile and Blended Learning, vol. 4, pp. 15--29, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Butgereit, "A Model for Automated Topic Spotting in a Mobile Chat Based Mathematics Tutoring Environment," 2012.Google ScholarGoogle Scholar
  3. W. J. Wilbur and K. Sirotkin. The automatic identification of stop words. J. Inf. Sci. 18(1), pp. 45. 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Butgereit and R. A. Botha, "Stop Words for "Dr Math"," Proceedings of IST-Africa, 2011, May 11--13, Gabarones, Botswana, 2011.Google ScholarGoogle Scholar
  5. E. Hatcher and O. Gospodnetic. Lucene in Action 2004.Google ScholarGoogle Scholar
  6. L. Butgereit and R. A. Botha, "A Lucene Stemmer for MXit Lingo," Proceedings of ZA WWW 2011, Sept 14--16, Johannesburg, 2011.Google ScholarGoogle Scholar
  7. L. Butgereit and R. A. Botha, "Using N-grams to Identify Mathematics Topics in Mxit Lingo," Proceedings of SAICSIT, Oct 3--5, Cape Town 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Butgereit and R. A. Botha, "A model to identify mathematics topics in MXit lingo to provide tutors quick access to supporting documentation," Pythagoras, 2011.Google ScholarGoogle Scholar
  9. C. Prün. Biographical notes on GK zipf. Glottometrics 3pp. 1--10. 2002.Google ScholarGoogle Scholar
  10. W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. Ann Arbor MI 48113 pp. 4001. 1994.Google ScholarGoogle Scholar
  11. Jaccard, Paul - Historischen Lexikon der Schweiz. Available: http://www.hls-dhs-dss.ch/textes/f/F31406.php.Google ScholarGoogle Scholar
  12. P. Jaccard, "Étude comparative de la distribution florale dans une portion des Alpes et des Jura," Bulletin De La Société Vaudoise Des Sciences Naturelles, vol. 37, pp. 547--579, 1901.Google ScholarGoogle Scholar
  13. P. Jaccard. The distribution of the flora in the alpine zone. New Phytol. 11(2), pp. 37--50. 1912.Google ScholarGoogle ScholarCross RefCross Ref
  14. J. Bank and B. Cole. Calculating the jaccard similarity coefficient with map reduce for entity pairs in wikipedia. Wikipedia Similarity Team 2008.Google ScholarGoogle Scholar
  15. N. Okazaki and J. Tsujii. Simple and efficient algorithm for approximate dictionary matching. Presented at Proceedings of the 23rd International Conference on Computational Linguistics. 2010,. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. L. Causey. Logic, Sets and Recursion 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Sørensen, "A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons." Royal Danish Academy of Sciences and Letters, pp. 1--34, 1948.Google ScholarGoogle Scholar
  19. F. C. Evans. Lee raymond dice obituary. J. Mammal. 59(3), pp. 635--644. 1978.Google ScholarGoogle Scholar

Index Terms

  1. A comparison of different calculations for N-gram similarities in a spelling corrector for mobile instant messaging language

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      SAICSIT '13: Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference
      October 2013
      398 pages
      ISBN:9781450321129
      DOI:10.1145/2513456

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 October 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SAICSIT '13 Paper Acceptance Rate48of89submissions,54%Overall Acceptance Rate187of439submissions,43%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader