Skip to main content
Log in

Contrastive study of minimum edit distance and cosine similarity measures in the context of word suggestions for misspelled Marathi words

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Spelling errors are fundamental errors in text writing. The digital era has added another dimension called keyboard layout to this problem. Memorization, language orthography, and keyboard layout are sources of spelling errors in electronic texts. English is being the linked language of the world, good quantum of work towards the spelling error detection and plausible suggestions has been done for English language. But it is not the case for digital resources scarce languages like Indian languages. Marathi which is the official language of Maharashtra State in India and the world’s 10th highest spoken language is not exception to this. Various computational approaches for spelling error detection and correction have been advocated in the literature. Amongst these, similarity-based measures have proven to be the prominent ones. This paper discusses the detailed contrastive study of the two popular similarity measures viz. minimum edit distance and cosine similarity measures in the context of mis-spelled Marathi words. The philosophical and empirical aspects of these methods have also been presented. For experimentation purpose we have chosen a dataset of 9, 29, 663 unique Marathi words harvested from various sources. We have obtained an accuracy of 85.88% and 86.76% for minimum edit distance algorithm and the cosine similarity algorithm, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Algorithm 2
Fig. 5

Similar content being viewed by others

References

  1. Al-Jefri MM, Mahmoud SA (2013) Context sensitive Arabic spell checker using context words and n gram language models

  2. Arun P (2001) Marathi Lekhan Kosh, vol 2001. Keshav Bhikaji Dhavale Publishers, Mumbai

  3. Asadullah, M (2007) “Finite state recognizer and string similarity based spelling checker for Bangla”, Department of Computer Science and Engineering. BRAC University

  4. Available at (n.d.) https://code.google.com/archive/p/hunspell-marathi-dictionary

  5. Avalilable at (n.d.) http://www.tdil.dc.in

  6. Awny S, Amal AM (2017) IBRI-CASONTO: Ontology-based semantic search engine. Egypt Inform J 18:181–192

    Article  Google Scholar 

  7. Basri S, Alfred R, On C (2012) Automatic spell checker for malay blog, pp 506–510. https://doi.org/10.1109/ICCSCE.2012.6487198

  8. Bhattacharya (1946) On a measure of divergence of two multinomial populations. Sankhya 7:401–406

    MathSciNet  Google Scholar 

  9. Bilenko MY (2006) Learnable similarity functions and their application to record linkage and clustering

  10. Broder Z, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Networks ISDN Syst 29(8–13):1157–1166. https://doi.org/10.1016/S0169-7552(97)00031-7

    Article  Google Scholar 

  11. Bruno M, Silva MJ (2004) Spelling correction for search engine queries. Advanced natural language processing. Springer, Berlin, pp 372–383

    Google Scholar 

  12. Comodi A, Conficconi D, Scolari A (2018) “TiReX: tiled regular expression matching architecture”, IEEE

  13. Amorim RC, Zampieri M (2013) Effective spell checking methods using clustering algorithms. RANLP, Hissar

  14. Damerau FJ (1964) A technique for computer detection and correction of spelling errors. Commun ACM 7(3):171–176. https://doi.org/10.1145/363958.363994

    Article  Google Scholar 

  15. Das M, Borgohain SK, Gogoi J, Nair SB (2002) Design and implementation of a spell checker for Assamese. Language Engineering Conference, 2002. Proceedings, pp 156–162

  16. Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26:297–302

  17. Dixit VD, Dethe SS, Joshi RK (2005) Design and implementation of a morphology-based spellchecker for Marathi, an Indian language. Arch Control Sci 5:301–308

  18. Etoori P, Chinnakotla M, Mamidi R (2018) Automatic spelling correction for resource scarce languages using deep learning, Melbourne, Australia

  19. FlorM, Futagi Y (2012) On using context for automatic correction of non-word misspellings in student essays. BEA@NAACL-HLT

  20. Forum for Information Retrieval (FIRE) (n.d.) Information Retrieval Society of India. (12 2–4). Mumbai, Maharashtra, India. Retrieved from http://www.isical.ac.in/~fire/2010/index.html

  21. Friedman JH (1997) On bias, variance, 0/1—loss, and the curse-of- dimensionality. Data Min Knowl Disc 1(1):55–77. https://doi.org/10.1023/A:1009778005914

    Article  MathSciNet  Google Scholar 

  22. Gravano L et al (2001) Approximate string joins in a database (almost) for free. In: VLDB, vol. 1, pp 491–500. Available at: http://www.vldb.org/conf/2001/P491.pdf

  23. Hamming RW (1950) Error detecting and error correcting codes. Bell Syst Tech J 29:147–160

  24. Hamza B, Abdellah Y, Hicham G, Mostafa B (2014) For an independent SpellChecking system from the Arabic language vocabulary, 5

  25. Hatem M (2016) Automatic Arabic spelling errors detection and correction based on confusion matrix noisy channel hybrid system. Egypt Comput Sci J 40:6164

    Google Scholar 

  26. Huang G, Chen J, Sun Z (2020) A correction method of word spelling mistake for English text. J Phys Conf Ser 1693:012118

  27. Jaccard P (1901) Étude Comparative de la Distribution Florale Dans Une Portion Des Alpes et Des Jura. Bull Soc Vaudoise Sci Nat 37:547–579

    Google Scholar 

  28. Jaro MA (1989) Advances in record-linkage methodology as applied to matching the 1985 census of of Tampa, Florida. J Am Stat Assoc 84:414–420

  29. Jayakodi K, Bandara M, Perera I, Meedeniya DA (2016) WordNet and cosine similarity based classifier of exam questions using bloom’s taxonomy. Int J Emerg Technol Learn 11:142–149

  30. Kaur K, Kaur H (2018) A hybrid approach for spell check and error correction for english and punjabi text paragraphs

  31. Paramjeet Singh D (2015) Spellchecking and error correcting system for text paragraphs written in Punjabi language using hybrid approach

  32. Kaur H et al (2007) Punjabi spell checker using dictionary clustering. Int J Sci Eng Technol Res 4(7):23692374

    Google Scholar 

  33. Kondrak G (2005) N-gram similarity and distance. SPIRE. https://doi.org/10.1007/11575832_13

  34. Krause EF (1987) Taxicab geometry: an adventure in non-euclidean geometry

  35. Lawaye A, Purkayastha B (2016) Design and implementation of spell checker for Kashmiri. Int J Sci Res 5:199200

    Google Scholar 

  36. Lee, D-G, Hyuk-Chul K (2022) Automatic string generator based on standard Korean pronunciation

  37. Levenshtein VI (1965) Binary codes capable of correcting spurious insertions and deletions of ones. Probl Inf Transm 1(1):8–17

    MATH  Google Scholar 

  38. Lu, Chris, Aronson, Alan Shooshan, Sonya Demner-Fushman, Dina.(2019). “Spell checker for Consumer Language (CSpell)”. J Am Med Inform Assoc. 26. 211–218. https://doi.org/10.1093/jamia/ocy171.

  39. Mahdi M, Tiun S (2014) Utilizing wordnet for instance-based schema matching. In: Proceedings of the International Conference on Advances in Computer Science and Electronics Engineering (CSEE 2014), pp 59–63

  40. Mandal, P., Hossain M., “Clustering based Bangla spell checker”, 2017.

    Book  Google Scholar 

  41. Maulana Y (2018) Autocomplete and spell checking Levenshtein distance algorithm to getting text suggest error data searching in library, 5, 6775

  42. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453. https://doi.org/10.1016/0022-2836(70)90057-4

    Article  Google Scholar 

  43. Padhy H, Mohanty S (2013) Designing hybrid approach Spell checker for Oriya

  44. Patil KT, Bhavsar RP, Pawar BV (2021) Spelling checking and error corrector system for Marathi language text using minimum edit distance algorithm

  45. Patil KT, Bhavsar RP, Pawar BV (2021) Word suggestions for non-word text errors using similarity measure. 7th International Conference on Advanced Computing and Communication Systems (ICACCS 2021) Coimbatore, IEEE Xplore, pp 892–897

  46. Peterson JL (1980) Computer programs for detecting and correcting spelling errors. Commun ACM 23(12):676–687

    Article  Google Scholar 

  47. Prasetya DD, Wibawa AP, Hirashima T (2018) The performance of text similarity algorithms. Int J Adv Intell Inform 4(1):63–69 ISSN 2442–6571

    Article  Google Scholar 

  48. (2017) Morphological analyzer for Kannada inflectional words using hybrid approach, 4 December 2016

  49. Chan C. Querol, A. Cheng, J. Querol, J., “SpellCheF: spelling checker and corrector for Filipino”, J Res Sci Comput Eng, 4, 2008.

  50. Smith TF, Waterman MS (1981) Identification of common molecular sub-sequences. J Mol Biol 147(1):195–197. https://doi.org/10.1016/0022-2836(81)90087-5

    Article  Google Scholar 

  51. Soel TT, Sann Z (2019) “Study on spell-checking system using Levenshtein distance algorithm”, Int J Recent Dev Eng Technol, pp. 1–3, Website: www.ijrdet.com ISSN 2347-6435(Online) 8, 9

  52. Soyusiawaty, D Wolley, D (2021) Hybrid spelling correction and query expansion for relevance document searching. Int J Adv Comput Sci Appl. 12. https://doi.org/10.14569/IJACSA.2021.0120838.

  53. Umar R, Hendriana Y, Budiyono E (2015) Implementation of edit-distance algorithm for E-commerce of bravoisitees distro. Int J Comput Trends Technol 27(3):131–136

    Article  Google Scholar 

  54. Wagner RA, Fischer MJ (1974) The string-to-string correction problem. J ACM 21:168–173

  55. Wang J, Li G, Fe J (2011) Fast-join: An efficient method for fuzzy token matching based string similarity join. In: 2011 IEEE 27th International Conference on Data Engineering, pp 458–469

  56. Watcharabutsarakham S (2007) Spell checker for Thai document. TENCON 2005 - 2005 IEEE Region 10 Conference, pp 1–4

  57. Winkler WE (1991) String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage

  58. Yu M, Li G, Deng D, Feng J (2016) String similarity search and join: a survey. Front Comput Sci 10(3):399–417. https://doi.org/10.1007/s11704-015-5900-5

    Article  Google Scholar 

  59. Yulianto M, Arifudin R, Alamsyah A (2018) Autocomplete and spell checking levenshtein distance algorithm to getting text suggest error data searching in library. Sci J Inform 5:75

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kavita T. Patil.

Ethics declarations

Conflicts of interests/competing interests

The authors have not received any funding for this research work and have no Conflicts of interests/Competing interests with respect to this work with any organization/third party. Authors further state that there is no any financial interests that are directly or indirectly related to the work submitted for publication.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Patil, K.T., Bhavsar, R.P. & Pawar, B.V. Contrastive study of minimum edit distance and cosine similarity measures in the context of word suggestions for misspelled Marathi words. Multimed Tools Appl 82, 15573–15591 (2023). https://doi.org/10.1007/s11042-022-13948-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13948-z

Keywords

Navigation