Abstract:
Each language has its own vocabulary which is spoken by a corresponding group of speakers. There are generally languages that have better resources and thus Natural Langu...Show MoreMetadata
Abstract:
Each language has its own vocabulary which is spoken by a corresponding group of speakers. There are generally languages that have better resources and thus Natural Language Processing methods typically perform generally better for such languages; whereas on other hand, in the case of a large number of low-resource languages - there is a lack of sufficient annotated data that can be used in order to efficiently use the unsupervised methods for NLP tasks. As a result, a spell checker is a necessity for composing any documentation in a language; typically, by identifying words that are typologically and grammatically correct as well as misspelled words in such a language. The aim of this paper is to present a spell-check dictionary for the Albanian language by comparing word usage among various texts. Furthermore, it aims to do so by defining words to be entered in the dictionary from a large text collection taken from experiments and then conducting a comparison review of word usage frequency. The corpora include 49k sentences for the Albanian language of different fields such as computer science, economics, law, medicine, politics, tourism, art, psychology, etc. This spell-check dictionary would further contribute to the ease of use of the Albanian language in electronic media. Noting that the Albanian language is a low-resource language, another aim of this paper and related further research relates to building a larger and better corpus of Albanian language on top of which the spell-checking dictionary could be continuously advanced and perfected.
Published in: 2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO)
Date of Conference: 27 September 2021 - 01 October 2021
Date Added to IEEE Xplore: 15 November 2021
ISBN Information:
Electronic ISSN: 2623-8764