Abstract:
With the Internet being used widely and technology advancing rapidly, the number of digital texts in various languages is continuously growing. However, due to difference...Show MoreMetadata
Abstract:
With the Internet being used widely and technology advancing rapidly, the number of digital texts in various languages is continuously growing. However, due to differences in keyboards and alphabets, there are many missing or incorrectly used diacritics, which can make reading a challenge. This presents a difficulty for natural language processing (NLP) applications, as they must accurately interpret the meaning of words despite these errors. This presents a difficulty for natural language processing (NLP) applications, as they must accurately interpret the meaning of words despite these errors. This study focuses on diacritic restoration (DR) which is a crucial element in many natural language processing applications across multiple languages. This study proposes a Bidirectional Transformer structure based on syllables to account for Turkish’s high sensitivity to syllables in determining meaning. Additionally, incorporating a semantic marker into the training data enhances the model’s performance. Our research has demonstrated that optimizing the configuration of our proposed model has resulted in a significant improvement in performance compared to previous studies that were based on words or characters. We were able to achieve an impressive accuracy rate of 98.84% of accent characters within ambiguous words, with a high accuracy rate of 92.85% in correcting ambiguous words, indicating success in semantic learning. This represents a significant breakthrough in the field of diacritic restoration and emphasizes the potential for improving natural language processing applications in various languages.
Date of Conference: 15-18 May 2024
Date Added to IEEE Xplore: 23 July 2024
ISBN Information:
Print on Demand(PoD) ISSN: 2165-0608