Abstract
This paper presents an unsupervised method for developing a character-based n-gram classifier that identifies loanwords or transliterated foreign words in Korean text. The classifier is trained on an unlabeled corpus using the Expectation Maximization algorithm, building on seed words extracted from the corpus. Words with high token frequency serve as native seed words. Words with seeming traces of vowel insertion to repair consonant clusters serve as foreign seed words. What counts as a trace of insertion is determined using phoneme co-occurrence statistics in conjunction with ideas and findings in phonology. Experiments show that the method can produce an unsupervised classifier that performs at a level comparable to that of a supervised classifier. In a cross-validation experiment using a corpus of about 9.2 million words and a lexicon of about 71,000 words, mean F-scores of the best unsupervised classifier and the corresponding supervised classifier were 94.77 and 96.67 %, respectively. Experiments also suggest that the method can be readily applied to other languages with similar phonotactics such as Japanese.
Similar content being viewed by others
Notes
In this paper, loanwords in Korean refer to all words of foreign origin that are transliterated in Hangul except Sino-Korean words, which are ancient borrowings from Chinese. Sino-Korean words are considered more native-like than other words of foreign origin due to their longer history and higher morphological productivity (Sohn 1999).
References
Baker, K., & Brew, C. (2008). Statistical identification of English loanwords in Korean using automatically generated training data. In Proceedings of the 6th language resources and evaluation conference (LREC’08) (pp. 1159–1163).
Bali, R.-M., Chong, C. C., & Pek, K. N. (2007). Identifying and classifying unknown words in Malay texts. In Proceedings of the 7th international symposium on natural language processing (pp. 493–498).
Bisani, M., & Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5), 434–451.
Breen, J. (2004). JMDict: A Japanese-multilingual dictionary. In Proceedings of the workshop on multilingual linguistic resources (pp. 71–79).
Clements, G. N. (1990). The role of the sonority cycle in core syllabification. In J. Kingston & M. Beckman (Eds.), Papers in laboratory phonology I: Between the grammar and physics of speech (pp. 283–333). Cambridge: Cambridge University Press.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pp. 1–38.
Goldberg, Y., & Elhadad, M. (2008). Identification of transliterated foreign words in Hebrew script. In Computational linguistics and intelligent text processing (pp. 466–477). Berlin: Springer.
Hagiwara, M. (2013). JEITA public morphologically tagged corpus (in Chasen format). Retrieved from http://lilyx.net/nltk-japanese-corpus/#jeitac
Hall, N. (2011). Vowel epenthesis. In M. van Oostendorp, C. J. Ewen, E. Hume, & K. Rice (Eds.), The Blackwell companion to phonology (pp. 1576–1596). Malden: Wiley-Blackwell.
Haspelmath, M., & Tadmor, U. (2009). Loanwords in the world’s languages: A comparative handbook. Walter de Gruyter.
Jeong, K. S., Myaeng, S. H., Lee, J. S., & Choi, K.-S. (1999). Automatic identification and back-transliteration of foreign words for information retrieval. Information Processing and Management, 35, 523–540.
Kang, Y. (2011). Loanword phonology. In M. van Oostendorp, C. J. Ewen, E. Hume, & K. Rice (Eds.), The Blackwell companion to phonology (pp. 2258–2281). Malden: Wiley-Blackwell.
Khaltar, B.-O., & Fujii, A. (2009). A lemmatization method for Mongolian and its application to indexing for information retrieval. Information Processing and Management, 45(4), 438–451.
Knight, K., & Graehl, J. (1998). Machine transliteration. Computational Linguistics, 24(4), 599–612.
Korea Advanced Institute of Science and Technology. (1997). Automatically analyzed large scale KAIST corpus [Data file]. Retrieved from http://semanticweb.kaist.ac.kr/home/index.php/Corpus3
Ladefoged, P. (2001). A Course in Phonetics (4th ed.). Orlando: Harcourt Brace. Maddieson, I. (2013). Syllable structure. In M. S. Dryer & M. Haspelmath (Eds.), The world atlas of language structures online. Leipzig: Max planck institute for evolutionary anthropology. Retrieved from http://wals.info/chapter/12
Ministry of Culture, Sports, and Tourism of South Korea, and National Institute of the Korean Language. (2011). The 21st century Sejong project [Data file].
NIKL. (2000a). gukeo eohwiui bunryu mokrok yeongu. Resource document. Retrieved from http://www.korean.go.kr
NIKL. (2000b). pyojuneo geomtoyong jaryo. Resource document. Retrieved from http://www.korean.go.kr
NIKL. (2000c). pyojungukeodaesajeon pyeonchanyong eowon jeongbo jaryo. Resource document. Retrieved from http://www.korean.go.kr
NIKL. (2000d). yongeon hwalyongpyo. Resource document. Retrieved from http://www.korean.go.kr
NIKL. (2008). Survey of the state of loanword usage. [Data file]. Retrieved from http://www.korean.go.kr
NIKL. (2013). oeraeeo pyogi yongrye jaryo—romaja inmyeonggwa jimyeong. Resource document. Retrieved from http://www.korean.go.kr
Nwesri, A. F. A. (2008). Effective retrieval techniques for Arabic text (Unpublished doctoral dissertation). RMIT University, Melbourne, Australia.
Oh, J.-H., & Choi, K.-S. (2001). Automatic extraction of transliterated foreign words using hidden markov model. In Proceedings of the international conference on computer processing of oriental languages (pp. 433–438).
Ravi, S., & Knight, K. (2009). Learning phoneme mappings for transliteration without parallel data. In Proceedings of human language technologies: The 2009 annual conference of the north american chapter of the association for computational linguistics (pp. 37–45).
Selkirk, E. (1984). On the major class features and syllable theory. In M. Aronoff & R. T. Oerhle (Eds.), Language sound structure: studies in phonology presented to morris Halle by his teachers and students (pp. 107–136). Cambridge: MIT Press.
Sohn, H.-M. (1999). The Korean language. Cambridge: Cambridge University Press.
Uffmann, C. (2006). Epenthetic vowel quality in loanwords: Empirical and formal issues. Lingua, 116(7), 1079–1111.
Witten, I. H., & Bell, T. (1991). The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4), 1085–1094.
Yoon, K., & Brew, C. (2006). A linguistically motivated approach to grapheme-to-phoneme conversion for Korean. Computer Speech & Language, 20(4), 357–381.
Yoon, S.-Y., Kim, K.-Y., & Sproat, R. (2007). Multilingual transliteration using feature based phonetic method. In Proceedings of the 45th annual meeting of the association of computational linguistics (pp. 112–119).
Author information
Authors and Affiliations
Corresponding author
Appendix: Rewrite rules for grapheme-to-phoneme conversion
Appendix: Rewrite rules for grapheme-to-phoneme conversion
The table below shows letter-to-phoneme correspondences in Korean. The idea is to transcribe the pronunciation of a spelled word by first decomposing syllable-sized characters into letters and then mapping the letters to their matching phonemes one by one. For example, 한글 → ᄒ + ᅡ + ᄂ + ᄀ + ᅳ + ᄅ → [hankɨ l].
Letter | Phoneme(s) | Letter | Phoneme(s) | Letter | Phoneme(s) |
---|---|---|---|---|---|
ㄱ | k | ㄲ | k* | ㄴ | n |
ㄷ | t | ㄸ | t* | ㄹ (onset) | ɾ |
ㄹ (coda) | l | ㅁ | m | ㅂ | p |
ㅃ | p* | ㅅ | s | ㅆ | s* |
ㅇ (onset) | Null | ㅇ (coda) | ŋ | ㅈ | tʃ |
ㅉ | tʃ * | ㅊ | tʃ h | ㅋ | kh |
ㅌ | th | ㅍ | ph | ㅎ | h |
ㅏ | a | ㅑ | j a | ㅐ | æ |
ㅒ | j æ | ㅓ | ʌ | ㅕ | j ʌ |
ㅔ | e | ㅖ | j e | ㅗ | o |
ㅛ | j o | ㅘ | w a | ㅙ | w æ |
ㅚ | ø | ㅜ | u | ㅠ | j u |
ㅝ | w ʌ | ㅞ | w e | ㅟ | w i |
ㅡ | ɨ | ㅣ | i | ㅢ | ɨ i |
Rights and permissions
About this article
Cite this article
Koo, H. An unsupervised method for identifying loanwords in Korean. Lang Resources & Evaluation 49, 355–373 (2015). https://doi.org/10.1007/s10579-015-9296-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-015-9296-5