Skip to main content
Log in

An unsupervised method for identifying loanwords in Korean

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper presents an unsupervised method for developing a character-based n-gram classifier that identifies loanwords or transliterated foreign words in Korean text. The classifier is trained on an unlabeled corpus using the Expectation Maximization algorithm, building on seed words extracted from the corpus. Words with high token frequency serve as native seed words. Words with seeming traces of vowel insertion to repair consonant clusters serve as foreign seed words. What counts as a trace of insertion is determined using phoneme co-occurrence statistics in conjunction with ideas and findings in phonology. Experiments show that the method can produce an unsupervised classifier that performs at a level comparable to that of a supervised classifier. In a cross-validation experiment using a corpus of about 9.2 million words and a lexicon of about 71,000 words, mean F-scores of the best unsupervised classifier and the corresponding supervised classifier were 94.77 and 96.67 %, respectively. Experiments also suggest that the method can be readily applied to other languages with similar phonotactics such as Japanese.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. In this paper, loanwords in Korean refer to all words of foreign origin that are transliterated in Hangul except Sino-Korean words, which are ancient borrowings from Chinese. Sino-Korean words are considered more native-like than other words of foreign origin due to their longer history and higher morphological productivity (Sohn 1999).

References

  • Baker, K., & Brew, C. (2008). Statistical identification of English loanwords in Korean using automatically generated training data. In Proceedings of the 6th language resources and evaluation conference (LREC’08) (pp. 1159–1163).

  • Bali, R.-M., Chong, C. C., & Pek, K. N. (2007). Identifying and classifying unknown words in Malay texts. In Proceedings of the 7th international symposium on natural language processing (pp. 493–498).

  • Bisani, M., & Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5), 434–451.

    Article  Google Scholar 

  • Breen, J. (2004). JMDict: A Japanese-multilingual dictionary. In Proceedings of the workshop on multilingual linguistic resources (pp. 71–79).

  • Clements, G. N. (1990). The role of the sonority cycle in core syllabification. In J. Kingston & M. Beckman (Eds.), Papers in laboratory phonology I: Between the grammar and physics of speech (pp. 283–333). Cambridge: Cambridge University Press.

    Google Scholar 

  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pp. 1–38.

  • Goldberg, Y., & Elhadad, M. (2008). Identification of transliterated foreign words in Hebrew script. In Computational linguistics and intelligent text processing (pp. 466–477). Berlin: Springer.

  • Hagiwara, M. (2013). JEITA public morphologically tagged corpus (in Chasen format). Retrieved from http://lilyx.net/nltk-japanese-corpus/#jeitac

  • Hall, N. (2011). Vowel epenthesis. In M. van Oostendorp, C. J. Ewen, E. Hume, & K. Rice (Eds.), The Blackwell companion to phonology (pp. 1576–1596). Malden: Wiley-Blackwell.

    Google Scholar 

  • Haspelmath, M., & Tadmor, U. (2009). Loanwords in the world’s languages: A comparative handbook. Walter de Gruyter.

  • Jeong, K. S., Myaeng, S. H., Lee, J. S., & Choi, K.-S. (1999). Automatic identification and back-transliteration of foreign words for information retrieval. Information Processing and Management, 35, 523–540.

    Article  Google Scholar 

  • Kang, Y. (2011). Loanword phonology. In M. van Oostendorp, C. J. Ewen, E. Hume, & K. Rice (Eds.), The Blackwell companion to phonology (pp. 2258–2281). Malden: Wiley-Blackwell.

    Google Scholar 

  • Khaltar, B.-O., & Fujii, A. (2009). A lemmatization method for Mongolian and its application to indexing for information retrieval. Information Processing and Management, 45(4), 438–451.

    Article  Google Scholar 

  • Knight, K., & Graehl, J. (1998). Machine transliteration. Computational Linguistics, 24(4), 599–612.

    Google Scholar 

  • Korea Advanced Institute of Science and Technology. (1997). Automatically analyzed large scale KAIST corpus [Data file]. Retrieved from http://semanticweb.kaist.ac.kr/home/index.php/Corpus3

  • Ladefoged, P. (2001). A Course in Phonetics (4th ed.). Orlando: Harcourt Brace. Maddieson, I. (2013). Syllable structure. In M. S. Dryer & M. Haspelmath (Eds.), The world atlas of language structures online. Leipzig: Max planck institute for evolutionary anthropology. Retrieved from http://wals.info/chapter/12

  • Ministry of Culture, Sports, and Tourism of South Korea, and National Institute of the Korean Language. (2011). The 21st century Sejong project [Data file].

  • NIKL. (2000a). gukeo eohwiui bunryu mokrok yeongu. Resource document. Retrieved from http://www.korean.go.kr

  • NIKL. (2000b). pyojuneo geomtoyong jaryo. Resource document. Retrieved from http://www.korean.go.kr

  • NIKL. (2000c). pyojungukeodaesajeon pyeonchanyong eowon jeongbo jaryo. Resource document. Retrieved from http://www.korean.go.kr

  • NIKL. (2000d). yongeon hwalyongpyo. Resource document. Retrieved from http://www.korean.go.kr

  • NIKL. (2008). Survey of the state of loanword usage. [Data file]. Retrieved from http://www.korean.go.kr

  • NIKL. (2013). oeraeeo pyogi yongrye jaryoromaja inmyeonggwa jimyeong. Resource document. Retrieved from http://www.korean.go.kr

  • Nwesri, A. F. A. (2008). Effective retrieval techniques for Arabic text (Unpublished doctoral dissertation). RMIT University, Melbourne, Australia.

  • Oh, J.-H., & Choi, K.-S. (2001). Automatic extraction of transliterated foreign words using hidden markov model. In Proceedings of the international conference on computer processing of oriental languages (pp. 433–438).

  • Ravi, S., & Knight, K. (2009). Learning phoneme mappings for transliteration without parallel data. In Proceedings of human language technologies: The 2009 annual conference of the north american chapter of the association for computational linguistics (pp. 37–45).

  • Selkirk, E. (1984). On the major class features and syllable theory. In M. Aronoff & R. T. Oerhle (Eds.), Language sound structure: studies in phonology presented to morris Halle by his teachers and students (pp. 107–136). Cambridge: MIT Press.

    Google Scholar 

  • Sohn, H.-M. (1999). The Korean language. Cambridge: Cambridge University Press.

    Google Scholar 

  • Uffmann, C. (2006). Epenthetic vowel quality in loanwords: Empirical and formal issues. Lingua, 116(7), 1079–1111.

    Article  Google Scholar 

  • Witten, I. H., & Bell, T. (1991). The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4), 1085–1094.

    Article  Google Scholar 

  • Yoon, K., & Brew, C. (2006). A linguistically motivated approach to grapheme-to-phoneme conversion for Korean. Computer Speech & Language, 20(4), 357–381.

    Article  Google Scholar 

  • Yoon, S.-Y., Kim, K.-Y., & Sproat, R. (2007). Multilingual transliteration using feature based phonetic method. In Proceedings of the 45th annual meeting of the association of computational linguistics (pp. 112–119).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hahn Koo.

Appendix: Rewrite rules for grapheme-to-phoneme conversion

Appendix: Rewrite rules for grapheme-to-phoneme conversion

The table below shows letter-to-phoneme correspondences in Korean. The idea is to transcribe the pronunciation of a spelled word by first decomposing syllable-sized characters into letters and then mapping the letters to their matching phonemes one by one. For example, 한글 → ᄒ + ᅡ + ᄂ + ᄀ + ᅳ + ᄅ → [hankɨ l].

Letter

Phoneme(s)

Letter

Phoneme(s)

Letter

Phoneme(s)

k

k*

n

t

t*

ㄹ (onset)

ɾ

ㄹ (coda)

l

m

p

p*

s

s*

ㅇ (onset)

Null

ㅇ (coda)

ŋ

tʃ *

h

kh

th

ph

h

a

j a

æ

j æ

ʌ

j ʌ

e

j e

o

j o

w a

w æ

ø

u

j u

w ʌ

w e

w i

ɨ

i

ɨ i

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Koo, H. An unsupervised method for identifying loanwords in Korean. Lang Resources & Evaluation 49, 355–373 (2015). https://doi.org/10.1007/s10579-015-9296-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-015-9296-5

Keywords

Navigation