An unsupervised method for identifying loanwords in Korean

Koo, Hahn

doi:10.1007/s10579-015-9296-5

An unsupervised method for identifying loanwords in Korean

Original Paper
Published: 11 February 2015

Volume 49, pages 355–373, (2015)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Hahn Koo¹

631 Accesses
3 Citations
Explore all metrics

Abstract

This paper presents an unsupervised method for developing a character-based n-gram classifier that identifies loanwords or transliterated foreign words in Korean text. The classifier is trained on an unlabeled corpus using the Expectation Maximization algorithm, building on seed words extracted from the corpus. Words with high token frequency serve as native seed words. Words with seeming traces of vowel insertion to repair consonant clusters serve as foreign seed words. What counts as a trace of insertion is determined using phoneme co-occurrence statistics in conjunction with ideas and findings in phonology. Experiments show that the method can produce an unsupervised classifier that performs at a level comparable to that of a supervised classifier. In a cross-validation experiment using a corpus of about 9.2 million words and a lexicon of about 71,000 words, mean F-scores of the best unsupervised classifier and the corresponding supervised classifier were 94.77 and 96.67 %, respectively. Experiments also suggest that the method can be readily applied to other languages with similar phonotactics such as Japanese.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Building a Pronunciation Dictionary for the Kabyle Language

Chinese lexical database (CLD)

Article 22 June 2018

Selecting and Weighting N-Grams to Identify 1100 Languages

Notes

In this paper, loanwords in Korean refer to all words of foreign origin that are transliterated in Hangul except Sino-Korean words, which are ancient borrowings from Chinese. Sino-Korean words are considered more native-like than other words of foreign origin due to their longer history and higher morphological productivity (Sohn 1999).

References

Baker, K., & Brew, C. (2008). Statistical identification of English loanwords in Korean using automatically generated training data. In Proceedings of the 6th language resources and evaluation conference (LREC’08) (pp. 1159–1163).
Bali, R.-M., Chong, C. C., & Pek, K. N. (2007). Identifying and classifying unknown words in Malay texts. In Proceedings of the 7th international symposium on natural language processing (pp. 493–498).
Bisani, M., & Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5), 434–451.
Article Google Scholar
Breen, J. (2004). JMDict: A Japanese-multilingual dictionary. In Proceedings of the workshop on multilingual linguistic resources (pp. 71–79).
Clements, G. N. (1990). The role of the sonority cycle in core syllabification. In J. Kingston & M. Beckman (Eds.), Papers in laboratory phonology I: Between the grammar and physics of speech (pp. 283–333). Cambridge: Cambridge University Press.
Google Scholar
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pp. 1–38.
Goldberg, Y., & Elhadad, M. (2008). Identification of transliterated foreign words in Hebrew script. In Computational linguistics and intelligent text processing (pp. 466–477). Berlin: Springer.
Hagiwara, M. (2013). JEITA public morphologically tagged corpus (in Chasen format). Retrieved from http://lilyx.net/nltk-japanese-corpus/#jeitac
Hall, N. (2011). Vowel epenthesis. In M. van Oostendorp, C. J. Ewen, E. Hume, & K. Rice (Eds.), The Blackwell companion to phonology (pp. 1576–1596). Malden: Wiley-Blackwell.
Google Scholar
Haspelmath, M., & Tadmor, U. (2009). Loanwords in the world’s languages: A comparative handbook. Walter de Gruyter.
Jeong, K. S., Myaeng, S. H., Lee, J. S., & Choi, K.-S. (1999). Automatic identification and back-transliteration of foreign words for information retrieval. Information Processing and Management, 35, 523–540.
Article Google Scholar
Kang, Y. (2011). Loanword phonology. In M. van Oostendorp, C. J. Ewen, E. Hume, & K. Rice (Eds.), The Blackwell companion to phonology (pp. 2258–2281). Malden: Wiley-Blackwell.
Google Scholar
Khaltar, B.-O., & Fujii, A. (2009). A lemmatization method for Mongolian and its application to indexing for information retrieval. Information Processing and Management, 45(4), 438–451.
Article Google Scholar
Knight, K., & Graehl, J. (1998). Machine transliteration. Computational Linguistics, 24(4), 599–612.
Google Scholar
Korea Advanced Institute of Science and Technology. (1997). Automatically analyzed large scale KAIST corpus [Data file]. Retrieved from http://semanticweb.kaist.ac.kr/home/index.php/Corpus3
Ladefoged, P. (2001). A Course in Phonetics (4th ed.). Orlando: Harcourt Brace. Maddieson, I. (2013). Syllable structure. In M. S. Dryer & M. Haspelmath (Eds.), The world atlas of language structures online. Leipzig: Max planck institute for evolutionary anthropology. Retrieved from http://wals.info/chapter/12
Ministry of Culture, Sports, and Tourism of South Korea, and National Institute of the Korean Language. (2011). The 21st century Sejong project [Data file].
NIKL. (2000a). gukeo eohwiui bunryu mokrok yeongu. Resource document. Retrieved from http://www.korean.go.kr
NIKL. (2000b). pyojuneo geomtoyong jaryo. Resource document. Retrieved from http://www.korean.go.kr
NIKL. (2000c). pyojungukeodaesajeon pyeonchanyong eowon jeongbo jaryo. Resource document. Retrieved from http://www.korean.go.kr
NIKL. (2000d). yongeon hwalyongpyo. Resource document. Retrieved from http://www.korean.go.kr
NIKL. (2008). Survey of the state of loanword usage. [Data file]. Retrieved from http://www.korean.go.kr
NIKL. (2013). oeraeeo pyogi yongrye jaryo—romaja inmyeonggwa jimyeong. Resource document. Retrieved from http://www.korean.go.kr
Nwesri, A. F. A. (2008). Effective retrieval techniques for Arabic text (Unpublished doctoral dissertation). RMIT University, Melbourne, Australia.
Oh, J.-H., & Choi, K.-S. (2001). Automatic extraction of transliterated foreign words using hidden markov model. In Proceedings of the international conference on computer processing of oriental languages (pp. 433–438).
Ravi, S., & Knight, K. (2009). Learning phoneme mappings for transliteration without parallel data. In Proceedings of human language technologies: The 2009 annual conference of the north american chapter of the association for computational linguistics (pp. 37–45).
Selkirk, E. (1984). On the major class features and syllable theory. In M. Aronoff & R. T. Oerhle (Eds.), Language sound structure: studies in phonology presented to morris Halle by his teachers and students (pp. 107–136). Cambridge: MIT Press.
Google Scholar
Sohn, H.-M. (1999). The Korean language. Cambridge: Cambridge University Press.
Google Scholar
Uffmann, C. (2006). Epenthetic vowel quality in loanwords: Empirical and formal issues. Lingua, 116(7), 1079–1111.
Article Google Scholar
Witten, I. H., & Bell, T. (1991). The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4), 1085–1094.
Article Google Scholar
Yoon, K., & Brew, C. (2006). A linguistically motivated approach to grapheme-to-phoneme conversion for Korean. Computer Speech & Language, 20(4), 357–381.
Article Google Scholar
Yoon, S.-Y., Kim, K.-Y., & Sproat, R. (2007). Multilingual transliteration using feature based phonetic method. In Proceedings of the 45th annual meeting of the association of computational linguistics (pp. 112–119).

Download references

Author information

Authors and Affiliations

San Jose State University, San Jose, CA, USA
Hahn Koo

Authors

Hahn Koo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hahn Koo.

Appendix: Rewrite rules for grapheme-to-phoneme conversion

The table below shows letter-to-phoneme correspondences in Korean. The idea is to transcribe the pronunciation of a spelled word by first decomposing syllable-sized characters into letters and then mapping the letters to their matching phonemes one by one. For example, 한글 → ᄒ + ᅡ + ᄂ + ᄀ + ᅳ + ᄅ → [hankɨ l].

Letter	Phoneme(s)	Letter	Phoneme(s)	Letter	Phoneme(s)
ㄱ	k	ㄲ	k*	ㄴ	n
ㄷ	t	ㄸ	t*	ㄹ (onset)	ɾ
ㄹ (coda)	l	ㅁ	m	ㅂ	p
ㅃ	p*	ㅅ	s	ㅆ	s*
ㅇ (onset)	Null	ㅇ (coda)	ŋ	ㅈ	tʃ
ㅉ	tʃ *	ㅊ	tʃ ^h	ㅋ	k^h
ㅌ	t^h	ㅍ	p^h	ㅎ	h
ㅏ	a	ㅑ	j a	ㅐ	æ
ㅒ	j æ	ㅓ	ʌ	ㅕ	j ʌ
ㅔ	e	ㅖ	j e	ㅗ	o
ㅛ	j o	ㅘ	w a	ㅙ	w æ
ㅚ	ø	ㅜ	u	ㅠ	j u
ㅝ	w ʌ	ㅞ	w e	ㅟ	w i
ㅡ	ɨ	ㅣ	i	ㅢ	ɨ i

Rights and permissions

Reprints and permissions

About this article

Cite this article

Koo, H. An unsupervised method for identifying loanwords in Korean. Lang Resources & Evaluation 49, 355–373 (2015). https://doi.org/10.1007/s10579-015-9296-5

Download citation

Published: 11 February 2015
Issue Date: June 2015
DOI: https://doi.org/10.1007/s10579-015-9296-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An unsupervised method for identifying loanwords in Korean

Abstract

Access this article

Similar content being viewed by others

Building a Pronunciation Dictionary for the Kabyle Language

Chinese lexical database (CLD)

Selecting and Weighting N-Grams to Identify 1100 Languages

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix: Rewrite rules for grapheme-to-phoneme conversion

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An unsupervised method for identifying loanwords in Korean

Abstract

Access this article

Similar content being viewed by others

Building a Pronunciation Dictionary for the Kabyle Language

Chinese lexical database (CLD)

Selecting and Weighting N-Grams to Identify 1100 Languages

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix: Rewrite rules for grapheme-to-phoneme conversion

Appendix: Rewrite rules for grapheme-to-phoneme conversion

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation