Extracting English-Korean Transliteration Pairs from Web Corpora

Oh, Jong-Hoon; Isahara, Hitoshi

doi:10.1007/11940098_24

Extracting English-Korean Transliteration Pairs from Web Corpora

Jong-Hoon Oh²² &
Hitoshi Isahara²²

Conference paper

1006 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4285))

Abstract

Transliteration pair acquisition has received significant attention as a technique for constructing up-to-date transliteration lexicons, and for supporting machine translation and cross-language information retrieval. Previous studies on transliteration pair acquisition focused on only the phonetic similarity model but seldom considered the real-usage of transliterations in texts. Moreover, previous web-based validation models considered only one-way validation (validation from the viewpoint of a source term) rather than joint validation between a source term and a target term. To address these problems, we propose a novel transliteration pair acquisition model that extracts transliteration pairs from the Web and validates the pairs by combining the phonetic similarity and joint web-validation models. Experiments demonstrated that our transliteration pair acquisition model was effective.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Fujii, A., Tetsuya, I.: Japanese/English cross-language information retrieval: Exploration of query translation and transliteration. Computers and the Humanities 35(4), 389–420 (2001)
Article Google Scholar
Kang, B.J., Choi, K.S.: Two approaches for the resolution of word mismatch problem caused by English words and foreign words in Korean information retrieval. IJCPOL 14(2) (2001)
Google Scholar
Brill, E., Kacmarcik, G., Brockett, C.: Automatically harvesting Katakana-English term pairs from search engine query logs. In: Proc. of NLPRS 2001, pp. 393–399 (2001)
Google Scholar
Tsujii, K.: Automatic extraction of translational Japanese-Katakana and English word pairs from bilingual corpora. IJCPOL 15(3), 261–279 (2002)
Google Scholar
Lee, C.J., Chang, J.S.: Acquisition of English-Chinese transliterated word pairs from parallel-aligned texts using a statistical machine transliteration model. In: Proc. of the HLT-NAACL 2003 Workshop on Building and using parallel texts, pp. 96–103 (2003)
Google Scholar
Bilac, S., Tanaka, H.: Extracting transliteration pairs from comparable corpora. In: Proc. of Symposium on Large-Scale Knowledge Resources (LKR 2005), pp. 203–206 (2005)
Google Scholar
Oh, J.H., Choi, K.S.: Recognizing transliteration equivalents for enriching domain-specific thesauri. In: Proc. of the 3rd International WordNet Conference (GWC 2006), pp. 231–237 (2006)
Google Scholar
Oh, J.H., Choi, K.S., Isahara, H.: A hybrid model for extracting transliteration equivalents from parallel corpora. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 119–126. Springer, Heidelberg (2006)
Chapter Google Scholar
Resnik, P., Smith, N.A.: The web as a parallel corpus. Computational Linguistics 29(3), 349–380 (2003)
Article Google Scholar
Qu, Y., Grefenstette, G.: Finding ideographic representations of Japanese names written in Latin script via language identification and corpus validation. In: Proc. of ACL, pp. 183–190 (2004)
Google Scholar
Lu, W.H., Chien, L.F., Lee, H.J.: Translation of web queries using anchor text mining. ACM Transactions on Asian Language Information Processing 1(2), 159–172 (2002)
Article Google Scholar
Lu, W.H., Chien, L.F., Lee, H.J.: Anchor text mining for translation of web queries: A transitive translation approach. ACM Transactions on Information Systems 22(2), 242–269 (2004)
Article Google Scholar
Wang, J.H., Teng, J.W., Lu, W.H., Chien, L.F.: Exploiting the web as the multilingual corpus for unknown query translation. Journal of the American Society for Information Science and Technology 57(5), 660–670 (2006)
Article Google Scholar
Nam, Y.S.: Foreign dictionary. Sung An Dang (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Computational Linguistics Group, National Institute of Information and Communications Technology (NICT), 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289, Japan
Jong-Hoon Oh & Hitoshi Isahara

Authors

Jong-Hoon Oh
View author publications
You can also search for this author in PubMed Google Scholar
Hitoshi Isahara
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate School of Information Science, Nara Institute of Science and Technology, 630-0192, Takayama, Ikoma, Nara, Japan
Yuji Matsumoto
Dept of ECE, University of Illinois at Urbana Champaign, IL 61801, Urbana, USA
Richard W. Sproat
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Kam-Fai Wong
State Key Lab of Intelligent Tech. & Sys., Tsinghua University,
Min Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Oh, JH., Isahara, H. (2006). Extracting English-Korean Transliteration Pairs from Web Corpora. In: Matsumoto, Y., Sproat, R.W., Wong, KF., Zhang, M. (eds) Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead. ICCPOL 2006. Lecture Notes in Computer Science(), vol 4285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11940098_24

Download citation

DOI: https://doi.org/10.1007/11940098_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49667-0
Online ISBN: 978-3-540-49668-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics