Abstract
Transliteration pair extraction, the identification of transliterations of foreign loanwords in literature, is a challenging task in research fields such as historical linguistics and digital humanities. In this paper, we focus on one important type of historical literature: classical Chinese Buddhist texts. We propose an approach which can identify transliteration pairs automatically in classical Chinese texts. Our approach comprises two stages: transliteration extraction and transliteration pair identification. In order to extract more possible transliterations without introducing too many false positives, we adopt a hybrid method consisting of a suffix-array-based extraction step and a language-model based filtering process. Using the ALINE algorithm, we then compare the extracted transliteration candidates for phonetic similarity based on their pronunciations in the middle Chinese rime book Guangyun (

). Pairs with similarity above a certain threshold are considered transliteration pairs. To evaluate our method, we constructed an evaluation set from several Buddhist texts such as the Samyuktagama and the Mahavibhasa, which were translated into Chinese in different eras. Precision and recall are used to measure and show the effectiveness of our method.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Shieh, Y.-P., “Appositional Term Clip: A Subject-oriented Appositional Term Extraction Algorithm,” New Eyes for Discovery: Foundations and Imaginations of Digital Humanities, National Taiwan University Press, pp. 133–162, 2011.
Sherif, T. and Kondrak, G., “Bootstrapping a stochastic transducer for Arabic-English transliteration extraction,” In Proc. of Annual Meeting-Association for Computational Linguistics, 2007.
Kuo, J-S., Li, H. and Yang, Y-K., “A Phonetic Similarity Model for Automatic Extraction of Transliteration Pairs,” ACM Trans. Asian Language Information Processing, 6, 2, 2007.
Oh J., Choi K.: “A statistical model for Automatic Extraction of Korean Transliterated Foreign words”. International Journal of Computer Processing of Oriental Languages 16(1), 41–62 (2003)
Goldberg, Y. and Elhadad, M., “Identification of transliterated foreign words in Hebrew script,” Computational Linguistics and Intelligent Text Processing, 2008.
Covington M.A.: “An algorithm to align words for historical comparison”. Computational Linguistics 22(4), 481–496 (1996)
Kondrak G.: “Phonetic alignment and similarity”. Computers and the Humanities 37(3), 273–291 (2003)
Tiedemann, J., “Extraction of translation equivalents from parallel corpora,” Proc. of the 11th Nordic conference on computational linguistics, pp. 120–128, 1998.
Nakov, P., Pacovski, V. and Paskaleva, E., “Extraction of translation equivalents from parallel corpora,” Proc. of the 11th Nordic conference on computational linguistics, pp. 120–128, 1998.
Ristad E.S., Yianilos P.N.: “Learning string-edit distance”. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(5), 522–532 (1998)
Mackay, W. and Kondrak, G., “Computing word similarity and identifying cognates with Pair Hidden Markov Models,” Proc. of the Ninth Conference on Computational Natural Language Learning, pp. 40–47, 2005.
Manzini G., Ferragina P.: “Engineering a lightweight suffix array construction algorithm”. Algorithmica 40(1), 33–50 (2004)
Wang, L., Historical Chinese Phonology, Zhonghua Book Company, 2002.
Cambel, L., Historical linguistics: an introduction, The MIT Press, 1987.
Ciyi, Fo Guang Buddhist Dictionary, Buddha’s Light Publishing, 1988.
Ding, F.-B., Great Dictionary of Buddhism, The Medical Press, 1922.
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Wang, YC., Wu, CK., Tsai, R.TH. et al. Transliteration Pair Extraction from Classical Chinese Buddhist Literature Using Phonetic Similarity Measurement. New Gener. Comput. 31, 265–283 (2013). https://doi.org/10.1007/s00354-013-0402-1
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00354-013-0402-1