Word Segmentation for Dialect Translation

Paul, Michael; Finch, Andrew; Sumita, Eiichiro

doi:10.1007/978-3-642-19437-5_5

Michael Paul¹⁷,
Andrew Finch¹⁷ &
Eiichiro Sumita¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6609))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1396 Accesses

Abstract

This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous source language text in order to improve the translation quality of statistical machine translation (SMT) approaches for the translation of local dialects by exploiting linguistic information of the standard language. The method iteratively learns multiple segmentation schemes that are consistent with (1) the standard dialect segmentations and (2) the phrasal segmentations of an SMT system trained on the resegmented bitext of the local dialect. In a second step multiple segmentation schemes are integrated into a single SMT system by characterizing the source language side and merging identical translation pairs of differently segmented SMT models. Experimental results translating three Japanese local dialects (Kumamoto, Kyoto, Osaka) into three Indo-European languages (English, German, Russian) revealed that the proposed system outperforms SMT engines trained on character-based as well as standard dialect segmentation schemes for the majority of the investigated translation tasks and automatic evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Semi-automatic Quasi-morphological Word Segmentation for Neural Machine Translation

Optimized Uyghur Segmentation for Statistical Machine Translation

An Improved Method of Applying a Machine Translation Model to a Chinese Word Segmentation Task

References

Nerbonne, J., Heeringa, W.: Measuring Dialect Distance Phonetically. In: Proc. of the ACL SIG in Computational Phonology, Madrid, Spain, pp. 11–18 (1997)
Google Scholar
Heeringa, W., Kleiweg, P., Gosskens, C., Nerbonne, J.: Evaluation of String Distance Algorithms for Dialectology. In: Proc. of the Workshop on Linguistic Distances, Sydney, Australia, pp. 51–62 (2006)
Google Scholar
Scherrer, Y.: Adaptive String Distance Measures for Bilingual Dialect Lexicon Induction. In: Proc. of the ACL Student Research Workshop, Prague, Czech Republic, pp. 55–60 (2007)
Google Scholar
Chitturi, R., Hansen, J.: Dialect Classification for online podcasts fusing Acoustic and Language-based Structural and Semantic Information. In: Proc. of the ACL-HLT (Companion Volume), Columbus, USA, pp. 21–24 (2008)
Google Scholar
Habash, N., Rambow, O., Kiraz, G.: Morphological Analysis and Generation for Arabic Dialects. In: Proc. of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, USA, pp. 17–24 (2005)
Google Scholar
Chiang, D., Diab, M., Habash, N., Rainbow, O., Shareef, S.: Parsing Arabic Dialects. In: Proc. of the EACL, Trento, Italy, pp. 369–376 (2006)
Google Scholar
Biadsy, F., Hirschberg, J., Habash, N.: Spoken Arabic Dialect Identification Using Phonotactic Modeling. In: Proc. of the EACL, Athens, Greek, pp. 53–61 (2009)
Google Scholar
Weber, D., Mann, W.: Prospects for Computer-Assisted Dialect Adaption. American Journal of Computational Linguistics 7(3), 165–177 (1981)
Google Scholar
Zhang, X., Hom, K.H.: Dialect MT: A Case Study between Cantonese and Mandarin. In: Proc. of the ACL-COLING, Montreal, Canada, pp. 1460–1464 (1998)
Google Scholar
Sawaf, H.: Arabic Dialect Handling in Hybrid Machine Translation. In: Proc. of the AMTA, Denver, USA (2010)
Google Scholar
Cheng, K.S., Young, G., Wong, K.F.: A study on word-based and integrat-bit Chinese text compression algorithms. American Society of Information Science 50(3), 218–228 (1999)
Article Google Scholar
Venkataraman, A.: A statistical model for word discovery in transcribed speech. Computational Linguistics 27(3), 351–372 (2001)
Article Google Scholar
Goldwater, S., Griffith, T., Johnson, M.: Contextual Dependencies in Unsupervised Word Segmentation. In: Proc. of the ACL, Sydney, Australia, pp. 673–680 (2006)
Google Scholar
Chang, P.C., Galley, M., Manning, C.: Optimizing Chinese Word Segmentation for Machine Translation Performance. In: Proc. of the 3rd Workshop on SMT, Columbus, USA, pp. 224–232 (2008)
Google Scholar
Xu, J., Gao, J., Toutanova, K., Ney, H.: Bayesian Semi-Supervised Chinese Word Segmentation for SMT. In: Proc. of the COLING, Manchester, UK, pp. 1017–1024 (2008)
Google Scholar
Zhang, R., Yasuda, K., Sumita, E.: Improved Statistical Machine Translation by Multiple Chinese Word Segmentation. In: Proc. of the 3rd Workshop on SMT, Columbus, USA, pp. 216–223 (2008)
Google Scholar
Dyer, C.: Using a maximum entropy model to build segmentation lattices for MT. In: Proc. of HLT, Boulder, USA, pp. 406–414 (2009)
Google Scholar
Ma, Y., Way, A.: Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation. In: Proc. of the 12th EACL, Athens, Greece, pp. 549–557 (2009)
Google Scholar
Berger, A., Pietra, S.D., Pietra, V.D.: A maximum entropy approach to NLP. Computational Linguistics 22(1), 39–71 (1996)
Google Scholar
Pietra, S.D., Pietra, V.D., Lafferty, J.: Inducing Features of Random Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(4), 380–393 (1997)
Article Google Scholar
Ratnaparkhi, A.: A Maximum Entropy Model for Part-Of-Speech Tagging. In: Proc. of the EMNLP, Pennsylvania, USA, pp. 133–142 (1996)
Google Scholar
Kikui, G., Yamamoto, S., Takezawa, T., Sumita, E.: Comparative study on corpora for speech translation. IEEE Transactions on Audio, Speech and Language 14(5), 1674–1682 (2006)
Article Google Scholar
Och, F.J., Ney, H.: A Systematic Comparison of Statistical Alignment Models. Computational Linguistics 29(1), 19–51 (2003)
Article MATH Google Scholar
Stolcke, A.: SRILM an extensible language modeling toolkit. In: Proc. of ICSLP, Denver, USA, pp. 901–904 (2002)
Google Scholar
Finch, A., Denoual, E., Okuma, H., Paul, M., Yamamoto, H., Yasuda, K., Zhang, R., Sumita, E.: The NICT/ATR Speech Translation System. In: Proc. of the IWSLT, Trento, Italy, pp. 103–110 (2007)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a Method for Automatic Evaluation of Machine Translation. In: Proc. of the 40th ACL, Philadelphia, USA, pp. 311–318 (2002)
Google Scholar
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proc. of the AMTA, Cambridge and USA, pp. 223–231 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Information and Communications Technology, MASTAR Project, Kyoto, Japan
Michael Paul, Andrew Finch & Eiichiro Sumita

Authors

Michael Paul
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Finch
View author publications
You can also search for this author in PubMed Google Scholar
Eiichiro Sumita
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paul, M., Finch, A., Sumita, E. (2011). Word Segmentation for Dialect Translation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2011. Lecture Notes in Computer Science, vol 6609. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19437-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-19437-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19436-8
Online ISBN: 978-3-642-19437-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics