Abstract
Alignment of parallel corpora is a crucial step prior to training statistical language models for machine translation. This paper investigates compression-based methods for aligning sentences in an English-Chinese parallel corpus. Four metrics for matching sentences required for measuring the alignment at the sentence level are compared: the standard sentence length ratio (SLR), and three new metrics, absolute sentence length difference (SLD), compression code length ratio (CR), and absolute compression code length difference (CD). Initial experiments with CR show that using the Prediction by Partial Matching (PPM) compression scheme, a method that also performs well at many language modeling tasks, significantly outperforms the other standard compression algorithms Gzip and Bzip2. The paper then shows that for sentence alignment of a parallel corpus with ground truth judgments, the compression code length ratio using PPM always performs better than sentence length ratio and the difference measurements also work better than the ratio measurements.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Behr, F.H., Fossum, V., Mitzenmacher, M., Xiao, D.: Estimating and comparing entropy across written natural languages using PPM compression. In: Proceedings of Data Compression Conference, p. 416 (2003)
Brown, P., Della Pieta, S., Della Pieta, V., Mercer, R.: The mathematics of machine translation: parameter estimation. Comput. Ling. 19, 263–312 (1993)
Bzip2.: The Bzip2 Home Page (2014). http://www.bzip.org
Chang, Z.: A PPM-based evaluation method for Chinese-English parallel corpora in machine translation. Ph.D. thesis of Bangor University (2008)
Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32(4), 396–402 (1984)
Ding, H., Quan, L., Qi, H.: The Chinese-English bilingual sentence alignment based on length. In: International Conference on Asian Language Processing, pp. 201–204 (2011)
Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. In: ACL’93 29th Annual Meeting, pp. 177–184 (1993)
Gzip.: The Gzip Home Page (2014). http://www.gzip.org
Haruno, M., Yamazaki, T.: High-performance bilingual text alignment using statistical and dictionary information. In: Proceedings of the 34th Annual Meeting of Association for Computational Linguistics, pp. 131–138 (1996)
Kay, M., Röscheisen, M.: Text-translation alignment. Comput. Ling. 19, 121–142 (1993)
Khmelev, D.V., Teahan, W.J.: A repetition based measure for verification of text collections and for text categorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 104–110 (2003)
Kutuzov, A.: Improving English-Russian sentence alignment through POS tagging and Damerau-Levenshtein distance. In: Association for Computational Linguistics, pp. 63–68 (2013)
Melamed, I.D.: Models of translational equivalence among words. Comput. Ling. 26(2), 221–249 (2000)
Moore, R.C.: Fast and accurate sentence alignment of bilingual corpora. In: Association for Machine Translation, pp. 135–144 (2002)
Mújdricza-Maydt, E., Körkel-Qu, H., Riezler, S., Padó, S.: High-precision sentence alignment by bootstrapping from wood standard annotations. Prague Bull. Math. Ling. 99, 5–16 (2013)
Papageorgiou, H., Cranias, L., Piperidis, S.: Automatic alignment in corpora. In: Proceedings of 32nd Annual Meeting of Association of Computational Linguistic, pp. 334–336 (1994)
Simard, M., Foster, G.F., Isabelle, P.: Using cognates to align sentences in bilingual corpora. In: Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI), pp. 67–81 (1992)
Teahan, W.J., Wen, Y., McNab, R., Witten, I.H.: A compression-based algorithm for Chinese word segmentation. Comput. Ling. 26(3), 375–393 (2000)
Wu, D.: Aligning a parallel English-Chinese corpus statistically with lexical criteria. In: ACL’94 32nd Annual Meeting, pp. 80–87 (1994)
Wu, P.: Adaptive models of Chinese text. Ph.D. dissertation, University of Wales, Bangor (2007)
Yu, Q., Max, A., Yvon, F.: Revisiting sentence alignment algorithms for alignment visualization and evaluation. In: LREC Workshop, pp. 10–16 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Liu, W., Chang, Z., Teahan, W.J. (2014). Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment. In: Besacier, L., Dediu, AH., MartÃn-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2014. Lecture Notes in Computer Science(), vol 8791. Springer, Cham. https://doi.org/10.1007/978-3-319-11397-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-11397-5_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11396-8
Online ISBN: 978-3-319-11397-5
eBook Packages: Computer ScienceComputer Science (R0)