Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment

Liu, Wei; Chang, Zhipeng; Teahan, William J.

doi:10.1007/978-3-319-11397-5_5

Wei Liu⁷,
Zhipeng Chang⁷ &
William J. Teahan⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8791))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

1036 Accesses

Abstract

Alignment of parallel corpora is a crucial step prior to training statistical language models for machine translation. This paper investigates compression-based methods for aligning sentences in an English-Chinese parallel corpus. Four metrics for matching sentences required for measuring the alignment at the sentence level are compared: the standard sentence length ratio (SLR), and three new metrics, absolute sentence length difference (SLD), compression code length ratio (CR), and absolute compression code length difference (CD). Initial experiments with CR show that using the Prediction by Partial Matching (PPM) compression scheme, a method that also performs well at many language modeling tasks, significantly outperforms the other standard compression algorithms Gzip and Bzip2. The paper then shows that for sentence alignment of a parallel corpus with ground truth judgments, the compression code length ratio using PPM always performs better than sentence length ratio and the difference measurements also work better than the ratio measurements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Evaluating automatic sentence alignment approaches on English-Slovak sentences

Article Open access 17 November 2023

Korean-Chinese Bilingual Sentence Alignment Method Based on Character Length

Collaborative Matching for Sentence Alignment

References

Behr, F.H., Fossum, V., Mitzenmacher, M., Xiao, D.: Estimating and comparing entropy across written natural languages using PPM compression. In: Proceedings of Data Compression Conference, p. 416 (2003)
Google Scholar
Brown, P., Della Pieta, S., Della Pieta, V., Mercer, R.: The mathematics of machine translation: parameter estimation. Comput. Ling. 19, 263–312 (1993)
Google Scholar
Bzip2.: The Bzip2 Home Page (2014). http://www.bzip.org
Chang, Z.: A PPM-based evaluation method for Chinese-English parallel corpora in machine translation. Ph.D. thesis of Bangor University (2008)
Google Scholar
Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32(4), 396–402 (1984)
Article Google Scholar
Ding, H., Quan, L., Qi, H.: The Chinese-English bilingual sentence alignment based on length. In: International Conference on Asian Language Processing, pp. 201–204 (2011)
Google Scholar
Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. In: ACL’93 29th Annual Meeting, pp. 177–184 (1993)
Google Scholar
Gzip.: The Gzip Home Page (2014). http://www.gzip.org
Haruno, M., Yamazaki, T.: High-performance bilingual text alignment using statistical and dictionary information. In: Proceedings of the 34th Annual Meeting of Association for Computational Linguistics, pp. 131–138 (1996)
Google Scholar
Kay, M., Röscheisen, M.: Text-translation alignment. Comput. Ling. 19, 121–142 (1993)
Google Scholar
Khmelev, D.V., Teahan, W.J.: A repetition based measure for verification of text collections and for text categorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 104–110 (2003)
Google Scholar
Kutuzov, A.: Improving English-Russian sentence alignment through POS tagging and Damerau-Levenshtein distance. In: Association for Computational Linguistics, pp. 63–68 (2013)
Google Scholar
Melamed, I.D.: Models of translational equivalence among words. Comput. Ling. 26(2), 221–249 (2000)
Article Google Scholar
Moore, R.C.: Fast and accurate sentence alignment of bilingual corpora. In: Association for Machine Translation, pp. 135–144 (2002)
Google Scholar
Mújdricza-Maydt, E., Körkel-Qu, H., Riezler, S., Padó, S.: High-precision sentence alignment by bootstrapping from wood standard annotations. Prague Bull. Math. Ling. 99, 5–16 (2013)
Google Scholar
Papageorgiou, H., Cranias, L., Piperidis, S.: Automatic alignment in corpora. In: Proceedings of 32nd Annual Meeting of Association of Computational Linguistic, pp. 334–336 (1994)
Google Scholar
Simard, M., Foster, G.F., Isabelle, P.: Using cognates to align sentences in bilingual corpora. In: Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI), pp. 67–81 (1992)
Google Scholar
Teahan, W.J., Wen, Y., McNab, R., Witten, I.H.: A compression-based algorithm for Chinese word segmentation. Comput. Ling. 26(3), 375–393 (2000)
Article Google Scholar
Wu, D.: Aligning a parallel English-Chinese corpus statistically with lexical criteria. In: ACL’94 32nd Annual Meeting, pp. 80–87 (1994)
Google Scholar
Wu, P.: Adaptive models of Chinese text. Ph.D. dissertation, University of Wales, Bangor (2007)
Google Scholar
Yu, Q., Max, A., Yvon, F.: Revisiting sentence alignment algorithms for alignment visualization and evaluation. In: LREC Workshop, pp. 10–16 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Bangor University, Dean Street, Bangor, Gwynedd, LL57 1UT, UK
Wei Liu, Zhipeng Chang & William J. Teahan

Authors

Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhipeng Chang
View author publications
You can also search for this author in PubMed Google Scholar
William J. Teahan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Liu .

Editor information

Editors and Affiliations

University Joseph Fourier, Grenoble, France
Laurent Besacier
Rovira i Virgili University, Tarragona, Spain
Adrian-Horia Dediu
Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, W., Chang, Z., Teahan, W.J. (2014). Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment. In: Besacier, L., Dediu, AH., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2014. Lecture Notes in Computer Science(), vol 8791. Springer, Cham. https://doi.org/10.1007/978-3-319-11397-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-11397-5_5
Published: 03 September 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11396-8
Online ISBN: 978-3-319-11397-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics