Skip to main content

Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment

  • Conference paper
  • First Online:
Book cover Statistical Language and Speech Processing (SLSP 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8791))

Included in the following conference series:

Abstract

Alignment of parallel corpora is a crucial step prior to training statistical language models for machine translation. This paper investigates compression-based methods for aligning sentences in an English-Chinese parallel corpus. Four metrics for matching sentences required for measuring the alignment at the sentence level are compared: the standard sentence length ratio (SLR), and three new metrics, absolute sentence length difference (SLD), compression code length ratio (CR), and absolute compression code length difference (CD). Initial experiments with CR show that using the Prediction by Partial Matching (PPM) compression scheme, a method that also performs well at many language modeling tasks, significantly outperforms the other standard compression algorithms Gzip and Bzip2. The paper then shows that for sentence alignment of a parallel corpus with ground truth judgments, the compression code length ratio using PPM always performs better than sentence length ratio and the difference measurements also work better than the ratio measurements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Behr, F.H., Fossum, V., Mitzenmacher, M., Xiao, D.: Estimating and comparing entropy across written natural languages using PPM compression. In: Proceedings of Data Compression Conference, p. 416 (2003)

    Google Scholar 

  2. Brown, P., Della Pieta, S., Della Pieta, V., Mercer, R.: The mathematics of machine translation: parameter estimation. Comput. Ling. 19, 263–312 (1993)

    Google Scholar 

  3. Bzip2.: The Bzip2 Home Page (2014). http://www.bzip.org

  4. Chang, Z.: A PPM-based evaluation method for Chinese-English parallel corpora in machine translation. Ph.D. thesis of Bangor University (2008)

    Google Scholar 

  5. Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32(4), 396–402 (1984)

    Article  Google Scholar 

  6. Ding, H., Quan, L., Qi, H.: The Chinese-English bilingual sentence alignment based on length. In: International Conference on Asian Language Processing, pp. 201–204 (2011)

    Google Scholar 

  7. Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. In: ACL’93 29th Annual Meeting, pp. 177–184 (1993)

    Google Scholar 

  8. Gzip.: The Gzip Home Page (2014). http://www.gzip.org

  9. Haruno, M., Yamazaki, T.: High-performance bilingual text alignment using statistical and dictionary information. In: Proceedings of the 34th Annual Meeting of Association for Computational Linguistics, pp. 131–138 (1996)

    Google Scholar 

  10. Kay, M., Röscheisen, M.: Text-translation alignment. Comput. Ling. 19, 121–142 (1993)

    Google Scholar 

  11. Khmelev, D.V., Teahan, W.J.: A repetition based measure for verification of text collections and for text categorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 104–110 (2003)

    Google Scholar 

  12. Kutuzov, A.: Improving English-Russian sentence alignment through POS tagging and Damerau-Levenshtein distance. In: Association for Computational Linguistics, pp. 63–68 (2013)

    Google Scholar 

  13. Melamed, I.D.: Models of translational equivalence among words. Comput. Ling. 26(2), 221–249 (2000)

    Article  Google Scholar 

  14. Moore, R.C.: Fast and accurate sentence alignment of bilingual corpora. In: Association for Machine Translation, pp. 135–144 (2002)

    Google Scholar 

  15. Mújdricza-Maydt, E., Körkel-Qu, H., Riezler, S., Padó, S.: High-precision sentence alignment by bootstrapping from wood standard annotations. Prague Bull. Math. Ling. 99, 5–16 (2013)

    Google Scholar 

  16. Papageorgiou, H., Cranias, L., Piperidis, S.: Automatic alignment in corpora. In: Proceedings of 32nd Annual Meeting of Association of Computational Linguistic, pp. 334–336 (1994)

    Google Scholar 

  17. Simard, M., Foster, G.F., Isabelle, P.: Using cognates to align sentences in bilingual corpora. In: Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI), pp. 67–81 (1992)

    Google Scholar 

  18. Teahan, W.J., Wen, Y., McNab, R., Witten, I.H.: A compression-based algorithm for Chinese word segmentation. Comput. Ling. 26(3), 375–393 (2000)

    Article  Google Scholar 

  19. Wu, D.: Aligning a parallel English-Chinese corpus statistically with lexical criteria. In: ACL’94 32nd Annual Meeting, pp. 80–87 (1994)

    Google Scholar 

  20. Wu, P.: Adaptive models of Chinese text. Ph.D. dissertation, University of Wales, Bangor (2007)

    Google Scholar 

  21. Yu, Q., Max, A., Yvon, F.: Revisiting sentence alignment algorithms for alignment visualization and evaluation. In: LREC Workshop, pp. 10–16 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Liu, W., Chang, Z., Teahan, W.J. (2014). Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment. In: Besacier, L., Dediu, AH., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2014. Lecture Notes in Computer Science(), vol 8791. Springer, Cham. https://doi.org/10.1007/978-3-319-11397-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11397-5_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11396-8

  • Online ISBN: 978-3-319-11397-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics