Improving Word Alignment Using Alignment of Deep Structures

Mareček, David

doi:10.1007/978-3-642-04208-9_11

David Mareček²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5729))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

869 Accesses

Abstract

In this paper, we describe differences between a classical word alignment on the surface (word-layer alignment) and an alignment of deep syntactic sentence representations (tectogrammatical alignment). The deep structures we use are dependency trees containing content (autosemantic) words as their nodes. Most of other functional words, such as prepositions, articles, and auxiliary verbs are hidden. We introduce an algorithm which aligns such trees using perceptron-based scoring function. For evaluation purposes, a set of parallel sentences was manually aligned. We show that using statistical word alignment (GIZA++) can improve the tectogrammatical alignment. Surprisingly, we also show that the tectogrammatical alignment can be then used to significantly improve the original word alignment.

The work on this project was supported by the grants GAUK 9994/2009, GAČR 201/09/H057, and GAAV ČR 1ET101120503.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Deep Learning in Lexical Analysis and Parsing

Labeling hierarchical phrase-based models without linguistic resources

Article Open access 01 December 2015

Improved Graph-Based Dependency Parsing via Hierarchical LSTM Networks

References

Och, F.J., Ney, H.: A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics 29(1), 19–51 (2003)
Article Google Scholar
Menezes, A., Richardson, S.D.: A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora. In: Proceedings of the workshop on Data-driven methods in machine translation, vol. 14, pp. 1–8 (2001)
Google Scholar
Sgall, P.: Generativní popis jazyka a česká deklinace. Academia, Prague (1967)
Google Scholar
Hajič, J., Hajičová, E., Panevová, J., Sgall, P., Pajas, P., Štěpánek, J., Havelka, J., Mikulová, M.: Prague Dependency Treebank 2.0. Linguistic Data Consortium, LDC Catalog No.: LDC2006T01, Philadelphia (2006)
Google Scholar
Haruno, M., Yamazaki, T.: High-performance Bilingual Text Alignment Using Statistical and Dictionary Information. In: Proceedings of the 34th conference of the Association for Computational Linguistics, pp. 131–138 (1996)
Google Scholar
Watanabe, H., Kurohashi, S., Aramaki, E.: In: Finding Translation Patterns from Paired Source and Target Dependency Structures, pp. 397–420. Kluwer Academic, Dordrecht (2003)
Google Scholar
Cuřín, J., Čmejrek, M., Havelka, J., Hajič, J., Kuboň, V., Žabokrtský, Z.: Prague Czech-English Dependency Treebank, Version 1.0. Linguistics Data Consortium, Catalog No.: LDC2004T25 (2004)
Google Scholar
Bojar, O., Prokopová, M.: Czech-English Word Alignment. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006), ELRA, May 2006, pp. 1236–1239 (2006)
Google Scholar
Bojar, O., Janíček, M., Žabokrtský, Z., Češka, P., Beňa, P.: CzEng 0.7: Parallel Corpus with Community-Supplied Translations. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco, ELRA (May 2008)
Google Scholar
Žabokrtský, Z., Ptáček, J., Pajas, P.: TectoMT: Highly Modular MT System with Tectogrammatics Used as Transfer Layer. In: Proceedings of the 3rd Workshop on Statistical Machine Translation, ACL (2008)
Google Scholar
McDonald, R., Pereira, F., Ribarov, K., Hajič, J.: Non-Projective Dependency Parsing using Spanning Tree Algorithms. In: Proceedings of Human Langauge Technology Conference and Conference on Empirical Methods in Natural Language Processing (HTL/EMNLP), Vancouver, BC, Canada, pp. 523–530 (2005)
Google Scholar
Brants, T.: TnT - A Statistical Part-of-Speech Tagger. In: Proceedings of the 6th Applied Natural Language Processing Conference, Seattle, pp. 224–231 (2000)
Google Scholar
Collins, M.: Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In: Proceedings of EMNLP, vol. 10, pp. 1–8 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Formal and Applied Linguistics, Charles University in Prague, Czech Republic
David Mareček

Authors

David Mareček
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Wet Bohemia at Pilsen, Czech Republic
Václav Matoušek
Department of Computer Science, University of West Bohemia in Pilsen, Univerzitni 8, 30614, Plzen, Czech Republic
Pavel Mautner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mareček, D. (2009). Improving Word Alignment Using Alignment of Deep Structures. In: Matoušek, V., Mautner, P. (eds) Text, Speech and Dialogue. TSD 2009. Lecture Notes in Computer Science(), vol 5729. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04208-9_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-04208-9_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04207-2
Online ISBN: 978-3-642-04208-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics