Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus

Wäschle, Katharina; Riezler, Stefan

doi:10.1007/978-3-642-31274-8_2

Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus

Katharina Wäschle¹⁸ &
Stefan Riezler¹⁸

Conference paper

985 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7356))

Abstract

Statistical machine translation of patents requires large amounts of sentence-parallel data. Translations of patent text often exist for parts of the patent document, namely title, abstract and claims. However, there are no direct translations of the largest part of the document, the description or background of the invention. We document a twofold approach for extracting parallel data from all patent document sections from a large multilingual patent corpus. Since language and style differ depending on document section (title, abstract, description, claims) and patent topic (according to the International Patent Classification), we sort the processed data into subdomains in order to enable its use in domain-oriented translation, e.g. when applying multi-task learning. We investigate several similarity metrics and apply them to the domains of patent topic and patent document sections. Product of our research is a corpus of 23 million parallel German-English sentences extracted from the MAREC patent corpus and a descriptive analysis of its subdomains.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Wäschle, K., Riezler, S.: Structural and topical dimensions in multi-task patent translation. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France (2012)
Google Scholar
Utiyama, M., Isahara, H.: A japanese-english patent parallel corpus. In: Proceedings of MT Summit XI, Copenhagen, Denmark (2007)
Google Scholar
Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Computational Linguistics 19(1), 75–102 (1993)
Google Scholar
Lu, B., Tsou, B.K., Zhu, J., Jiang, T., Kwong, O.Y.: The construction of a chinese-english patent parallel corpus. In: Proceedings of the MT Summit XII, Ottawa, Canada (2009)
Google Scholar
Tinsley, J., Way, A., Sheridan, P.: PLuTO: MT for online patent translation. In: Proceedings of the 9th Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, CO (2010)
Google Scholar
Jochim, C., Lioma, C., Schütze, H., Koch, S., Ertl, T.: Preliminary study into query translation for patent retrieval. In: Proceedings of the 3rd International Workshop on Patent Information Retrieval (PaIR 2010), Toronto, Canada (2010)
Google Scholar
Ceauşu, A., Tinsley, J., Zhang, J., Way, A.: Experiments on domain adaptation for patent machine translation in the PLuTO project. In: Proceedings of the 15th Conference of the European Assocation for Machine Translation (EAMT 2011), Leuven, Belgium (2011)
Google Scholar
Braune, F., Fraser, A.: Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China (2010)
Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)
Article MATH Google Scholar
Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X, Phuket, Thailand (2005)
Google Scholar
Moore, R.C.: Fast and Accurate Sentence Alignment of Bilingual Corpora. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 135–144. Springer, Heidelberg (2002)
Chapter Google Scholar
Siegel, S., Castellan, J.: Nonparametric Statistics for the Behavioral Sciences, 2nd edn. MacGraw-Hill, Boston (1988)
Google Scholar
Kilgarriff, A., Rose, T.: Measures for corpus similarity and homogeneity. In: Proceedings of the 3rd conference on Empirical Methods in Natural Language Processing (EMNLP-3), Granada, Spain (1998)
Google Scholar
Ben-David, S., Blitzer, J., Crammer, K., Pereira, F.: Analysis of representations for domain adaptation. In: Proceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS 2006), Vancouver, Canada (2006)
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Birch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the ACL 2007 Demo and Poster Sessions, Prague, Czech Republic (2007)
Google Scholar
Federico, M., Bertoldi, N., Cettolo, M.: IRSTLM: an open source toolkit for handling large scale language models. In: Proceedings of Interspeech, Brisbane, Australia (2008)
Google Scholar
Heafield, K.: KenLN: faster and smaller language model queries. In: Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation (WMT 2011), Edinburgh, UK (2011)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. Technical Report IBM Research Division Technical Report, RC22176 (W0190-022), Yorktown Heights, N.Y. (2001)
Google Scholar
Koehn, P., Knight, K.: Empirical methods for compound splitting. In: Proceedings of the 10th Conference on European chapter of the Association for Computational Linguistics (EACL 2003), Budapest, Hungary (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computational Linguistics, Heidelberg University, Germany
Katharina Wäschle & Stefan Riezler

Authors

Katharina Wäschle
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Riezler
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstr. 9-11/188, 1040, Vienna, Austria
Michail Salampasis
Royal School of Library and Information Science, 2300, Copenhagen, Denmark
Birger Larsen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wäschle, K., Riezler, S. (2012). Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus. In: Salampasis, M., Larsen, B. (eds) Multidisciplinary Information Retrieval. IRFC 2012. Lecture Notes in Computer Science, vol 7356. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31274-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-31274-8_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31273-1
Online ISBN: 978-3-642-31274-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics