Skip to main content

Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7356))

Abstract

Statistical machine translation of patents requires large amounts of sentence-parallel data. Translations of patent text often exist for parts of the patent document, namely title, abstract and claims. However, there are no direct translations of the largest part of the document, the description or background of the invention. We document a twofold approach for extracting parallel data from all patent document sections from a large multilingual patent corpus. Since language and style differ depending on document section (title, abstract, description, claims) and patent topic (according to the International Patent Classification), we sort the processed data into subdomains in order to enable its use in domain-oriented translation, e.g. when applying multi-task learning. We investigate several similarity metrics and apply them to the domains of patent topic and patent document sections. Product of our research is a corpus of 23 million parallel German-English sentences extracted from the MAREC patent corpus and a descriptive analysis of its subdomains.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Wäschle, K., Riezler, S.: Structural and topical dimensions in multi-task patent translation. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France (2012)

    Google Scholar 

  2. Utiyama, M., Isahara, H.: A japanese-english patent parallel corpus. In: Proceedings of MT Summit XI, Copenhagen, Denmark (2007)

    Google Scholar 

  3. Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Computational Linguistics 19(1), 75–102 (1993)

    Google Scholar 

  4. Lu, B., Tsou, B.K., Zhu, J., Jiang, T., Kwong, O.Y.: The construction of a chinese-english patent parallel corpus. In: Proceedings of the MT Summit XII, Ottawa, Canada (2009)

    Google Scholar 

  5. Tinsley, J., Way, A., Sheridan, P.: PLuTO: MT for online patent translation. In: Proceedings of the 9th Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, CO (2010)

    Google Scholar 

  6. Jochim, C., Lioma, C., Schütze, H., Koch, S., Ertl, T.: Preliminary study into query translation for patent retrieval. In: Proceedings of the 3rd International Workshop on Patent Information Retrieval (PaIR 2010), Toronto, Canada (2010)

    Google Scholar 

  7. Ceauşu, A., Tinsley, J., Zhang, J., Way, A.: Experiments on domain adaptation for patent machine translation in the PLuTO project. In: Proceedings of the 15th Conference of the European Assocation for Machine Translation (EAMT 2011), Leuven, Belgium (2011)

    Google Scholar 

  8. Braune, F., Fraser, A.: Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China (2010)

    Google Scholar 

  9. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)

    Article  MATH  Google Scholar 

  10. Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X, Phuket, Thailand (2005)

    Google Scholar 

  11. Moore, R.C.: Fast and Accurate Sentence Alignment of Bilingual Corpora. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 135–144. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  12. Siegel, S., Castellan, J.: Nonparametric Statistics for the Behavioral Sciences, 2nd edn. MacGraw-Hill, Boston (1988)

    Google Scholar 

  13. Kilgarriff, A., Rose, T.: Measures for corpus similarity and homogeneity. In: Proceedings of the 3rd conference on Empirical Methods in Natural Language Processing (EMNLP-3), Granada, Spain (1998)

    Google Scholar 

  14. Ben-David, S., Blitzer, J., Crammer, K., Pereira, F.: Analysis of representations for domain adaptation. In: Proceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS 2006), Vancouver, Canada (2006)

    Google Scholar 

  15. Koehn, P., Hoang, H., Birch, A., Callison-Birch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the ACL 2007 Demo and Poster Sessions, Prague, Czech Republic (2007)

    Google Scholar 

  16. Federico, M., Bertoldi, N., Cettolo, M.: IRSTLM: an open source toolkit for handling large scale language models. In: Proceedings of Interspeech, Brisbane, Australia (2008)

    Google Scholar 

  17. Heafield, K.: KenLN: faster and smaller language model queries. In: Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation (WMT 2011), Edinburgh, UK (2011)

    Google Scholar 

  18. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. Technical Report IBM Research Division Technical Report, RC22176 (W0190-022), Yorktown Heights, N.Y. (2001)

    Google Scholar 

  19. Koehn, P., Knight, K.: Empirical methods for compound splitting. In: Proceedings of the 10th Conference on European chapter of the Association for Computational Linguistics (EACL 2003), Budapest, Hungary (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wäschle, K., Riezler, S. (2012). Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus. In: Salampasis, M., Larsen, B. (eds) Multidisciplinary Information Retrieval. IRFC 2012. Lecture Notes in Computer Science, vol 7356. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31274-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-31274-8_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-31273-1

  • Online ISBN: 978-3-642-31274-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics