Skip to main content

Abstract

This chapter describes several approaches of using comparable corpora beyond the area of MT for under-resourced languages, which is the primary focus of the ACCURAT project. Section 7.1, which is based on Rapp and Zock (Automatic dictionary expansion using non-parallel corpora. In: A. Fink, B. Lausen, W. Seidel, & A. Ultsch (Eds.) Advances in Data Analysis, Data Handling and Business Intelligence. Proceedings of the 32nd Annual Meeting of the GfKl, 2008. Springer, Heidelberg, 2010), addresses the task of creating resources for bilingual dictionaries using a seed lexicon; Sect. 7.2 (based on Rapp et al., Identifying word translations from comparable documents without a seed lexicon. Proceedings of LREC 2012, Istanbul, 2012) develops and evaluates a novel methodology of creating bilingual dictionaries without an initial lexicon. Section 7.3 proposes a novel system that can extract Chinese–Japanese parallel sentences from quasi-comparable and comparable corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Examples are the parallel corpora derived from the proceedings of the European parliament (Armstrong et al. 1998; Koehn 2005) and the JRC-Aquis corpus (Steinberger et al. 2006).

  2. 2.

    For an overview on the availability of parallel texts for various languages, see Mike Maxwell’s posting on the corpora mailing list of February 27, 2008, with subject line ‘quantities of publicly available parallel text’, archived at http://listserv.linguistlist.org/archives/corpora.html

  3. 3.

    This is an image converter allowing the exchange of camera lenses, thereby providing a shallow depth of field.

  4. 4.

    In corpus based studies, sometimes thresholds of e.g. 50 are recommended. However, as we here consider keywords that have a higher information content than an average token in a corpus, it makes sense to use a lower threshold.

  5. 5.

    Note that the scores reported in Rapp (1999) were based on different corpora and a proprietary seed lexicon, which is why this work had been replicated by Laws et al. (2010) using Wikipedia and a freely available lexicon.

  6. 6.

    We could not easily compare with the TS1000 test set provided by Laws et al. (2010) as this adds some more sophistication (parts of speech and multiple translations) to the evaluation process, whereas we wanted to keep the evaluation process simple as we are dealing with many languages.

  7. 7.

    Variable thresholds depending on word frequency might reduce the problem, but this has not been implemented.

  8. 8.

    For better results, an evaluation method taking into account multiple translation possibilities might be desirable for Chinese. On the other hand (similar to BLEU scores in machine translation), it is better not to take these accuracy figures as absolute but instead as a means for comparing the performances of different algorithms. We think that, for this application, it is preferable to consider only the most salient translations, because the degree of arbitrariness (as inherent in the production of any gold standard) is minimised in this way.

  9. 9.

    http://www.mandarintools.com/zhcode.html

  10. 10.

    http://unicode.org/charts/unihan.html

  11. 11.

    http://lotus.kuee.kyoto-u.ac.jp/ASPEC

  12. 12.

    http://www.jst.go.jp

  13. 13.

    http://www.nict.go.jp

  14. 14.

    http://www.cnki.net

  15. 15.

    http://ci.nii.ac.jp

  16. 16.

    http://people.com.cn

  17. 17.

    http://j.people.com.cn

  18. 18.

    http://en.wikipedia.org/wiki/People’s_Daily

  19. 19.

    http://www.lemurproject.org/indri

  20. 20.

    http://code.google.com/p/giza-pp

  21. 21.

    http://www.csie.ntu.edu.tw/~cjlin/libsvm

  22. 22.

    http://www.speech.sri.com/projects/srilm

References

  • Abdul-Rauf, S., & Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4), 341–375.

    Article  Google Scholar 

  • Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. In Proceedings of EACL (pp. 62–69).

    Google Scholar 

  • Armstrong, S., Kempen, M., McKelvie, D., Petitpierre, D., Rapp, R., & Thompson, H. (1998). Multilingual corpora for cooperation. In Proceedings of the 1st International Conference on Linguistic Resources and Evaluation (LREC) (Vol. 2, pp. 975–980), Granada.

    Google Scholar 

  • Brants, T. (2000). TnT − A statistical part-of-speech tagger. In Proceedings of the 6th Applied Natural Language Processing Conference (pp. 224–231).

    Google Scholar 

  • Chiao, Y.-C., Sta, J.-D., & Zweigenbaum, P. (2004). A novel approach to improve word translations extraction from non-parallel, comparable corpora. In Proceedings of the International Joint Conference on Natural Language Processing, Hainan, China, AFNLP, 2004.

    Google Scholar 

  • Chu, C., Nakazawa, T., & Kurohashi, S. (2011). Japanese-Chinese phrase alignment using common Chinese characters information. In Proceedings of MT Summit XIII (pp. 475–482), Xiamen, China, September.

    Google Scholar 

  • Chu, C., Nakazawa, T., Kawahara, D., & Kurohashi, S. (2012a, May). Exploiting shared Chinese characters in Chinese word segmentation optimization for Chinese-Japanese machine translation. In Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT2012) (pp. 35–42), Trento, Italy.

    Google Scholar 

  • Chu, C., Nakazawa, T., Kawahara, D., & Kurohashi, S. (2012b, May). Chinese characters mapping table of Japanese, Traditional Chinese and Simplified Chinese. In Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC2012) (pp. 2149–2152), Istanbul, Turkey.

    Google Scholar 

  • Chu, C., Nakazawa, T., Kawahara, D., & Kurohashi, S. (2013, August). Chinese–Japanese parallel sentence extraction from quasi–comparable corpora. In Proceedings of the Sixth Workshop on Building and Using Comparable Corpora (pp. 34–42). Association for Computational Linguistics, Sofia, Bulgaria.

    Google Scholar 

  • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

    Google Scholar 

  • Fung, P., & Cheung, P. (2004). Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In Proceedings of Coling 2004 (pp. 1051–1057), Geneva, Switzerland, Aug 23–Aug 27. COLING.

    Google Scholar 

  • Fung, P., & McKeown, K. (1997). Finding terminology translations from non-parallel corpora. In Proceedings of the 5th Annual Workshop on Very Large Corpora (pp. 192–202), Hong Kong.

    Google Scholar 

  • Fung, P., & Yee, L. Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of COLING-ACL 1998 (Vol. 1, pp. 414–420), Montreal.

    Google Scholar 

  • Goh, C. L., Asahara, M., & Matsumoto, Y. (2005). Building a Japanese-Chinese dictionary using kanji/hanzi conversion. In Proceedings of the International Joint Conference on Natural Language Processing (pp. 670–681).

    Google Scholar 

  • Jongejan, B., & Dalianis, H. (2009). Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (pp. 145–153).

    Google Scholar 

  • Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In D. Lin, & D. Wu (Eds.), Proceedings of EMNLP 2004 (pp. 388–395). Association for Computational Linguistics, Barcelona, Spain.

    Google Scholar 

  • Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of MT Summit (pp. 79–86), Phuket, Thailand.

    Google Scholar 

  • Koehn, P., Hoang, H., Birch, A., et al. (2007, June). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions (pp. 177–180), Association for Computational Linguistics, Prague, Czech Republic.

    Google Scholar 

  • Kondrak, G., Marcu, D., & Knight, K. (2003). Cognates can improve statistical translation models. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (pp. 46–48).

    Google Scholar 

  • Kurohashi, S., Nakamura, T., Matsumoto, Y., & Nagao, M. (1994). Improvements of Japanese morphological analyzer JUMAN. In Proceedings of the International Workshop on Sharable Natural Language(pp. 22–28).

    Google Scholar 

  • Laws, F., Michelbacher, L., Dorow, B., Scheible, C., Heid, U., & Schütze, H. (2010). A linguistically grounded graph model for bilingual lexicon extraction. In Proceedings of Coling, Poster Volume (pp. 614–622).

    Google Scholar 

  • Munteanu, D. S., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.

    Article  Google Scholar 

  • Munteanu, D. S., & Marcu, D. (2006, July). Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (pp. 81–88). Association for Computational Linguistics, Sydney, Australia.

    Google Scholar 

  • Och, F. J. (2003). Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (pp. 160–167). Association for Computational Linguistics, Sapporo, Japan.

    Google Scholar 

  • Papineni, K., Roukos, S.,Ward, T., & Zhu, W-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (pp. 311–318), Philadelphia, PA.

    Google Scholar 

  • Rapp, R. (1995). Identifying word translations in non-parallel texts. In Proceedings of the 33rd Meeting of the Association for Computational Linguistics (pp. 320–322), Cambridge, MA.

    Google Scholar 

  • Rapp, R. (1996). Die Berechnung von Assoziatonen. Hildesheim: Olms.

    Google Scholar 

  • Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (pp. 519–526), College Park, MD.

    Google Scholar 

  • Rapp, R., & Martin Vide, C. (2007). Statistical machine translation without parallel corpora. In G. Rehm, A. Witt, & L. Lemnitzer (Eds.), Datenstrukturen für linguistische Ressourcen und ihre Anwendungen/Data Structures for Linguistic Resources and Applications. Proceedings of the Biennial GLDV Conference 2007 (pp. 231–240). Gunter Narr Verlag, Tübingen.

    Google Scholar 

  • Rapp, R., & Zock, M. (2010). Automatic dictionary expansion using non-parallel corpora. In A. Fink, B. Lausen, W. Seidel, & A. Ultsch (Eds.) Advances in Data Analysis, Data Handling and Business Intelligence. Proceedings of the 32nd Annual Meeting of the GfKl, 2008. Springer, Heidelberg.

    Google Scholar 

  • Rapp, R., Sharoff, S., & Babych, B. (2012). Identifying word translations from comparable documents without a seed lexicon. In Proceedings of LREC 2012, Istanbul.

    Google Scholar 

  • Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the Workshop on Comparing Corpora (WCC00) (Vol. 9, pp. 1–6).

    Google Scholar 

  • Rumelhart, D. E., & McClelland, J. L. (1987). Parallel Distributed Processing. Explorations in the Microstructure of Cognition. Vol. 1: Foundations. MIT Press.

    Google Scholar 

  • Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. International Conference on New Methods in Language Processing (pp. 44–49).

    Google Scholar 

  • Sharoff, S., Kopotev, M., Erjavec, T., Feldman, A., & Divjak, D. (2008). Designing and evaluating a Russian tagset. In Proceedings of the Sixth Language Resources and Evaluation Conference, LREC 2008 (pp. 279–285), Marrakech.

    Google Scholar 

  • Smith, J. R., Quirk, Ch., & Toutanova, K. (2010, June). Extracting parallel sentences from comparable corpora using document level alignment. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 403–411), Association for Computational Linguistics, Los Angeles, CA.

    Google Scholar 

  • Stefanescu, D., Ion, R., & Hunsicker, S. (2012, May). Hybrid parallel sentence mining from comparable corpora. In Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT2012) (pp. 117–128), Trento, Italy.

    Google Scholar 

  • Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., et al. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation, Genoa, Italy.

    Google Scholar 

  • Tan, Ch. L., & Nagao, M. (1995). Automatic alignment of Japanese-Chinese bilingual texts. IEICE Transactions on Information and Systems, E78-D(1), 68–76.

    Google Scholar 

  • Tillmann, Ch. (2009, August). A beam-search extraction algorithm for comparable data. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers (pp. 225–228), Association for Computational Linguistics, Suntec, Singapore.

    Google Scholar 

  • Utiyama, M., & Isahara, H. (2003, July). Reliable measures for aligning Japanese-English news articles and sentences. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (pp. 72–79), Association for Computational Linguistics, Sapporo, Japan.

    Google Scholar 

  • Wu, D., & Fung, P. (2005). Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In Proceedings of the Second International Joint Conference on Natural Language Processing (IJCNLP-2005), Jeju, Korea.

    Google Scholar 

  • Zhao, B., & Vogel, S. (2002). Adaptive parallel sentences mining from web a bilingual news collections. In Proceedings of the 2002 I.E. International Conference on Data Mining (pp. 745–748), IEEE Computer Society, Maebashi City, Japan.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bogdan Babych .

Editor information

Editors and Affiliations

Additional information

Chapter editors: Bogdan Babych and Inguna Skadiņa

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Rapp, R. et al. (2019). New Areas of Application of Comparable Corpora. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99004-0_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99003-3

  • Online ISBN: 978-3-319-99004-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics