New Areas of Application of Comparable Corpora

Rapp, Reinhard; Xu, Vivian; Zock, Michael; Sharoff, Serge; Forsyth, Richard; Babych, Bogdan; Chu, Chenhui; Nakazawa, Toshiaki; Kurohashi, Sadao

doi:10.1007/978-3-319-99004-0_7

Reinhard Rapp¹⁰,
Vivian Xu¹¹,
Michael Zock¹²,
Serge Sharoff¹³,
Richard Forsyth¹³,
Bogdan Babych¹³,
Chenhui Chu¹⁴,
Toshiaki Nakazawa¹⁴ &
…
Sadao Kurohashi¹⁴

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

396 Accesses

Abstract

This chapter describes several approaches of using comparable corpora beyond the area of MT for under-resourced languages, which is the primary focus of the ACCURAT project. Section 7.1, which is based on Rapp and Zock (Automatic dictionary expansion using non-parallel corpora. In: A. Fink, B. Lausen, W. Seidel, & A. Ultsch (Eds.) Advances in Data Analysis, Data Handling and Business Intelligence. Proceedings of the 32nd Annual Meeting of the GfKl, 2008. Springer, Heidelberg, 2010), addresses the task of creating resources for bilingual dictionaries using a seed lexicon; Sect. 7.2 (based on Rapp et al., Identifying word translations from comparable documents without a seed lexicon. Proceedings of LREC 2012, Istanbul, 2012) develops and evaluates a novel methodology of creating bilingual dictionaries without an initial lexicon. Section 7.3 proposes a novel system that can extract Chinese–Japanese parallel sentences from quasi-comparable and comparable corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Examples are the parallel corpora derived from the proceedings of the European parliament (Armstrong et al. 1998; Koehn 2005) and the JRC-Aquis corpus (Steinberger et al. 2006).
2.
For an overview on the availability of parallel texts for various languages, see Mike Maxwell’s posting on the corpora mailing list of February 27, 2008, with subject line ‘quantities of publicly available parallel text’, archived at http://listserv.linguistlist.org/archives/corpora.html
3.
This is an image converter allowing the exchange of camera lenses, thereby providing a shallow depth of field.
4.
In corpus based studies, sometimes thresholds of e.g. 50 are recommended. However, as we here consider keywords that have a higher information content than an average token in a corpus, it makes sense to use a lower threshold.
5.
Note that the scores reported in Rapp (1999) were based on different corpora and a proprietary seed lexicon, which is why this work had been replicated by Laws et al. (2010) using Wikipedia and a freely available lexicon.
6.
We could not easily compare with the TS1000 test set provided by Laws et al. (2010) as this adds some more sophistication (parts of speech and multiple translations) to the evaluation process, whereas we wanted to keep the evaluation process simple as we are dealing with many languages.
7.
Variable thresholds depending on word frequency might reduce the problem, but this has not been implemented.
8.
For better results, an evaluation method taking into account multiple translation possibilities might be desirable for Chinese. On the other hand (similar to BLEU scores in machine translation), it is better not to take these accuracy figures as absolute but instead as a means for comparing the performances of different algorithms. We think that, for this application, it is preferable to consider only the most salient translations, because the degree of arbitrariness (as inherent in the production of any gold standard) is minimised in this way.
9.
http://www.mandarintools.com/zhcode.html
10.
http://unicode.org/charts/unihan.html
11.
http://lotus.kuee.kyoto-u.ac.jp/ASPEC
12.
http://www.jst.go.jp
13.
http://www.nict.go.jp
14.
http://www.cnki.net
15.
http://ci.nii.ac.jp
16.
http://people.com.cn
17.
http://j.people.com.cn
18.
http://en.wikipedia.org/wiki/People’s_Daily
19.
http://www.lemurproject.org/indri
20.
http://code.google.com/p/giza-pp
21.
http://www.csie.ntu.edu.tw/~cjlin/libsvm
22.
http://www.speech.sri.com/projects/srilm

References

Abdul-Rauf, S., & Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4), 341–375.
Article Google Scholar
Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. In Proceedings of EACL (pp. 62–69).
Google Scholar
Armstrong, S., Kempen, M., McKelvie, D., Petitpierre, D., Rapp, R., & Thompson, H. (1998). Multilingual corpora for cooperation. In Proceedings of the 1st International Conference on Linguistic Resources and Evaluation (LREC) (Vol. 2, pp. 975–980), Granada.
Google Scholar
Brants, T. (2000). TnT − A statistical part-of-speech tagger. In Proceedings of the 6th Applied Natural Language Processing Conference (pp. 224–231).
Google Scholar
Chiao, Y.-C., Sta, J.-D., & Zweigenbaum, P. (2004). A novel approach to improve word translations extraction from non-parallel, comparable corpora. In Proceedings of the International Joint Conference on Natural Language Processing, Hainan, China, AFNLP, 2004.
Google Scholar
Chu, C., Nakazawa, T., & Kurohashi, S. (2011). Japanese-Chinese phrase alignment using common Chinese characters information. In Proceedings of MT Summit XIII (pp. 475–482), Xiamen, China, September.
Google Scholar
Chu, C., Nakazawa, T., Kawahara, D., & Kurohashi, S. (2012a, May). Exploiting shared Chinese characters in Chinese word segmentation optimization for Chinese-Japanese machine translation. In Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT2012) (pp. 35–42), Trento, Italy.
Google Scholar
Chu, C., Nakazawa, T., Kawahara, D., & Kurohashi, S. (2012b, May). Chinese characters mapping table of Japanese, Traditional Chinese and Simplified Chinese. In Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC2012) (pp. 2149–2152), Istanbul, Turkey.
Google Scholar
Chu, C., Nakazawa, T., Kawahara, D., & Kurohashi, S. (2013, August). Chinese–Japanese parallel sentence extraction from quasi–comparable corpora. In Proceedings of the Sixth Workshop on Building and Using Comparable Corpora (pp. 34–42). Association for Computational Linguistics, Sofia, Bulgaria.
Google Scholar
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Google Scholar
Fung, P., & Cheung, P. (2004). Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In Proceedings of Coling 2004 (pp. 1051–1057), Geneva, Switzerland, Aug 23–Aug 27. COLING.
Google Scholar
Fung, P., & McKeown, K. (1997). Finding terminology translations from non-parallel corpora. In Proceedings of the 5th Annual Workshop on Very Large Corpora (pp. 192–202), Hong Kong.
Google Scholar
Fung, P., & Yee, L. Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of COLING-ACL 1998 (Vol. 1, pp. 414–420), Montreal.
Google Scholar
Goh, C. L., Asahara, M., & Matsumoto, Y. (2005). Building a Japanese-Chinese dictionary using kanji/hanzi conversion. In Proceedings of the International Joint Conference on Natural Language Processing (pp. 670–681).
Google Scholar
Jongejan, B., & Dalianis, H. (2009). Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (pp. 145–153).
Google Scholar
Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In D. Lin, & D. Wu (Eds.), Proceedings of EMNLP 2004 (pp. 388–395). Association for Computational Linguistics, Barcelona, Spain.
Google Scholar
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of MT Summit (pp. 79–86), Phuket, Thailand.
Google Scholar
Koehn, P., Hoang, H., Birch, A., et al. (2007, June). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions (pp. 177–180), Association for Computational Linguistics, Prague, Czech Republic.
Google Scholar
Kondrak, G., Marcu, D., & Knight, K. (2003). Cognates can improve statistical translation models. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (pp. 46–48).
Google Scholar
Kurohashi, S., Nakamura, T., Matsumoto, Y., & Nagao, M. (1994). Improvements of Japanese morphological analyzer JUMAN. In Proceedings of the International Workshop on Sharable Natural Language(pp. 22–28).
Google Scholar
Laws, F., Michelbacher, L., Dorow, B., Scheible, C., Heid, U., & Schütze, H. (2010). A linguistically grounded graph model for bilingual lexicon extraction. In Proceedings of Coling, Poster Volume (pp. 614–622).
Google Scholar
Munteanu, D. S., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.
Article Google Scholar
Munteanu, D. S., & Marcu, D. (2006, July). Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (pp. 81–88). Association for Computational Linguistics, Sydney, Australia.
Google Scholar
Och, F. J. (2003). Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (pp. 160–167). Association for Computational Linguistics, Sapporo, Japan.
Google Scholar
Papineni, K., Roukos, S.,Ward, T., & Zhu, W-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (pp. 311–318), Philadelphia, PA.
Google Scholar
Rapp, R. (1995). Identifying word translations in non-parallel texts. In Proceedings of the 33rd Meeting of the Association for Computational Linguistics (pp. 320–322), Cambridge, MA.
Google Scholar
Rapp, R. (1996). Die Berechnung von Assoziatonen. Hildesheim: Olms.
Google Scholar
Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (pp. 519–526), College Park, MD.
Google Scholar
Rapp, R., & Martin Vide, C. (2007). Statistical machine translation without parallel corpora. In G. Rehm, A. Witt, & L. Lemnitzer (Eds.), Datenstrukturen für linguistische Ressourcen und ihre Anwendungen/Data Structures for Linguistic Resources and Applications. Proceedings of the Biennial GLDV Conference 2007 (pp. 231–240). Gunter Narr Verlag, Tübingen.
Google Scholar
Rapp, R., & Zock, M. (2010). Automatic dictionary expansion using non-parallel corpora. In A. Fink, B. Lausen, W. Seidel, & A. Ultsch (Eds.) Advances in Data Analysis, Data Handling and Business Intelligence. Proceedings of the 32nd Annual Meeting of the GfKl, 2008. Springer, Heidelberg.
Google Scholar
Rapp, R., Sharoff, S., & Babych, B. (2012). Identifying word translations from comparable documents without a seed lexicon. In Proceedings of LREC 2012, Istanbul.
Google Scholar
Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the Workshop on Comparing Corpora (WCC ’00) (Vol. 9, pp. 1–6).
Google Scholar
Rumelhart, D. E., & McClelland, J. L. (1987). Parallel Distributed Processing. Explorations in the Microstructure of Cognition. Vol. 1: Foundations. MIT Press.
Google Scholar
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. International Conference on New Methods in Language Processing (pp. 44–49).
Google Scholar
Sharoff, S., Kopotev, M., Erjavec, T., Feldman, A., & Divjak, D. (2008). Designing and evaluating a Russian tagset. In Proceedings of the Sixth Language Resources and Evaluation Conference, LREC 2008 (pp. 279–285), Marrakech.
Google Scholar
Smith, J. R., Quirk, Ch., & Toutanova, K. (2010, June). Extracting parallel sentences from comparable corpora using document level alignment. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 403–411), Association for Computational Linguistics, Los Angeles, CA.
Google Scholar
Stefanescu, D., Ion, R., & Hunsicker, S. (2012, May). Hybrid parallel sentence mining from comparable corpora. In Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT2012) (pp. 117–128), Trento, Italy.
Google Scholar
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., et al. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation, Genoa, Italy.
Google Scholar
Tan, Ch. L., & Nagao, M. (1995). Automatic alignment of Japanese-Chinese bilingual texts. IEICE Transactions on Information and Systems, E78-D(1), 68–76.
Google Scholar
Tillmann, Ch. (2009, August). A beam-search extraction algorithm for comparable data. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers (pp. 225–228), Association for Computational Linguistics, Suntec, Singapore.
Google Scholar
Utiyama, M., & Isahara, H. (2003, July). Reliable measures for aligning Japanese-English news articles and sentences. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (pp. 72–79), Association for Computational Linguistics, Sapporo, Japan.
Google Scholar
Wu, D., & Fung, P. (2005). Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In Proceedings of the Second International Joint Conference on Natural Language Processing (IJCNLP-2005), Jeju, Korea.
Google Scholar
Zhao, B., & Vogel, S. (2002). Adaptive parallel sentences mining from web a bilingual news collections. In Proceedings of the 2002 I.E. International Conference on Data Mining (pp. 745–748), IEEE Computer Society, Maebashi City, Japan.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Mainz, Mainz, Germany
Reinhard Rapp
Beijing Foreign Studies University, Beijing, China
Vivian Xu
CNRS, Marseille, France
Michael Zock
University of Leeds, Leeds, UK
Serge Sharoff, Richard Forsyth & Bogdan Babych
Graduate School of Informatics, Kyoto University, Kyoto, Japan
Chenhui Chu, Toshiaki Nakazawa & Sadao Kurohashi

Authors

Reinhard Rapp
View author publications
You can also search for this author in PubMed Google Scholar
Vivian Xu
View author publications
You can also search for this author in PubMed Google Scholar
Michael Zock
View author publications
You can also search for this author in PubMed Google Scholar
Serge Sharoff
View author publications
You can also search for this author in PubMed Google Scholar
Richard Forsyth
View author publications
You can also search for this author in PubMed Google Scholar
Bogdan Babych
View author publications
You can also search for this author in PubMed Google Scholar
Chenhui Chu
View author publications
You can also search for this author in PubMed Google Scholar
Toshiaki Nakazawa
View author publications
You can also search for this author in PubMed Google Scholar
Sadao Kurohashi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bogdan Babych .

Editor information

Editors and Affiliations

Tilde, Riga, Latvia
Inguna Skadiņa
Department of Computer Science, University of Sheffield, Sheffield, UK
Robert Gaizauskas
School of Modern Languages & Cultures, University of Leeds, Leeds, UK
Bogdan Babych
Faculty of Humanities & Social Sciences, University of Zagreb, Zagreb, Croatia
Nikola Ljubešić
Institute for Artificial Intelligence, Romanian Academy, Bucharest, Romania
Dan Tufiş
Tilde , Riga, Latvia
Andrejs Vasiļjevs

Additional information

Chapter editors: Bogdan Babych and Inguna Skadiņa

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rapp, R. et al. (2019). New Areas of Application of Comparable Corpora. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-99004-0_7
Published: 07 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99003-3
Online ISBN: 978-3-319-99004-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics