Skip to main content

3-Step Parallel Corpus Cleaning Using Monolingual Crowd Workers

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 593))

Abstract

A high-quality parallel corpus needs to be manually created to achieve good machine translation for the domains which do not have enough existing resources. Although the quality of the corpus to some extent can be improved by asking the professional translators to translate, it is impossible to completely avoid making any mistakes. In this paper, we propose a framework for cleaning the existing professionally-translated parallel corpus in a quick and cheap way. The proposed method uses a 3-step crowdsourcing procedure to efficiently detect and edit the translation flaws, and also guarantees the reliability of the edits. The experiments using the fashion-domain e-commerce-site (EC-site) parallel corpus show the effectiveness of the proposed method for the parallel corpus cleaning.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Unfortunately, this service has been closed now.

  2. 2.

    Pants for children without the inside of a thigh being sewn up.

  3. 3.

    In our experiments, we showed both source and translated sentences.

  4. 4.

    http://crowdsourcing.yahoo.co.jp.

  5. 5.

    We excluded some sentences which are garbled.

  6. 6.

    http://www.editage.com.

References

  1. Ambati, V., Vogel, S.: Can crowds build parallel corpora for machine translation systems? In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 62–65 (2010)

    Google Scholar 

  2. Ambati, V., Vogel, S., Carbonell, J.: Active learning and crowd-sourcing for machine translation. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010) (2010)

    Google Scholar 

  3. Aranberri, N., Labaka, G., de Ilarraza, A.D., Sarasola, K.: Comparison of post-editing productivity between professional translators and lay users. In: Proceedings of the Third Workshop on Post-Editing Technology and Practice, pp. 20–33 (2014)

    Google Scholar 

  4. Cao, D., Nakano, H., Xu, Y., Kumai, H.: Development of “Chinese-Japanese bilingual corpus” and its remaining tasks. IPSJ SIG Notes 99(95), 1–8 (1999)

    Google Scholar 

  5. Chu, C., Nakazawa, T., Kurohashi, S.: Accurate parallel fragment extraction from quasi-comparable corpora using alignment model and translation lexicon. In: Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP 2013), pp. 1144–1150 (2013)

    Google Scholar 

  6. Chu, C., Nakazawa, T., Kurohashi, S.: Chinese-Japanese parallel sentence extraction from quasi-comparable corpora. In: Proceedings of the 6th Workshop on Building and Using Comparable Corpora (BUCC 2013), pp. 34–42 (2013)

    Google Scholar 

  7. Koehn, P.: Statistical significance tests for machine translation evaluation. In: Lin, D., Wu, D. (eds.) Proceedings of EMNLP 2004, pp. 388–395. Association for Computational Linguistics, Barcelona, July 2004

    Google Scholar 

  8. Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of the Tenth Machine Translation Summit (MT Summit X), pp. 79–86 (2005)

    Google Scholar 

  9. Nakazawa, T., Kurohashi, S.: Alignment by bilingual generation and monolingual derivation. In: Proceedings of COLING 2012, pp. 1963–1978. The COLING 2012 Organizing Committee, Mumbai, December 2012. http://www.aclweb.org/anthology/C12-1120

  10. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)

    Google Scholar 

  11. Richardson, J., Cromières, F., Nakazawa, T., Kurohashi, S.: KyotoEBMT: an example-based dependency-to-dependency translation framework. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 79–84 (2014)

    Google Scholar 

  12. Schwartz, L.: Monolingual post-editing by a domain expert is highly effective for translation triage. In: Proceedings of the Third Workshop on Post-editing Technology and Practice, pp. 34–44 (2014)

    Google Scholar 

  13. Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 403–411 (2010)

    Google Scholar 

  14. Uszkoreit, J., Ponte, J., Popat, A., Dubiner, M.: Large scale parallel document mining for machine translation. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 1101–1109 (2010)

    Google Scholar 

  15. Utiyama, M., Isahara, H.: A Japanese-English patent parallel corpus. In: MT summit XI, pp. 475–482 (2007)

    Google Scholar 

  16. Zaidan, O.F., Callison-Burch, C.: Crowdsourcing translation: professional quality from non-professionals. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1220–1229 (2011)

    Google Scholar 

  17. Zhang, Y., Uchimoto, K., Ma, Q., Isahara, H.: Building an annotated Japanese-Chinese parallel corpus - a part of NICT multilingual corpora. In: Proceedings of 2nd International Joint Conference on Natural Language Processing, pp. 85–90 (2005)

    Google Scholar 

Download references

Acknowledgments

This work is supported by the Yahoo Japan Corporation. We want to thank the anonymous reviewers for many very useful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Toshiaki Nakazawa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media Singapore

About this paper

Cite this paper

Nakazawa, T., Kurohashi, S., Kobayashi, H., Ishikawa, H., Sassano, M. (2016). 3-Step Parallel Corpus Cleaning Using Monolingual Crowd Workers. In: Hasida, K., Purwarianti, A. (eds) Computational Linguistics. PACLING 2015. Communications in Computer and Information Science, vol 593. Springer, Singapore. https://doi.org/10.1007/978-981-10-0515-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-0515-2_6

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-0514-5

  • Online ISBN: 978-981-10-0515-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics