Skip to main content

A Collaborative Platform to Collect Data for Developing Machine Translation Systems

  • Conference paper
  • First Online:
Book cover Proceedings of International Joint Conference on Computational Intelligence

Abstract

The emergence of neural machine translation techniques has opened up a new era for developing translation systems. However, it requires a very large amount of parallel corpus, which is scarce for many under-resourced languages, e.g., Bangla. In order to develop a corpus, currently, there is a lack of publicly available collaborative system. In this paper, we report an online collaborative system for the development of the parallel corpus. The system is developed for supporting any language, however, we only evaluated for developing Bangla–English parallel corpus. In a task completion evaluation experiment, the system outperforms the widely used offline system, i.e., OmegaT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://omegat.org.

  2. 2.

    http://zanata.org.

  3. 3.

    https://github.com/AridHasan/Data-Collection-System-for-Machine-Translation.

  4. 4.

    https://github.com/AridHasan/Data-Collection-System-for-Machine-Translation/tree/master/data.

  5. 5.

    TM is a database consisting of source and target language pairs. It is usually stored in the database while translating the text corpus by the translators.

  6. 6.

    Fuzzy matching is an approximate matching approach that tries to find a segment of matched translation by matching them with previously translated sentences. The segment can be a phrase or the whole sentence.

References

  1. Ahmed S, Rahman MO, Pir SR, Mottalib M, Islam MS (2003) A new approach towards the development of English to Bangla machine translation system. In: International conference on computer information and technology (ICCIT). pp 360–364

    Google Scholar 

  2. Allen IE, Seaman CA (2007) Likert scales and data analyses. Qual Prog 40(7):64–65

    Google Scholar 

  3. Asaduzzaman S, Ali MM (2003) Transfer machine translation-an experience with Bangla English machine translation system. In: Proceedings of the international conference on computer and information technology (ICCIT). Bangladesh

    Google Scholar 

  4. Ashrafi SS, Kabir MH, Anwar MM, Noman A (2013) English to Bangla machine translation system using context-free grammars. Int J Comput Sci Issues (IJCSI) 10(3):144

    Google Scholar 

  5. Brown PF, Pietra VJD, Pietra SAD, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263–311

    Google Scholar 

  6. Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 263–270

    Google Scholar 

  7. Cho K, Van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder-decoder approaches. arXiv:1409.1259

  8. Das S, Mitra P (2011) A rule-based approach of stemming for inflectional and derivational words in Bengali. In: 2011 IEEE Students’ technology symposium (TechSym). IEEE, pp 134–136

    Google Scholar 

  9. Dobrišek S, Žibert J, Pavešić N, Mihelič F (2008) An edit-distance model for the approximate matching of timed strings. IEEE Trans Pattern Anal Mach Intell 4:736–741

    Google Scholar 

  10. Escartín CP (2012) Design and compilation of a specialized Spanish-German parallel corpus. In: LREC. pp 2199–2206

    Google Scholar 

  11. Harshawardhan R, Augustine MS, Soman K (2011) Phrase based English-Tamil translation system by concept labeling using translation memory. Int J Comput Appl 20(3):1–6

    Google Scholar 

  12. Hummel J, Knyphausen I (2006) Method and apparatus for processing source information based on source placeable elements. US Patent 7,020,601, 28 Mar 2006

    Google Scholar 

  13. Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. MT summit 5:79–86

    Google Scholar 

  14. Koehn P (2009) Statistical machine translation. Cambridge University Press, Cambridge

    Book  Google Scholar 

  15. Koehn P, Hoang H, Birch A., Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, et al (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, pp 177–180

    Google Scholar 

  16. Koehn P, Senellart J (2010) Convergence of translation memory and statistical machine translation. In: Proceedings of AMTA workshop on MT research and the translation industry. pp 21–31

    Google Scholar 

  17. Lagoudaki E (2006) Translation memories survey 2006: users perceptions around tm use. In: proceedings of the ASLIB international conference translating and the computer, vol 28. pp 1–29

    Google Scholar 

  18. Mahmud MR, Afrin M, Razzaque MA, Miller E, Iwashige J (2014) A rule based Bengali stemmer. In: 2014 International conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 2750–2756

    Google Scholar 

  19. Nielsen J (1994) Usability inspection methods. In: Conference companion on human factors in computing systems. ACM, pp 413–414

    Google Scholar 

  20. Ruiz Yepes G, et al (2011) Parallel corpora in translator education

    Google Scholar 

  21. Skadiņš R, Puriņš M, Skadiņa I, Vasiļjevs A (2011) Evaluation of SMT in localization to under-resourced inflected. In: 15th international conference of the European association for machine translation. pp 35–40

    Google Scholar 

  22. Somers H (2003) Translation memory systems. Benjamins Transl Libr 35:31–48

    Article  Google Scholar 

  23. Ummi RS (2013) A rule-based stemmer for Bangla verbs. PhD thesis, Independent University

    Google Scholar 

  24. Wharton C (1994) The cognitive walkthrough method: a practitioner’s guide. Usability inspection methods

    Google Scholar 

  25. Zampieri M, Vela M (2014) Quantifying the influence of MT output in the translators’ performance: a case study in technical translation. In: Proceedings of the EACL 2014 workshop on humans and computer-assisted translation. pp 93–98

    Google Scholar 

Download references

Acknowledgements

We would like to extend our sincere thanks to A S M Humayun Morshed from Daffodil International University and students from Department of English of the same University for helping us with the data collection task. We also would like to thank our anonymous reviewers for their detailed and constructive comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md. Arid Hasan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hasan, M.A., Alam, F., Noori, S.R.H. (2020). A Collaborative Platform to Collect Data for Developing Machine Translation Systems. In: Uddin, M., Bansal, J. (eds) Proceedings of International Joint Conference on Computational Intelligence. Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-13-7564-4_35

Download citation

Publish with us

Policies and ethics