Domain-specific machine translation with recurrent neural network for software localization

Wang, Xu; Chen, Chunyang; Xing, Zhenchang

doi:10.1007/s10664-019-09702-z

Domain-specific machine translation with recurrent neural network for software localization

Published: 30 April 2019

Volume 24, pages 3514–3545, (2019)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

1379 Accesses
6 Altmetric
Explore all metrics

Abstract

Software localization is the process of adapting a software product to the linguistic, cultural and technical requirements of a target market. It allows software companies to access foreign markets that would be otherwise difficult to penetrate. Many studies have been carried out to locate need-to-translate strings in software and adapt UI layout after text translation in the new language. However, no work has been done on the most important and time-consuming step of software localization process, i.e., the translation of software text. Due to some unique characteristics of software text, for example, application-specific meanings, context-sensitive translation, domain-specific rare words, general machine translation tools such as Google Translate cannot properly address linguistic and technical nuance in translating software text for software localization. In this paper, we propose a neural-network based translation model specifically designed and trained for mobile application text translation. We collect large-scale human-translated bilingual sentence pairs inside different Android applications, which are crawled from Google Play store. We customize the original RNN encoder-decoder neural machine translation model by adding categorical information addressing the domain-specific rare word problem which is common phenomenon in software text. We evaluate our approach in translating the text of testing Android applications by both BLEU score and exact match rate. The results show that our method outperforms the general machine translation tool, Google Translate, and generates more acceptable translation for software localization with less needs for human revision. Our approach is language independent, and we show the generality of our approach between English and the other five official languages used in United Nation (UN).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sentence Structure and Boundary for Deep Neural Machine Translation Alignment Model

Nahuatl Neural Machine Translation Using Attention Based Architectures: A Comparative Analysis for RNNs and Transformers as a Mobile Application Service

Improved neural machine translation using Natural Language Processing (NLP)

Article 07 October 2023

Notes

https://translate.google.com
https://www.bing.com/translator
Note that “domain-specific” in this work refer to the domain of the software engineering, instead of app category.
Although Google Play distinguishes detailed game category such as cards, racing, puzzle, we take them as one game category.
https://developer.android.com/guide/topics/resources/localization
We do not use cross validation for evaluation as the training process takes a long time on our PC.
https://www.transifex.com/
https://crowdin.com/
https://www.smartling.com/
This indeed limits the scale of our experiment because it is a paid service to use Google Translate API for large-scale translation (https://cloud.google.com/translate/v2/pricing).
https://translate.google.com/
http://fanyi.youdao.com/
The detailed checking results can be found in https://sites.google.com/view/domainspecifictranslation/
https://en.wikipedia.org/wiki/World_population
https://en.wikipedia.org/wiki/English-speaking_world

References

Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: ACM Sigmod Record, ACM, vol 22, pp 207–216
Article Google Scholar
Alameer A, Mahajan S, Halfond WG (2016) Detecting and localizing internationalization presentation failures in web applications
Alshaikh Z, Mostafa S, Wang X, He S (2015) A empirical study on the status of software localization in open source projects
Apktool (2018) A tool for reverse engineering android apk files. https://ibotpeaches.github.io/Apktool/
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:14090473
Borgelt C (2012) Frequent item set mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(6):437–456
Google Scholar
Burukhin A, Gadre MA, Aldahleh AM, Farrell T, Larrinaga-Pardo JL (2007) Dynamically providing a localized user interface language resource. US Patent App. 11/869,083
Chen C, Chen X, Sun J, Xing Z, Li G (2018a) Data-driven proactive policy assurance of post quality in community q&a sites. Proceedings of the ACM on human-computer interaction 2(CSCW):33
Chen C, Gao S, Xing Z (2016a) Mining analogical libraries in q&a discussions–incorporating relational and categorical knowledge into word embedding. In: 2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER), vol 1. IEEE, pp 338–348
Chen C, Su T, Meng G, Xing Z, Liu Y (2018b) From ui design image to gui skeleton: a neural machine translator to bootstrap mobile gui implementation. In: Proceedings of the 40th international conference on software engineering. ACM, pp 665–676
Chen C, Xing Z (2016a) Mining technology landscape from stack overflow. In: Proceedings of the 10th ACM/IEEE international symposium on empirical software engineering and measurement. ACM, p 14
Chen C, Xing Z (2016b) Similartech: automatically recommend analogical libraries across different programming languages. In: 2016 31st IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 834–839
Chen C, Xing Z, Han L (2016b) Techland: assisting technology landscape inquiries with insights from stack overflow. In: 2016 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 356–366
Chen C, Xing Z, Liu Y (2017a) By the community & for the community: a deep learning approach to assist collaborative editing in q&a sites. Proceedings of the ACM on Human-Computer Interaction 1(CSCW):32
Chen C, Xing Z, Liu Y (2018c) What’s spain’s paris? mining analogical libraries from q&a discussions. Empir Softw Eng, pp 1–40
Chen C, Xing Z, Liu Y, Ong KLX (2019) Mining likely analogical apis across third-party libraries via large-scale unsupervised api semantics embedding. IEEE Trans Softw Eng
Chen C, Xing Z, Wang X (2017b) Unsupervised software-specific morphological forms inference from informal discussions. In: Proceedings of the 39th international conference on software engineering. IEEE Press, pp 450–461
Chen G, Chen C, Xing Z, Bowen X (2016c) Learning a dual-language vector space for domain-specific cross-lingual question retrieval. In: 31st IEEE/ACM international conference on automated software engineering (ASE), IEEE/ACM
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:14061078
Chung T, Gildea D (2009) Unsupervised tokenization for machine translation. In: Proceedings of the 2009 conference on empirical methods in natural language processing: Volume 2-Volume 2, Association for computational linguistics, pp 718–726
Eck M, Vogel S, Waibel A (2004) Improving statistical machine translation in the medical domain using the unified medical language system. In: Proceedings of the 20th international conference on computational linguistics, association for computational linguistics, p 792
Fitzpatrick C, Whelan JP, Doyle RP, Lane JG, McHugh B, Farrell T, Barnes P, McQuaid AM, Mowatt D (2013) Dynamic screentip language translation. US Patent 8,612,893
Fraser A, Marcu D (2007) Measuring word alignment quality for statistical machine translation. Comput Linguist 33(3):293–303
Article MathSciNet Google Scholar
Gao S, Chen C, Xing Z, Ma Y, Song W, Lin SW (2019) A neural model for method name generation from functional description. In: 2019 IEEE 26th international conference on software analysis, evolution, and reengineering (SANER), vol 1. IEEE
Graves A, Mohamed A-r, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 6645–6649
Green S, Cer D, Manning C (2014) Phrasal: a toolkit for new directions in statistical machine translation. In: Proceedings of the ninth workshop on statistical machine translation, pp 114–121
Google Play Store (2018a). https://play.google.com/store
Gu X, Zhang H, Kim S (2018) Deep code search. In: Proceedings of the 40th international conference on software engineering. ACM, pp 933–944
Gu X, Zhang H, Zhang D, Kim S (2016) Deep api learning. arXiv:160508535
Gu X, Zhang H, Zhang D, Kim S (2017) Deepam: migrate apis with multi-modal sequence to sequence learning. arXiv:170407734
Holzer H, Ant F, Nogueira D, Semolini K, Martin C, Aiken M, Balan S, Zetzsche J, Avval SF, Carl M et al (2011) An analysis of google translate accuracy
Hu X, Li G, Xia X, Lo D, Jin Z (2018) Deep code comment generation. In: Proceedings of the 26th conference on program comprehension. ACM, pp 200–210
Huang Y, Chen C, Xing Z, Lin T, Liu Y (2018) Tell them apart: distilling technology differences from crowd-scale comparison discussions. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering. ACM, pp 214–224
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology-volume 1, association for computational linguistics, pp 48–54
Luong MT, Sutskever I, Le QV, Vinyals O, Zaremba W (2014) Addressing the rare word problem in neural machine translation. arXiv:14108206
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:13013781
Mikolov T, Dean J (2013) Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems
Mikolov T, Deoras A, Povey D, Burget L, Cernockỳ J (2011) Strategies for training large scale neural network language models. In: 2011 IEEE workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 196–201
Mikolov T, Karafiát M, Burget L, Cernockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Interspeech, vol 2, p 3
Muntés Mulero V, Paladini Adell P, España Bonet C, Màrquez Villodre L (2012) Context-aware machine translation for software localization. In: Proceedings of the 16th annual conference of the European association for machine translation: EAMT 2012: Trento, Italy, May 28th-30th 2012, pp 77–80
United Nations (2018b) http://www.un.org/en/sections/about-un/official-languages/index.html. http://ask.un.org/faq/14463, Accessed 2018-06-20
Nguyen AT, Nguyen TT, Nguyen TN (2013) Lexical statistical machine translation for language migration. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering. ACM, pp 651–654
Nguyen AT, Nguyen TT, Nguyen TN (2014) Migrating code with statistical machine translation. In: Companion proceedings of the 36th international conference on software engineering. ACM, pp 544–547
Nguyen AT, Nguyen TT, Nguyen TN (2015) Divide-and-conquer approach for multi-phase statistical migration for source code (t). In: 2015 30th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 585–596
O’Brien S (1998) Practical experience of computer-aided translation tools in the software localization industry. Unity in diversity pp 115–122
Oda Y, Fudaba H, Neubig G, Hata H, Sakti S, Toda T, Nakamura S (2015) Learning to generate pseudo-code from source code using statistical machine translation (t). In: 2015 30th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 574–584
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, association for computational linguistics, pp 311–318
Phraseapp (2018c) Software translation management. https://phraseapp.com/, Accessed 2018-06-20
Plamada M, Volk M (2013) Mining for domain-specific parallel text from wikipedia. ACL 2013, pp 112
Ren Z, Lü Y, Cao J, Liu Q, Huang Y (2009) Improving statistical machine translation using domain bilingual multiword expressions. In: Proceedings of the workshop on multiword expressions: identification, interpretation, disambiguation and applications, association for computational linguistics, pp 47–54
Rice WR (1989) Analyzing tables of statistical tests. Evolution 43(1):223–225
Article Google Scholar
Rich DP (2011) Method and system for improved software localization. US Patent 7,987,087
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Article Google Scholar
Smartling (2018d) Smartling global content translation and localization solution. https://www.smartling.com/, Accessed 2018-06-20
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
Tjalling H (2016) Automatic comment generation using a neural translation model
Transifex (2018e) Transifex: Localization platform for translating digital content. https://www.transifex.com/, Accessed 2018-07-20
Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, association for computational linguistics, pp 384–394
Wang X, Zhang L, Xie T, Mei H, Sun J (2010) Locating need-to-translate constant strings in web applications. In: Proceedings of the eighteenth ACM SIGSOFT international symposium on foundations of software engineering. ACM, pp 87–96
Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560
Article Google Scholar
White M, Vendome C, Linares-Vásquez M, Poshyvanyk D (2015) Toward deep learning software repositories. In: 2015 IEEE/ACM 12th working conference on mining software repositories (MSR). IEEE, pp 334–345
Wu H, Wang H, Zong C (2008) Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora. In: Proceedings of the 22nd international conference on computational linguistics-volume 1, association for computational linguistics, pp 993–1000
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, et al. (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:160908144
Xia X, Lo D, Zhu F, Wang X, Zhou B (2013) Software internationalization and localization: an industrial experience. In: 2013 18th international conference on Engineering of complex computer systems (ICECCS). IEEE, pp 222–231
Zens R, Och FJ, Ney H (2002) Phrase-based statistical machine translation. In: Annual conference on artificial intelligence. Springer, pp 18–32
Zhang J, Zong C et al (2013) Learning a phrase-based translation model from monolingual data with application to domain adaptation. In: ACL, vol 1, pp 1425–1434

Download references

Author information

Authors and Affiliations

College of Engineering & Computer Science, Australian National University, Canberra, Australia
Xu Wang & Zhenchang Xing
Faculty of Information Technology, Monash University, Clayton, VIC, 3800, Australia
Chunyang Chen

Authors

Xu Wang
View author publications
You can also search for this author inPubMed Google Scholar
Chunyang Chen
View author publications
You can also search for this author inPubMed Google Scholar
Zhenchang Xing
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Chunyang Chen.

Additional information

Communicated by: David Lo, Meiyappan Nagappan, Fabio Palomba and Sebastian Panichella

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, X., Chen, C. & Xing, Z. Domain-specific machine translation with recurrent neural network for software localization. Empir Software Eng 24, 3514–3545 (2019). https://doi.org/10.1007/s10664-019-09702-z

Download citation

Published: 30 April 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10664-019-09702-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Domain-specific machine translation with recurrent neural network for software localization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Sentence Structure and Boundary for Deep Neural Machine Translation Alignment Model

Nahuatl Neural Machine Translation Using Attention Based Architectures: A Comparative Analysis for RNNs and Transformers as a Mobile Application Service

Improved neural machine translation using Natural Language Processing (NLP)

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now