Abstract
Software localization is the process of adapting a software product to the linguistic, cultural and technical requirements of a target market. It allows software companies to access foreign markets that would be otherwise difficult to penetrate. Many studies have been carried out to locate need-to-translate strings in software and adapt UI layout after text translation in the new language. However, no work has been done on the most important and time-consuming step of software localization process, i.e., the translation of software text. Due to some unique characteristics of software text, for example, application-specific meanings, context-sensitive translation, domain-specific rare words, general machine translation tools such as Google Translate cannot properly address linguistic and technical nuance in translating software text for software localization. In this paper, we propose a neural-network based translation model specifically designed and trained for mobile application text translation. We collect large-scale human-translated bilingual sentence pairs inside different Android applications, which are crawled from Google Play store. We customize the original RNN encoder-decoder neural machine translation model by adding categorical information addressing the domain-specific rare word problem which is common phenomenon in software text. We evaluate our approach in translating the text of testing Android applications by both BLEU score and exact match rate. The results show that our method outperforms the general machine translation tool, Google Translate, and generates more acceptable translation for software localization with less needs for human revision. Our approach is language independent, and we show the generality of our approach between English and the other five official languages used in United Nation (UN).











Similar content being viewed by others
Notes
Note that “domain-specific” in this work refer to the domain of the software engineering, instead of app category.
Although Google Play distinguishes detailed game category such as cards, racing, puzzle, we take them as one game category.
We do not use cross validation for evaluation as the training process takes a long time on our PC.
This indeed limits the scale of our experiment because it is a paid service to use Google Translate API for large-scale translation (https://cloud.google.com/translate/v2/pricing).
The detailed checking results can be found in https://sites.google.com/view/domainspecifictranslation/
References
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: ACM Sigmod Record, ACM, vol 22, pp 207–216
Alameer A, Mahajan S, Halfond WG (2016) Detecting and localizing internationalization presentation failures in web applications
Alshaikh Z, Mostafa S, Wang X, He S (2015) A empirical study on the status of software localization in open source projects
Apktool (2018) A tool for reverse engineering android apk files. https://ibotpeaches.github.io/Apktool/
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:14090473
Borgelt C (2012) Frequent item set mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(6):437–456
Burukhin A, Gadre MA, Aldahleh AM, Farrell T, Larrinaga-Pardo JL (2007) Dynamically providing a localized user interface language resource. US Patent App. 11/869,083
Chen C, Chen X, Sun J, Xing Z, Li G (2018a) Data-driven proactive policy assurance of post quality in community q&a sites. Proceedings of the ACM on human-computer interaction 2(CSCW):33
Chen C, Gao S, Xing Z (2016a) Mining analogical libraries in q&a discussions–incorporating relational and categorical knowledge into word embedding. In: 2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER), vol 1. IEEE, pp 338–348
Chen C, Su T, Meng G, Xing Z, Liu Y (2018b) From ui design image to gui skeleton: a neural machine translator to bootstrap mobile gui implementation. In: Proceedings of the 40th international conference on software engineering. ACM, pp 665–676
Chen C, Xing Z (2016a) Mining technology landscape from stack overflow. In: Proceedings of the 10th ACM/IEEE international symposium on empirical software engineering and measurement. ACM, p 14
Chen C, Xing Z (2016b) Similartech: automatically recommend analogical libraries across different programming languages. In: 2016 31st IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 834–839
Chen C, Xing Z, Han L (2016b) Techland: assisting technology landscape inquiries with insights from stack overflow. In: 2016 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 356–366
Chen C, Xing Z, Liu Y (2017a) By the community & for the community: a deep learning approach to assist collaborative editing in q&a sites. Proceedings of the ACM on Human-Computer Interaction 1(CSCW):32
Chen C, Xing Z, Liu Y (2018c) What’s spain’s paris? mining analogical libraries from q&a discussions. Empir Softw Eng, pp 1–40
Chen C, Xing Z, Liu Y, Ong KLX (2019) Mining likely analogical apis across third-party libraries via large-scale unsupervised api semantics embedding. IEEE Trans Softw Eng
Chen C, Xing Z, Wang X (2017b) Unsupervised software-specific morphological forms inference from informal discussions. In: Proceedings of the 39th international conference on software engineering. IEEE Press, pp 450–461
Chen G, Chen C, Xing Z, Bowen X (2016c) Learning a dual-language vector space for domain-specific cross-lingual question retrieval. In: 31st IEEE/ACM international conference on automated software engineering (ASE), IEEE/ACM
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:14061078
Chung T, Gildea D (2009) Unsupervised tokenization for machine translation. In: Proceedings of the 2009 conference on empirical methods in natural language processing: Volume 2-Volume 2, Association for computational linguistics, pp 718–726
Eck M, Vogel S, Waibel A (2004) Improving statistical machine translation in the medical domain using the unified medical language system. In: Proceedings of the 20th international conference on computational linguistics, association for computational linguistics, p 792
Fitzpatrick C, Whelan JP, Doyle RP, Lane JG, McHugh B, Farrell T, Barnes P, McQuaid AM, Mowatt D (2013) Dynamic screentip language translation. US Patent 8,612,893
Fraser A, Marcu D (2007) Measuring word alignment quality for statistical machine translation. Comput Linguist 33(3):293–303
Gao S, Chen C, Xing Z, Ma Y, Song W, Lin SW (2019) A neural model for method name generation from functional description. In: 2019 IEEE 26th international conference on software analysis, evolution, and reengineering (SANER), vol 1. IEEE
Graves A, Mohamed A-r, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 6645–6649
Green S, Cer D, Manning C (2014) Phrasal: a toolkit for new directions in statistical machine translation. In: Proceedings of the ninth workshop on statistical machine translation, pp 114–121
Google Play Store (2018a). https://play.google.com/store
Gu X, Zhang H, Kim S (2018) Deep code search. In: Proceedings of the 40th international conference on software engineering. ACM, pp 933–944
Gu X, Zhang H, Zhang D, Kim S (2016) Deep api learning. arXiv:160508535
Gu X, Zhang H, Zhang D, Kim S (2017) Deepam: migrate apis with multi-modal sequence to sequence learning. arXiv:170407734
Holzer H, Ant F, Nogueira D, Semolini K, Martin C, Aiken M, Balan S, Zetzsche J, Avval SF, Carl M et al (2011) An analysis of google translate accuracy
Hu X, Li G, Xia X, Lo D, Jin Z (2018) Deep code comment generation. In: Proceedings of the 26th conference on program comprehension. ACM, pp 200–210
Huang Y, Chen C, Xing Z, Lin T, Liu Y (2018) Tell them apart: distilling technology differences from crowd-scale comparison discussions. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering. ACM, pp 214–224
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology-volume 1, association for computational linguistics, pp 48–54
Luong MT, Sutskever I, Le QV, Vinyals O, Zaremba W (2014) Addressing the rare word problem in neural machine translation. arXiv:14108206
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:13013781
Mikolov T, Dean J (2013) Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems
Mikolov T, Deoras A, Povey D, Burget L, Cernockỳ J (2011) Strategies for training large scale neural network language models. In: 2011 IEEE workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 196–201
Mikolov T, Karafiát M, Burget L, Cernockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Interspeech, vol 2, p 3
Muntés Mulero V, Paladini Adell P, España Bonet C, Màrquez Villodre L (2012) Context-aware machine translation for software localization. In: Proceedings of the 16th annual conference of the European association for machine translation: EAMT 2012: Trento, Italy, May 28th-30th 2012, pp 77–80
United Nations (2018b) http://www.un.org/en/sections/about-un/official-languages/index.html. http://ask.un.org/faq/14463, Accessed 2018-06-20
Nguyen AT, Nguyen TT, Nguyen TN (2013) Lexical statistical machine translation for language migration. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering. ACM, pp 651–654
Nguyen AT, Nguyen TT, Nguyen TN (2014) Migrating code with statistical machine translation. In: Companion proceedings of the 36th international conference on software engineering. ACM, pp 544–547
Nguyen AT, Nguyen TT, Nguyen TN (2015) Divide-and-conquer approach for multi-phase statistical migration for source code (t). In: 2015 30th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 585–596
O’Brien S (1998) Practical experience of computer-aided translation tools in the software localization industry. Unity in diversity pp 115–122
Oda Y, Fudaba H, Neubig G, Hata H, Sakti S, Toda T, Nakamura S (2015) Learning to generate pseudo-code from source code using statistical machine translation (t). In: 2015 30th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 574–584
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, association for computational linguistics, pp 311–318
Phraseapp (2018c) Software translation management. https://phraseapp.com/, Accessed 2018-06-20
Plamada M, Volk M (2013) Mining for domain-specific parallel text from wikipedia. ACL 2013, pp 112
Ren Z, Lü Y, Cao J, Liu Q, Huang Y (2009) Improving statistical machine translation using domain bilingual multiword expressions. In: Proceedings of the workshop on multiword expressions: identification, interpretation, disambiguation and applications, association for computational linguistics, pp 47–54
Rice WR (1989) Analyzing tables of statistical tests. Evolution 43(1):223–225
Rich DP (2011) Method and system for improved software localization. US Patent 7,987,087
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Smartling (2018d) Smartling global content translation and localization solution. https://www.smartling.com/, Accessed 2018-06-20
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
Tjalling H (2016) Automatic comment generation using a neural translation model
Transifex (2018e) Transifex: Localization platform for translating digital content. https://www.transifex.com/, Accessed 2018-07-20
Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, association for computational linguistics, pp 384–394
Wang X, Zhang L, Xie T, Mei H, Sun J (2010) Locating need-to-translate constant strings in web applications. In: Proceedings of the eighteenth ACM SIGSOFT international symposium on foundations of software engineering. ACM, pp 87–96
Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560
White M, Vendome C, Linares-Vásquez M, Poshyvanyk D (2015) Toward deep learning software repositories. In: 2015 IEEE/ACM 12th working conference on mining software repositories (MSR). IEEE, pp 334–345
Wu H, Wang H, Zong C (2008) Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora. In: Proceedings of the 22nd international conference on computational linguistics-volume 1, association for computational linguistics, pp 993–1000
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, et al. (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:160908144
Xia X, Lo D, Zhu F, Wang X, Zhou B (2013) Software internationalization and localization: an industrial experience. In: 2013 18th international conference on Engineering of complex computer systems (ICECCS). IEEE, pp 222–231
Zens R, Och FJ, Ney H (2002) Phrase-based statistical machine translation. In: Annual conference on artificial intelligence. Springer, pp 18–32
Zhang J, Zong C et al (2013) Learning a phrase-based translation model from monolingual data with application to domain adaptation. In: ACL, vol 1, pp 1425–1434
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: David Lo, Meiyappan Nagappan, Fabio Palomba and Sebastian Panichella
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, X., Chen, C. & Xing, Z. Domain-specific machine translation with recurrent neural network for software localization. Empir Software Eng 24, 3514–3545 (2019). https://doi.org/10.1007/s10664-019-09702-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-019-09702-z