Abstract
We present a simple, efficient data augmentation approach for boosting Chinese-Vietnamese neural machine translation performance by leveraging the linguistic difference between the two languages. We first define the formalized representation of modifier symmetry, which is one of the most representative linguistic differences between Chinese and Vietnamese. We then propose and test two data augmentation strategies for leveraging the linguistic difference, which can be integrated naturally with different translation models. Results indicate that both strategies can introduce linguistic rules to boost translation accuracy. Tests on Chinese-Vietnamese benchmarks show significant accuracy improvements. To facilitate studies in this domain, we also release an open-source toolkit1 with flexible implementation for Chinese-Vietnamese linguistic difference tagging.
- [1] . 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems -Volume 2 (NIPS'14). MIT Press, Cambridge, MA, USA, 3104–3112.Google ScholarDigital Library
- [2] . 2015. Neural machine translation by jointly learning to align and translate. arXiv.2014. 1409.0473.Google Scholar
- [3] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30, 5998–6008.Google Scholar
- [4] . 2017. Six challenges for neural machine translation. In Proceedings of the 1st Workshop on Neural Machine Translation. Vancouver, Canada, 2017, 28–39.Google ScholarCross Ref
- [5] . 2021. Neural machine translation: A review of methods, resources, and tools. arXiv:2012.15515.Google Scholar
- [6] . 2019. Efficient low-resource neural machine translation with reread and feedback mechanism. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 3, Article
34 (2019), 13 pages. Google ScholarDigital Library - [7] . 2019. Generalized data augmentation for low-resource translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy, July 28–August 2, 2019, 5786–5796.Google ScholarCross Ref
- [8] . 2019. Using monolingual data in neural machine translation: A systematic study. arxiv:cs.CL/1903.11437.Google Scholar
- [9] . 2019. Code-Switching for Enhancing NMT with Pre-Specified Translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for Computational Linguistics, 449–459.Google Scholar
- [10] . 2019. Training neural machine translation to apply terminology constraints. arxiv:cs.CL/1906.01105.Google Scholar
- [11] . 2019. Syntax-based chinese-vietnamese tree-to-tree statistical machine translation with bilingual features. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 4, 36 (2019), 20.
DOI: Google ScholarDigital Library - [12] . 2020. Towards integrated classification lexicon for handling unknown words in Chinese-Vietnamese neural machine translation. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 3 (2020), 42 (2020), 17 pages.
DOI: Google ScholarDigital Library - [13] . 2018. Integrating pronunciation into Chinese-Vietnamese statistical machine translation. Tsinghua Science and Technology 23, 6 (2018), 715–723.Google Scholar
- [14] . 2019. Preordering for Chinese-Vietnam statistical machine translation. IEICE Transactions on Information and Systems E102-D, 2, 375–382.Google Scholar
- [15] . 2018. Dependency-based pre-ordering of preposition phrases in Chinese-Vietnamese machine translation. ICIC Express Letters, Part B: Applications 9 (2018), 265–272.Google Scholar
- [16] . 2017. Language post positioned characteristic based Chinese-Vietnamese statistical machine translation method. In Proceedings of the International Conference on Asian Language Processing (IALP’17), Singapore. IEEE, 2017.Google Scholar
- [17] . 2019. Code-switching for enhancing NMT with pre-specified translation[C]. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’19). Minneapolis, Minnesota, 2019.Google Scholar
- [18] . Training neural machine translation to apply terminology constraints. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics, 3063–3068.Google Scholar
- [19] . 2019. Revisiting low-resource neural machine translation: A case study[C]. In 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). Florence, Italy. Association for Computational Linguistics, 211--221.Google ScholarCross Ref
- [20] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. (Eds.). 30. 5998–6008.Google Scholar
- [21] . 2015. Adam: A method for stochastic optimization. arXiv:1412.6980V5.Google Scholar
- [22] . 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics, 311--318.Google Scholar
- [23] . 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.Google Scholar
- [24] . 2010. Automatic evaluation of translation quality for distant language pairs. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing Cambridge, MA. Association for Computational Linguistics, 944--952.Google Scholar
- [25] . 2009. Fluency, adequacy, or HTER?: Exploring different human judgments with a tunable MT metric. In Proceedings of the 4th Workshop on Statistical Machine Translation (StatMT’09). Association for Computational Linguistics, Stroudsburg, PA, 259–268. http://dl.acm.org/citation.cfm?id=1626431.1626480.Google ScholarCross Ref
Index Terms
- Improving Chinese-Vietnamese Neural Machine Translation with Linguistic Differences
Recommendations
Towards Integrated Classification Lexicon for Handling Unknown Words in Chinese-Vietnamese Neural Machine Translation
In Neural Machine Translation (NMT), due to the limitations of the vocabulary, unknown words cannot be translated properly, which brings suboptimal performance of the translation system. For resource-scarce NMT that have small-scale training corpus, the ...
Syntax-Based Chinese-Vietnamese Tree-to-Tree Statistical Machine Translation with Bilingual Features
Because of the scarcity of bilingual corpora, current Chinese--Vietnamese machine translation is far from satisfactory. Considering the differences between Chinese and Vietnamese, we investigate whether linguistic differences can be used to supervise ...
Neural Machine Translation Enhancements through Lexical Semantic Network
ICCMS '18: Proceedings of the 10th International Conference on Computer Modeling and SimulationIn most languages, many words have multiple senses, thus machine translation systems have to choose between several candidates representing different senses of an input word. Although neural machine translation has recently become a dominant paradigm ...
Comments