Abstract
In this work, we present a strategy of using the policy gradient training method from deep reinforcement learning to train a Seq2Seq model. This strategy is based on the combination of the classical cross-entropy and the policy gradient in the training phase. To evaluate the effectiveness of this strategy, we compare two Seq2Seq models trained with two different training methods to translate from Arabic to English. The first method is the cross-entropy, and the second one is a combination of cross-entropy and policy gradient with different amounts. Experimental results show that the second training method leads to improve the performance of the sequence-to-sequence model by 0.71 BLEU points.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Almahairi, A., Cho, K., Habash, N., Courville, A.: First result on arabic neural machine translation (2016). ArXiv https://arxiv.org/pdf/1606.02680
Alrajeh, A.: A recipe for Arabic-English neural machine translation (2018). ArXiv https://arxiv.org/pdf/1808.06116
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2015). CoRR https://arxiv.org/pdf/1409.0473
Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation (2014). arXiv preprint http://arxiv.org/abs/1406.1078
Durrani, N., Dalvi, F., Sajjad, H., Vogel, S.: Qcri machine translation systems for iwslt 16 (2017). ArXiv https://arxiv.org/pdf/1701.03924
Hassan, H., et al.: Achieving human parity on automatic chinese to english news translation (2018). arXiv preprint http://arxiv.org/abs/1803.05567
He, D., Lu, H., Xia, Y., Qin, T., Wang, L., Liu, T.Y.: Decoding with value networks for neural machine translation. In: Advances in Neural Information Processing Systems, pp. 178–187 (2017)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv preprint http://arxiv.org/abs/1412.6980
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 127–133 (2003). https://www.aclweb.org/anthology/N03-1017
Kreutzer, J., Uyheng, J., Riezler, S.: Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning (2018). arXiv preprint http://arxiv.org/abs/1805.10627
Li, J., Monroe, W., Jurafsky, D.: Learning to decode for future success (2017). arXiv preprint http://arxiv.org/abs/1701.06549
Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: theory and application to reward shaping. In: ICML, vol. 99, pp. 278–287 (1999)
Oudah, M., Almahairi, A., Habash, N.: The impact of preprocessing on Arabic-English statistical and neural machine translation (2019). arXiv preprint http://arxiv.org/abs/1906.11751
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, p. 311–318. ACL 2002, Association for Computational Linguistics, USA (2002). https://doi.org/10.3115/1073083.1073135
Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks (2015). arXiv preprint http://arxiv.org/abs/1511.06732
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 3104–3112. Curran Associates, Inc. (2014). http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf
Sutton, R.S., Barto, A.G., et al.: Introduction to Reinforcement Learning, vol. 135. MIT press, Cambridge (1998)
Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000)
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992). https://doi.org/10.1007/BF00992696
Wu, L., Tian, F., Qin, T., Lai, J., Liu, T.Y.: A study of reinforcement learning for neural machine translation (2018). arXiv preprint http://arxiv.org/abs/1808.08866
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation (2016). ArXiv https://arxiv.org/abs/1609.08144
Zakraoui, J., Saleh, M., Al-Maadeed, S., AlJa’am, J.M.: Evaluation of Arabic to English machine translation systems. In: 2020 11th International Conference on Information and Communication Systems (ICICS), pp. 185–190. IEEE (2020)
Ziemski, M., Junczys-Dowmunt, M., Pouliquen, B.: The united nations parallel corpus v1. 0. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 3530–3534 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Proof of Equation 11
A Proof of Equation 11
Here we demonstrate the Eq. (11).
We have the Eq. (10):
Given a function f(x), \(p(x|\theta )\) a parametrized probability distribution, and its expectation \(E_{x \sim p(x|\theta )}\left[ f(x) \right] \):
Now, we suppose \(x = \tau \), \(f(x)=R(\tau )\), and \(p(x|\theta )=p(\tau |\theta )\):
The trajectory \(\tau \) is a sequence of events \(a_t\) and \(s_{t+1}\), respectively sampled from the agent’s policy \(\pi _{\theta }(a_{t}|s_{t})\) and the probability of transition \(p(s_{t+1}|s_{t}, \; a_{t})\). The probability of the complete trajectory is the product of the individual probabilities:
By putting (**) in (*), we find the Eq. (11):
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zouidine, M., Khalil, M., Farouk, A.I.E. (2022). Policy Gradient for Arabic to English Neural Machine Translation. In: Lazaar, M., Duvallet, C., Touhafi, A., Al Achhab, M. (eds) Proceedings of the 5th International Conference on Big Data and Internet of Things. BDIoT 2021. Lecture Notes in Networks and Systems, vol 489. Springer, Cham. https://doi.org/10.1007/978-3-031-07969-6_35
Download citation
DOI: https://doi.org/10.1007/978-3-031-07969-6_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07968-9
Online ISBN: 978-3-031-07969-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)