Conferences >2018 IEEE 24th International ...

Seal: Efficient Training Large Scale Statistical Machine Translation Models on Spark

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Statistical machine translation (SMT) is an important research branch in natural language processing (NLP). Similar to many other NLP applications, large scale training d...View more

Metadata

Abstract:

Statistical machine translation (SMT) is an important research branch in natural language processing (NLP). Similar to many other NLP applications, large scale training data can potentially bring higher translation accuracy for SMT models. However, the traditional single-node SMT model training systems can hardly cope with the fast-growing amount of large scale training corpus in the big data era, which makes the urgent requirement of efficient large scale machine translation model training systems. In this paper, we propose Seal, an efficient, scalable, and end-to-end offline SMT model training toolkit based on Apache Spark which is a widely-used distributed data-parallel platform. Seal parallelizes the training process of the entire three key SMT models that are the word alignment model, the translation model, and the N -Gram language model, respectively. To further improve the performance of the model training in Seal, we also propose a number of system optimization methods. In word alignment model training, by optimizing the block size tuning, the overhead of IO operation and communication is greatly reduced. In translation model training, by well encoding the training corpus, the data size transferred over the network can be reduced significantly, thus improving the overall training efficiency. We also optimize the maximum likelihood estimation (MLE) algorithm to solve the data skew issue on the join operation which is adopted both in the translation model training and the language model training. The experiment results show that Seal outperforms the well-known SMT training system Chaski with about 5× speedup for word alignment model training. For the syntactic translation model and language model training, Seal outperforms the existing cutting-edge tools with about 9~18× speedup and 8~9× speedup on average, respectively. On the whole, Seal outperforms the existing distributed system with 4~6× speedup and the single-node system with 9~60× speedup on average res...

(Show More)

Published in: 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS)

Date of Conference: 11-13 December 2018

Date Added to IEEE Xplore: 21 February 2019

ISBN Information:

Print on Demand(PoD) ISSN: 1521-9097

DOI: 10.1109/PADSW.2018.8644562

Conference Location: Singapore

Contents

References is not available for this document.

Seal: Efficient Training Large Scale Statistical Machine Translation Models on Spark

Abstract:

Metadata

Abstract:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Seal: Efficient Training Large Scale Statistical Machine Translation Models on Spark

Alerts

Abstract:

Metadata

Abstract:

References

IEEE Account

Purchase Details

Profile Information

Need Help?