A Hybrid Sentence Splitting Method by Comma Insertion for Machine Translation with CRF

Yang, Shuli; Feng, Chong; Huang, Heyan

doi:10.1007/978-3-319-25816-4_12

Shuli Yang¹⁹,
Chong Feng¹⁹ &
Heyan Huang¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9427))

Included in the following conference series:

7115 Accesses

Abstract

When writing formal articles many English writers often use long sentences with few punctuation marks. Since long sentences bring difficulty to machine translation systems, many researchers try to split them using punctuation marks before translation. But dealing with sentences with few punctuation marks is still intractable. In this paper we use a log linear model to insert commas into proper positions to split long sentence, trying to shorten the length of sub-sentence and benefit to machine translation. Experiment results show that our method can reasonably segment long sentences, and improve the quality of machine translation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

1.
http://www.statmt.org/wmt13/translation-task.html.

References

De Marneffe, M.C., MacCartney, B., Manning, C.D.: Generating typed dependency parses from phrase structure parses. In: Proceedings of LREC, vol. 6, pp. 449–454 (2006)
Google Scholar
Huang, H., Chen, Z.: Tlie hybrid strategy processing approach of complex long sentence. J. Chin. Inf. Process. 16(3), 1–7 (2002)
MathSciNet Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics (2007)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)
Google Scholar
Mudrak, B.: When two parts of a sentence should go their separate ways, April 2013. http://expertedge.aje.com/2013/04/16/editing-tip-of-the-week-when-two-parts-of-a-sentence-should-go-their-separate-ways/
Somers, H.: Round-trip translation: what is it good for. In: Proceedings of the Australasian Language Technology Workshop, pp. 127–133 (2005)
Google Scholar
Sun, Y., O’Brien, S., O’Hagan, M., Hollowood, F.: A novel statistical pre-processing model for rule-based machine translation system. In: Proceedings of EAMT, p. 8 (2010)
Google Scholar
Tian, L., Wong, D.F., Chao, L.S., Quaresma, P., Oliveira, F., Yi, L.: Um-corpus: a large english-chinese parallel corpus for statistical machine translation. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA) (2014)
Google Scholar
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 173–180. Association for Computational Linguistics (2003)
Google Scholar
Xiong, H., Xu, W., Mi, H., Liu, Y., Liu, Q.: Sub-sentence division for tree-based machine translation. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 137–140. Association for Computational Linguistics (2009)
Google Scholar
Xue, N., Yang, Y.: Chinese sentence segmentation as comma classification. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers, vol. 2, pp. 631–635. Association for Computational Linguistics (2011)
Google Scholar
Yin, B., Zuo, J., Ye, N.: Long sentence partitioning using top-down analysis for machine translation. In: 2012 IEEE 2nd International Conference on Cloud Computing and Intelligent Systems (CCIS), vol. 3, pp. 1425–1429. IEEE (2012)
Google Scholar
Yin, D., Ren, F., Jiang, P., Kuroiwa, S.: Chinese complex long sentences processing method for chinese-japanese machine translation. In: International Conference on Natural Language Processing and Knowledge Engineering. NLP-KE 2007, pp. 170–175. IEEE (2007)
Google Scholar

Download references

Acknowledgements

The work of this paper was supported by the National Basic Research Program of China (973 Program, Grant No. 2013CB329303) and National Natural Science Foundation of China (Grant No. 61201351, 61132009).

Author information

Authors and Affiliations

Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Application, Beijing Institute of Technology, Beijing, China
Shuli Yang, Chong Feng & Heyan Huang

Authors

Shuli Yang
View author publications
You can also search for this author in PubMed Google Scholar
Chong Feng
View author publications
You can also search for this author in PubMed Google Scholar
Heyan Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chong Feng .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Maosong Sun
Tsinghua University, Beijing, China
Zhiyuan Liu
Soochow University, Suzhou, Jiangsu, China
Min Zhang
Tsinghua University, Beijing, China
Yang Liu

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, S., Feng, C., Huang, H. (2015). A Hybrid Sentence Splitting Method by Comma Insertion for Machine Translation with CRF. In: Sun, M., Liu, Z., Zhang, M., Liu, Y. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL NLP-NABD 2015 2015. Lecture Notes in Computer Science(), vol 9427. Springer, Cham. https://doi.org/10.1007/978-3-319-25816-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-25816-4_12
Published: 08 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25815-7
Online ISBN: 978-3-319-25816-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics