Unsupervised Joint Monolingual Character Alignment and Word Segmentation

Teng, Zhiyang; Xiong, Hao; Liu, Qun

doi:10.1007/978-3-319-12277-9_1

Zhiyang Teng^21,22,
Hao Xiong^22,23 &
Qun Liu^22,24

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8801))

Included in the following conference series:

1642 Accesses

Abstract

We propose a novel Bayesian model for fully unsupervised word segmentation based on monolingual character alignment. Adapted bilingual word alignment models and a Bayesian language model are combined through product of experts to estimate the joint posterior distribution of a monolingual character alignment and the corresponding segmentation. Our approach enhances the performance of conventional hierarchical Pitman-Yor language models with richer character-level features. In the conducted experiments, our model achieves an 88.6% word token f-score on the standard Brent version of the Bernstein-Ratner corpora. Moreover, on standard Chinese segmentation datasets, our method outperforms a baseline model by 1.9-2.9 f-score points.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Wang, H., Zhu, J., Tang, S., Fan, X.: A new unsupervised approach to word segmentation. CL 37, 421–454 (2011)
Google Scholar
Sun, M., Shen, D., Tsou, B.K.: Chinese word segmentation without using lexicon and hand-crafted training data. In: Proceedings of the Joint Conference of ACL and COLING, Montreal, Quebec, Canada, pp. 1265–1271. ACL (1998)
Google Scholar
Goldwater, S., Griffiths, T.L., Johnson, M.: Contextual dependencies in unsupervised word segmentation. In: Proceedings of the Joint Conference of ACL and COLING, ACL-44, Stroudsburg, PA, USA, pp. 673–680 (2006)
Google Scholar
Mochihashi, D., Yamada, T., Ueda, N.: Bayesian unsupervised word segmentation with nested pitman-yor language modeling. In: Proceedings of the Joint Conference of ACL and IJCNLP, ACL 2009, Stroudsburg, PA, USA, pp. 100–108 (2009)
Google Scholar
Johnson, M., Goldwater, S.: Improving nonparameteric bayesian inference: experiments on unsupervised word segmentation with adaptor grammars. In: Proceedings of Human Language Technologies: The 2009 NAACL, NAACL 2009, Stroudsburg, PA, USA, pp. 317–325 (2009)
Google Scholar
Liu, Z., Wang, H., Wu, H., Li, S.: Collocation extraction using monolingual word alignment method. In: Proceedings of EMNLP, Singapore, pp. 487–495 (2009)
Google Scholar
Brody, S.: It depends on the translation: unsupervised dependency parsing via word alignment. In: Proceedings of EMNLP, EMNLP 2010, Stroudsburg, PA, USA, pp. 1214–1222 (2010)
Google Scholar
Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 19, 263–311 (1993)
Google Scholar
Vogel, S., Ney, H., Tillmann, C.: Hmm-based word alignment in statistical translation. In: Proceedings of COLING, COLING 1996, Stroudsburg, PA, USA, pp. 836–841 (1996)
Google Scholar
Teh, Y.W.: A hierarchical bayesian language model based on pitman-yor processes. In: Proceedings of the Joint Conference of ACL and COLING, ACL-44, Stroudsburg, PA, USA, pp. 985–992 (2006)
Google Scholar
Bernstein-Ratner, N.: The phonology of parent-child speech. In: Nelson, K., van Kleeck, A. (eds.), vol. 6. Erlbaum, Hillsdale (1987)
Google Scholar
Xu, J., Gao, J., Toutanova, K., Ney, H.: Bayesian semi-supervised Chinese word segmentation for statistical machine translation. In: Proceedings of COLING, COLING 2008, Stroudsburg, PA, USA, pp. 1017–1024 (2008)
Google Scholar
Nguyen, T., Vogel, S., Smith, N.A.: Nonparametric word segmentation for machine translation. In: Proceedings of COLING, COLING 2010, Stroudsburg, PA, USA, pp. 815–823 (2010)
Google Scholar
Chung, T., Gildea, D.: Unsupervised tokenization for machine translation. In: Proceedings of EMNLP, EMNLP 2009, Stroudsburg, PA, USA, pp. 718–726 (2009)
Google Scholar
Pitman, J., Yor, M.: The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator (1995)
Google Scholar
Goldwater, S., Griffiths, T.L., Johnson, M.: A bayesian framework for word segmentation: Exploring the effects of Context. Cognition 112, 21–54 (2009)
Article Google Scholar
Och, F.J., Ney, H., Josef, F., Ney, O.H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29 (2003)
Google Scholar
Tom, E.: Second international Chinese word segmentation bakeoff (2005)
Google Scholar
MacWhinney, B., Snow, C., et al.: The child language data exchange system. Journal of Child Language 12, 271–296 (1985)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Chinese Academy of Sciences, China
Zhiyang Teng
Institute of Computing Technology, Chinese Academy of Sciences, China
Zhiyang Teng, Hao Xiong & Qun Liu
Torangetek Information Technology (Beijing) Ltd., China
Hao Xiong
Centre for Next Generation Localisation, Faculty of Engineering and Computing, Dublin City University, Ireland
Qun Liu

Authors

Zhiyang Teng
View author publications
You can also search for this author in PubMed Google Scholar
Hao Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Qun Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Haidian District, 100084, Beijing, China
Maosong Sun & Yang Liu &
Chinese Academy of Sciences, Institute of Automation, 100190, Beijing, China
Jun Zhao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Teng, Z., Xiong, H., Liu, Q. (2014). Unsupervised Joint Monolingual Character Alignment and Word Segmentation. In: Sun, M., Liu, Y., Zhao, J. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2014 2014. Lecture Notes in Computer Science(), vol 8801. Springer, Cham. https://doi.org/10.1007/978-3-319-12277-9_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-12277-9_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12276-2
Online ISBN: 978-3-319-12277-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics