Skip to main content

Abstract

We propose a novel Bayesian model for fully unsupervised word segmentation based on monolingual character alignment. Adapted bilingual word alignment models and a Bayesian language model are combined through product of experts to estimate the joint posterior distribution of a monolingual character alignment and the corresponding segmentation. Our approach enhances the performance of conventional hierarchical Pitman-Yor language models with richer character-level features. In the conducted experiments, our model achieves an 88.6% word token f-score on the standard Brent version of the Bernstein-Ratner corpora. Moreover, on standard Chinese segmentation datasets, our method outperforms a baseline model by 1.9-2.9 f-score points.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Wang, H., Zhu, J., Tang, S., Fan, X.: A new unsupervised approach to word segmentation. CL 37, 421–454 (2011)

    Google Scholar 

  2. Sun, M., Shen, D., Tsou, B.K.: Chinese word segmentation without using lexicon and hand-crafted training data. In: Proceedings of the Joint Conference of ACL and COLING, Montreal, Quebec, Canada, pp. 1265–1271. ACL (1998)

    Google Scholar 

  3. Goldwater, S., Griffiths, T.L., Johnson, M.: Contextual dependencies in unsupervised word segmentation. In: Proceedings of the Joint Conference of ACL and COLING, ACL-44, Stroudsburg, PA, USA, pp. 673–680 (2006)

    Google Scholar 

  4. Mochihashi, D., Yamada, T., Ueda, N.: Bayesian unsupervised word segmentation with nested pitman-yor language modeling. In: Proceedings of the Joint Conference of ACL and IJCNLP, ACL 2009, Stroudsburg, PA, USA, pp. 100–108 (2009)

    Google Scholar 

  5. Johnson, M., Goldwater, S.: Improving nonparameteric bayesian inference: experiments on unsupervised word segmentation with adaptor grammars. In: Proceedings of Human Language Technologies: The 2009 NAACL, NAACL 2009, Stroudsburg, PA, USA, pp. 317–325 (2009)

    Google Scholar 

  6. Liu, Z., Wang, H., Wu, H., Li, S.: Collocation extraction using monolingual word alignment method. In: Proceedings of EMNLP, Singapore, pp. 487–495 (2009)

    Google Scholar 

  7. Brody, S.: It depends on the translation: unsupervised dependency parsing via word alignment. In: Proceedings of EMNLP, EMNLP 2010, Stroudsburg, PA, USA, pp. 1214–1222 (2010)

    Google Scholar 

  8. Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 19, 263–311 (1993)

    Google Scholar 

  9. Vogel, S., Ney, H., Tillmann, C.: Hmm-based word alignment in statistical translation. In: Proceedings of COLING, COLING 1996, Stroudsburg, PA, USA, pp. 836–841 (1996)

    Google Scholar 

  10. Teh, Y.W.: A hierarchical bayesian language model based on pitman-yor processes. In: Proceedings of the Joint Conference of ACL and COLING, ACL-44, Stroudsburg, PA, USA, pp. 985–992 (2006)

    Google Scholar 

  11. Bernstein-Ratner, N.: The phonology of parent-child speech. In: Nelson, K., van Kleeck, A. (eds.), vol. 6. Erlbaum, Hillsdale (1987)

    Google Scholar 

  12. Xu, J., Gao, J., Toutanova, K., Ney, H.: Bayesian semi-supervised Chinese word segmentation for statistical machine translation. In: Proceedings of COLING, COLING 2008, Stroudsburg, PA, USA, pp. 1017–1024 (2008)

    Google Scholar 

  13. Nguyen, T., Vogel, S., Smith, N.A.: Nonparametric word segmentation for machine translation. In: Proceedings of COLING, COLING 2010, Stroudsburg, PA, USA, pp. 815–823 (2010)

    Google Scholar 

  14. Chung, T., Gildea, D.: Unsupervised tokenization for machine translation. In: Proceedings of EMNLP, EMNLP 2009, Stroudsburg, PA, USA, pp. 718–726 (2009)

    Google Scholar 

  15. Pitman, J., Yor, M.: The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator (1995)

    Google Scholar 

  16. Goldwater, S., Griffiths, T.L., Johnson, M.: A bayesian framework for word segmentation: Exploring the effects of Context. Cognition 112, 21–54 (2009)

    Article  Google Scholar 

  17. Och, F.J., Ney, H., Josef, F., Ney, O.H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29 (2003)

    Google Scholar 

  18. Tom, E.: Second international Chinese word segmentation bakeoff (2005)

    Google Scholar 

  19. MacWhinney, B., Snow, C., et al.: The child language data exchange system. Journal of Child Language 12, 271–296 (1985)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Teng, Z., Xiong, H., Liu, Q. (2014). Unsupervised Joint Monolingual Character Alignment and Word Segmentation. In: Sun, M., Liu, Y., Zhao, J. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2014 2014. Lecture Notes in Computer Science(), vol 8801. Springer, Cham. https://doi.org/10.1007/978-3-319-12277-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12277-9_1

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12276-2

  • Online ISBN: 978-3-319-12277-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics