Skip to main content

Unsupervised Learning of Word Segmentation: Does Tone Matter?

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2018)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13396))

  • 232 Accesses

Abstract

In this paper, we investigate the usefulness of tonal features for unsupervised word discovery, taking Mboshi, a low-resource tonal language from the Bantu family, as our main target language. In a preliminary step, we show that tone annotation improves the performance of supervised learning when using a simplified representation of the data. To leverage this information in an unsupervised setting, we then present a probabilistic model based on a hierarchical Pitman-Yor process that incorporates tonal representations in its backoff structure. We compare our model with a tone-agnostic baseline and analyze if and how tone helps unsupervised segmentation on our small dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The distinction between high and low tones is phonological (see [14]).

  2. 2.

    We use scikit-learn’s implementation (http://scikit-learn.org/stable/modules/tree.html).

  3. 3.

    With padding at the beginning and end of the sentence.

  4. 4.

    We exclude word boundaries corresponding to the beginning and end of the sentence.

  5. 5.

    Similar results are obtained for the larger corpora.

  6. 6.

    Following [13], we use a forward filtering-backward sampling (FFBS) algorithm to sample segmentations. As this method only approximates the posterior distribution, we also perform a Metropolis-Hastings correction step.

  7. 7.

    where we add initial padding symbols as needed.

  8. 8.

    http://homepages.inf.ed.ac.uk/sgwater/resources.html.

References

  1. Adda, G., et al.: Breaking the unwritten language barrier: the Bulb project. In: Proceedings of SLTU (Spoken Language Technologies for Under-Resourced Languages). Yogyakarta, Indonesia (2016)

    Google Scholar 

  2. Austin, P.K., Sallabank, J. (eds.): The Cambridge Handbook of Endangered Languages. Cambridge University Press, Cambridge (2011)

    Google Scholar 

  3. Beapami, R.P., Chatfield, R., Kouarata, G., Embengue-Waldschmidt, A.: Dictionnaire Mbochi-Français. SIL-Congo Publishers, Congo (Brazzaville) (2000)

    Google Scholar 

  4. Bird, S., Hanke, F.R., Adams, O., Lee, H.: Aikuma: a mobile app for collaborative language documentation. ACL 2014, 1 (2014)

    Google Scholar 

  5. Blachon, D., Gauthier, E., Besacier, L., Kouarata, G.N., Adda-Decker, M., Rialland, A.: Parallel speech collection for under-resourced language studies using the LIG-Aikuma mobile device app. In: Proceedings of SLTU (Spoken Language Technologies for Under-Resourced Languages). Yogyakarta, Indonesia, May 2016

    Google Scholar 

  6. Bouquiaux, L., Thomas, J.M.C. (eds.): Enquête et description des langues à tradition orale. SELAF, Paris (1976)

    Google Scholar 

  7. Godard, P., et al.: Preliminary experiments on unsupervised word discovery in Mboshi. In: Proceedings of the Interspeech (2016)

    Google Scholar 

  8. Goldwater, S., Griffiths, T.L., Johnson, M.: Contextual dependencies in unsupervised word segmentation. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 673–680. Association for Computational Linguistics, July 2006. http://www.aclweb.org/anthology/P06-1085

  9. Goldwater, S., Griffiths, T.L., Johnson, M.: Interpolating between types and tokens by estimating power-law generators. In: Advances in Neural Information Processing Systems 18, pp. 459–466. MIT Press, Cambridge (2006)

    Google Scholar 

  10. Goldwater, S., Griffiths, T.L., Johnson, M.: A Bayesian framework for word segmentation: exploring the effects of context. Cognition 112(1), 21–54 (2009)

    Article  Google Scholar 

  11. Johnson, M., Demuth, K.: Unsupervised phonemic Chinese word segmentation using adaptor grammars. In: 23rd International Conference on Computational Linguistics (COLING) (2010)

    Google Scholar 

  12. Ludusan, B., Synnaeve, G., Dupoux, E.: Prosodic boundary information helps unsupervised word segmentation. In: Annual Conference of the North American Chapter of the ACL, pp. 953–963. Denver, Colorado, USA (2015)

    Google Scholar 

  13. Mochihashi, D., Yamada, T., Ueda, N.: Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pp. 100–108. Association for Computational Linguistics (2009)

    Google Scholar 

  14. Rialland, A., Aborobongui, M.E.: How intonations interact with tones in Embosi (Bantu C25), a two-tone language without downdrift. In: Intonation in African Tone Languages, vol. 24. De Gruyter, Berlin, Boston (2016)

    Google Scholar 

  15. Stücker, S., et al.: Innovative technologies for under-resourced language documentation: the Bulb project. In: Proceedings of CCURL (Collaboration and Computing for Under-Resourced Languages : toward an Alliance for Digital Language Diversity). Portoroz̃ Slovenia (2016)

    Google Scholar 

  16. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)

    Article  Google Scholar 

Download references

Acknowledgments

This work was partly funded by the French ANR and the German DFG under grant ANR-14-CE35-0002. We warmly thank Martine Adda-Decker and Annie Rialland (from LPP-CNRS) for the sketch on the Mboshi language, as well as Gilles Adda (from LIMSI-CNRS), for many meaningful conversations and contributions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to François Yvon .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Godard, P., Löser, K., Allauzen, A., Besacier, L., Yvon, F. (2023). Unsupervised Learning of Word Segmentation: Does Tone Matter?. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2018. Lecture Notes in Computer Science, vol 13396. Springer, Cham. https://doi.org/10.1007/978-3-031-23793-5_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-23793-5_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-23792-8

  • Online ISBN: 978-3-031-23793-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics