Unsupervised Learning of Word Segmentation: Does Tone Matter?

Godard, Pierre; Löser, Kevin; Allauzen, Alexandre; Besacier, Laurent; Yvon, François

doi:10.1007/978-3-031-23793-5_28

Pierre Godard⁸,
Kevin Löser⁸,
Alexandre Allauzen⁸,
Laurent Besacier⁹ &
…
François Yvon⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13396))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

232 Accesses

Abstract

In this paper, we investigate the usefulness of tonal features for unsupervised word discovery, taking Mboshi, a low-resource tonal language from the Bantu family, as our main target language. In a preliminary step, we show that tone annotation improves the performance of supervised learning when using a simplified representation of the data. To leverage this information in an unsupervised setting, we then present a probabilistic model based on a hierarchical Pitman-Yor process that incorporates tonal representations in its backoff structure. We compare our model with a tone-agnostic baseline and analyze if and how tone helps unsupervised segmentation on our small dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The distinction between high and low tones is phonological (see [14]).
2.
We use scikit-learn’s implementation (http://scikit-learn.org/stable/modules/tree.html).
3.
With padding at the beginning and end of the sentence.
4.
We exclude word boundaries corresponding to the beginning and end of the sentence.
5.
Similar results are obtained for the larger corpora.
6.
Following [13], we use a forward filtering-backward sampling (FFBS) algorithm to sample segmentations. As this method only approximates the posterior distribution, we also perform a Metropolis-Hastings correction step.
7.
where we add initial padding symbols as needed.
8.
http://homepages.inf.ed.ac.uk/sgwater/resources.html.

References

Adda, G., et al.: Breaking the unwritten language barrier: the Bulb project. In: Proceedings of SLTU (Spoken Language Technologies for Under-Resourced Languages). Yogyakarta, Indonesia (2016)
Google Scholar
Austin, P.K., Sallabank, J. (eds.): The Cambridge Handbook of Endangered Languages. Cambridge University Press, Cambridge (2011)
Google Scholar
Beapami, R.P., Chatfield, R., Kouarata, G., Embengue-Waldschmidt, A.: Dictionnaire Mbochi-Français. SIL-Congo Publishers, Congo (Brazzaville) (2000)
Google Scholar
Bird, S., Hanke, F.R., Adams, O., Lee, H.: Aikuma: a mobile app for collaborative language documentation. ACL 2014, 1 (2014)
Google Scholar
Blachon, D., Gauthier, E., Besacier, L., Kouarata, G.N., Adda-Decker, M., Rialland, A.: Parallel speech collection for under-resourced language studies using the LIG-Aikuma mobile device app. In: Proceedings of SLTU (Spoken Language Technologies for Under-Resourced Languages). Yogyakarta, Indonesia, May 2016
Google Scholar
Bouquiaux, L., Thomas, J.M.C. (eds.): Enquête et description des langues à tradition orale. SELAF, Paris (1976)
Google Scholar
Godard, P., et al.: Preliminary experiments on unsupervised word discovery in Mboshi. In: Proceedings of the Interspeech (2016)
Google Scholar
Goldwater, S., Griffiths, T.L., Johnson, M.: Contextual dependencies in unsupervised word segmentation. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 673–680. Association for Computational Linguistics, July 2006. http://www.aclweb.org/anthology/P06-1085
Goldwater, S., Griffiths, T.L., Johnson, M.: Interpolating between types and tokens by estimating power-law generators. In: Advances in Neural Information Processing Systems 18, pp. 459–466. MIT Press, Cambridge (2006)
Google Scholar
Goldwater, S., Griffiths, T.L., Johnson, M.: A Bayesian framework for word segmentation: exploring the effects of context. Cognition 112(1), 21–54 (2009)
Article Google Scholar
Johnson, M., Demuth, K.: Unsupervised phonemic Chinese word segmentation using adaptor grammars. In: 23rd International Conference on Computational Linguistics (COLING) (2010)
Google Scholar
Ludusan, B., Synnaeve, G., Dupoux, E.: Prosodic boundary information helps unsupervised word segmentation. In: Annual Conference of the North American Chapter of the ACL, pp. 953–963. Denver, Colorado, USA (2015)
Google Scholar
Mochihashi, D., Yamada, T., Ueda, N.: Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pp. 100–108. Association for Computational Linguistics (2009)
Google Scholar
Rialland, A., Aborobongui, M.E.: How intonations interact with tones in Embosi (Bantu C25), a two-tone language without downdrift. In: Intonation in African Tone Languages, vol. 24. De Gruyter, Berlin, Boston (2016)
Google Scholar
Stücker, S., et al.: Innovative technologies for under-resourced language documentation: the Bulb project. In: Proceedings of CCURL (Collaboration and Computing for Under-Resourced Languages : toward an Alliance for Digital Language Diversity). Portoroz̃ Slovenia (2016)
Google Scholar
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)
Article Google Scholar

Download references

Acknowledgments

This work was partly funded by the French ANR and the German DFG under grant ANR-14-CE35-0002. We warmly thank Martine Adda-Decker and Annie Rialland (from LPP-CNRS) for the sketch on the Mboshi language, as well as Gilles Adda (from LIMSI-CNRS), for many meaningful conversations and contributions.

Author information

Authors and Affiliations

LIMSI, CNRS, Université Paris-Saclay, Orsay, France
Pierre Godard, Kevin Löser, Alexandre Allauzen & François Yvon
Laboratoire d’Informatique de Grenoble (LIG), Université Grenoble Alpes, Grenoble, France
Laurent Besacier

Authors

Pierre Godard
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Löser
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre Allauzen
View author publications
You can also search for this author in PubMed Google Scholar
Laurent Besacier
View author publications
You can also search for this author in PubMed Google Scholar
François Yvon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to François Yvon .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Godard, P., Löser, K., Allauzen, A., Besacier, L., Yvon, F. (2023). Unsupervised Learning of Word Segmentation: Does Tone Matter?. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2018. Lecture Notes in Computer Science, vol 13396. Springer, Cham. https://doi.org/10.1007/978-3-031-23793-5_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-23793-5_28
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23792-8
Online ISBN: 978-3-031-23793-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unsupervised Learning of Word Segmentation: Does Tone Matter?