ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Unsupervised Neural-Based Graph Clustering for Variable-Length Speech Representation Discovery of Zero-Resource Languages

Shun Takahashi, Sakriani Sakti, Satoshi Nakamura

Discovering symbolic units from unannotated speech data is fundamental in zero resource speech technology. Previous studies focused on learning fixed-length frame units based on acoustic features. Although they achieve high quality, they also suffer from a high bit-rate due to time-frame encoding. In this work, to discover variable-length, low bit-rate speech representation from a limited amount of unannotated speech data, we propose an approach based on graph neural networks (GNNs), and we study the temporal closeness of salient speech features. Our approach is built upon vector-quantized neural networks (VQNNs), which learn discrete encoding by contrastive predictive coding (CPC). We exploit the predetermined finite set of embeddings (a codebook) used by VQNNs to encode input data. We consider a codebook a set of nodes in a directed graph, where each arc represents the transition from one feature to another. Subsequently, we extract and encode the topological features of nodes in the graph to cluster them using graph convolution. By this process, we can obtain coarsened speech representation. We evaluated our model on the English data set of the ZeroSpeech 2020 challenge on Track 2019. Our model successfully drops the bit rate while achieving high unit quality.


doi: 10.21437/Interspeech.2021-1340

Cite as: Takahashi, S., Sakti, S., Nakamura, S. (2021) Unsupervised Neural-Based Graph Clustering for Variable-Length Speech Representation Discovery of Zero-Resource Languages. Proc. Interspeech 2021, 1559-1563, doi: 10.21437/Interspeech.2021-1340

@inproceedings{takahashi21_interspeech,
  author={Shun Takahashi and Sakriani Sakti and Satoshi Nakamura},
  title={{Unsupervised Neural-Based Graph Clustering for Variable-Length Speech Representation Discovery of Zero-Resource Languages}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1559--1563},
  doi={10.21437/Interspeech.2021-1340}
}