Abstract
This paper proposes a new method that detects the repeated keyword/phrase patterns from speech utterances by performing pattern discovery at the phoneme level. Prior to this method, we have developed a pattern discovery method using frame-level features. Even though the method’s performance is decent, it allows a considerable number of false positives. This paper aims to extract the desired keywords by examining the match between the speech utterances at the phoneme level instead of frame level to reduce the false positives and improve the accuracy. In this work, initially, we segment the speech utterances into phoneme-like regions using an affinity matrix. Then, the matched phoneme regions present in a pair of speech utterances are identified. A new 3-neighbor depth-first search traversal technique is proposed to discover the sequence of phoneme matches. Finally, the distance scores in the sequence of phoneme matches are validated to identify the desired keyword patterns. The performance of the proposed method is evaluated on the Hindi and Bengali news databases and compared with the state-of-the-art techniques. Based on the detected keyword patterns, the speech utterances are divided into groups using a standard clustering algorithm. The derived clusters represent the broader domain-specific groups which are useful for efficient speech retrieval task.









Similar content being viewed by others
Data Availability
The datasets used in the current study are available at the IIT Kharagpur speech group repository, http://cse.iitkgp.ac.in/~ksrao/res.html
References
S. Bhati, H. Kamper, K. Sri Rama Murty, Phoneme based embedded segmental K-means for unsupervised term discovery, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Calgary, 2018), pp. 5169–5173
S. Bhati, S. Nayak, K. Sri Rama Murty, in Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Applications, ed. by INTERSPEECH (Stockholm, 2017), pp. 2133–2137
S. Bhati, N. Shekhar, S. Bhati, N. Shekhar, K. Sri Rama Murty, Unsupervised speech signal-to-symbol transformation for language identification. Circuits Syst. Signal Process. 39, 5169–5197 (2020)
A. Black, P. Taylor, R. Caley, The Festival Speech Sythesis System. http://www.festvox.orgl/festival/
J. Chang, J.W. Fisher, in Parallel Sampling of DP Mixture Models Using Sub-Clusters Splits (Red Hook, 2013), pp. 620–628
H. Chen, C.C. Leung, L. Xie, B. Ma, H. Li, in Parallel Inference of Dirichlet Process Gaussian Mixture Models for Unsupervised Acoustic Modeling: A Feasibility Study, ed. by INTERSPEECH (Dresden, 2015), pp. 3189–3193
C. Chiu, T.N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R.J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, M. Bacchiani, state-of-the-art speech recognition with sequence-to-sequence models, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Calgary, 2018), pp. 4774–4778
Creating Label Files for Training Data. http://www.cs.columbia.edu/~ecooper/tts/labels.html
S. Dusan, L. Rabiner, in On the Relation Between Maximum Spectral Transition Positions and Phone Boundaries (Pittsburgh, 2006), pp. 645–648
J. Franke, M. Muller, S. Stuker, A. Waibel, in Phoneme Boundary Detection Using Deep Bidirectional LSTMs (Paderborn, 2016), pp. 1–5
J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1–1.1. NASA STI/Recon Tech. Rep. N 93, 1–79 (1993)
R. Giuseppe, B. Paolo, R. Pierluigi, Spoken dialog system: from theory to technology, in Workshop Toni Mian, pp. 1–4 (2007)
D. Gorur, C. Edward Rasmussen, Dirichlet process gaussian mixture models: choice of the base distribution. J. Comput. Sci. Technol. 25(4), 616–626 (2010)
S. Igor, S. Petr, M. Pavel, B. Lukas, K. Martin, C. Jan, Phoneme based acoustics keyword spotting in informal continuous speech. Lecture Notes Comput. Sci. 3658, 302–309 (2005)
A. Jansen, B.V. Durme, Efficient spoken term discovery using randomized algorithms, in IEEE Workshop on Automatic Speech Recognition Understanding (ASRU) (Hawaii, 2011), pp. 401–406
A. Jansen, C. Kenneth, H. Hynek, in Towards Spoken Term Discovery at Scale with Zero Resources, ed. by INTERSPEECH (Chiba, 2010), pp. 1676–1679
S. Jayaram, A constructive definition of the Dirichlet prior. Statistica Sinica 4, 639–650 (1994)
H. Kamper, A. Jansen, S. Goldwater, Unsupervised word segmentation and lexicon discovery using acoustic word embeddings. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(4), 669–679 (2016)
H. Kamper, A. Jansen, S. King, S. Goldwater, Unsupervised lexical clustering of speech segments using fixed-dimensional acoustic embeddings, in IEEE Spoken Language Technology Workshop (SLT), (Nevada, 2014), pp. 100–105
V. Khanagha, K. Daoudi, O. Pont, H. Yahia, Phonetic segmentation of speech signal using local singularity analysis. Digital Signal Process. 35, 86–94 (2014)
R. Kishore Kumar, B. Lokendra, K. Sreenivasa Rao, A robust unsupervised pattern discovery and clustering of speech signals. Pattern Recognit. Lett. 116, 254–261 (2018)
F. Kreuk, J. Keshet, Y. Adi, in Self-supervised Contrastive Learning for Unsupervised Phoneme Segmentation, ed. by INTERSPEECH (Shanghai, 2020), pp. 3700–3704
F. Kreuk, Y. Sheena, J. Keshet, Y. Adi, in Phoneme Boundary Detection Using Learnable Segmental Features (Barcelona, 2020), pp. 8089–8093
H. Lee, L. Lee, Enhanced spoken term detection using support vector machines and weighted pseudo examples. IEEE Trans. Audio Speech Lang. Process. 21(6), 1272–1284 (2013)
L. Lee, J. Glass, H. Lee, C. Chan, Spoken content retrieval-beyond cascading speech recognition with text retrieval. IEEE/ACM Trans. Audio Speech Lang. Process. 23(9), 1389–1420 (2015)
S.J. Leow, E.S. Chng, C. Lee, Language-resource independent speech segmentation using cues from a spectrogram image, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (South Brisbane, 2015), pp. 5813–5817
K. Levin, K. Henry, A. Jansen, K. Livescu, Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings, in IEEE Workshop on Automatic Speech Recognition and Understanding (Olomouc, 2013), pp. 410–415
A. Li-chun Wang, in An Industrial-Strength Audio Search Algorithm (Baltimore, 2003), pp. 1–7
L. Martha, J. Gareth, Spoken content retrieval: a survey of techniques and technologies. Found. Trends Inf. Retr. 5(4–5), 235–422 (2012)
M.E. Newman, Fast algorithm for detecting community structure in networks. Phys. Rev. E 69(6), 066133-1-066133–5 (2004)
R. Pappagari Raghavendra, R. Kallola, K. Sri Rama Murty, in Query Word Retrieval from Continuous Speech Using GMM Posteriorgrams (Bangalore, 2014), pp. 1–6
A. Park, J.R. Glass, Towards unsupervised pattern discovery in speech, in IEEE Workshop on Automatic Speech Recognition and Understanding (San Juan, 2005), pp. 53–58
A.S. Park, J.R. Glass, Unsupervised pattern discovery in speech. IEEE Trans. Audio Speech Lang. Process. 16(1), 186–197 (2008)
D. Ram, L. Miculicich, H. Bourlard, in CNN Based Query by Example Spoken Term Detection (Hyderabad, 2018), pp. 92–96
P.B. Ramteke, S.G. Koolagudi, Phoneme boundary detection from speech: a rule based approach. Speech Commun. 107, 1–17 (2019)
O. Rasanen, M. Andrea, C. Blandon, Unsupervised discovery of recurring speech patterns using probabilistic adaptive metrics, in Interspeech (Shanghai, 2020), pp. 4871–4875
O. Räsänen, D. Gabriel, C. Michael, F., in Unsupervised Word Discovery from Speech Using Automatic Segmentation into Syllable-like Units, ed. by INTERSPEECH (Dresden, 2015), pp. 3204–3208
O. Räsänen, U.K. Laine, T. Altosaar, in An Improved Speech Segmentation Quality Measure: The r-Value. ed. by INTERSPEECH (Brighton, 2009), pp. 1851–1854
O. Räsänen, U.K. Laine, T. Altosaar, Blind Segmentation of Speech Using Non-linear Filtering Methods (Speech Technologies, 2011), pp. 106–124
G. Simon, H. Tobias, B. Markus, S. Gerhard, Features for voice activity detection: a comparative analysis. EURASIP J. Adv. Signal Process. 91, 1–15 (2015)
A. Stan, C. Valentini-Botinhao, B. Orza, M. Giurgiu, Blind speech segmentation using spectrogram image-based features and Mel cepstral coefficients, in IEEE Spoken Language Technology Workshop (SLT) (San Juan, Puerto Rico, 2016), pp. 597–602
Y. Zhang, J.R. Glass, in Towards Multi-speaker Unsupervised Speech Pattern Discovery (Texas, 2010), pp. 4366–4369
Acknowledgements
The authors would like to thank Annu Debnath and Sutapa Bhattacharya (speakers) for their support in the creation of Hindi and Bengali speech corpora.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ravi, K.K., Krothapalli, S.R. Phoneme Segmentation-Based Unsupervised Pattern Discovery and Clustering of Speech Signals. Circuits Syst Signal Process 41, 2088–2117 (2022). https://doi.org/10.1007/s00034-021-01876-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-021-01876-6