Clustering of triphones using phoneme similarity estimation for the definition of a multilingual set of triphones

doi:10.1016/S0167-6393(02)00048-1

Speech Communication

Volume 39, Issues 3–4, February 2003, Pages 353-366

https://doi.org/10.1016/S0167-6393(02)00048-1 Get rights and content

Abstract

This paper addresses the problem of multilingual acoustic modelling for the design of multilingual speech recognisers. An agglomerative clustering algorithm for the definition of multilingual set of triphones is proposed. This clustering algorithm is based on the definition of an indirect distance measure for triphones defined as a weighted sum of the explicit estimates of the context similarity on a monophone level. The monophone similarity estimation method is based on the algorithm of Houtgast. The new clustering algorithm was tested in a multilingual speech recognition experiment for three languages. The algorithm was applied on monolingual triphone sets of language specific recognisers for all languages. In order to evaluate the clustering algorithm, the performance of the multilingual set of triphones was compared to the performance of the reference system composed of all three language specific recognisers operating in parallel, and to the performance of the multilingual set of triphones produced by the tree-based clustering algorithm. All experiments were based on the 1000 FDB SpeechDat(II) databases (Slovenian, Spanish and German). Experiments have shown that the use of the clustering algorithm results in a significant reduction of the number of triphones with minor degradation of recognition rate.

Zusammenfassung

Dieser Beitrag befasst sich mit dem Problem der multilingualen akustischen Modellierung für die automatische Spracherkennung. Die Verwendung eines agglomerativen Cluster-Algorithmus zur Defintion einer Menge multilingualer kontextabhängiger phonetischer Einheiten (Triphone) wird eingeführt. Der Algorithmus basiert auf einem indirekten Abstandsmaß für Triphone, das als eine gewichtete Summe der geschätzten Ähnlichkeiten der zu den Triphonen gehörenden Monophone definiert ist. Die Ähnlichkeitsschätzung der Monophone basiert auf dem Algorithmus von Houtgast. Der neue Cluster-Algorithmus wurde auf multilinguale Spracherkennungsexperimente für drei verschiedene Sprachen angewendet. Dazu wurden sprachspezifische Erkennungssysteme mit monolingualen Triphonen für alle drei Sprachen eingesetzt. Um den Cluster-Algorithmus bewerten zu können, wurde die Leistungsfähigkeit des auf den multilingualen Triphonen basierenden Systems mit zwei Referenzsystemen verglichen. Während in dem ersten Referenzsystem die sprachenspezifischen Modelle gleichzeitig (parallel) eingesetzt wurden, fanden im zweiten Referenzsystem multilinguale Modelle Verwendung, die mithilfe eines entscheidungsbaumbasierten Cluster-Algorithmus erstellt wurden. Alle Experimente basieren auf der 1000 FDB SpeechDat(II) Datenbasis (slowenisch, spanisch und deutsch). Die Untersuchungen haben gezeigt, dass die Verwendung des vorgeschlagenen agglomerativen Cluster-Algorithmus bei einer geringgradigen Abnahme der Erkennungsrate zu einer deutlichen Verringerung der Anzahl der Triphonparameter führt.

Introduction

During the last few years the development of speech technology raised an interest in the research of multilingual speech recognition. The goal of research in this area is twofold. On the one hand, the multilingual speech recognition system should extend the functionality of language specific recognisers to a number of languages without degrading the recognition accuracy or considerably increasing the computational complexity. On the other hand, the multilingual speech recognition system should serve as a tool for porting the speech technology from one language, or more languages, to another language with little or no spoken language resources. The work presented in this paper focuses on the first goal––the development of methods for definition of the multilingual phonetic inventories that reduces the complexity of multilingual speech recognisers.

The definition of multilingual phonetic inventories by exploiting similarities amongst the sounds of different languages is a promising approach. One of the first attempts was reported in (Andersen et al., 1993) where multilingual phonetic inventory, consisting of language-dependent and language-independent speech units, was defined by using the data-driven clustering technique. Other attempts based on different distance measures and clustering techniques also followed (Koehler, 1996; Berkling, 1996; Bonaventura et al., 1997; Weng et al., 1997), however, most of the work so far has been focused on context-independent phoneme modelling (monophone). These experiments show that the transition from language-dependent monophone set to multilingual inventory of monophones may result in a degradation of recognition accuracy due to the lack of acoustic resolution of the multilingual monophone set.

Transition from context-independent to context-dependent phoneme modelling seems inevitable in order to improve the performance of multilingual speech recognition systems. Context-dependent modelling has already proved it can enhance the performance of monolingual recognisers (Young, 1996; Bourlard, 1995). Such an improvement can also be expected in the multilingual case, however, the crucial problem is how to define the multilingual set of context-dependent phoneme models, that is, how to define the clustering procedure.

Even though the introduction of context-dependent phoneme modelling is a self-evident step in the evolution of multilingual speech recognition research, only a few research reports have been given so far. Most of these reports were given by Schultz and Waibel, 1998, Schultz and Waibel, 1999 who applied a decision tree-based clustering procedure to generate a multilingual set of context-dependent phoneme models. The purpose of the work reported in this paper is to complement the previous work on “top-down” clustering procedures by proposing a “bottom-up” approach. A new technique for the definition of the multilingual set of context-dependent phoneme models based on agglomerative clustering is introduced in this paper. A clustering algorithm has been implemented for the triphones and is based on an indirect similarity estimation of two triphones defined as the weighted sum of explicit estimation of the similarity of the phonemes of left and right contexts, and the center phonemes.

This paper is organized as follows. The formulation of the triphone distance measure and the clustering algorithm as well as the description of the reference tree-based clustering algorithm are given in Section 2. The experimental multilingual recognition systems and the speech databases used for the experiments are described in 3 Baseline recogniser, 4 Speech databases, respectively. Experimental results are presented in Section 5, while the discussion and the conclusion are made in 6 Discussion, 7 Conclusion.

Section snippets

Definition of multilingual triphone set

A multilingual set of triphone units and corresponding acoustic models should be defined in a way to preserve a high acoustic resolution that is specific for triphone modelling, without an excessive increase in the number of triphone units. This means that identical or similar triphone units should be identified across different languages to be represented by the same acoustic model. Clusters of similar triphones can be defined either by the agglomerative clustering procedure, where similar

Baseline recogniser

The speech recognition system used for the experiments was developed in the framework of the “SpeechDat task force” within the COST 249 project (Johansen et al., 2000). It relies on the HTK toolkit and is basically an extension of the HTK tutorial example (Young et al., 1997). The frontend module was, however, modified in order to enhance speech recognition robustness. The acoustic feature vector produced by the frontend module consisted of 24 mel-scaled cepstral, 12 Δ-cepstral, 12 ΔΔ-cepstral,

Speech databases

The experiments were carried out using the speech databases produced in the framework of the SpeechDat(II) project (Hoege et al., 1997). These databases provide a realistic basis for the development of voice-driven teleservices and are especially useful for the research of multilingual systems. The SpeechDat(II) databases were recorded under the same recording conditions for all languages: telephone speech, 1000 or 5000 speakers, ≈2 min of speech per speaker, each speaker recorded in a separate

Experimental results

The recognition task was limited to the recognition of isolated words of medium size vocabulary. The number of test utterances and the size of the test vocabulary for each particular language are given in Table 2.

Discussion

These experiments have shown that the use of the agglomerative clustering algorithm can produce a multilingual set of triphones that achieves a similar recognition rate as the language specific triphone sets operating in parallel. Herewith, the number of triphones in the multilingual set of triphones is significantly smaller than the total number of triphones in the language specific triphone sets. In the best case, using the clustering algorithm resulted in a reduction of the number of

Conclusion

The experiments described in this paper have confirmed that triphone clustering can be performed efficiently by estimating similarity on the phoneme level. The agglomerative clustering technique shows a promising way to cope with the problems of multilingual speech recognition. It is capable of maintaining the recognition rate of the triphone based recognisers, whilst significantly reducing the computational complexity of the multilingual speech recognition system. The full potential of this

Acknowledgements

The authors wish to acknowledge Central Corporate Research Laboratories Siemens AG Munich, Germany and the Universitat Politechnica de Catalunya Barcelona, Spain for providing the German and Spanish SpeechDat(II) databases.

References (24)

Andersen, O., Dalsgaard, P., Barry, W., 1993. Data-driven identification of poly- and mono-phonemes for four European...
Berkling, K.M., 1996. Automatic language identification with sequences of language-independent phoneme clusters. PhD...
Berkling, K.M., Barnard, E., 1994. Language identification with multilingual phoneme clusters. In: Proceedings...
Bonaventura, P., Gallocchio, F., Micca, G., 1997. Multilingual speech recognition for flexible vocabularies. In:...
Bourlard, H., 1995. Towards increasing speech recognition error rates. In: Proceedings Eurospeech’95, Madrid, pp....
Harbeck, S., Nöth, E., Niemann, H., 1997. Multilingual speech recognition. In: Multilingual Information Retrieval,...
Haunstein, A., Marschall, E., 1995. Methods for improved speech recognition over the telephone lines. In: Proceedings...
Hoege, H., Tropf, H., Winski, R., van den Heuvel, H., Haeb-Umbach, R., 1997. European speech databases for telephone...
Johansen, F.T., Warakagoda, N., Lindberg, B., Lehtinen, G., Kačič, Z., Žgank, A., Elenius, K., Salvi, G., 2000. The...
Kadambe, S., Hieronymus, J.L., 1994. Spontaneous speech language identification with a knowledge of linguistics. In:...

Kaiser, J., Kačič, Z., 1998. Development of the Slovenian SpeechDat database. In: Speech Database Development for...

Koehler, J., 1996. Multi-lingual phoneme recognition exploiting acoustic–phonetic similarities of sounds. In:...

Cited by (15)

Progress of machine learning based automatic phoneme recognition and its prospect
2021, Speech Communication
Citation Excerpt :
Rather, they are very much suitable for context dependent (biphone, triphone) grouping of similar phonemes. Tree-based clustering (Lee et al., 1990; Zgank et al., 2001; Hwang et al., 1996), agglomerative clustering (Hwang et al., 1996; Imperl et al., 2003; Mak and Barnard, 1996), maximum likelihood clustering (Kannan et al., 1994), k-means (McDermott and Katagiri, 1989) are commonly used ML techniques. Bhattacharya distance measures the similarity of two probability distributions.
A phoneme is the smallest perceptually distinct sound unit that can be distinguished among words in a particular language. Every language has its own set of phonemes, and all possible words can be considered as ordered sequences of phonemes.The total number of phonemes contained in a language is always very few in comparison to the size of the vocabulary supported by the language. These facts have made phoneme recognition an attractive proposition in the entire journey of the Automatic Speech Processing (ASP) till date. As a result, the classification and recognition of phonemes are considered as the primary tasks of automatic speech recognition (ASR) systems irrespective of application domain. The dynamic nature of phonemes and several sources of their variability create lots of barriers in accurate identification of phonemes from an acoustic signal. The contribution of Machine Learning (ML) based techniques in overcoming these obstructions in automatic phoneme recognition (APR) is remarkable. Nowadays with lot of data availability, ML based ASR is preferred because of its simplicity over acoustic-phonetic based methods. The ML based techniques do not follow the conventional method based on identification of acoustic properties. Rather, ML techniques build their own trained model (algorithm) using readily available data. They do so by finding out the hidden patterns in speech signals, and acquire predictive intelligence through learning. Therefore, ML techniques can be said to provide a more generalized model for phoneme classification. In this paper, we present a comprehensive survey of ML tools to build phoneme recognizers. We also highlight some applications of speech (especially phoneme) recognition which illustrate the current scope as well as future prospects of APR.
Multidialectal Spanish acoustic modeling for speech recognition
2009, Speech Communication
During the last years, language resources for speech recognition have been collected for many languages and specifically, for global languages. One of the characteristics of global languages is their wide geographical dispersion, and consequently, their wide phonetic, lexical, and semantic dialectal variability. Even if the collected data is huge, it is difficult to represent dialectal variants accurately.
This paper deals with multidialectal acoustic modeling for Spanish. The goal is to create a set of multidialectal acoustic models that represents the sounds of the Spanish language as spoken in Latin America and Spain. A comparative study of different methods for combining data between dialects is presented. The developed approaches are based on decision tree clustering algorithms. They differ on whether a multidialectal phone set is defined, and in the decision tree structure applied.
Besides, a common overall phonetic transcription for all dialects is proposed. This transcription can be used in combination with all the proposed acoustic modeling approaches. Overall transcription combined with approaches based on defining a multidialectal phone set leads to a full dialect-independent recognizer, capable to recognize any dialect even with a total absence of training data from such dialect.
Multidialectal systems are evaluated over data collected in five different countries: Spain, Colombia, Venezuela, Argentina and Mexico. The best results given by multidialectal systems show a relative improvement of 13% over the results obtained with monodialectal systems. Experiments with dialect-independent systems have been conducted to recognize speech from Chile, a dialect not seen in the training process. The recognition results obtained for this dialect are similar to the ones obtained for other dialects.
Data-driven generation of phonetic broad classes, based on phoneme confusion matrix similarity
2005, Speech Communication
Citation Excerpt :
In order to overcome the afore-mentioned problems, this paper proposes a new method, based on phoneme confusion matrix, for the data-driven generation of phonetic broad classes. In the field of multilingual and crosslingual speech recognition, confusion matrix is often used for similarity measure (Constantinescu and Chollet, 1997; Imperl et al., 2000; Imperl et al., 2003; Köhler, 1996; Žgank et al., 2001a; Žgank et al., 2001a,b). The main idea of this proposed method is to generate an approach which is appropriate for multilingual environment, without the disadvantages of the expert-driven approach, yet also suitable for a monolingual environment.
This paper addresses the topic of defining phonetic broad classes needed during acoustic modeling for speech recognition in the procedure of decision tree based clustering. The usual approach is to use phonetic broad classes which are defined by an expert. This method has some disadvantages, especially in the case of multilingual speech recognition. A new data-driven method is proposed for the generation of phonetic broad classes based on a phoneme confusion matrix. The similarity measure is defined using the number of confusions between the master phoneme and all other phonemes included in the set. This proposed method is compared to the standard approach based on expert knowledge and to the randomly generated broad classes approach. The proposed data-driven method is implicitly evaluated within a speech recognition experiment. The intention of the first evaluation stage is to test the generated acoustic models in a monolingual environment (Slovenian), to show that the proposed method does not contain a multilingual influence. In the second evaluation stage, the generated acoustic models are tested in a multilingual environment (Slovenian, German and Spanish). All experiments were based on SpeechDat(II) speech databases. The proposed data-driven method for the generation of phonetic broad classes, based on phoneme confusion matrix, improved speech recognition results when compared to the method based on expert knowledge.
Multilingual speech recognition using language-specific phoneme recognition as auxiliary task for indian languages
2020, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Feature trajectory dynamic time warping for clustering of speech segments
2019, Eurasip Journal on Audio, Speech, and Music Processing
Feature trajectory dynamic time warping for clustering of speech segments
2018, arXiv

View all citing articles on Scopus

View full text