Elsevier

Speech Communication

Volume 39, Issues 3–4, February 2003, Pages 353-366
Speech Communication

Clustering of triphones using phoneme similarity estimation for the definition of a multilingual set of triphones

https://doi.org/10.1016/S0167-6393(02)00048-1Get rights and content

Abstract

This paper addresses the problem of multilingual acoustic modelling for the design of multilingual speech recognisers. An agglomerative clustering algorithm for the definition of multilingual set of triphones is proposed. This clustering algorithm is based on the definition of an indirect distance measure for triphones defined as a weighted sum of the explicit estimates of the context similarity on a monophone level. The monophone similarity estimation method is based on the algorithm of Houtgast. The new clustering algorithm was tested in a multilingual speech recognition experiment for three languages. The algorithm was applied on monolingual triphone sets of language specific recognisers for all languages. In order to evaluate the clustering algorithm, the performance of the multilingual set of triphones was compared to the performance of the reference system composed of all three language specific recognisers operating in parallel, and to the performance of the multilingual set of triphones produced by the tree-based clustering algorithm. All experiments were based on the 1000 FDB SpeechDat(II) databases (Slovenian, Spanish and German). Experiments have shown that the use of the clustering algorithm results in a significant reduction of the number of triphones with minor degradation of recognition rate.

Zusammenfassung

Dieser Beitrag befasst sich mit dem Problem der multilingualen akustischen Modellierung für die automatische Spracherkennung. Die Verwendung eines agglomerativen Cluster-Algorithmus zur Defintion einer Menge multilingualer kontextabhängiger phonetischer Einheiten (Triphone) wird eingeführt. Der Algorithmus basiert auf einem indirekten Abstandsmaß für Triphone, das als eine gewichtete Summe der geschätzten Ähnlichkeiten der zu den Triphonen gehörenden Monophone definiert ist. Die Ähnlichkeitsschätzung der Monophone basiert auf dem Algorithmus von Houtgast. Der neue Cluster-Algorithmus wurde auf multilinguale Spracherkennungsexperimente für drei verschiedene Sprachen angewendet. Dazu wurden sprachspezifische Erkennungssysteme mit monolingualen Triphonen für alle drei Sprachen eingesetzt. Um den Cluster-Algorithmus bewerten zu können, wurde die Leistungsfähigkeit des auf den multilingualen Triphonen basierenden Systems mit zwei Referenzsystemen verglichen. Während in dem ersten Referenzsystem die sprachenspezifischen Modelle gleichzeitig (parallel) eingesetzt wurden, fanden im zweiten Referenzsystem multilinguale Modelle Verwendung, die mithilfe eines entscheidungsbaumbasierten Cluster-Algorithmus erstellt wurden. Alle Experimente basieren auf der 1000 FDB SpeechDat(II) Datenbasis (slowenisch, spanisch und deutsch). Die Untersuchungen haben gezeigt, dass die Verwendung des vorgeschlagenen agglomerativen Cluster-Algorithmus bei einer geringgradigen Abnahme der Erkennungsrate zu einer deutlichen Verringerung der Anzahl der Triphonparameter führt.

Introduction

During the last few years the development of speech technology raised an interest in the research of multilingual speech recognition. The goal of research in this area is twofold. On the one hand, the multilingual speech recognition system should extend the functionality of language specific recognisers to a number of languages without degrading the recognition accuracy or considerably increasing the computational complexity. On the other hand, the multilingual speech recognition system should serve as a tool for porting the speech technology from one language, or more languages, to another language with little or no spoken language resources. The work presented in this paper focuses on the first goal––the development of methods for definition of the multilingual phonetic inventories that reduces the complexity of multilingual speech recognisers.

The definition of multilingual phonetic inventories by exploiting similarities amongst the sounds of different languages is a promising approach. One of the first attempts was reported in (Andersen et al., 1993) where multilingual phonetic inventory, consisting of language-dependent and language-independent speech units, was defined by using the data-driven clustering technique. Other attempts based on different distance measures and clustering techniques also followed (Koehler, 1996; Berkling, 1996; Bonaventura et al., 1997; Weng et al., 1997), however, most of the work so far has been focused on context-independent phoneme modelling (monophone). These experiments show that the transition from language-dependent monophone set to multilingual inventory of monophones may result in a degradation of recognition accuracy due to the lack of acoustic resolution of the multilingual monophone set.

Transition from context-independent to context-dependent phoneme modelling seems inevitable in order to improve the performance of multilingual speech recognition systems. Context-dependent modelling has already proved it can enhance the performance of monolingual recognisers (Young, 1996; Bourlard, 1995). Such an improvement can also be expected in the multilingual case, however, the crucial problem is how to define the multilingual set of context-dependent phoneme models, that is, how to define the clustering procedure.

Even though the introduction of context-dependent phoneme modelling is a self-evident step in the evolution of multilingual speech recognition research, only a few research reports have been given so far. Most of these reports were given by Schultz and Waibel, 1998, Schultz and Waibel, 1999 who applied a decision tree-based clustering procedure to generate a multilingual set of context-dependent phoneme models. The purpose of the work reported in this paper is to complement the previous work on “top-down” clustering procedures by proposing a “bottom-up” approach. A new technique for the definition of the multilingual set of context-dependent phoneme models based on agglomerative clustering is introduced in this paper. A clustering algorithm has been implemented for the triphones and is based on an indirect similarity estimation of two triphones defined as the weighted sum of explicit estimation of the similarity of the phonemes of left and right contexts, and the center phonemes.

This paper is organized as follows. The formulation of the triphone distance measure and the clustering algorithm as well as the description of the reference tree-based clustering algorithm are given in Section 2. The experimental multilingual recognition systems and the speech databases used for the experiments are described in 3 Baseline recogniser, 4 Speech databases, respectively. Experimental results are presented in Section 5, while the discussion and the conclusion are made in 6 Discussion, 7 Conclusion.

Section snippets

Definition of multilingual triphone set

A multilingual set of triphone units and corresponding acoustic models should be defined in a way to preserve a high acoustic resolution that is specific for triphone modelling, without an excessive increase in the number of triphone units. This means that identical or similar triphone units should be identified across different languages to be represented by the same acoustic model. Clusters of similar triphones can be defined either by the agglomerative clustering procedure, where similar

Baseline recogniser

The speech recognition system used for the experiments was developed in the framework of the “SpeechDat task force” within the COST 249 project (Johansen et al., 2000). It relies on the HTK toolkit and is basically an extension of the HTK tutorial example (Young et al., 1997). The frontend module was, however, modified in order to enhance speech recognition robustness. The acoustic feature vector produced by the frontend module consisted of 24 mel-scaled cepstral, 12 Δ-cepstral, 12 ΔΔ-cepstral,

Speech databases

The experiments were carried out using the speech databases produced in the framework of the SpeechDat(II) project (Hoege et al., 1997). These databases provide a realistic basis for the development of voice-driven teleservices and are especially useful for the research of multilingual systems. The SpeechDat(II) databases were recorded under the same recording conditions for all languages: telephone speech, 1000 or 5000 speakers, ≈2 min of speech per speaker, each speaker recorded in a separate

Experimental results

The recognition task was limited to the recognition of isolated words of medium size vocabulary. The number of test utterances and the size of the test vocabulary for each particular language are given in Table 2.

Discussion

These experiments have shown that the use of the agglomerative clustering algorithm can produce a multilingual set of triphones that achieves a similar recognition rate as the language specific triphone sets operating in parallel. Herewith, the number of triphones in the multilingual set of triphones is significantly smaller than the total number of triphones in the language specific triphone sets. In the best case, using the clustering algorithm resulted in a reduction of the number of

Conclusion

The experiments described in this paper have confirmed that triphone clustering can be performed efficiently by estimating similarity on the phoneme level. The agglomerative clustering technique shows a promising way to cope with the problems of multilingual speech recognition. It is capable of maintaining the recognition rate of the triphone based recognisers, whilst significantly reducing the computational complexity of the multilingual speech recognition system. The full potential of this

Acknowledgements

The authors wish to acknowledge Central Corporate Research Laboratories Siemens AG Munich, Germany and the Universitat Politechnica de Catalunya Barcelona, Spain for providing the German and Spanish SpeechDat(II) databases.

References (24)

  • Andersen, O., Dalsgaard, P., Barry, W., 1993. Data-driven identification of poly- and mono-phonemes for four European...
  • Berkling, K.M., 1996. Automatic language identification with sequences of language-independent phoneme clusters. PhD...
  • Berkling, K.M., Barnard, E., 1994. Language identification with multilingual phoneme clusters. In: Proceedings...
  • Bonaventura, P., Gallocchio, F., Micca, G., 1997. Multilingual speech recognition for flexible vocabularies. In:...
  • Bourlard, H., 1995. Towards increasing speech recognition error rates. In: Proceedings Eurospeech’95, Madrid, pp....
  • Harbeck, S., Nöth, E., Niemann, H., 1997. Multilingual speech recognition. In: Multilingual Information Retrieval,...
  • Haunstein, A., Marschall, E., 1995. Methods for improved speech recognition over the telephone lines. In: Proceedings...
  • Hoege, H., Tropf, H., Winski, R., van den Heuvel, H., Haeb-Umbach, R., 1997. European speech databases for telephone...
  • Johansen, F.T., Warakagoda, N., Lindberg, B., Lehtinen, G., Kačič, Z., Žgank, A., Elenius, K., Salvi, G., 2000. The...
  • Kadambe, S., Hieronymus, J.L., 1994. Spontaneous speech language identification with a knowledge of linguistics. In:...
  • Kaiser, J., Kačič, Z., 1998. Development of the Slovenian SpeechDat database. In: Speech Database Development for...
  • Koehler, J., 1996. Multi-lingual phoneme recognition exploiting acoustic–phonetic similarities of sounds. In:...
  • Cited by (15)

    • Progress of machine learning based automatic phoneme recognition and its prospect

      2021, Speech Communication
      Citation Excerpt :

      Rather, they are very much suitable for context dependent (biphone, triphone) grouping of similar phonemes. Tree-based clustering (Lee et al., 1990; Zgank et al., 2001; Hwang et al., 1996), agglomerative clustering (Hwang et al., 1996; Imperl et al., 2003; Mak and Barnard, 1996), maximum likelihood clustering (Kannan et al., 1994), k-means (McDermott and Katagiri, 1989) are commonly used ML techniques. Bhattacharya distance measures the similarity of two probability distributions.

    • Data-driven generation of phonetic broad classes, based on phoneme confusion matrix similarity

      2005, Speech Communication
      Citation Excerpt :

      In order to overcome the afore-mentioned problems, this paper proposes a new method, based on phoneme confusion matrix, for the data-driven generation of phonetic broad classes. In the field of multilingual and crosslingual speech recognition, confusion matrix is often used for similarity measure (Constantinescu and Chollet, 1997; Imperl et al., 2000; Imperl et al., 2003; Köhler, 1996; Žgank et al., 2001a; Žgank et al., 2001a,b). The main idea of this proposed method is to generate an approach which is appropriate for multilingual environment, without the disadvantages of the expert-driven approach, yet also suitable for a monolingual environment.

    • Multilingual speech recognition using language-specific phoneme recognition as auxiliary task for indian languages

      2020, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
    • Feature trajectory dynamic time warping for clustering of speech segments

      2019, Eurasip Journal on Audio, Speech, and Music Processing
    View all citing articles on Scopus
    View full text