Language identification using acoustic log-likelihoods of syllable-like units

doi:10.1016/j.specom.2005.12.003

Speech Communication

Volume 48, Issue 8, August 2006, Pages 913-926

https://doi.org/10.1016/j.specom.2005.12.003 Get rights and content

Abstract

Automatic spoken language identification (LID) is the task of identifying the language from a short utterance of the speech signal uttered by an unknown speaker. The most successful approach to LID uses phone recognizers of several languages in parallel [Zissman, M.A., 1996. Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4 (1), 31–44]. The basic requirement to build a parallel phone recognition (PPR) system is segmented and labeled speech corpora. In this paper, a novel approach is proposed for the LID task which uses parallel syllable-like unit recognizers, in a frame work similar to the PPR approach in the literature. The difference is that the sub-word unit models for each of the languages to be recognized are generated in an unsupervised manner without the use of segmented and labeled speech corpora. The training data of each of the languages is first segmented into syllable-like units and language-dependent syllable-like unit inventory is created. These syllable-like units are then clustered using an incremental approach. This results in a set of syllable-like units models for each language. Using these language-dependent syllable-like unit models, language identification is performed based on accumulated acoustic log-likelihoods. Our initial results on the Oregon Graduate Institute Multi-language Telephone Speech Corpus [Muthusamy, Y.K., Cole, R.A., Oshika, B.T., 1992. The OGI multi-language telephone speech corpus. In: Proceedings of Internat. Conf. Spoken Language Process., October 1992, pp. 895–898] show that the performance is 72.3%. We further show that if only a subset of syllable-like unit models that are unique (in some sense) are considered, the performance improves to 75.9%.

Introduction

Automatic spoken language identification without any knowledge about the languages to be identified is a challenging problem. In the spoken language identification task, it should be assumed that no test speaker’s spectral or any other type of information is present in the training set. In that, the comparison between the test utterance and the reference models of the languages is from unconstrained utterances of two different speakers (Li, 1994). Therefore, the differences between two utterances encompass text differences, speaker differences, environment differences, and language differences. The main problem is how to extract the language differences apart from text, speaker, and environment differences for a reliable spoken language identification system.

The main features of an ideal spoken language identification system are:

•
The computation time requirement to determine the identity of a test utterance must be small.
•
The performance degradation must be graceful as the length of the test utterance is reduced.
•
The system should not be biased towards any language or a group of languages.
•
The system should not be complex, in the sense that,
- –
  amount of language specific information required for developing the system should be small,
- –
  including a new language into the existing system should be easy.
•
The system should tolerate,
- –
  channel and environment variations,
- –
  noisy/low SNR speech signals,
- –
  accent variations.

Humans are the best LID systems in the world today. Just by hearing one or two seconds of speech of a familiar language, they can easily identify the language. The sources of information used by humans to identify the language are several. Speech in a language is a sequence of phones/sound units and the differences among the languages can be at several levels. Hierarchically, these levels are Frame level (10–30 ms), phone level, consonant-vowel (CV unit) level, syllable level, word level, and phrase level. The possible differences among different languages at these levels are the inventory, the frequency of occurrence of different units in each set, the sequence of units (phonotactics) and their frequencies of occurrence, the acoustic signatures, the duration of the same sound unit in different languages, and the intonation patterns of units at higher levels. The performance of any LID system depends on the amount of information and the reliability of information extracted from the speech signal and how efficiently it is incorporated into the system.

Existing spoken language identification systems can be broadly classified into two groups, namely, explicit and implicit LID systems. The LID systems that require speech recognizers of one or several languages, in other words, the systems that require a segmented and labeled speech corpus are termed here as explicit LID systems. The language identification systems which do not require phone recognizers (or rather segmented and labeled speech data) are termed here as implicit LID systems. In other words, these systems require only the raw speech data along with the true identity of the language spoken (Zissman, 1996). The language models or the language-specific information are derived only from the raw speech data. In the literature both types of systems have received significant attention.

A number of researchers have used phone recognizers (either language-dependent or language-independent) as front-end for language identification (Lamel and Gauvain, 1994, Berkling et al., 1994, Hazen and Zue, 1994, Kadambe and Hieronymus, 1995, Yan and Barnard, 1995, Navratil and Zuhlke, 1997). The most successful approach to LID (in terms of performance) uses phone recognizers of several languages in parallel (Zissman, 1996). In (Zissman, 1996), it is shown that even with one language phone recognizer, a language identification system can be built. But the analysis in (Zissman, 1996) also indicates that the performance of the system considerably improves in proportion to the number of front-end phone recognizers. The basic requirement for building a parallel phone recognition (PPR) system is a segmented and labeled speech corpus. Building segmented and labeled speech corpora for all the languages to be recognized, is both time consuming and expensive, requiring trained human annotators and substantial amount of supervision (Greenberg, 1999). Further, in (Singer et al., 2003), a GMM and an SVM based implicit LID systems are shown to perform better than the conventional explicit LID systems. Therefore, the unavailability of segmented and labeled speech corpus, and the recent developments in the implicit LID systems, make the implicit LID systems more attractive.

In (Jayaram et al., 2003, Ramasubramanian et al., 2003), a parallel sub-word recognition system for the LID task is proposed, in a framework similar to the parallel phone recognition (PPR) approach in the literature (Zissman, 1996). The difference is that this approach does not require segmented and labeled speech corpora. Since most of the phonemes among different languages are common, the source of information that may be used for LID is the variation in the frequency of occurrence of the same phoneme in different languages, and the variation in the acoustic realization of the same phoneme in different languages. Only very few phonemes are unique for a particular language. If a longer sound unit, say syllable-like unit is used, then the number of unique syllable-like units in any language is very high, which may be a potential information for discriminating languages. Li (1994) proposed a system which is based on features extracted at the syllable level. In this system, the syllable nuclei (vowels) for each speech utterance are located automatically. Next, feature vectors containing spectral information are computed for regions near the syllable nuclei. Each of these vectors consists of spectral sub-vectors computed on neighboring frames of speech data. Rather than collecting and modeling these vectors over all training speech, Li keeps separate collections of feature vectors for each training speaker. During testing, syllable nuclei of the test utterance are located and feature vector extraction is performed. Each speaker-dependent set of training feature vectors is compared to feature vectors of the test utterance, and most similar speaker-dependent set of training vectors is found.

One of the major reasons for considering the syllable as a basic unit for speech recognition systems is its better representational and durational stability relative to the phoneme (Wu et al., 1998). The syllable was proposed as a unit for ASR as early as 1975 (Fujimura, 1975), in which irregularities in phonetic manifestations of phonemes were discussed. It was argued that the syllable will serve as an effective minimal unit in the time-domain. In (Prasad, 2003), it is demonstrated that segmentation at syllable-like units followed by isolated style recognition of continuous speech performs well.

Many languages of the world possess a relatively simple syllable structure consisting of several canonical forms (Greenberg, 1999). Most of the syllables in such languages contain just two phonetic segments, typically of CV type (for example, Japanese). The remaining syllabic forms are generally of V or VC variety. In contrast, English and German possess a more highly heterogeneous syllable structure. In such forms, the onset and/or coda constituents often contain two or more consonants. But a salient property shared in common by stress and syllable-timed languages is the preference for CV syllabic forms in spontaneous speech. Nearly half of the forms in English and over 70% of the syllables in Japanese are of this variety. There is also a substantial proportion of CVC syllables in spontaneous speech of both the languages (Greenberg, 1999). This shows that even for the languages which are not syllable-timed, the syllable can be defined using a simple structure. Further, the definition of syllable in terms of short-term energy function is suitable for almost all the languages, in the case of spontaneous speech.

In this paper, a novel approach is proposed for the LID task which uses parallel syllable-like unit recognizers (Nagarajan and Murthy, 2004), in a framework similar to PPR approach in the literature with one significant difference. The difference is that the sub-word unit models (syllable-like unit models) for each of the languages to be recognized are generated in an unsupervised manner without the use of segmented and labeled speech corpora.

The basic requirement for building syllable-like unit recognizers for all the languages to be identified, is an efficient segmentation algorithm. Earlier, an algorithm (Prasad et al., 2004) was proposed, which segments the speech signal into syllable-like units. Recently, several refinements (Nagarajan et al., 2003) have been made to improve the segmentation performance of the baseline algorithm (Prasad et al., 2004). Using this algorithm (Nagarajan et al., 2003) each language training utterances are first segmented into syllable-like units. Similar syllable segments are then grouped together and syllable models are trained incrementally. These language-dependent syllable models are then used for identifying the language of the unknown test utterances.

The rest of the paper is organized as follows. In Section 2, the speech corpora used for this study is mentioned. In Section 3, the segmentation approach used to segment the speech signal into syllable-like units is briefly described. Section 4 describes the unsupervised and incremental clustering procedure which is used to cluster similar syllable-like units. In Section 5, different methods used to identify the language of the unknown utterance are described in detail. The performance of these LID systems are analyzed in Section 6.

Section snippets

Speech corpus

The Oregon Graduate Institute Multi-language Telephone Speech (OGI_MLTS) Corpus (Muthusamy et al., 1992), which is designed specifically for LID research, is used for both training and testing. This corpus currently consists of spontaneous utterances in 11 languages: English (En), Farsi (Fa), French (Fr), German (Ge), Hindi (Hi), Japanese (Ja), Korean (Ko), Mandarin (Ma), Spanish (Sp), Tamil (Ta) and Vietnamese (Vi). The utterances were produced by ∼90 male and ∼40 female, in each language over

Segmentation of speech into syllable-like units

Researchers have tried different ways of segmenting the speech signal either at the phoneme level or at the syllable level (Mermelstein, 1975, Schmidbauer, 1987, Nakagawa and Hashimoto, 1988, Noetzel, 1991, Shastri et al., 1999), with or without the use of phonetic transcription. These segmentation methods can further be classified into two categories, namely, time-domain based methods, where short-term energy function, zero-crossing rate, etc. are used and frequency-domain based methods, where

Unsupervised and incremental clustering

The main objective of this work it to derive a minimal set of syllable-like unit models for each language, to carry out the language identification task. Here, the Hidden Markov modeling (HMM) technique is used to model the automatically segmented syllable-like units, and to reduce the number of syllable-like unit models of each language. To derive sub-word unit models, the conventional batch training technique can be used in which all the training examples, which belong to a particular class,

Language identification (LID) systems

One of the important language identification cues that can be used in parallel sub-word unit recognition based systems is the n-gram statistics. Even if the speech data used during training is limited, the n-gram statistics can very well be derived from the digital text and used for the language identification task. But, if the training process is unsupervised and if the sub-word unit models do not have any identity, n-gram statistics derived from the digital text cannot be of any use. An

Discussion

A thorough analysis of the errors was performed on the above-described LID methods. It is noticed that the error in identifying the languages correctly is either because of the low-quality of the speech signal or accent variation. In particular, for the language Tamil, majority of the failure cases belong to the category of utterances of Srilankan Tamils. Even though, the performance of the syllable-like unit based LID system is reasonably good, some languages are strongly biased.

The

Conclusion

In this paper, a novel approach is proposed for spoken language identification which uses the features derived from syllable-like units. Even though the frame work used here is similar to PPR approach given in literature, the main difference is that this approach does not require segmented and labeled speech corpus of any language. Using the automatically segmented speech data, it is shown that syllable-like unit models can be generated without any supervision. For this, a clustering technique

References (28)

S. Greenberg
Speaking in short hand—a syllable-centric perspective for understanding pronunciation variation
Speech Comm.
(1999)
Berkling, K.M., Arai, T., Bernard, E., 1994. Analysis of phoneme based features for language identification. In:...
O. Fujimura
Syllable as a unit of speech recognition
IEEE Trans. Acoust. Speech Signal Process.
(1975)
Godfrey, J.J., Holliman, E.C., McDaniel, J., 1992. SWITCHBOARD: telephone speech corpus for research and development....
Hazen, T.J., Zue, V.W., 1994. Recent improvements in an approach to segment-based automatic language identification....
Jayaram, A.K.V.S., Ramasubramanian, V., Sreenivas, T.V., 2003. Language identification using parallel sub-word...
Kadambe, S., Hieronymus, J.L., 1995. Language identification with phonological and lexical models. In: Proceedings of...
Lamel, L.F., Gauvain, J.L., 1994. Language identification using phone-based acoustic likelihoods. In: Proceedings of...
Li, K.P., 1994. Automatic language identification using syllabic spectral features. In: Proceedings of IEEE Internat....
P. Mermelstein
Automatic segmentation of speech into syllabic units
J. Acoust. Soc. Amer.
(1975)

Muthusamy, Y.K., Cole, R.A., Oshika, B.T., 1992. The OGI multilanguage telephone speech corpus. In: Proceedings of...

Y.K. Muthusamy et al.

Reviewing automatic language identification

IEEE Signal Process. Mag.

(1994)

Nagarajan, T., Murthy, H.A., 2004. Language identification using parallel syllable-like unit recognition. In:...

Nagarajan, T., Murthy, H.A., Hegde, R.M., 2003. Segmentation of speech into syllable-like units. In: Proceedings of...

Cited by (11)

A hierarchical language identification system for Indian languages
2012, Digital Signal Processing: A Review Journal
Automatic spoken Language IDentification (LID) is the task of identifying the language from a short duration of speech signal uttered by an unknown speaker. In this work, an attempt has been made to develop a two level language identification system for Indian languages using acoustic features. In the first level, the system identifies the family of the spoken language, and then it is fed to the second level which aims at identifying the particular language in the corresponding family. The performance of the system is analyzed for various acoustic features and different classifiers. The suitable acoustic feature and the pattern classification model are suggested for effective identification of Indian languages. The system has been modeled using hidden Markov model (HMM), Gaussian mixture model (GMM) and artificial neural networks (ANN). We studied the discriminative power of the system for the features mel frequency cepstral coefficients (MFCC), MFCC with delta and acceleration coefficients and shifted delta cepstral (SDC) coefficients. Then the LID performance as a function of the different training and testing set sizes has been studied. To carry out the experiments, a new database has been created for 9 Indian languages. It is shown that GMM based LID system using MFCC with delta and acceleration coefficients is performing well with 80.56% accuracy. The performance of GMM based LID system with SDC is also considerable.
Extraction and representation of prosodic features for language and speaker recognition
2008, Speech Communication
In this paper, we propose a new approach for extracting and representing prosodic features directly from the speech signal. We hypothesize that prosody is linked to linguistic units such as syllables, and it is manifested in terms of changes in measurable parameters such as fundamental frequency ( $F_{0}$ ), duration and energy. In this work, syllable-like unit is chosen as the basic unit for representing the prosodic characteristics. Approximate segmentation of continuous speech into syllable-like units is obtained by locating the vowel onset points (VOP) automatically. The knowledge of the VOPs serve as reference for extracting prosodic features from the speech signal. Quantitative parameters are used to represent $F_{0}$ and energy contour in each region between two consecutive VOPs. Prosodic features extracted using this approach may be useful in applications such as recognition of language or speaker, where explicit phoneme/syllable boundaries are not easily available. The effectiveness of the derived prosodic features for language and speaker recognition is evaluated in the case of NIST language recognition evaluation 2003 and the extended data task of NIST speaker recognition evaluation 2003, respectively.
Identifying language from songs
2021, Multimedia Tools and Applications
Data Augmentation Using Virtual Microphone Array Synthesis and Multi-Resolution Feature Extraction for Isolated Word Dysarthric Speech Recognition
2020, IEEE Journal on Selected Topics in Signal Processing
CV-syllable feature extraction for automatic language identification
2013, Information and Control
Multiresolution feature extraction (MRFE) based speech recognition system
2013, 2013 International Conference on Recent Trends in Information Technology, ICRTIT 2013

View all citing articles on Scopus

View full text