Very low bit-rate F0 coding for phonetic vocoders using MSD-HMM with quantized F0 symbols
Highlights
► We propose very log bit-rate F0 coding based on quantized F0 symbols and MSD-HMM. ► F0 symbols are used as the contextual factors in F0 modeling. ► Model parameters can be robustly estimated using tree-based parameter tying. ► Experimental results have shown significant improvement of coding performance compared to MSD-VQ.
Introduction
Segment-based coding is one of the most popular approaches to very low bit-rate speech coding at a rate on the order of 100 bits/sec. In the segment-based coding, several frames are regarded as an acoustic segment and encoded into a discrete symbol using a codebook trained in advance. One of the typical segment-based coders is phonetic vocoder where a phone is used as an acoustic unit for encoding and decoding processes.
The basic idea of the phonetic vocoder was introduced in the 1950s by Dudley (1958), where the number of phonemes was very limited and the quality of the decoded speech was not much satisfactory. With improvements to computational performance and the development of phoneme recognition and speech synthesis, a number of techniques were studied in the 1980s, e.g., Schwartz et al., 1980, Soong, 1989, Picone and Doddington, 1989. In these techniques, the encoding was based on phoneme recognition and the decoding was based on concatenative synthesis of phone units. In the late 1990s, a parameter generation algorithm for HMM-based speech synthesis (Tokuda et al., 1995b) was employed into the decoder part (Tokuda et al., 1998) in place of the unit selection. In the HMM-based phonetic vocoder, a spectral feature sequence is generated from the HMMs that are used as acoustic models in the phoneme recognition.
The HMM-based phonetic vocoder is a promising approach to very low bit-rate spectral coding and can generate natural sounding speech using relatively a smaller amount of speech data than that in unit selection. However, most of the related studies have mainly focused on spectral coding, and prosodic coding, especially the fundamental frequency (F0), has not been well discussed. It is obvious that F0 is also an essential factor in speech representation to express linguistic information such as accent and tone. Moreover, F0 often carries para-linguistic cues such as emotion and speaking style which enhance the speech communication. One of the difficulties in the F0 coding at very low bit-rates is that it is not easy to automatically extract such linguistic and para-linguistic information whereas spectral features can effectively be represented by the phonetic information obtained with phoneme recognition.
Several techniques have been proposed to overcome the problem with F0 coding. Picone and Doddington (1987) proposed contour quantization, where the F0 contour was normalized by a nominal value and was vector-quantized. Lee and Cox (2001) proposed an alternative technique using piecewise linear approximation (Scheffers, 1988), which is similar to polygon approximation (Katsaggelos et al., 2002) in image coding. However, the decoded F0 contour was linear within each segment and F0 variations in segment boundaries were not smooth. In addition, the above two techniques did not use phonetic information through the F0 coding process, which could enhance the coding efficiency. To address these problems, Hoshiya et al. (2003) introduced a statistical approach into the segment-based F0 coding. They proposed the vector quantization based on multi-space probability distribution (MSD-VQ). In this technique, F0 values are modeled with the MSD where the observation space of F0 features is represented by a union of voiced and unvoiced spaces. Although this approach is feasible to statistically treat F0 values, codebooks are separately trained for respective phonemes and codewords depend only on current phonemes. This means that the context of preceding and succeeding phonemes are not taken into account in the MSD-VQ. In contrast, since the synthesis unit is generally modeled with a phonetically and prosodically context-dependent HMM in the speech synthesis, there is inconsistency between encoding and decoding processes in MSD-VQ.
In this paper, an F0 coding technique for HMM-based phonetic vocoders is proposed where MSD-HMM (Tokuda et al., 1999) is used for F0 modeling. MSD-HMM is generally used in HMM-based speech synthesis to model F0 that has a continuous value in a voiced frame and has a discrete symbol in an unvoiced frame. In the proposed technique, we model the F0 values of each phone unit using a phonetically and prosodically context-dependent MSD-HMM. As described above, it is difficult to automatically extract prosodic contextual factors such as accent and tone with high reliability, and inaccurate prosodic context could degrade the performance of F0 coding. To overcome this problem, we employ quantized F0 symbols (Nose et al., 2010a) which were originally proposed for unsupervised prosodic labeling in HMM-based speech synthesis. We obtain the quantized F0 symbol for each phone by quantizing an average log F0 value of the phone. The F0 symbol sequence represents a rough shape of the original F0 contour and these symbols are used as the prosodic context for a current phone as well as the phonetic context, i.e., triphone. This means that we can use not only the contextual factors for the current phones but also those for the preceding and succeeding phones, which is one of the advantages of the proposed F0 coding technique against the above conventional techniques.
The contributions of the proposed technique to very low bit-rate F0 coding are summarized as follows. The first is the use of phonetic and prosodic contexts. In the conventional coding technique with the MSD-VQ, the codebook was trained for each phoneme separately, and the phonetic and F0 information of the preceding and succeeding phone segments are not taken into account. However, it is well known in unit-selection-based speech synthesis that the lack of the information of adjacent phone segments often causes discontinuity in the boundaries of the concatenated segments, and the quality of the resultant synthetic speech is not always satisfactory. On the other hand, the proposed technique takes advantage of the HMM-based speech synthesis by using the phonetically and prosodically context-dependent model, where the contexts capture the supra-segmental characteristics of spectral and F0 trajectories as well as the segmental ones within each phone. The second is robust parameter estimation using parameter tying with decision trees. Generally, in the MSD-VQ, the appropriate codebook sizes for phonemes are different from each other, and we should manually adjust the respective codebooks’ sizes of all phonemes if we wish to reduce the bit-rate as much as possible. On the other hand, in the proposed technique, the number of parameters are automatically determined using state-based decision trees with the minimum description length (MDL) criterion (Shinoda and Watanabe, 2000), where the phonemes and F0 symbols are taken into account as the contextual factors. This means that the parameters are tied across the phonemes and F0 symbols, and it would make the parameter estimation and F0 modeling more robust.
This paper is organized as follows. In Section 2, we propose the F0 coding technique based on quantized F0 symbols and context-dependent MSD-HMMs. In Section 3, we conduct objective and subjective experiments and discuss the results. Finally, Section 4 summarizes our findings.
Section snippets
F0 encoding using phone-level F0 quantization
In the encoder, an extracted F0 contour is converted into an F0 symbol sequence at the phone level. Each F0 symbol is obtained by roughly quantizing the average log F0 value of each phone. The resulting F0 symbol sequence represents the outline of the original F0 contour. In our previous studies on unsupervised F0 modeling (Nose et al., 2010a) and voice conversion (Nose et al., 2010b), we found that these F0 symbols could be used as a prosodic context for HMM-based speech synthesis.
For the F0
Experimental conditions
We used reading style speech of three male (MHT, MSH, and MYI) and three female (FKN, FTK, and FYM) speakers from the ATR Japanese speech database set B in the following experiments. Each speaker uttered 503 phonetically balanced Japanese sentences. We used 450 sentences for model training and the remaining 53 sentences for evaluation. The average phoneme recognition rate including insertion error for the test data of six speakers was 78.3%. The average phoneme rate computed from the
Conclusions
We proposed a technique of very low bit-rate F0 coding for HMM-based phonetic vocoders. The proposed technique utilizes the roughly quantized F0 symbols as the prosodic context for the HMM-based speech synthesis in the decoding process. By taking account of preceding and succeeding F0 symbols as contexts, the decoded F0 contour is very smooth and is similar to that of the original even when the number of quantization bits is very small such as two or three. Experimental results demonstrated
Acknowledgment
Part of this work was supported by JSPS Grant-in-Aid for Scientific Research 21300063 and 21800020.
References (18)
- et al.
Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds
Speech Communication
(1999) Phonetic pattern recognition vocoder for narrow-band speech transmission
The Journal of the Acoustical Society of America
(1958)- Hoshiya, T., Sako, S., Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., Kitantura, T., 2003. Improving the performance...
- et al.
MPEG-4 and rate-distortion-based shape-coding techniques
Proceedings of the IEEE, Special Issue Part Two: Multimedia Signal Processing
(2002) - et al.
A very low bit rate speech coder based on a recognition/synthesis paradigm
IEEE Transactions on Speech and Audio Process.
(2001) - Nose, T., Ooki, K., Kobayashi, T., 2010a. HMM-based speech synthesis with unsupervised labeling of accentual context...
- et al.
HMM-based voice conversion using quantized F0 context
IEICE Transactions on Information and Systems
(2010) - Picone, J., Doddington, G., 1987. Low rate speech coding using contour quantization. In: Proceedings of the ICASSP’87,...
- Picone, J., Doddington, G., 1989. A phonetic vocoder. In: Proceedings of the ICASSP’89, pp....
Cited by (1)
Quantized f0 context and its applications to speech synthesis, speech coding, and voice conversion
2014, Proceedings - 2014 10th International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIH-MSP 2014