Elsevier

Signal Processing

Volume 85, Issue 1, January 2005, Pages 37-50
Signal Processing

A novel method to represent speech signals

https://doi.org/10.1016/j.sigpro.2004.08.012Get rights and content

Abstract

In this work, speech signals are modeled by means of the so-called pre-defined “signature functions”. The pre-defined signature functions are generated using the statistical properties of the speech signals. It has been exhibited that, with a few basic signature functions, any speech signal can be generated within a tolerable error.

Introduction

The major objective of speech coding, an important application of speech processing, is to represent the signal with a minimum number of bits while maintaining perceptual quality. The coding of the speech makes it possible to achieve bandwidth efficiency during the transmission of the signal and store it efficiently on a variety of magnetic and optical media [7]. Over the decades, a variety of speech coding methods have been proposed and developed such as LPC, CELP, RELP, VSELP, PCM, DPCM, ADPCM, Sub-band Coding, Transform Coding, Adaptive Transform Coding [4], [10], [15], [18], [19] and projection pursuit techniques [8], [9], [13]. Beyond coding, numerous methods are utilized for signal representation for compression, recognition, classification and also for secure communication purposes. Some of these signal representation techniques are known as frequency domain, time domain, transform domain, fuzzy logic and synthetic neural networks techniques [5], [6], [10], [11], [16], [17]. In late 1990s, new methods of signal representation by means of special future functions so-called “signature functions”, were introduced [1], [2], [3], [12].

The main idea of the new methods was to represent the signals via pre-defined signature functions directly obtained from the source. In this understanding, vocal track of humans is considered as one of a kind source or we say that all the speech signals constitutes a specific family of signals—“Family of the Speech Signals”. Similarly, any sort of music stems from a family of signal so-called the “Family of the Music Signals”, etc.

In the new techniques of [1], [2], [3], [12], signature functions were created experimentally on an ad hoc basis. In this work however, considering the quasi-stationary behavior of the speech signals, a statistical method is proposed to generate the signature functions. In this regard, we run several thousands of experiments to analyze speech signals employing the signal representation method. In the experiments, each signal piece was divided into small frames (Fig. 1). For each frame, the correlation matrix is constructed and its eigenvalues and eigenvectors were computed. For each frame, the eigenvector which is associated with the highest eigenvalue is sorted and it is stored for further evaluation. Eventually, a big storage area like a data warehouse is constructed as the result of these experiments. Employing a comparison algorithm, eigenvectors with similar shapes were eliminated. In conclusion, for speech signals family, we ended up with a data set that contains only 15 or 16 different shapes of eigenvectors. These vectors are collected under a new set which has only 15 or 16 elements. In this approach, each eigenvector is regarded as a time sequence. Its continuous form is named as “signature function”. Eventually, these time sequences or signature functions are utilized to model the signals. In the modeling process, each frame of the speech signal is represented with only one signature function multiplied by a coefficient. Therefore, each frame is represented by an index number which is associated with a pre-defined signature function or signature sequencemultiplied with a coefficient. Hence, substantial signal compression rate is achieved. In the following section, generation of the pre-defined signature sequences or signature functions is explained. Based on our experimental results, some selected signature functions for speech modeling are depicted. Examples are given to show the practical implementation of the new method. It is expected that the new idea presented in this paper to model the speech signals may be utilized for speech coding, efficient storage with high compression rate and transmission purposes.

Section snippets

A statistical method to generate signature functions

In this method, a quasi-stationary signal, given over a long period of time, is divided into “frames” as shown in Fig. 1. Assume that, N number of samples is equally placed over the long but finite interval. Then, the sampled signal x(n) is given byx(n)=i=1Nxiδ(n-i).

Here δ(n) is the unit impulse and xi is the height of the sample “i”. Let us assume that the long signal train x(n) is divided into equal length frames with LF samples. Then, the time sequence of each frame can be represented by a

Selection of frame length LF by means of hearing quality test “MOS”

In this section, we present our experimental results to select the optimum frame length LF employing the above algorithm. In this process, first several speech pieces obtained from 10 male and 10 female speakers were recorded with 8 kHz sampling rate. The speakers were given random texts in Turkish language to read. For each person, correlation matrices with different frame lengths were computed and then, corresponding eigen-vectors (or signature sequences) were generated. In this phase of the

Summary of the speech reconstruction process and discussion on the compression ability of the new technique

  • The most important conclusion of this research work is that any speech signal can be modeled by means of a pre-defined signature sequence set S={S1,S2,S3,,SNs} which contains only Ns=15 (or 16) different signal shapes or waveforms constructed with LF=24 (or 40) samples.

  • It is shown that any random speech frame XFk consist of LF samples can be expressed as XFkCkr.SrsuchthatCkr=XFkT.Sr,where Sr=[s1r,s2r,,sLFr]Tis pulled from the signature sequence set S which yields the minimum value for the

Examples

In this section, 10 different Turkish sentences were read by 20 speakers. Ten of these speakers were male and the remaining were female. Each sentence was sampled with 8 kHz; and recorded on the computer. Then, utilizing the pre-defined signature sequence set given for the frame lengths LF=24 and LF=40 as depicted in Figs. 2 and 3 respectively, the original sentences were reconstructed. The hearing quality of the reconstructed sentences was evaluated by 20 listeners. Ten of these listeners were

Comparative results

For LF=24, the transmission rate of our method corresponds to 64Kb/16=4Kb/s and the average MOS computed from Table 1 is 2.76. Similarly, for LF=40, the transmission rate of the newly proposed method corresponds to 64Kb/26.6=2.4Kb/s with average MOS of 2.5 as specified by Table 2. In this case, it is fair to compare the hearing quality of our recently proposed method with “FS1015 LPC-10E” of 2.4Kb/s for which MOS is given as 2.6 [14]. This means that the hearing quality of our proposed

Conclusion

In this paper, a novel method to model speech signals is presented by means of so-called pre-defined signature sequences Sr which consist of LF samples. Pre-defined signature sequences are collected in a set called “signature sequence set S”. The set S is generated employing the statistical properties of speech signals. It is shown that any speech sequence (or frame) XFk, which consist of LF samples, can be expressed by XFkCkr.Sr. In this representation, Sr is pulled from the pre-defined

Acknowledgements

The authors would like to thank reviewers of this paper for their helpful and constructive comments. They were guiding and motivating. Useful discussions with Prof. Dr. Erdal Panayirci, Dr. Umit Guz and Hakan Gurkan of ISIK University, Istanbul, Turkey are also acknowledged.

References (19)

  • B. Bigi et al.

    A fuzzy decision strategy for topic identification and dynamic selection of language models

    Signal Process. (EURASIP)

    (June 2000)
  • R. Akdeniz, A.M. Karas, B.S. Yarman, Turkish speech coding by signature base sequences, Proceedings of the...
  • R. Akdeniz, B.S. Yarman, Temel Tanim Dizileri ile Konuşma Kodlama, Proceedings of the SIU'98 -6. Sinyal İşleme ve...
  • R. Akdeniz, B.S. Yarman, Generation of optimum signature base sequences for speech signals, Proceedings...
  • T.P. Barnwell et al.

    Speech CodingA Computer Laboratory Textbook

    (1996)
  • H. Bourlard et al.

    A training algorithm for statistical sequence recognition with applications to transition-based speech recognition

    IEEE Signal Proc. Lett.

    (July 1996)
  • J.R. Deller et al.

    Discrete-Time Processing of Speech Signals

    (1993)
  • C.Ö. Etemoglu et al.

    Matching pursuits sinusoidal speech coding

    IEEE Trans. Speech Audi. Proc.

    (September 2003)
  • J.H. Friedman et al.

    Projection pursuit regression

    J. Am. Stat. Assoc.

    (1981)
There are more references available in the full text version of this article.

Cited by (7)

View all citing articles on Scopus
View full text