Abstract
There has been a significant growth in the mobile devices and services, fuelling an increasing demand for voice-activated applications. In this context, it is important that individual speaker characteristics are captured, in addition to the salient information in the speech signal. Thus, efficient speech coders that can achieve the dual goals of compact speech representation that maintains speech intelligibility and quality, and preservation of speaker-specific characteristics are attractive. A wideband scalable bit rate mixed excitation linear prediction-enhanced speech coder with an efficient representation for excitation using glottal instants and linear predictive coding based on mel scale is proposed in this paper. The instantaneous pitch or epoch is included in the excitation to get an accurate estimation of glottal instants, a vital parameter in speaker recognition. By optimizing the bit requirement using speech category-based coding, the proposed wideband coder can operate at bit rates ranging from 3.3 to 5.1 kbps with an average bit rate of 3.6 kbps. The proposed coder provides, at 3.6 kbps, similar perceptual quality, as measured by mean opinion score and perceptual evaluation of speech quality, as that of code excited linear prediction operating at 6.4 kbps. The performance of the proposed coder in speaker recognition is analysed, and it gives an equal error rate of 12.5%, which is very promising.
Similar content being viewed by others
References
G. Alipoor, M.H. Savoji, Wide-band speech coding based on bandwidth extension and sparse linear prediction. 2012 35th International Conference on Telecommunications and Signal Processing (TSP) (Prague, 2012), pp. 454–459
T. Ananthapadmanabha, B. Yegnanarayana, Epoch extraction from linear prediction residual for identification of closed glottis interval. IEEE Trans. Acoust. Speech Signal Process. 27(4), 309–319 (1979)
M.S. Arun Sankar, P.S. Sathidevi, An investigation on the degradation of different features extracted from the compressed American English speech using narrowband and wideband codecs. Int. J. Speech Technol. 21, 861–876 (2018). https://doi.org/10.1007/s10772-018-09559-5
M.S. Arun-Sankar, P.S. Sathidevi, Design of MELPe-based variable-bit-rate speech coding with mel scale approach using low-order linear prediction filter and representing excitation signal using glottal closure instants. Arab. J. Sci. Eng. (2019). https://doi.org/10.1007/s13369-019-04273-z
M.S. Arun-Sankar, P.S. Sathidevi, Mel scale-based linear prediction approach to reduce the prediction filter order in CELP paradigm. Circuits Syst. Signal Process. 40, 1–23 (2021). https://doi.org/10.1007/s00034-021-01647-3
M.S. Athulya, P.S. Sathidevi, Speaker verification from codec distorted speech for forensic investigation through serial combination of classifiers. Digit. Investig. 25, 70–77 (2018). https://doi.org/10.1016/j.diin.2018.03.005
M.S. Athulya, P.S. Sathidevi, Speaker verification from codec-distorted speech through combination of affine transform and feature switching. Circuits Syst. Signal Process. 40, 6016–6034 (2021)
T. Backstrom, Speech coding. Signals and Communication Technology (Springer International Publishing AG, 2017) https://doi.org/10.1007/978-3-319-50204-5_5
P. Boersma, D. Weenink, Praat: doing phonetics by computer. Version 6.0.40 (2018)
A. Bouzid, N. Ellouze, Glottal opening instant detection from speech signal. 2004 12th European Signal Processing Conference (Vienna, 2004), pp. 729–732
M. Bouzid, S.E. Cheraitia, M. Hireche, Switched split vector quantizer applied for encoding the LPC parameters of the 2.4 Kbits/s MELP speech coder. 2010 7th International Multi-conference on Systems, Signals and Devices (Amman, 2010), pp. 1–5
S. Bruhn et al., Standardization of the new 3GPP EVS codec. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 5703–5707. https://doi.org/10.1109/ICASSP.2015.7179064
C. Cannam, C. Landone, M. Sandler, An open source application for viewing, analysing, and annotating music audio files. Proceedings of the ACM Multimedia 2010 International Conference, Firenze, Italy, October, pp. 1467–1468, 2010
W.C. Chu, Speech Coding Algorithms: Foundation and Evolution of Standardized Coders (Wiley, Hoboken, 2004)
V. Cuperman et al., A novel approach to excitation coding in low-bit-rate high-quality CELP coders. 2000 IEEE Workshop on Speech Coding (Delavan, WI, USA, 2000), pp. 14–16
A.M. De Lima Araujo, F. Violaro, Formant frequency estimation using a Mel-scale LPC algorithm. Telecommunications Symposium, 1998. ITS ’98 Proceedings vol. 1 (SBT/IEEE International, Sao Paulo, 1998), pp. 207–212
T. Friedrich, G. Schuller, Spectral band replication tool for very low delay audio coding applications. 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (New Paltz, NY, USA, 2007), pp. 199–202
J.S. Garofolo et al., TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1 (Linguistic Data Consortium, Philadelphia, 1993)
J.D. Gibson, Challenges in speech coding research. in Speech and Audio Processing for Coding, Enhancement and Recognition. (Springer, 2015), pp. 19–39
J.D. Gibson, Speech compression. Information 7(2), 32 (2016). https://doi.org/10.3390/info7020032
A. Gray, J. Markel, Distance measures for speech processing. IEEE Trans. Acoust. Speech Signal Process. 24(5), 380–391 (1976)
J.M. Hillenbrand, L.A. Getty, M.J. Clark, K. Wheeler, Acoustic characteristics of American English vowels. J. Acoust. Soc. Am. 97, 3011–3099 (1995)
ITU-T. Recommendation, P.862.1 Mapping function for transforming P.862 raw result scores to MOS-LQO
R. Jarina, J. Polacký, P. Poćta, M. Chmulik, Automatic speaker verification on narrowband and wideband lossy coded clean speech. IET Biom 6, 276–281 (2017)
G. Jyothish-Lal, E.A. Gopalakrishnan, D. Govind, Epoch estimation from emotional speech signals using variational mode decomposition. Circuits Syst. Signal Process. 37(8), 3245–3274 (2018)
A. Krobba, M. Debyeche, S.A. Selouani, Maximum entropy PLDA for robust speaker recognition under speech coding distortion. Int. J. Speech Technol. 22, 1115–1122 (2019)
E. Kruger, H.W. Strube, Linear prediction on a warped frequency scale speech processing. IEEE Trans. Acoust. Speech Signal Process. 36(9), 1529–1531 (1988)
U.K. Laine, M. Karjalainen, T. Altosaar, Warped linear prediction (WLP) in speech and audio processing. 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1994. ICASSP-94, vol.3. Adelaide, SA, 1994, pp. III/349-III/352
M. Lourakis, A brief description of the Levenberg–Marquardt algorithm implemened by levmar. Found. Res. Technol. 4, 1–6 (2005)
R. Martin, R.V. Cox, New speech enhancement techniques for low bit rate speech coding. 1999 IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria (Cat. No.99EX351) (Porvoo, Finland, 1999), pp. 165–167
A.V. McCree, T.P. Barnwell, A mixed excitation LPC vocoder model for low bit rate speech coding. IEEE Trans. Speech Audio Process. 3(4), 242–250 (1995)
P. Nizampatnam, K.K. Tappeta, Bandwidth extension of narrowband speech using integer wavelet transform. IET Signal Process. 11(4), 437–445 (2017). https://doi.org/10.1049/iet-spr.2016.0453
K.K. Paliwal, B.S. Atal, Efficient vector quantization of LPC parameters at 24 bits/frame. IEEE Trans. Speech Audio Process. 1(1), 3–14 (1993)
D. Pravena, D. Govind, Significance of incorporating excitation source parameters for improved emotion recognition from speech and electroglottographic signals. Int. J. Speech Technol. 20(4), 787–797 (2017)
L.R. Rabiner, R.W. Schafer, Digital Processing of Speech Signals (Prentice-Hall, New Jersey, 1978)
K.S. Rao, B. Yegnanarayana, Prosody modification using instants of significant excitation. IEEE Trans. Audio Speech Lang. Process. 14(3), 972–980 (2006)
A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, Perceptual evaluation of speech quality (PESQ)—a new method for speech quality assessment of telephone networks and codecs. in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221) vol.2 (Salt Lake City, UT, 2001), pp. 749–752
S. Singh, The role of speech technology in biometrics, forensics and man-machine interface. Int. J. Electric. Comput. Eng. (IJECE) (2019). https://doi.org/10.11591/ijece.v9i1.pp281-288
K. Sreenivasa Rao, B. Yegnanarayana, Prosodic manipulation using instants of significant excitation. in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03) (Hong Kong, 2003), p. I
C.J. van der Merwe, J.A. du Preez, Calculation of LPC-based cepstrum coefficients using mel-scale frequency warping. in COMSIG 1991 Proceedings: South African Symposium on Communications and Signal Processing (Pretoria, 1991), pp. 17–21
R. Vergin, D. O’Shaughnessy, A. Farhat, Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition. IEEE Trans. Speech Audio Process. 7(5), 525–532 (1999)
A.K. Vuppala, J. Yadav, S. Chakrabarti, K.S. Rao, Effect of low bit rate speech coding on epoch extraction. in 2011 International Conference on Devices and Communications (ICDeCom) (Mesra, 2011), pp. 1–4
B. Yegnanarayana, Suryakanth V. Gangashetty, Epoch-based analysis of speech signals. Sadhana 36(5), 651–697 (2011)
E.W. M. Yu, M.-W. Mak, S.-Y. Kung. Speaker verification from coded telephone speech using stochastic feature transformation and handset identification. in Pacific-Rim Conference on Multimedia (Springer, Berlin, 2002)
Acknowledgements
Authors would like to thank Department of Science and Technology, Government of India, for supporting this work under the FIST scheme No. SR/FST/ET-I/2017/68.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sankar, M.S.A., Sathidevi, P.S. A Wideband Scalable Bit Rate Mixed Excitation Linear Prediction-Enhanced Speech Coder by Preserving Speaker-Specific Features. Circuits Syst Signal Process 42, 3437–3463 (2023). https://doi.org/10.1007/s00034-022-02277-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-022-02277-z