Abstract
Most speech processing models begin with feature extraction and then pass the feature vector to the primary processing model. The solution’s performance mainly depends on the quality of the feature representation and the model architecture. Much research focuses on designing robust deep network architecture and ignoring feature representation’s important role during the deep neural network era. This work aims to exploit a new approach to design a speech signal representation in the time-frequency domain via Linear Chirplet Transform (LCT). The proposed method provides a feature vector sensitive to the frequency change inside the human speech with a solid mathematical foundation. This is a potential direction for many applications, such as speaker gender recognition or emotion recognition. The experimental results show the improvement of the feature based on LCT compared to MFCC or Fourier Transform. Particularly, the proposed method gains \(95.56\%\) and \(97.28\%\) in term of accuracy for speaker gender recognition in English and Vietnamese, respectively. This result also implies that the feature based on LCT is independent of language, so it can be used in a wide range of applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cowie, R., Douglas-Cowie, E.: Automatic statistical analysis of the signal and prosodic signs of emotion in speech. In: Fourth International Conference on Spoken Language Processing (1996)
Koolagudi, S.G., Rao, K.S.: Emotion recognition from speech: a review. Int. J. Speech Technol. 15(2), 99–117 (2012)
Do, H.D., Chau, D.T., Nguyen, D.D., Tran, S.T.: Enhancing speech signal features with linear envelope subtraction. In: Wojtkiewicz, K., Treur, J., Pimenidis, E., Maleszka, M. (eds) Advances in Computational Collective Intelligence. Communications in Computer and Information Science, vol. 1463. Springer, Cham. https://doi.org/10.1007/978-3-030-88113-9_25(2021)
Nwe, T.L., Foo, S.W., De Silva, L.C.: Detection of stress and emotion in speech using traditional and FFT based log energy features. In: Joint Conference of the Fourth International Conference on Information, Communications and Signal Processing and Fourth Pacific Rim Conference on Multimedia, pp. 1619–1623 (2003)
Do, H.D., Tran, S.T., Chau, D.T.: Speech separation in the frequency domain with autoencoder. J. Commun. 15(11), 841–848 (2020). https://doi.org/10.12720/jcm.15.11.841-848
Tzirakis, P., Zhang, J., Schuller, B.W.: End-to-End speech emotion recognition using deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5089–5093. (2018). https://doi.org/10.1109/ICASSP.2018.8462677
Do, H.D., Tran, S.T., Chau, D.T.: A variational autoencoder approach for speech signal separation. In: Nguyen, N.T., Hoang, B.H., Huynh, C.P., Hwang, D., Trawiński, B., Vossen, G. (eds.) Computational Collective Intelligence. Lecture Notes in Computer Science, vol. 12496. Springer, Cham. https://doi.org/10.1007/978-3-030-63007-2_43(2020)
Do, H.D., Tran, S.T., Chau, D.T.: Speech source separation using variational autoencoder and bandpass filter. IEEE Access 8, 156219–156231 (2020). https://doi.org/10.1109/ACCESS.2020.3019495
Mann, S., Haykin, S. : The Chirplet transform: a generalization of Gabor’s logon transform. In: Proceedings Vision Interface, pp. 205–212 (1991)
Mann, S., Haykin, S.: The Chirplet transform: physical considerations. IEEE Trans. Signal Process. 43(11), 2745–2761 (1995). https://doi.org/10.1109/78.482123
Mihovilovic, D., Bracewell, R.N.: Adaptive Chirplet representation of signals in the time-frequency plane. Electron. Lett. 27(13), 1159–1161 (1991)
Liu, Y., An, H., Bian, S.: Hilbert-Huang transform and the application. In: IEEE International Conference on Artificial Intelligence and Information Systems (ICAIIS), pp. 534–539 (2020). https://doi.org/10.1109/ICAIIS49377.2020.9194944
Yang, Y., Peng, Z.K., Dong, X.J., Zhang, W.M., Meng, G.: General parameterized time-frequency transform. IEEE Trans. Signal Process. 62(11), 2751–2764 (2014). https://doi.org/10.1109/TSP.2014.2314061
Garofolo, J.S., et al.: TIMIT acoustic-phonetic continuous speech corpus. In: Linguistic Data Consortium (1993)
Luong, H.T., Vu, H.Q.: A non-expert Kaldi recipe for Vietnamese speech recognition system. In: Proceeding of WLSI/OIAF4HLT at COLING, pp. 51–55 (2016)
LeCun, Y., et al.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989). https://doi.org/10.1162/neco.1989.1.4.541
Acknowledgement
Hao D. Do was funded by Vingroup JSC and supported by the PhD Scholarship Programme of Vingroup Innovation Foundation (VINIF), Institute of Big Data, code VINIF.2021.TS.120. The authors would like to thank OLLI Technology JSC for their support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Do, H.D., Chau, D.T., Tran, S.T. (2022). Speech Representation Using Linear Chirplet Transform and Its Application in Speaker-Related Recognition. In: Nguyen, N.T., Manolopoulos, Y., Chbeir, R., Kozierkiewicz, A., Trawiński, B. (eds) Computational Collective Intelligence. ICCCI 2022. Lecture Notes in Computer Science(), vol 13501. Springer, Cham. https://doi.org/10.1007/978-3-031-16014-1_56
Download citation
DOI: https://doi.org/10.1007/978-3-031-16014-1_56
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16013-4
Online ISBN: 978-3-031-16014-1
eBook Packages: Computer ScienceComputer Science (R0)