Abstract
Fricatives are characterized by two prime acoustic properties, i.e., having high-frequency spectral concentration and possessing noisy nature. Spectral domain approaches for detecting fricatives employ a time–frequency representation to compute acoustic cues such as band energy ratio, spectral centroid, and dominant resonant frequency. The detection accuracy of these approaches depends on the efficiency of the employed time–frequency representation. An approach that would not require any time–frequency representation for detecting fricatives from speech has been explored in this work. In this study, a time-domain operation is proposed which emphasizes the high-frequency spectral characteristics of fricatives implicitly. The proposed approach aims to scale the spectrum of the speech signal using a scaling function \(k^2\), where k is the discrete frequency. The spectral weighting function used in the proposed approach can be approximated as a cascaded temporal difference operation over speech signal. The emphasized regions in spectrally weighted speech signal are quantified to detect fricative regions. Contrasting the spectral domain approaches, the predictability measure-based approach in literature relies on capturing the noisy nature of fricatives. The proposed approach and the predictability measure-based approaches rely on two complementary properties for detecting fricatives, and a combination of these approaches is put forth in this work. The proposed approach has performed better than the state-of-the-art fricative detectors. To study the significance of the proposed evidence, an early fusion between the proposed evidence and the feature-space maximum log-likelihood transform features is explored for developing speech recognition systems.
Similar content being viewed by others
Data Availability
The data that support the findings of this study are available from the corresponding author, Hari Krishna Vydana upon reasonable request.
References
A.M.A. Ali, J.V. der Spiegel, Acoustic-phonetic features for the automatic classification of fricatives. J. Acoust. Soc. Am. 109(5), 2217–2235 (2001)
T. Ananthapadmanabha, A. Prathosh, A. Ramakrishnan, Detection of the closure-burst transitions of stops and affricates in continuous speech using the plosion index. J. Acoust. Soc. Am. 135(1), 460–471 (2014)
T. Ananthapadmanabha, A. Ramakrishnan, P. Balachandran, An interesting property of LPCs for sonorant vs fricative discrimination. arXiv:1411.1267 (2014)
C. Chan, K. Ng, Separation of fricatives from aspirated plosives by means of temporal spectral variation. IEEE Trans. Acoust. Speech Signal Process. 33(5), 1130–1137 (1985)
I.F. Chen, S.M. Siniscalchi, C.H. Lee, Attribute based lattice rescoring in spontaneous speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech, Signal Processing, pp. 3325–3329 (2014)
C.-Y. Chiang, S.M. Siniscalchi, S.-H. Chen, C.-H. Lee, Knowledge integration for improving performance in LVCSR, in INTERSPEECH, pp. 1786–1790 (2013)
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Lang. Process. 28(4), 357–366 (1980)
L. Deng, G. Hinton, B. Kingsbury, New types of deep neural network learning for speech recognition and related applications: an overview, in Proceedings of IEEE International Conference on Acoustics, Speech, Signal Processing, pp. 8599–8603 (2013)
N. Dhananjaya, Signal processing for excitation-based analysis of acoustic events in speech. Ph.D. dissertation, Department of Computer Science and Engineering IIT Madras, pp. 129–184 (2011)
J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, DARPA TIMIT acoustic-phonetic continous speech corpus. NASA STI/Recon Technical Report N, vol. 93 (1993)
M. Gautam, Query-by-example spoken term detection on low resource languages. Ph.D. dissertation, IIIT Hyderabad, pp. 128–129 (2011)
A. Graves, N. Jaitly, Towards end-to-end speech recognition with recurrent neural networks. Int. Conf. Mach. Learn. 14, 1764–1772 (2014)
A. Graves, A.R. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in Proceedings of IEEE International Conference on Acoustics, Speech, Signal Processing, pp. 6645–6649 (2013)
H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)
G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
A. Jansen, P. Niyogi, Modeling the temporal dynamics of distinctive feature landmark detectors for speech recognition. J. Acoust. Soc. Am. 124(3), 1739–1758 (2008)
A. Jongman, R. Wayland, S. Wong, Acoustic characteristics of english fricatives. J. Acoust. Soc. Am. 108(3), 1252–1263 (2000)
A. Juneja, C. Espy-Wilson, A probabilistic framework for landmark detection based on phonetic features for automatic speech recognition. J. Acoust. Soc. Am. 123(2), 1154–1168 (2008)
S. King, P. Taylor, Detection of phonological features in continuous speech using neural networks. Comput. Speech Lang. 14(4), 333–353 (2000)
C.-Y. Lin, H.-C. Wang, Burst onset landmark detection and its application to speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(5), 1253–1264 (2011)
S.A. Liu, Landmark detection for distinctive feature-based speech recognition. J. Acoust. Soc. Am. 100(5), 3417–3430 (1996)
Y. Miao, Kaldi+ PDNN: Building DNN-based ASR systems with Kaldi and PDNN. arXiv:1401.6984 (2014)
Y. Miao, M. Gowayyed, F. Metze, Eesen: End-to-end speech recognition using deep RNN models and WFST-based decoding, in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, pp. 167–174 (2015)
S. Narayanan, A. Alwan, Noise source models for fricative consonants. IEEE Trans. Speech Audio Process. 8(3), 328–344 (2000)
C.H. Shadle, The acoustics of fricative consonants. Ph.D. dissertation, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, pp. 17–18 (1985)
Shakti P Rath, D. Povey, K. Vesely, J. Cernocky, Improved feature processing for deep neural networks, in INTERSPEECH, pp. 109–113 (2013)
S.M. Siniscalchi, T. Svendsen, C.-H. Lee, A bottom-up modular search approach to large vocabulary continuous speech recognition. IEEE Trans. Audio Speech Lang. Process. 21(4), 786–797 (2013)
K.N. Stevens, Toward a model for lexical access based on acoustic landmarks and distinctive features. J. Acoust. Soc. Am. 111(4), 1872–1891 (2002)
R.G. Stockwell, L. Mansinha, R.P. Lowe, Localization of the complex spectrum: the S-transform. IEEE Trans. Signal Process. 44(4), 998–1001 (1996)
H.K. Vydana, A.K. Vuppala, Detection of fricatives using s-transform. J. Acoust. Soc. Am. 140(5), 3896–3907 (2016)
Acknowledgements
The authors would like to thank MeitY (Ministry of Electronics and Information Technology) for supporting the research under the Visvesvaraya PhD fellowship scheme.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Vydana, H.K., Vuppala, A.K. Detection of Fricative Landmarks Using Spectral Weighting: A Temporal Approach. Circuits Syst Signal Process 40, 2376–2399 (2021). https://doi.org/10.1007/s00034-020-01576-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-020-01576-7