Abstract
This paper describes a work aimed towards understanding the art of mimicking by professional mimicry artists while imitating the speech characteristics of known persons, and also explores the possibility of detecting a given speech as genuine or impostor. This includes a systematic approach of collecting three categories of speech data, namely original speech of the mimicry artists, speech while mimicking chosen celebrities and original speech of the chosen celebrities, to analyze the variations in prosodic features. A method is described for the automatic extraction of relevant prosodic features in order to model speaker characteristics. Speech is automatically segmented as intonation phrases using speech/nonspeech classification. Further segmentation is done using valleys in energy contour. Intonation, duration and energy features are extracted for each of these segments. Intonation curve is approximated using Legendre polynomials. Other useful prosodic features include average jitter, average shimmer, total duration, voiced duration and change in energy. These prosodic features extracted from original speech of celebrities and mimicry artists are used for creating speaker models. Support Vector Machine (SVM) is used for creating speaker models, and detection of a given speech as genuine or impostor is attempted using a speaker verification framework of SVM models.
Similar content being viewed by others
References
Adami, A. G., Mihaescu, R., Reynolds, D. A., & Godfrey, J. J. (2003). Modeling prosodic dynamics for speaker recognition. In Proceeding of int. conf. acoust., speech and signal processing, Hong Kong, China (Vol. 4, pp. 788–791).
Atal, B. (1972). Automatic speaker recognition based on pitch contours. The Journal of the Acoustical Society of America, 52(3), 1687–1697.
Blomberg, M., Elenius, D., & Zetterholm, E. (2004). Speaker verification scores and acoustic analysis of a professional impersonator. In Proceedings FONETIK 2004 the XVIIth Swedish phonetics conference (pp. 84–87).
Campbell, J. P. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9), 1437–1462.
Drygajlo, A. (2007). Forensic automatic speaker recognition. IEEE Signal Processing Magazine, 132–135.
Farrús, M., Wagner, M., Erro, D., & Hernando, J. (2010). Automatic speaker recognition as a measurement of voice imitation and conversion. The International Journal of Speech, Language and the Law, 17(1), 119–142.
Heck, L. P. (2002). Integrating high-level information for robust speaker recognition in John Hopkins University workshop on SuperSID. Baltimore, Maryland. http://www.cslp.jhu.edu/ws2002/groups/supersid.
Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: from features to supervectors. Speech Communication, 52, 12–40.
Lin, C., & Wang, H. (2005). Language identification using pitch contour information. In Proceedings of int. conf. acoust., speech and signal processing, Philadelphia. USA (Vol. I, pp. 601–605).
Mary, L. (2006). Multilevel implicit features for language and speaker recognition. Ph.D. Thesis, Indian Institute of Technology, Madras, India.
Mary, L. (2011). Prosodic features for speaker recognition. In A. Neustein & H. A. Patil (Eds.), Forensic speaker recognition—law enforcement and counter-terrorism (pp. 365–388). Berlin: Springer.
Mary, L., & Yegnanarayana, B. (2006). Prosodic features for speaker verification. In Proceedings of interspeech, Pittsburgh, Pennsylvania (pp. 917–920).
Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50, 782–796.
Moattar, M. H., Homayounpour, M. M., & Kalantari, N. K. (2010). A new approach for robust realtime voice activity detection using spectral pattern. In Proceeding of int. conf. acoust., speech and signal processing (pp. 4478–4481).
NIST (2001). Speaker recognition evaluation website. http://www.nist.gov/speech/tests/spk/2001.
Perrot, P., & Chollet, G. (2008). The question of disguised voice. In Proceedings of acoustics 08, Paris (pp. 5681–5685).
Perrot, P., Aversano, G., & Chollet, G. (2007). Voice disguise and automatic detection: review and perspectives. Lecture notes in computer science.: Vol. 4391. In Progress in nonlinear speech processing (pp. 101–117). Berlin: Springer.
Perrot, P., Morel, M., Razik, G., & Chollet, G. (2009). Lecture notes of the Institute for Computer Sciences: Vol. 8. Vocal forgery in forensic sciences, Social Informatics and Telecommunication Engineering (pp. 179–185). Berlin: Springer.
Rabinerl, L. R., & Schafer, R. W. (2007). Introduction to digital speech processing. Foundations and Trends in Signal Processing, 1(1–2), 1–194.
Reynolds, D., Andrews, W., Campbell, J., Navratil, J., Peskin, B., Adami, A., Jin, Q., Klusacek, D., Abramson, J., Mihaescu, R., Godfrey, J., Jones, D., & Xiang, B. (2003). The superSID project: exploiting high-level information for high-accuracy speaker recognition. In Proceedings of int. conf. acoust., speech and signal processing, Hong Kong, China (Vol. 4, pp. 784–787).
Rose, P. (2006). Technical speaker recognition: evaluation, types and testing of evidence. Computer Speech & Language, 20, 159–191.
Shriberg, E., & Stolcke (2008). The case for automatic higher level features in forensic speaker recognition. In Proceedings of interspeech (pp. 1509–1512).
Shriberg, E., & Stolcke (2008). The case for automatic higher level features in forensic speaker recognition. In Proceedings of interspeech (pp. 1509–1512).
Zetterholm, E. (2006). Same speaker–different voices. A study of one impersonator and some of his different imitations. In Proceedings of the 11th Australian international conference on speech science and technology (pp. 70–75).
Zetterholm, E., & Sullivan, K. P. H. (2002). The impact of semantic expectation on the acceptance of a voice imitation. In Proceedings of the 9th Australian conference on speech science and technology (pp. 291–296).
Zetterholm, E., Blomberg, M., & Elenius, D. A. (2004). Comparison between human perception and a speaker verification system score of a voice imitation. In Proceedings of the 10th Australian international conference on speech science and technology (pp. 393–397).
Acknowledgement
The authors would like to thank Kerala State Council for Science, Technology and Environment, India for giving financial support to carry out the study described in this paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mary, L., Anish Babu, K.K. & Joseph, A. Analysis and detection of mimicked speech based on prosodic features. Int J Speech Technol 15, 407–417 (2012). https://doi.org/10.1007/s10772-012-9163-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-012-9163-3