Skip to main content
Log in

Analysis and detection of mimicked speech based on prosodic features

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

This paper describes a work aimed towards understanding the art of mimicking by professional mimicry artists while imitating the speech characteristics of known persons, and also explores the possibility of detecting a given speech as genuine or impostor. This includes a systematic approach of collecting three categories of speech data, namely original speech of the mimicry artists, speech while mimicking chosen celebrities and original speech of the chosen celebrities, to analyze the variations in prosodic features. A method is described for the automatic extraction of relevant prosodic features in order to model speaker characteristics. Speech is automatically segmented as intonation phrases using speech/nonspeech classification. Further segmentation is done using valleys in energy contour. Intonation, duration and energy features are extracted for each of these segments. Intonation curve is approximated using Legendre polynomials. Other useful prosodic features include average jitter, average shimmer, total duration, voiced duration and change in energy. These prosodic features extracted from original speech of celebrities and mimicry artists are used for creating speaker models. Support Vector Machine (SVM) is used for creating speaker models, and detection of a given speech as genuine or impostor is attempted using a speaker verification framework of SVM models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  • Adami, A. G., Mihaescu, R., Reynolds, D. A., & Godfrey, J. J. (2003). Modeling prosodic dynamics for speaker recognition. In Proceeding of int. conf. acoust., speech and signal processing, Hong Kong, China (Vol. 4, pp. 788–791).

    Google Scholar 

  • Atal, B. (1972). Automatic speaker recognition based on pitch contours. The Journal of the Acoustical Society of America, 52(3), 1687–1697.

    Article  Google Scholar 

  • Blomberg, M., Elenius, D., & Zetterholm, E. (2004). Speaker verification scores and acoustic analysis of a professional impersonator. In Proceedings FONETIK 2004 the XVIIth Swedish phonetics conference (pp. 84–87).

    Google Scholar 

  • Campbell, J. P. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9), 1437–1462.

    Article  Google Scholar 

  • Drygajlo, A. (2007). Forensic automatic speaker recognition. IEEE Signal Processing Magazine, 132–135.

  • Farrús, M., Wagner, M., Erro, D., & Hernando, J. (2010). Automatic speaker recognition as a measurement of voice imitation and conversion. The International Journal of Speech, Language and the Law, 17(1), 119–142.

    Google Scholar 

  • Heck, L. P. (2002). Integrating high-level information for robust speaker recognition in John Hopkins University workshop on SuperSID. Baltimore, Maryland. http://www.cslp.jhu.edu/ws2002/groups/supersid.

  • Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: from features to supervectors. Speech Communication, 52, 12–40.

    Article  Google Scholar 

  • Lin, C., & Wang, H. (2005). Language identification using pitch contour information. In Proceedings of int. conf. acoust., speech and signal processing, Philadelphia. USA (Vol. I, pp. 601–605).

    Google Scholar 

  • Mary, L. (2006). Multilevel implicit features for language and speaker recognition. Ph.D. Thesis, Indian Institute of Technology, Madras, India.

  • Mary, L. (2011). Prosodic features for speaker recognition. In A. Neustein & H. A. Patil (Eds.), Forensic speaker recognition—law enforcement and counter-terrorism (pp. 365–388). Berlin: Springer.

    Google Scholar 

  • Mary, L., & Yegnanarayana, B. (2006). Prosodic features for speaker verification. In Proceedings of interspeech, Pittsburgh, Pennsylvania (pp. 917–920).

    Google Scholar 

  • Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50, 782–796.

    Article  Google Scholar 

  • Moattar, M. H., Homayounpour, M. M., & Kalantari, N. K. (2010). A new approach for robust realtime voice activity detection using spectral pattern. In Proceeding of int. conf. acoust., speech and signal processing (pp. 4478–4481).

    Google Scholar 

  • NIST (2001). Speaker recognition evaluation website. http://www.nist.gov/speech/tests/spk/2001.

  • Perrot, P., & Chollet, G. (2008). The question of disguised voice. In Proceedings of acoustics 08, Paris (pp. 5681–5685).

    Google Scholar 

  • Perrot, P., Aversano, G., & Chollet, G. (2007). Voice disguise and automatic detection: review and perspectives. Lecture notes in computer science.: Vol. 4391. In Progress in nonlinear speech processing (pp. 101–117). Berlin: Springer.

    Chapter  Google Scholar 

  • Perrot, P., Morel, M., Razik, G., & Chollet, G. (2009). Lecture notes of the Institute for Computer Sciences: Vol. 8. Vocal forgery in forensic sciences, Social Informatics and Telecommunication Engineering (pp. 179–185). Berlin: Springer.

    Google Scholar 

  • Rabinerl, L. R., & Schafer, R. W. (2007). Introduction to digital speech processing. Foundations and Trends in Signal Processing, 1(1–2), 1–194.

    Article  Google Scholar 

  • Reynolds, D., Andrews, W., Campbell, J., Navratil, J., Peskin, B., Adami, A., Jin, Q., Klusacek, D., Abramson, J., Mihaescu, R., Godfrey, J., Jones, D., & Xiang, B. (2003). The superSID project: exploiting high-level information for high-accuracy speaker recognition. In Proceedings of int. conf. acoust., speech and signal processing, Hong Kong, China (Vol. 4, pp. 784–787).

    Google Scholar 

  • Rose, P. (2006). Technical speaker recognition: evaluation, types and testing of evidence. Computer Speech & Language, 20, 159–191.

    Article  Google Scholar 

  • Shriberg, E., & Stolcke (2008). The case for automatic higher level features in forensic speaker recognition. In Proceedings of interspeech (pp. 1509–1512).

    Google Scholar 

  • Shriberg, E., & Stolcke (2008). The case for automatic higher level features in forensic speaker recognition. In Proceedings of interspeech (pp. 1509–1512).

    Google Scholar 

  • Zetterholm, E. (2006). Same speaker–different voices. A study of one impersonator and some of his different imitations. In Proceedings of the 11th Australian international conference on speech science and technology (pp. 70–75).

    Google Scholar 

  • Zetterholm, E., & Sullivan, K. P. H. (2002). The impact of semantic expectation on the acceptance of a voice imitation. In Proceedings of the 9th Australian conference on speech science and technology (pp. 291–296).

    Google Scholar 

  • Zetterholm, E., Blomberg, M., & Elenius, D. A. (2004). Comparison between human perception and a speaker verification system score of a voice imitation. In Proceedings of the 10th Australian international conference on speech science and technology (pp. 393–397).

    Google Scholar 

Download references

Acknowledgement

The authors would like to thank Kerala State Council for Science, Technology and Environment, India for giving financial support to carry out the study described in this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leena Mary.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mary, L., Anish Babu, K.K. & Joseph, A. Analysis and detection of mimicked speech based on prosodic features. Int J Speech Technol 15, 407–417 (2012). https://doi.org/10.1007/s10772-012-9163-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-012-9163-3

Keywords

Navigation