Skip to main content
Log in

Automatic prominent syllable detection with machine learning classifiers

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In this paper, we examine the performance of automatically detecting Brazil’s prominent syllables using five machine learning classifiers and seven sets of features consisting of three features: pitch, intensity, and duration, taken one at time, two at a time, and all three. Prominent syllables are the foundation of Brazil’s prosodic intonation model. We found that using pitch, intensity, and duration as features produces the best optimal results. Our findings also revealed that in terms of accuracy, F-measure, and Cohen’s kappa coefficient that bagging an ensemble of decision tree learners performed the best (accuracy = 95.9 ± 0.2 %; F-measure = 93.7 ± 0.4; κ = 0.907 ± 0.005). The performance of our current model proves to be significantly better than any other automatic detection software that exists or that of human transcription experts of prosody.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

References

  • Ananthakrishnan, S., & Narayanan, S. S. (2008). Automatic prosodic event detection using acoustic, lexical, and syntactic evidence. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 216–228.

    Article  Google Scholar 

  • Avanzi, M., Lacheret-Dujour, A., & Victorri, B. (2010). A corpus-based learning method for prominence detection in spontaneous speech. In Proceedings of prosodic prominence, speech prosody 2010 satellite workshop, Chicago, 10 May.

  • Beckman, M., & Elam, G. (1997). Guidelines for ToBI labelling. http://www.ling.ohio-state.edu/research/phonetics/E_ToBI.

  • Bocklet, T., & Shriberg, E. (2009, April). Speaker recognition using syllable-based constraints for cepstral frame selection. In IEEE international conference on acoustics, speech and signal processing, 2009 (ICASSP 2009) (pp. 4525–4528). IEEE.

  • Boersma, P., & Weenink, D. (2014). Praat: Doing phonetics by computer (version 5.3.83). [Computer program]. Retrieved August 19, 2014.

  • Brazil, D. (1997). The communicative value of intonation in English. Cambridge: Cambridge University Press.

    Google Scholar 

  • Breen, M., Dilley, L. C., Kraemer, J., & Gibson, E. (2012). Inter-transcriber reliability for two systems of prosodic annotation: ToBI (Tones and Break Indices) and RaP (Rhythm and Pitch).

  • Breiman, L. (1994). Bagging predictors. Technical Report 421. Department of Statistics, University of California at Berkeley.

  • Breiman, L. (1996). Bias, variance, and arcing classifiers. Technical Report 460. Department of Statistics, University of California at Berkeley.

  • Cauldwell, R. (2012). RIAS VAN DEN DOEL, How friendly are the natives? An evaluation of native-speaker judgements of foreign-accented British and American English. Utrecht: Netherlands Graduate School of Linguistics (LOT), 2006. pp. xii + 341. ISBN-10: 90-78328-09-6, ISBN-13: 978-90-78328-09-4. Journal of the International Phonetic Association, 42(02), 213–215.

  • Christodoulides, G., & Avanzi, M. (2014). An evaluation of machine learning methods for prominence detection in French. In Fifteenth annual conference of the International Speech Communication Association.

  • Chun, D. M. (2002). Discourse intonation in L2: From theory and research to practice. Amsterdam: John Benjamins.

    Book  Google Scholar 

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.

    Article  Google Scholar 

  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273–297.

    MATH  Google Scholar 

  • Cutugno, F., Leone, E., Ludusan, B., & Origlia, A. (2012). Investigating syllabic prominence with conditional random fields and latent-dynamic conditional random fields. In INTERSPEECH.

  • Dilley, L. C. (2005). The phonetics and phonology of tonal systems. Doctoral dissertation, Massachusetts Institute of Technology.

  • Dilley, L. C., & Brown, M. (2005). The RaP (Rhythm and Pitch) labeling system. Unpublished manuscript.

  • Escudero-Mancebo, D., González-Ferreras, C., Vivaracho-Pascual, C., & Cardeñoso-Payo, V. (2014). A fuzzy classifier to deal with similarity between labels on automatic prosodic labeling. Computer Speech & Language, 28(1), 326–341.

    Article  Google Scholar 

  • Fine, J., Bartolucci, G., Ginsberg, G., & Szatmari, P. (1991). The use of intonation to communicate in pervasive developmental disorders. Journal of Child Psychology and Psychiatry, 32(5), 771–782.

    Article  Google Scholar 

  • Frith, U., & Happé, F. (1994). Language and communication in autistic disorders. Philosophical Transactions of the Royal Society B: Biological Sciences, 346(1315), 97–104.

    Article  Google Scholar 

  • Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., & Pallett, D. S. (1993). DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Technical Report N, 93, 27403.

  • González-Ferreras, C., Escudero-Mancebo, D., Vivaracho-Pascual, C., & Cardeñoso-Payo, V. (2012). Improving automatic classification of prosodic events by pairwise coupling. IEEE Transactions on Audio, Speech, and Language Processing, 20(7), 2045–2058.

    Article  Google Scholar 

  • Hämäläinen, A., Boves, L., de Veth, J., & Bosch, L. T. (2007). On the utility of syllable-based acoustic models for pronunciation variation modelling. EURASIP Journal on Audio, Speech, and Music Processing, 2007(2), 3.

    Google Scholar 

  • Happel, B. L., & Murre, J. M. (1994). Design and evolution of modular neural network architectures. Neural Networks, 7(6), 985–1004.

    Article  Google Scholar 

  • Jeon, J. H., & Liu, Y. (2009). Automatic prosodic events detection using syllable-based acoustic and syntactic features. In IEEE international conference on acoustics, speech and signal processing, 2009 (ICASSP 2009) (pp. 4565–4568). IEEE.

  • Juslin, P. N., & Laukka, P. (2004). Expression, perception, and induction of musical emotions: A review and a questionnaire study of everyday listening. Journal of New Music Research, 33(3), 217–238.

    Article  Google Scholar 

  • Kang, O. (2010). Relative salience of suprasegmental features on judgments of L2 comprehensibility and accentedness. System, 38(2), 301–315.

    Article  Google Scholar 

  • Kang, O., & Pickering, L. (2013). Using acoustic and temporal analysis for assessing speaking. In A. Kunnan (Ed.), Companion to language assessment (pp. 1047–1062). Hoboken: Wiley-Blackwell.

    Chapter  Google Scholar 

  • Kang, O., Rubin, D., & Pickering, L. (2010). Suprasegmental measures of accentedness and judgments of language learner proficiency in oral English. The Modern Language Journal, 94(4), 554–566.

    Article  Google Scholar 

  • Kang, O., & Wang, L. (2014). Impact of different task types on candidates’ speaking performances and interactive features that distinguish between CEFR levels. ISSN 1756-509X, 40.

  • KayPENTAX. (2008). Multi-speech and CSL software. Lincoln Park, NJ: KayPENTAX.

    Google Scholar 

  • Kochanski, G., Grabe, E., Coleman, J., & Rosner, B. (2005). Loudness predicts prominence: Fundamental frequency lends little. The Journal of the Acoustical Society of America, 118(2), 1038–1054.

    Article  Google Scholar 

  • Litman, D. J., Hirschberg, J. B., & Swerts, M. (2000). Predicting automatic speech recognition performance using prosodic cues. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference (pp. 218–225). Association for Computational Linguistics.

  • Ludusan, B., & Dupoux, E. (2014). Towards low-resource prosodic boundary detection.

  • Ludusan, B., Origlia, A., & Cutugno, F. (2011). On the use of the rhythmogram for automatic syllabic prominence detection (pp. 2424–2427). In INTERSPEECH.

  • Mahrt, T., Cole, J., Fleck, M. M., & Hasegawa-Johnson, M. (2012a). F0 and the perception of prominence. In INTERSPEECH.

  • Mahrt, T., Cole, J., Fleck, M., & Hasegawa-Johnson, M. (2012b). Modeling speaker variation in cues to prominence using the Bayesian information criterion. In Speech prosody 2012.

  • Mahrt, T., Huang, J. T., Mo, Y., Fleck, M. M., Hasegawa-Johnson, M., & Cole, J. (2011). Optimal models of prosodic prominence using the Bayesian information criterion (pp. 2037–2040). In INTERSPEECH.

  • MathWorks, Inc. (2013). MATLAB release 2013a. [Computer program]. Retrieved February 15, 2013.

  • McCann, J., & Peppé, S. (2003). Prosody in autism spectrum disorders: A critical review. International Journal of Language & Communication Disorders, 38(4), 325–350.

    Article  Google Scholar 

  • Nadel, J., Simon, M., Canet, P., Soussignan, R., Blancard, P., Canamero, L., & Gaussier, P. (2006). Human responses to an expressive robot. In Proceedings of the sixth international workshop on epigenetic robotics. Lund University.

  • Ni, C. J., Liu, W., & Xu, B. (2011). Automatic prosodic events detection by using syllable-based acoustic, lexical and syntactic features. In INTERSPEECH (pp. 2017–2020).

  • Ni, C., Liu, W., & Xu, B. (2012). From English pitch accent detection to Mandarin stress detection, where is the difference? Computer Speech & Language, 26(3), 127–148.

    Article  Google Scholar 

  • Obin, N., Rodet, X., & Lacheret-Dujour, A. (2009). A syllable-based prominence detection model based on discriminant analysis and context-dependency. In SPECOM (pp. 97–100).

  • Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11, 169–198.

    MATH  Google Scholar 

  • Ostendorf, M. (1999, December). Moving beyond the ‘beads-on-a-string’ model of speech. In Proceedings of IEEE ASRU workshop (pp. 79–84). Piscataway, NJ: IEEE.

  • Ostendorf, M., Price, P. J., & Shattuck-Hufnagel, S. (1995). The Boston University radio news corpus. Linguistic Data Consortium, 1–19.

  • Paul, R., Augustyn, A., Klin, A., & Volkmar, F. R. (2005). Perception and production of prosody by speakers with autism spectrum disorders. Journal of Autism and Developmental Disorders, 35(2), 205–220.

    Article  Google Scholar 

  • Pickering, L. (1999). An analysis of prosodic systems in the classroom discourse of native speaker and nonnative speaker teaching assistants. Unpublished doctoral dissertation, University of Florida, Gainesville.

  • Pickering, L. (2009). Intonation as a pragmatic resource in ELF interaction. Intercultural Pragmatics, 6(2), 235–255.

    Article  Google Scholar 

  • Pierrehumbert, J. B. (1980). The phonology and phonetics of English intonation. Doctoral dissertation, Massachusetts Institute of Technology.

  • Pierrehumbert, J., & Beckman, M. (1988). Japanese tone structure. Linguistic Inquiry Monographs, 15, 1–282.

    Google Scholar 

  • Price, P., Ostendorf, M., Shattuck-Hufnagel, S., & Veilleux, N. (1988). A methodology for analyzing prosody. The Journal of the Acoustical Society of America, 84(S1), S99.

    Article  Google Scholar 

  • Quinlan, J. R. (1999). Simplifying decision trees. International Journal of Human-Computer Studies, 51(2), 497–510.

    Article  Google Scholar 

  • Rosenberg, A., & Hirschberg, J. (2006). On the correlation between energy and pitch accent in read English speech. In INTERSPEECH.

  • Rosenberg, A., & Hirschberg, J. (2009). Detecting pitch accents at the word, syllable and vowel level. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the Association for Computational Linguistics, Companion Volume: Short Papers (pp. 81–84). Association for Computational Linguistics.

  • Rosenberg, A., & Hirschberg, J. B. (2010). Production of English prominence by native mandarin Chinese speakers.

  • Shriberg, E., Ferrer, L., Kajarekar, S., Venkataraman, A., & Stolcke, A. (2005). Modeling prosodic feature sequences for speaker recognition. Speech Communication, 46(3), 455–472.

    Article  Google Scholar 

  • Shriberg, L. D., Paul, R., McSweeny, J. L., Klin, A., Cohen, D. J., & Volkmar, F. R. (2001). Speech and prosody characteristics of adolescents and adults with high-functioning autism and Asperger syndrome. Journal of Speech, Language, and Hearing Research, 44(5), 1097–1115.

    Article  Google Scholar 

  • Silipo, R., & Greenberg, S. (1999). Automatic transcription of prosodic stress for spontaneous English discourse. In Proceedings of the XIVth international congress of phonetic sciences (ICPhS) (Vol. 3, p. 2351).

  • Silipo, R., & Greenberg, S. (2000). Prosodic stress revisited: Reassessing the role of fundamental frequency. In Proceedings of NIST speech transcription workshop.

  • Sridhar, V. R., Bangalore, S., & Narayanan, S. S. (2008). Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE Transactions on Audio, Speech, and Language Processing, 16(4), 797–811.

    Article  Google Scholar 

  • Streefkerk, B. M., Pols, L. C., & Ten Bosch, L. F. (1997). Prominence in read aloud sentences, as marked by listeners and classified automatically. In Proceedings of the Institute of Phonetic Sciences, University of Amsterdam (Vol. 21, pp. 101–116).

  • Syrdal, A. K., & McGory, J. T. (2000). Inter-transcriber reliability of ToBI prosodic labeling. In INTERSPEECH (pp. 235–238).

  • Tamburini, F. (2006). Reliable prominence identification in English spontaneous speech. Proceedings of speech prosody 2006.

  • Terken, J. (1991). Fundamental frequency and perceived prominence of accented syllables. The Journal of the Acoustical Society of America, 89(4), 1768–1776.

    Article  Google Scholar 

  • Wightman, C., Price, P., Pierrehumbert, J., & Hirschberg, J. (1992). ToBI: A standard for labeling English prosody. In Proceedings of the 1992 international conference on spoken language processing, ICSLP (pp. 12–16).

  • Xu, Y. (2012). Speech prosody: A methodological review. Journal of Speech Sciences, 1(1), 85–115.

    Google Scholar 

  • Yoon, T., Chavarria, S., Cole, J., & Hasegawa-Johnson, M. (2004). Intertranscriber reliability of prosodic labeling on telephone conversation using ToBI. In INTERSPEECH.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Okim Kang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Johnson, D.O., Kang, O. Automatic prominent syllable detection with machine learning classifiers. Int J Speech Technol 18, 583–592 (2015). https://doi.org/10.1007/s10772-015-9299-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-015-9299-z

Keywords

Navigation