Abstract
In this paper, we examine the performance of automatically detecting Brazil’s prominent syllables using five machine learning classifiers and seven sets of features consisting of three features: pitch, intensity, and duration, taken one at time, two at a time, and all three. Prominent syllables are the foundation of Brazil’s prosodic intonation model. We found that using pitch, intensity, and duration as features produces the best optimal results. Our findings also revealed that in terms of accuracy, F-measure, and Cohen’s kappa coefficient that bagging an ensemble of decision tree learners performed the best (accuracy = 95.9 ± 0.2 %; F-measure = 93.7 ± 0.4; κ = 0.907 ± 0.005). The performance of our current model proves to be significantly better than any other automatic detection software that exists or that of human transcription experts of prosody.
Similar content being viewed by others
References
Ananthakrishnan, S., & Narayanan, S. S. (2008). Automatic prosodic event detection using acoustic, lexical, and syntactic evidence. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 216–228.
Avanzi, M., Lacheret-Dujour, A., & Victorri, B. (2010). A corpus-based learning method for prominence detection in spontaneous speech. In Proceedings of prosodic prominence, speech prosody 2010 satellite workshop, Chicago, 10 May.
Beckman, M., & Elam, G. (1997). Guidelines for ToBI labelling. http://www.ling.ohio-state.edu/research/phonetics/E_ToBI.
Bocklet, T., & Shriberg, E. (2009, April). Speaker recognition using syllable-based constraints for cepstral frame selection. In IEEE international conference on acoustics, speech and signal processing, 2009 (ICASSP 2009) (pp. 4525–4528). IEEE.
Boersma, P., & Weenink, D. (2014). Praat: Doing phonetics by computer (version 5.3.83). [Computer program]. Retrieved August 19, 2014.
Brazil, D. (1997). The communicative value of intonation in English. Cambridge: Cambridge University Press.
Breen, M., Dilley, L. C., Kraemer, J., & Gibson, E. (2012). Inter-transcriber reliability for two systems of prosodic annotation: ToBI (Tones and Break Indices) and RaP (Rhythm and Pitch).
Breiman, L. (1994). Bagging predictors. Technical Report 421. Department of Statistics, University of California at Berkeley.
Breiman, L. (1996). Bias, variance, and arcing classifiers. Technical Report 460. Department of Statistics, University of California at Berkeley.
Cauldwell, R. (2012). RIAS VAN DEN DOEL, How friendly are the natives? An evaluation of native-speaker judgements of foreign-accented British and American English. Utrecht: Netherlands Graduate School of Linguistics (LOT), 2006. pp. xii + 341. ISBN-10: 90-78328-09-6, ISBN-13: 978-90-78328-09-4. Journal of the International Phonetic Association, 42(02), 213–215.
Christodoulides, G., & Avanzi, M. (2014). An evaluation of machine learning methods for prominence detection in French. In Fifteenth annual conference of the International Speech Communication Association.
Chun, D. M. (2002). Discourse intonation in L2: From theory and research to practice. Amsterdam: John Benjamins.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273–297.
Cutugno, F., Leone, E., Ludusan, B., & Origlia, A. (2012). Investigating syllabic prominence with conditional random fields and latent-dynamic conditional random fields. In INTERSPEECH.
Dilley, L. C. (2005). The phonetics and phonology of tonal systems. Doctoral dissertation, Massachusetts Institute of Technology.
Dilley, L. C., & Brown, M. (2005). The RaP (Rhythm and Pitch) labeling system. Unpublished manuscript.
Escudero-Mancebo, D., González-Ferreras, C., Vivaracho-Pascual, C., & Cardeñoso-Payo, V. (2014). A fuzzy classifier to deal with similarity between labels on automatic prosodic labeling. Computer Speech & Language, 28(1), 326–341.
Fine, J., Bartolucci, G., Ginsberg, G., & Szatmari, P. (1991). The use of intonation to communicate in pervasive developmental disorders. Journal of Child Psychology and Psychiatry, 32(5), 771–782.
Frith, U., & Happé, F. (1994). Language and communication in autistic disorders. Philosophical Transactions of the Royal Society B: Biological Sciences, 346(1315), 97–104.
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., & Pallett, D. S. (1993). DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Technical Report N, 93, 27403.
González-Ferreras, C., Escudero-Mancebo, D., Vivaracho-Pascual, C., & Cardeñoso-Payo, V. (2012). Improving automatic classification of prosodic events by pairwise coupling. IEEE Transactions on Audio, Speech, and Language Processing, 20(7), 2045–2058.
Hämäläinen, A., Boves, L., de Veth, J., & Bosch, L. T. (2007). On the utility of syllable-based acoustic models for pronunciation variation modelling. EURASIP Journal on Audio, Speech, and Music Processing, 2007(2), 3.
Happel, B. L., & Murre, J. M. (1994). Design and evolution of modular neural network architectures. Neural Networks, 7(6), 985–1004.
Jeon, J. H., & Liu, Y. (2009). Automatic prosodic events detection using syllable-based acoustic and syntactic features. In IEEE international conference on acoustics, speech and signal processing, 2009 (ICASSP 2009) (pp. 4565–4568). IEEE.
Juslin, P. N., & Laukka, P. (2004). Expression, perception, and induction of musical emotions: A review and a questionnaire study of everyday listening. Journal of New Music Research, 33(3), 217–238.
Kang, O. (2010). Relative salience of suprasegmental features on judgments of L2 comprehensibility and accentedness. System, 38(2), 301–315.
Kang, O., & Pickering, L. (2013). Using acoustic and temporal analysis for assessing speaking. In A. Kunnan (Ed.), Companion to language assessment (pp. 1047–1062). Hoboken: Wiley-Blackwell.
Kang, O., Rubin, D., & Pickering, L. (2010). Suprasegmental measures of accentedness and judgments of language learner proficiency in oral English. The Modern Language Journal, 94(4), 554–566.
Kang, O., & Wang, L. (2014). Impact of different task types on candidates’ speaking performances and interactive features that distinguish between CEFR levels. ISSN 1756-509X, 40.
KayPENTAX. (2008). Multi-speech and CSL software. Lincoln Park, NJ: KayPENTAX.
Kochanski, G., Grabe, E., Coleman, J., & Rosner, B. (2005). Loudness predicts prominence: Fundamental frequency lends little. The Journal of the Acoustical Society of America, 118(2), 1038–1054.
Litman, D. J., Hirschberg, J. B., & Swerts, M. (2000). Predicting automatic speech recognition performance using prosodic cues. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference (pp. 218–225). Association for Computational Linguistics.
Ludusan, B., & Dupoux, E. (2014). Towards low-resource prosodic boundary detection.
Ludusan, B., Origlia, A., & Cutugno, F. (2011). On the use of the rhythmogram for automatic syllabic prominence detection (pp. 2424–2427). In INTERSPEECH.
Mahrt, T., Cole, J., Fleck, M. M., & Hasegawa-Johnson, M. (2012a). F0 and the perception of prominence. In INTERSPEECH.
Mahrt, T., Cole, J., Fleck, M., & Hasegawa-Johnson, M. (2012b). Modeling speaker variation in cues to prominence using the Bayesian information criterion. In Speech prosody 2012.
Mahrt, T., Huang, J. T., Mo, Y., Fleck, M. M., Hasegawa-Johnson, M., & Cole, J. (2011). Optimal models of prosodic prominence using the Bayesian information criterion (pp. 2037–2040). In INTERSPEECH.
MathWorks, Inc. (2013). MATLAB release 2013a. [Computer program]. Retrieved February 15, 2013.
McCann, J., & Peppé, S. (2003). Prosody in autism spectrum disorders: A critical review. International Journal of Language & Communication Disorders, 38(4), 325–350.
Nadel, J., Simon, M., Canet, P., Soussignan, R., Blancard, P., Canamero, L., & Gaussier, P. (2006). Human responses to an expressive robot. In Proceedings of the sixth international workshop on epigenetic robotics. Lund University.
Ni, C. J., Liu, W., & Xu, B. (2011). Automatic prosodic events detection by using syllable-based acoustic, lexical and syntactic features. In INTERSPEECH (pp. 2017–2020).
Ni, C., Liu, W., & Xu, B. (2012). From English pitch accent detection to Mandarin stress detection, where is the difference? Computer Speech & Language, 26(3), 127–148.
Obin, N., Rodet, X., & Lacheret-Dujour, A. (2009). A syllable-based prominence detection model based on discriminant analysis and context-dependency. In SPECOM (pp. 97–100).
Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11, 169–198.
Ostendorf, M. (1999, December). Moving beyond the ‘beads-on-a-string’ model of speech. In Proceedings of IEEE ASRU workshop (pp. 79–84). Piscataway, NJ: IEEE.
Ostendorf, M., Price, P. J., & Shattuck-Hufnagel, S. (1995). The Boston University radio news corpus. Linguistic Data Consortium, 1–19.
Paul, R., Augustyn, A., Klin, A., & Volkmar, F. R. (2005). Perception and production of prosody by speakers with autism spectrum disorders. Journal of Autism and Developmental Disorders, 35(2), 205–220.
Pickering, L. (1999). An analysis of prosodic systems in the classroom discourse of native speaker and nonnative speaker teaching assistants. Unpublished doctoral dissertation, University of Florida, Gainesville.
Pickering, L. (2009). Intonation as a pragmatic resource in ELF interaction. Intercultural Pragmatics, 6(2), 235–255.
Pierrehumbert, J. B. (1980). The phonology and phonetics of English intonation. Doctoral dissertation, Massachusetts Institute of Technology.
Pierrehumbert, J., & Beckman, M. (1988). Japanese tone structure. Linguistic Inquiry Monographs, 15, 1–282.
Price, P., Ostendorf, M., Shattuck-Hufnagel, S., & Veilleux, N. (1988). A methodology for analyzing prosody. The Journal of the Acoustical Society of America, 84(S1), S99.
Quinlan, J. R. (1999). Simplifying decision trees. International Journal of Human-Computer Studies, 51(2), 497–510.
Rosenberg, A., & Hirschberg, J. (2006). On the correlation between energy and pitch accent in read English speech. In INTERSPEECH.
Rosenberg, A., & Hirschberg, J. (2009). Detecting pitch accents at the word, syllable and vowel level. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the Association for Computational Linguistics, Companion Volume: Short Papers (pp. 81–84). Association for Computational Linguistics.
Rosenberg, A., & Hirschberg, J. B. (2010). Production of English prominence by native mandarin Chinese speakers.
Shriberg, E., Ferrer, L., Kajarekar, S., Venkataraman, A., & Stolcke, A. (2005). Modeling prosodic feature sequences for speaker recognition. Speech Communication, 46(3), 455–472.
Shriberg, L. D., Paul, R., McSweeny, J. L., Klin, A., Cohen, D. J., & Volkmar, F. R. (2001). Speech and prosody characteristics of adolescents and adults with high-functioning autism and Asperger syndrome. Journal of Speech, Language, and Hearing Research, 44(5), 1097–1115.
Silipo, R., & Greenberg, S. (1999). Automatic transcription of prosodic stress for spontaneous English discourse. In Proceedings of the XIVth international congress of phonetic sciences (ICPhS) (Vol. 3, p. 2351).
Silipo, R., & Greenberg, S. (2000). Prosodic stress revisited: Reassessing the role of fundamental frequency. In Proceedings of NIST speech transcription workshop.
Sridhar, V. R., Bangalore, S., & Narayanan, S. S. (2008). Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE Transactions on Audio, Speech, and Language Processing, 16(4), 797–811.
Streefkerk, B. M., Pols, L. C., & Ten Bosch, L. F. (1997). Prominence in read aloud sentences, as marked by listeners and classified automatically. In Proceedings of the Institute of Phonetic Sciences, University of Amsterdam (Vol. 21, pp. 101–116).
Syrdal, A. K., & McGory, J. T. (2000). Inter-transcriber reliability of ToBI prosodic labeling. In INTERSPEECH (pp. 235–238).
Tamburini, F. (2006). Reliable prominence identification in English spontaneous speech. Proceedings of speech prosody 2006.
Terken, J. (1991). Fundamental frequency and perceived prominence of accented syllables. The Journal of the Acoustical Society of America, 89(4), 1768–1776.
Wightman, C., Price, P., Pierrehumbert, J., & Hirschberg, J. (1992). ToBI: A standard for labeling English prosody. In Proceedings of the 1992 international conference on spoken language processing, ICSLP (pp. 12–16).
Xu, Y. (2012). Speech prosody: A methodological review. Journal of Speech Sciences, 1(1), 85–115.
Yoon, T., Chavarria, S., Cole, J., & Hasegawa-Johnson, M. (2004). Intertranscriber reliability of prosodic labeling on telephone conversation using ToBI. In INTERSPEECH.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Johnson, D.O., Kang, O. Automatic prominent syllable detection with machine learning classifiers. Int J Speech Technol 18, 583–592 (2015). https://doi.org/10.1007/s10772-015-9299-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-015-9299-z