Abstract
The performance of machine learning classifiers in automatically scoring the English proficiency of unconstrained speech has been explored. Suprasegmental measures were computed by software, which identifies the basic elements of Brazil’s model in human discourse. This paper explores machine learning training with multiple corpora to improve two of those algorithms: prominent syllable detection and tone choice classification. The results show that machine learning training with the Boston University Radio News Corpus can improve automatic English proficiency scoring of unconstrained speech from a Pearson’s correlation of 0.677–0.718. This correlation is higher than any other existing computer programs for automatically scoring the proficiency of unconstrained speech and is approaching that of human raters in terms of inter-rater reliability.
Similar content being viewed by others
References
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V. 2. The Journal of Technology, Learning and Assessment, 4(3), 3–30.
Bernstein, J. (1999). PhonePass testing: Structure and construct. Menlo Park: Ordinate Corporation.
Bernstein, J., Van Moere, A., & Cheng, J. (2010). Validating automated speaking tests. Language Testing, 27(3), 355–377.
Boersma, P., & Weenink, D. (2014). Praat: doing phonetics by computer (Version 5.3.83), [Computer program]. Retrieved August 19, 2014.
Brazil, D. (1997). The communicative value of intonation in English. Cambridge: Cambridge University Press.
Burstein, J., Kukich, K., Braden-Harder, L., Chodorow, M., Hua, S., Kaplan, B., et al. (1998). Computer analysis of essay content for automated score prediction: A prototype automated scoring system for GMAT analytical writing assessment essays. ETS Research Report Series, 1998(1), i-67.
Cambridge English Language Assessment (2015). Retrieved March 29, 2015 from www.cambridgeenglish.org.
Černý, V. (1985). Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. Journal of Optimization Theory and Applications, 45(1), 41–51.
Chodorow, M., & Burstein, J. (2004). Beyond essay length: evaluating e‐rater®’s performance on toefl® essays. ETS Research Report Series, 2004(1), i-38.
Chun, D. M. (2002). Discourse intonation in L2: From theory and research to practice (Vol. 1). Philadelphia: John Benjamins Publishing.
Evanini, K., & Wang, X. (2013). Automated speech scoring for non-native middle school students with multiple task types. In INTERSPEECH (pp. 2435–2439).
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., & Pallett, D. S. (1993). DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Technical Report N, 93, 27403.
Hastie, T., & Tibshirani, R. (1998). Classification by pairwise coupling. The Annals of Statistics, 26(2), 451–471.
Johnson, D. O., & Kang, O. (2015). Automatic prominent syllable detection with machine learning classifiers. International Journal of Speech Technology, 18(4), 583–592.
Johnson, D. O., & Kang, O. (2016). Automatic prosodic tone choice classification with Brazil’s intonation model. International Journal of Speech Technology, 19(1), 95–109.
Kahn, D. (1976). Syllable-based generalizations in English phonology (Vol. 156). Bloomington: Indiana University Linguistics Club.
Kang, O. (2010). Relative salience of suprasegmental features on judgments of L2 comprehensibility and accentedness. System, 38(2), 301–315.
Kang, O., & Johnson, D. O. (2015). Comparison of inter-rater reliability of human and computer prosodic annotation using brazil’s prosody model. English Linguistics Research, 4(4), p58.
Kang, O., & Johnson, D. O. (2016). Systems and Methods for Automated Evaluation of Human Speech. U.S. Patent Application No. 15/054,128. Washington, DC: U.S. Patent and Trademark Office.
Kang, O., Rubin, D., & Pickering, L. (2010). Suprasegmental measures of accentedness and judgments of language learner proficiency in oral English. The Modern Language Journal, 94(4), 554–566.
Kang, O., & Wang, L. (2014). Impact of different task types on candidates’ speaking performances and interactive features that distinguish between CEFR levels. ISSN 1756-509X, 40.
KayPENTAX. (2008). Multi-speech and CSL software. Lincoln Park: KayPENTAX.
Kirkpatrick, S., & Vecchi, M. P. (1983). Optimization by simmulated annealing. Science, 220(4598), 671–680.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211.
Leacock, C. (2004). Scoring free-responses automatically: A case study of a large-scale assessment. Examens, 1(3).
Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(4), 389–405.
Longman, P. (2013). Official guide to Pearson test of English academic.
MathWorks, Inc. (2013). MATLAB Release 2013a. [Computer program]. Retrieved February 15, 2013.
Ostendorf, M., Price, P. J., & Shattuck-Hufnagel, S. (1995). The Boston University radio news corpus. Linguistic Data Consortium, pp. 1–19.
Pearson Education, Inc. (2015). Versant English Test. Retrieved from https://www.versanttest.com/products/english.jsp.
Pickering, L. (1999). An analysis of prosodic systems in the classroom discourse of native speaker and nonnative speaker teaching assistants (Doctoral dissertation, University of Florida).
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlíček, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesel, K. (2011). The Kaldi speech recognition toolkit.
Rudner, L. M., Garcia, V., & Welch, C. (2006). An evaluation of IntelliMetric™ essay scoring system. The Journal of Technology, Learning and Assessment, 4(4), 1–22.
Zechner, K., Higgins, D., Xi, X., & Williamson, D. M. (2009). Automatic scoring of non-native spontaneous speech in tests of spoken English. Speech Communication, 51(10), 883–895.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix
The 35 suprasegmental measures are computed as follows:
Definitions
-
1.
# (i.e. number of) syllables includes coughs, laughs, etc. and filled pauses.
-
2.
# runs = # silent pauses +1.
-
3.
Duration of utterance includes silent and filled pauses (in seconds).
-
4.
(Prominent) syllable pitch = maximum F0 of low-pass filtered Praat pitch contour.
-
5.
Tone choice only calculated for termination prominent syllables.
-
6.
Relative pitch only calculated for key and termination prominent syllables.
-
7.
Paratone boundary = termination followed by key with higher relative pitch.
-
8.
Lexical item = prominent syllable that occurs more than once in an utterance.
-
9.
New lexical item = first time a lexical item occurs.
-
10.
Given lexical item = subsequent times a lexical item occurs.
Calculations
SYPS = # syllables/duration of utterance.
PHTR = (duration of utterance − duration of silent pauses)/duration of utterance.
ARTI = # syllables/(duration of utterance − duration of silent pauses) = SYPS/PHTR.
RNLN = # syllables/# runs.
SPRT = # silent pauses/duration of utterance * 60.
SPLN = duration of silent pauses/# silent pauses.
FPRT = # filled pauses/duration of utterance * 60.
FPLN = duration of filled pauses/# filled pauses.
SPAC = # prominent syllables/# syllables.
PACE = # prominent syllables/# runs = SPAC * RNLN.
PCHR = # tone units (termination prominent syllables) / # runs.
RISL = % of termination prominent syllables with rising tone choice & low relative pitch.
RISM = % of termination prominent syllables with rising tone choice & mid relative pitch.
RISH = % of termination prominent syllables with rising tone choice & high relative pitch.
NEUL = % of termination prominent syllables with neutral tone choice & low relative pitch.
NEUM = % of termination prominent syllables with neutral tone choice & mid relative pitch.
NEUH = % of termination prominent syllables with neutral tone choice & high relative pitch.
FALL = % of termination prominent syllables with falling tone choice & low relative pitch.
FALM = % of termination prominent syllables with falling tone choice & mid relative pitch.
FALH = % of termination prominent syllables with falling tone choice & high relative pitch.
FRSL = % of termination prominent syllables with fall-rise tone choice & low relative pitch.
FRSM = % of termination prominent syllables with fall-rise tone choice & mid relative pitch.
FRSH = % of termination prominent syllables with fall-rise tone choice & high relative pitch.
RFAL = % of termination prominent syllables with rise-fall tone choice & low relative pitch.
RFAM = % of termination prominent syllables with rise-fall tone choice & mid relative pitch.
RFAH = % of termination prominent syllables with rise-fall tone choice & high relative pitch.
PRAN = maximum prominent syllable pitch of utterance – minimum prominent syllable pitch of utterance.
AVNP = average non-prominent syllable pitch.
AVPP = average prominent syllable pitch.
PARA = # paratone boundaries / duration of utterance.
TPTH = average pitch of termination prominent syllables at paratone boundaries.
OPTH = average pitch of key prominent syllables at paratone boundaries.
PPLN = average duration of silent pauses at paratone boundaries (if present).
NEWP = average pitch of new lexical items.
GIVP = average pitch of given lexical items.
Rights and permissions
About this article
Cite this article
Johnson, D.O., Kang, O. & Ghanem, R. Improved automatic English proficiency rating of unconstrained speech with multiple corpora. Int J Speech Technol 19, 755–768 (2016). https://doi.org/10.1007/s10772-016-9366-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-016-9366-0