A three-stage approach to the automated scoring of spontaneous spoken responses

doi:10.1016/j.csl.2010.06.001

Computer Speech & Language

Volume 25, Issue 2, April 2011, Pages 282-306

https://doi.org/10.1016/j.csl.2010.06.001 Get rights and content

Abstract

This paper presents a description and evaluation of $S p e e c h R a t e r^{S M}$ , a system for automated scoring of non-native speakers’ spoken English proficiency, based on tasks which elicit spontaneous monologues on particular topics. This system builds on much previous work in the automated scoring of test responses, but differs from previous work in that the highly unpredictable nature of the responses to this task type makes the challenge of accurate scoring much more difficult.

SpeechRater uses a three-stage architecture. Responses are first processed by a filtering model to ensure that no exceptional conditions exist which might prevent them from being scored by SpeechRater. Responses not filtered out at this stage are then processed by the scoring model to estimate the proficiency rating which a human might assign to them, on the basis of features related to fluency, pronunciation, vocabulary diversity, and grammar. Finally, an aggregation model combines an examinee’s scores for multiple items to calculate a total score, as well as an interval in which the examinee’s score is predicted to reside with high confidence.

SpeechRater’s current level of accuracy and construct representation have been deemed sufficient for low-stakes practice exercises, and it has been used in a practice exam for the TOEFL since late 2006. In such a practice environment, it offers a number of advantages compared to human raters, including system load management, and the facilitation of immediate feedback to students. However, it must be acknowledged that SpeechRater presently fails to measure many important aspects of speaking proficiency (such as intonation and appropriateness of topic development), and its agreement with human ratings of proficiency does not yet approach the level of agreement between two human raters.

Introduction

The last decade has seen a proliferation of new applications of speech technology in the educational domain, in particular in support of English language learner (ELL) populations. Computer-based learning tools for spoken English allow learners who may have limited access to native or high-proficiency English speakers an opportunity to practice their English and receive feedback on their performance. Automated systems also have the potential to make educational materials for English speaking accessible to a wider range of learners, by reducing their cost (compared to a human tutor), and making them available on a more flexible schedule.

Speech-enabled dialogue systems allow learners to practice their speaking and listening in an exchange with a virtual interlocutor—e.g., SpeakESL and Auralog’s Tell Me More system. Automated tutoring systems for speaking practice provide feedback to learners on their pronunciation, vocabulary, and grammar—e.g., Carnegie Speech NativeAccent (Eskenazi et al., 2007), Saybot (Chevalier, 2007) and SRI’s EduSpeak (Franco et al., 2000). The contribution of the work described in this paper is to the field of automated scoring of spoken responses: assessing the level of English speaking proficiency demonstrated in some speaking task used in a testing context.

In particular, this paper describes SpeechRater, an automated system for scoring spontaneous spoken responses. SpeechRater uses automatically derived measures of fluency, pronunciation, vocabulary diversity, and grammar to score spoken responses to tasks similar to those found on the Test of English as a Foreign Language (TOEFL^®), a test of English proficiency for academic settings, used internationally for college admissions. Only the internet-delivered version of the TOEFL test (TOEFL iBT) includes tasks to assess speaking proficiency; the paper-based version of the test assesses only reading, listening, and writing. SpeechRater is currently used operationally to score spoken responses to the low-stakes TOEFL Practice Online (TPO) test, but not for the high-stakes TOEFL iBT test.

Where other spoken response scoring systems had previously relied on restricted speaking tasks such as reading a passage aloud, or answering questions to which the range of responses is narrowly circumscribed (Bernstein, 1999, Bernstein et al., 2000, Balogh et al., 2007), SpeechRater addresses the more challenging task of scoring responses which are relatively unstructured, unrestricted, and spontaneous (for a set of speakers varying considerably in English proficiency and first language). While much work remains to be done to improve SpeechRater’s agreement with human raters and the coverage of important aspects of speaking proficiency in its feature set, the current capability has advanced sufficiently to allow it to be used for the scoring of practice tests used in preparing for the operational TOEFL test.

Section 2 surveys previous work in automated scoring of test items, especially speaking tasks. Section 3 then provides a brief overview of the SpeechRater system architecture, and the speech recognition system used as the basis for feature computation. After an overview of the spoken response data is provided in Section 4, Section 5 describes the first component of SpeechRater, which filters out non-scoreable responses. Section 6 introduces the scoring model itself, and Section 7 describes the way in which scores are aggregated across tasks, and bounds on prediction error are calculated. Finally, a brief summary of the major findings of the paper is presented in Section 8.

Section snippets

Automated scoring of constructed response items

Constructed response test items are those which require the examinee to generate an answer productively (in contrast to selected response or multiple-choice items). This section will discuss some of the motivation for using constructed-response tasks and scoring them by automated means, and review the previous work in this area.

System architecture

The SpeechRater automated speech scoring system consists of the components outlined in Fig. 1. (The output values depicted in Fig. 1 for each stage of the process are meant solely as illustrative examples.) Responses to each of the six speaking tasks administered in a TPO test are first analyzed by a speech recognizer, the output of which is then processed by a set of feature extraction routines, which derive a set of measures meant to be indicative of speaking proficiency. Based on these

Spoken response data

In developing the candidate models to identify non-scoreable responses (Section 5) and to assign scores to all other responses (Section 6), two sources of spoken response data were used. Responses from the TPO practice test were the main focus of development, as the model was intended for actual use on that test. Responses from the operational TOEFL test were also used for comparison, to determine whether the same features had predictive value in both the practice test condition and the

Filtering model

Capturing spoken responses to test items can be complicated by technical problems of many sorts: equipment malfunctions, excessive levels of ambient noise, and data transmission errors, to name a few. In addition, there is the possibility that certain examinees may fail to respond to particular test items (by remaining silent). Both of these issues are aggravated in a practice test environment, where the equipment and testing environment are less tightly controlled, and examinees’ motivation to

Scoring model

The model for estimating the score to be assigned to a spoken response is the core of the SpeechRater application. The usefulness of the system’s feedback depends primarily on this model’s success in providing scores which are similar to those which human raters would provide, and which accurately reflect the important aspects of speaking proficiency reflected in the rubrics for the speaking tasks.

These criteria for success imply two different sets of evaluation criteria for SpeechRater. The

Construction of prediction intervals

Once the raw scores for each of an examinee’s six responses have been calculated by the scoring model, they are summed together to produce the complete TPO Speaking section score. Note that these item scores and the complete section raw score are unrounded values; rounding occurs only when the section raw score is converted to a scaled score (on the TPO 0–30 scale), in order to make the most efficient use possible of the score estimates provided by SpeechRater.

In addition to the score

Discussion

This paper has described the development and structure of SpeechRater, a system for automated scoring of spontaneous spoken responses such as those provided to TOEFL Speaking tasks.

The system processes responses according to a three-stage process. First, a filtering model screens out responses which are not scoreable, because of technical difficulties or the examinee’s failure to respond appropriately to the question. Second, the scoring model uses a set of features related to fluency,

Acknowledgements

The authors would like to thank an anonymous reviewer, and their ETS colleagues Isaac Bejar, Yeonsuk Cho, and Dan Eignor for their thoughtful reviews which helped to improve the content and readability of this paper. We are also indebted to the Technical and Content Advisory Committees at ETS, whose advice was invaluable in developing the models described here. Mike Wagner and Ramin Hemat deserve special mention for their help in collecting and processing the data for this project, and Pam

References (50)

K. Zechner et al.
Automatic scoring of non-native spontaneous speech in tests of spoken English
Speech Communication
(2009)
American Educational Research Association
National Council on Measurement in Education. Standards for Educational and Psychological Testing
(1999)
Y. Attali et al.
Automated essay scoring with e-rater v.2
Journal of Technology, Learning and Assessment
(2006)
Bailey, K.M., 1999. Washback in Language Testing. Tech. Re MS-15, TOEFL...
J. Balogh et al.
Automated evaluation of reading accuracy: assessing machine scores
R.E. Bennett et al.
Validity and automated scoring: it’s not just the scoring.
Educational Measurement, Issues and Practice
(1998)
Bernstein, J., 1999. PhonePass testing: Structure and Construct. Tech. Rep., Ordinate Corporation, Menlo Park,...
J. Bernstein et al.
Two experiments in automated scoring of spoken language proficiency
L. Breiman et al.
Classification and Regression Trees
(1984)
J. Burstein
The e-rater^® scoring engine: Automated essay scoring with natural language processing

D. Callear et al.

CAA of short non-MCQ answers

D. Charney

The validity of using holistic scoring to evaluate writing: a critical overview

Research in the Teaching of English

(1984)

S. Chevalier

Speech interaction with Saybot player, a CALL software to help Chinese learners of English

Clark, J.L.D., Swinton, S.S., 1979. An exploration of speaking proficiency measures in the TOEFL context. Tech. Re...

J. Cohen

A coefficient of agreement for nominal scales

Educational and Psychological Measurement

(1960)

C. Cucchiarini et al.

Automatic evaluation of Dutch pronunciation by using speech recognition technology

Cucchiarini, C., Strik, H., Boves, L., 1997b. Using speech recognition technology to assess foreign speakers’...

S. Elliott

Intellimetric™: From here to validity

M. Eskenazi et al.

The Native Accent $™$ pronunciation tutor: measuring success in the real world

U.M. Fayyad et al.

Multi-interval discretization of continuousvalued attributes for classification learning

L.S. Feldt et al.

Reliability

H. Franco et al.

The SRI EduSpeak system: recognition and pronunciation scoring for language learning

E. Frank et al.

Using model trees for classification

Machine Learning

(1998)

J.S. Garofolo et al.

Design and preparation of the 1996 HUB-4 broadcast news benchmark test corpora

Hall, M.A., 1998. Correlation-based feature selection for machine learning. Ph.D. thesis, University of Waikato,...

Cited by (0)

View full text

A three-stage approach to the automated scoring of spontaneous spoken responses

Abstract

Introduction

Section snippets

Automated scoring of constructed response items

System architecture

Spoken response data

Filtering model

Scoring model

Construction of prediction intervals

Discussion

Acknowledgements

Speech Communication

National Council on Measurement in Education. Standards for Educational and Psychological Testing

Automated essay scoring with e-rater v.2

Journal of Technology, Learning and Assessment

Automated evaluation of reading accuracy: assessing machine scores

Validity and automated scoring: it’s not just the scoring.

Educational Measurement, Issues and Practice

Two experiments in automated scoring of spoken language proficiency

Classification and Regression Trees

The e-rater® scoring engine: Automated essay scoring with natural language processing