A three-stage approach to the automated scoring of spontaneous spoken responses
Introduction
The last decade has seen a proliferation of new applications of speech technology in the educational domain, in particular in support of English language learner (ELL) populations. Computer-based learning tools for spoken English allow learners who may have limited access to native or high-proficiency English speakers an opportunity to practice their English and receive feedback on their performance. Automated systems also have the potential to make educational materials for English speaking accessible to a wider range of learners, by reducing their cost (compared to a human tutor), and making them available on a more flexible schedule.
Speech-enabled dialogue systems allow learners to practice their speaking and listening in an exchange with a virtual interlocutor—e.g., SpeakESL and Auralog’s Tell Me More system. Automated tutoring systems for speaking practice provide feedback to learners on their pronunciation, vocabulary, and grammar—e.g., Carnegie Speech NativeAccent (Eskenazi et al., 2007), Saybot (Chevalier, 2007) and SRI’s EduSpeak (Franco et al., 2000). The contribution of the work described in this paper is to the field of automated scoring of spoken responses: assessing the level of English speaking proficiency demonstrated in some speaking task used in a testing context.
In particular, this paper describes SpeechRater, an automated system for scoring spontaneous spoken responses. SpeechRater uses automatically derived measures of fluency, pronunciation, vocabulary diversity, and grammar to score spoken responses to tasks similar to those found on the Test of English as a Foreign Language (TOEFL®), a test of English proficiency for academic settings, used internationally for college admissions. Only the internet-delivered version of the TOEFL test (TOEFL iBT) includes tasks to assess speaking proficiency; the paper-based version of the test assesses only reading, listening, and writing. SpeechRater is currently used operationally to score spoken responses to the low-stakes TOEFL Practice Online (TPO) test, but not for the high-stakes TOEFL iBT test.
Where other spoken response scoring systems had previously relied on restricted speaking tasks such as reading a passage aloud, or answering questions to which the range of responses is narrowly circumscribed (Bernstein, 1999, Bernstein et al., 2000, Balogh et al., 2007), SpeechRater addresses the more challenging task of scoring responses which are relatively unstructured, unrestricted, and spontaneous (for a set of speakers varying considerably in English proficiency and first language). While much work remains to be done to improve SpeechRater’s agreement with human raters and the coverage of important aspects of speaking proficiency in its feature set, the current capability has advanced sufficiently to allow it to be used for the scoring of practice tests used in preparing for the operational TOEFL test.
Section 2 surveys previous work in automated scoring of test items, especially speaking tasks. Section 3 then provides a brief overview of the SpeechRater system architecture, and the speech recognition system used as the basis for feature computation. After an overview of the spoken response data is provided in Section 4, Section 5 describes the first component of SpeechRater, which filters out non-scoreable responses. Section 6 introduces the scoring model itself, and Section 7 describes the way in which scores are aggregated across tasks, and bounds on prediction error are calculated. Finally, a brief summary of the major findings of the paper is presented in Section 8.
Section snippets
Automated scoring of constructed response items
Constructed response test items are those which require the examinee to generate an answer productively (in contrast to selected response or multiple-choice items). This section will discuss some of the motivation for using constructed-response tasks and scoring them by automated means, and review the previous work in this area.
System architecture
The SpeechRater automated speech scoring system consists of the components outlined in Fig. 1. (The output values depicted in Fig. 1 for each stage of the process are meant solely as illustrative examples.) Responses to each of the six speaking tasks administered in a TPO test are first analyzed by a speech recognizer, the output of which is then processed by a set of feature extraction routines, which derive a set of measures meant to be indicative of speaking proficiency. Based on these
Spoken response data
In developing the candidate models to identify non-scoreable responses (Section 5) and to assign scores to all other responses (Section 6), two sources of spoken response data were used. Responses from the TPO practice test were the main focus of development, as the model was intended for actual use on that test. Responses from the operational TOEFL test were also used for comparison, to determine whether the same features had predictive value in both the practice test condition and the
Filtering model
Capturing spoken responses to test items can be complicated by technical problems of many sorts: equipment malfunctions, excessive levels of ambient noise, and data transmission errors, to name a few. In addition, there is the possibility that certain examinees may fail to respond to particular test items (by remaining silent). Both of these issues are aggravated in a practice test environment, where the equipment and testing environment are less tightly controlled, and examinees’ motivation to
Scoring model
The model for estimating the score to be assigned to a spoken response is the core of the SpeechRater application. The usefulness of the system’s feedback depends primarily on this model’s success in providing scores which are similar to those which human raters would provide, and which accurately reflect the important aspects of speaking proficiency reflected in the rubrics for the speaking tasks.
These criteria for success imply two different sets of evaluation criteria for SpeechRater. The
Construction of prediction intervals
Once the raw scores for each of an examinee’s six responses have been calculated by the scoring model, they are summed together to produce the complete TPO Speaking section score. Note that these item scores and the complete section raw score are unrounded values; rounding occurs only when the section raw score is converted to a scaled score (on the TPO 0–30 scale), in order to make the most efficient use possible of the score estimates provided by SpeechRater.
In addition to the score
Discussion
This paper has described the development and structure of SpeechRater, a system for automated scoring of spontaneous spoken responses such as those provided to TOEFL Speaking tasks.
The system processes responses according to a three-stage process. First, a filtering model screens out responses which are not scoreable, because of technical difficulties or the examinee’s failure to respond appropriately to the question. Second, the scoring model uses a set of features related to fluency,
Acknowledgements
The authors would like to thank an anonymous reviewer, and their ETS colleagues Isaac Bejar, Yeonsuk Cho, and Dan Eignor for their thoughtful reviews which helped to improve the content and readability of this paper. We are also indebted to the Technical and Content Advisory Committees at ETS, whose advice was invaluable in developing the models described here. Mike Wagner and Ramin Hemat deserve special mention for their help in collecting and processing the data for this project, and Pam
References (50)
- et al.
Automatic scoring of non-native spontaneous speech in tests of spoken English
Speech Communication
(2009) National Council on Measurement in Education. Standards for Educational and Psychological Testing
(1999)- et al.
Automated essay scoring with e-rater v.2
Journal of Technology, Learning and Assessment
(2006) - Bailey, K.M., 1999. Washback in Language Testing. Tech. Re MS-15, TOEFL...
- et al.
Automated evaluation of reading accuracy: assessing machine scores
- et al.
Validity and automated scoring: it’s not just the scoring.
Educational Measurement, Issues and Practice
(1998) - Bernstein, J., 1999. PhonePass testing: Structure and Construct. Tech. Rep., Ordinate Corporation, Menlo Park,...
- et al.
Two experiments in automated scoring of spoken language proficiency
- et al.
Classification and Regression Trees
(1984) The e-rater® scoring engine: Automated essay scoring with natural language processing