research-article

Spoken English Grading: Machine Learning with Crowd Intelligence

Authors:
Vinay Shashidhar

Aspiring Minds, Gurgaon, India

Aspiring Minds, Gurgaon, India
View Profile

,
Nishant Pandey

Aspiring Minds, Gurgaon, India

Aspiring Minds, Gurgaon, India
View Profile

,
Varun Aggarwal

Aspiring Minds, Gurgaon, India

Aspiring Minds, Gurgaon, India
View Profile

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2015Pages 2089–2097https://doi.org/10.1145/2783258.2788595

Published:10 August 2015Publication History

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 2089–2097

ABSTRACT

In this paper, we address the problem of grading spontaneous speech using a combination of machine learning and crowdsourcing. Traditional machine learning techniques solve the stated problem inadequately as automatic speaker-independent speech transcription is inaccurate. The features derived from it are also inaccurate and so is the machine learning model developed for speech evaluation. We propose a framework that combines machine learning with crowdsourcing. This entails identifying human intelligence tasks in the feature derivation step and using crowdsourcing to get them completed. We post the task of speech transcription to a large community of online workers (crowd). We also get spoken English grades from the crowd. We achieve 95% transcription accuracy by combining transcriptions from multiple crowd workers. Speech and prosody features are derived by force aligning the speech samples on these highly accurate transcriptions. Additionally, we derive surface and semantic level features directly from the transcription. We demonstrate the efficacy of our approach by predicting expert graded speech sample of 566 adult non-native speakers across two different countries - India and Philippines. Using the regression modeling technique, we are able achieve a Pearson correlation of 0.79 on the Philippines set and 0.74 on the Indian set with expert grades, an accuracy much higher than any previously reported machine learning approach. Our approach has an accuracy that rivals that of expert agreement. We show the value of the system through a case study in a real-world industrial deployment. This work is timely given the huge requirement of spoken English training and assessment.

Supplemental Material

p2089.mp4

mp4

244.6 MB

Download

References

Menucha Birenbaum and Kikumi K Tatsuoka. Open-ended versus multiple-choice response formats - it does make a difference for diagnostic purposes. Applied Psychological Measurement, 11(4):385--395, 1987.Google ScholarCross Ref
Varun Aggarwal, Shahank Srikant, and Vinay Shashidhar. Principles for using machine learning in the assessment of open response items: Programming assessment as a case study. NIPS - Workshop on Data Driven Education, 2013.Google Scholar
P Mitros, Vikas Paruchuri, John Rogosic, and Diana Huang. An integrated framework for the grading of freeform responses. MIT Learning International Networks Consortium, 2013.Google Scholar
Cliff E Beevers and Jane S Paterson. Automatic assessment of problem-solving skills in mathematics. Active Learning in Higher Education, 4(2):127--144, 2003.Google ScholarCross Ref
SVAR. 2014. http://www.aspiringminds.in/talent-evaluation/spoken-english-SVAR.html.Google Scholar
Klaus Zechner, Derrick Higgins, Xiaoming Xi, and David M Williamson. Automatic scoring of non-native spontaneous speech in tests of spoken english. Speech Communication, 51(10):883--895, 2009. Google ScholarDigital Library
Sumit Basu, Chuck Jacobs, and Lucy Vanderwende. Powergrading: a clustering approach to amplify human effort for short answer grading. TACL, 1:391--402, 2013.Google ScholarCross Ref
Donald E Powers, Jill C Burstein, Martin Chodorow, Mary E Fowles, and Karen Kukich. Stumpingtextite-rater: challenging the validity of automated essay scoring. Computers in Human Behavior, 18(2):103--134, 2002.Google ScholarCross Ref
Cem Akkaya, Alexander Conrad, Janyce Wiebe, and Rada Mihalcea. Amazon mechanical turk for subjectivity word sense disambiguation. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 195--203. Association for Computational Linguistics, 2010. Google ScholarDigital Library
ASID Lang and Joshua Rio-Ross. Using amazon mechanical turk to transcribe historical handwritten documents. The Code4Lib Journal, 2011.Google Scholar
Matthew Marge, Satanjeev Banerjee, and Alexander I Rudnicky. Using the amazon mechanical turk for transcription of spoken language. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 5270--5273. IEEE, 2010.Google ScholarCross Ref
Thierry Buecheler, Jan Henrik Sieg, Rudolf M Füchslin, and Rolf Pfeifer. Crowdsourcing, open innovation and collective intelligence in the scientific method-a research agenda and operational framework. In ALIFE, pages 679--686, 2010.Google Scholar
Mark Lejk and Michael Wyvill. The effect of the inclusion of selfassessment with peer assessment of contributions to a group project: A quantitative study of secret and agreed assessments. Assessment & Evaluation in Higher Education, 26(6):551--561, 2001.Google ScholarCross Ref
Nathan Van Houdnos. Can the internet grade math? crowdsourcing a complex scoring task and picking the optimal crowd size. Dietrich College of Humanities and Social Sciences at Research Showcase @ Carnegie Mellon University, 2011.Google Scholar
Joel R Tetreault, Elena Filatova, and Martin Chodorow. Rethinking grammatical error annotation and evaluation with the amazon mechanical turk. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, pages 45--48. Association for Computational Linguistics, 2010. Google ScholarDigital Library
Nitin Madnani, Joel Tetreault, Martin Chodorow, and Alla Rozovskaya. They can help: using crowdsourcing to improve the evaluation of grammatical error detection systems. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 508--513. Association for Computational Linguistics, 2011. Google ScholarDigital Library
Sherrie A Kossoudji. English language ability and the labor market opportunities of hispanic and east asian immigrant men. Journal of Labor Economics, pages 205--228, 1988.Google ScholarCross Ref
Elizabeth J Erling and Philip Seargeant. English and development: Policy, pedagogy and globalization, volume 17. Multilingual Matters, 2013.Google ScholarCross Ref
Cahit Guven and Asadul Islam. Age at migration, language proficiency and socio-economic outcomes: Evidence from australia. Technical report, 2013.Google Scholar
Kåre Sjölander. An hmm-based system for automatic segmentation and alignment of speech. In Proceedings of Fonetik, volume 2003, pages 93--96. Citeseer, 2003.Google Scholar
Chinmay E Kulkarni, Richard Socher, Michael S Bernstein, and Scott R Klemmer. Scaling short-answer grading by combining peer assessment with algorithmic scoring. In Proceedings of the first ACM conference on Learning@ scale conference, pages 99--108. ACM, 2014. Google ScholarDigital Library
Marina Boia, Claudiu Cristian Musat, and Boi Faltings. Acquiring commonsense knowledge for sentiment analysis using human computation. In Proceedings of the companion publication of the 23rd international conference on World wide web companion, pages 225--226. International World Wide Web Conferences Steering Committee, 2014. Google ScholarDigital Library
Ece Kamar, Severin Hacker, and Eric Horvitz. Combining human and machine intelligence in large-scale crowdsourcing. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages 467--474. International Foundation for Autonomous Agents and Multiagent Systems, 2012. Google ScholarDigital Library
Sivan Sabato and Adam Kalai. Feature multi-selection among subjective features. arXiv preprint arXiv:1302.4297, 2013.Google Scholar
Michael S Bernstein, Joel Brandt, Robert C Miller, and David R Karger. Crowds in two seconds: Enabling realtime crowd-powered interfaces. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pages 33--42. ACM, 2011. Google ScholarDigital Library
Walter S Lasecki, Christopher D Miller, and Jeffrey P Bigham. Warping time for more effective real-time crowdsourcing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 2033--2036. ACM, 2013. Google ScholarDigital Library
Cambridge EOCL Examinations. Using thetextitCEFR: Principles of good practice. at University of Cambridge, 2011.Google Scholar
Eric John Dobson. English Pronunciation, 1500--1700: Phonology, volume 2. Clarendon Press, 1957.Google Scholar
Christopher Brumfit and Christopher J Brumfit. Communicative methodology in language teaching: The roles of fluency and accuracy, volume 129. Cambridge University Press Cambridge, 1984.Google Scholar
Robert Stalnaker. The problem of logical omniscience, ii. context and content: Essays on intentionality in speech and thought (pp. 255--273), 1999.Google ScholarCross Ref
David Brazil. A grammar of speech. Oxford University Press, USA, 1995.Google Scholar
Voxforge. 2014. http://www.voxforge.org/home/dev/autoaudioseg.Google Scholar
Mehdi Hosseini, Ingemar J Cox, Nata\vsa Milić-Frayling, Gabriella Kazai, and Vishwa Vinay. On aggregating labels from multiple crowd workers to infer relevance of documents. In Advances in information retrieval, pages 182--194. Springer, 2012. Google ScholarDigital Library
Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, et al. The htk book (for htk version 3.4). Cambridge university engineering department, 2(2):2--3, 2006.Google Scholar
John S Garofolo, Lori F Lamel, William M Fisher, Jonathon G Fiscus, and David S Pallett. Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1--1.1. NASA STI/Recon Technical Report N, 93:27403, 1993.Google Scholar
Leonardo Neumeyer, Horacio Franco, Mitchel Weintraub, and Patti Price. Automatic text-independent pronunciation scoring of foreign language student speech. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, volume 3, pages 1457--1460. IEEE, 1996.Google ScholarCross Ref
Catia Cucchiarini, Helmer Strik, and Lou Boves. Quantitative assessment of second language learners\textquoteright\ fluency by means of automatic speech recognition technology. The Journal of the Acoustical Society of America, 107(2):989--999, 2000.Google ScholarCross Ref
LightSide. 2013. http://lightsidelabs.com/.Google Scholar
AfterTheDeadline. 2014. http://www.afterthedeadline.com/.Google Scholar
Gabriele Paolacci, Jesse Chandler, and Panagiotis G Ipeirotis. Running experiments on amazon mechanical turk. Judgment and Decision making, 5(5):411--419, 2010.Google Scholar
Amazon Mechanical Turk. 2014. https://requester.mturk.com/tour.Google Scholar
Jonathan G Fiscus. A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (rover). In Automatic Speech Recognition and Understanding, 1997. Proceedings., 1997 IEEE Workshop on, pages 347--354. IEEE, 1997.Google ScholarCross Ref
Ahmet Aker, Mahmoud El-Haj, M-Dyaa Albakour, and Udo Kruschwitz. Assessing crowdsourcing quality through objective tasks. In LREC, pages 1456--1461. Citeseer, 2012.Google Scholar
Quoc Viet Hung Nguyen, Tam Nguyen Thanh, Tran Lam Ngoc, and Karl Aberer. An evaluation of aggregation techniques in crowdsourcing. In The 14th International Conference on Web Information System Engineering (WISE), 2013, number EPFL-CONF-187456, 2013.Google Scholar

Index Terms

Spoken English Grading: Machine Learning with Crowd Intelligence
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition

Recommendations

Question Independent Grading using Machine Learning: The Case of Computer Program Grading
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Learning supervised models to grade open-ended responses is an expensive process. A model has to be trained for every prompt/question separately, which in turn requires graded samples. In automatic programming evaluation specifically, the focus of this ...
Read More
AMEO 2015: A dataset comprising AMCAT test scores, biodata details and employment outcomes of job seekers
CODS '16: Proceedings of the 3rd IKDD Conference on Data Science, 2016

More than a million engineers enter the global workforce every year. A relevant question is what determines the jobs and salaries these engineers are offered right after graduation. Previous studies have shown the influence of various factors such as ...
Read More
A system to grade computer programming skills using machine learning
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

The automatic evaluation of computer programs is a nascent area of research with a potential for large-scale impact. Extant program assessment systems score mostly based on the number of test-cases passed, providing no insight into the competency of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2015
2378 pages
ISBN:9781450336642
DOI:10.1145/2783258
General Chairs:
Longbing Cao
University of Technology, Sydney
,
Chengqi Zhang
University of Technology, Sydney
,
Program Chairs:
Thorsten Joachims
Cornell University
,
Geoff Webb
Monash University
,
Dragos D. Margineantu
Boeing Research
,
Graham Williams
Australian Taxation Office
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 August 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
crowdsourcing
machine learning
speech recognition
spoken english evaluation
spontaneous speech grading
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '15 Paper Acceptance Rate160of819submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 483
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Spoken English Grading: Machine Learning with Crowd Intelligence

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Question Independent Grading using Machine Learning: The Case of Computer Program Grading

AMEO 2015: A dataset comprising AMCAT test scores, biodata details and employment outcomes of job seekers

A system to grade computer programming skills using machine learning