ABSTRACT
In this paper, we address the problem of grading spontaneous speech using a combination of machine learning and crowdsourcing. Traditional machine learning techniques solve the stated problem inadequately as automatic speaker-independent speech transcription is inaccurate. The features derived from it are also inaccurate and so is the machine learning model developed for speech evaluation. We propose a framework that combines machine learning with crowdsourcing. This entails identifying human intelligence tasks in the feature derivation step and using crowdsourcing to get them completed. We post the task of speech transcription to a large community of online workers (crowd). We also get spoken English grades from the crowd. We achieve 95% transcription accuracy by combining transcriptions from multiple crowd workers. Speech and prosody features are derived by force aligning the speech samples on these highly accurate transcriptions. Additionally, we derive surface and semantic level features directly from the transcription. We demonstrate the efficacy of our approach by predicting expert graded speech sample of 566 adult non-native speakers across two different countries - India and Philippines. Using the regression modeling technique, we are able achieve a Pearson correlation of 0.79 on the Philippines set and 0.74 on the Indian set with expert grades, an accuracy much higher than any previously reported machine learning approach. Our approach has an accuracy that rivals that of expert agreement. We show the value of the system through a case study in a real-world industrial deployment. This work is timely given the huge requirement of spoken English training and assessment.
Supplemental Material
- Menucha Birenbaum and Kikumi K Tatsuoka. Open-ended versus multiple-choice response formats - it does make a difference for diagnostic purposes. Applied Psychological Measurement, 11(4):385--395, 1987.Google ScholarCross Ref
- Varun Aggarwal, Shahank Srikant, and Vinay Shashidhar. Principles for using machine learning in the assessment of open response items: Programming assessment as a case study. NIPS - Workshop on Data Driven Education, 2013.Google Scholar
- P Mitros, Vikas Paruchuri, John Rogosic, and Diana Huang. An integrated framework for the grading of freeform responses. MIT Learning International Networks Consortium, 2013.Google Scholar
- Cliff E Beevers and Jane S Paterson. Automatic assessment of problem-solving skills in mathematics. Active Learning in Higher Education, 4(2):127--144, 2003.Google ScholarCross Ref
- SVAR. 2014. http://www.aspiringminds.in/talent-evaluation/spoken-english-SVAR.html.Google Scholar
- Klaus Zechner, Derrick Higgins, Xiaoming Xi, and David M Williamson. Automatic scoring of non-native spontaneous speech in tests of spoken english. Speech Communication, 51(10):883--895, 2009. Google ScholarDigital Library
- Sumit Basu, Chuck Jacobs, and Lucy Vanderwende. Powergrading: a clustering approach to amplify human effort for short answer grading. TACL, 1:391--402, 2013.Google ScholarCross Ref
- Donald E Powers, Jill C Burstein, Martin Chodorow, Mary E Fowles, and Karen Kukich. Stumpingtextite-rater: challenging the validity of automated essay scoring. Computers in Human Behavior, 18(2):103--134, 2002.Google ScholarCross Ref
- Cem Akkaya, Alexander Conrad, Janyce Wiebe, and Rada Mihalcea. Amazon mechanical turk for subjectivity word sense disambiguation. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 195--203. Association for Computational Linguistics, 2010. Google ScholarDigital Library
- ASID Lang and Joshua Rio-Ross. Using amazon mechanical turk to transcribe historical handwritten documents. The Code4Lib Journal, 2011.Google Scholar
- Matthew Marge, Satanjeev Banerjee, and Alexander I Rudnicky. Using the amazon mechanical turk for transcription of spoken language. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 5270--5273. IEEE, 2010.Google ScholarCross Ref
- Thierry Buecheler, Jan Henrik Sieg, Rudolf M Füchslin, and Rolf Pfeifer. Crowdsourcing, open innovation and collective intelligence in the scientific method-a research agenda and operational framework. In ALIFE, pages 679--686, 2010.Google Scholar
- Mark Lejk and Michael Wyvill. The effect of the inclusion of selfassessment with peer assessment of contributions to a group project: A quantitative study of secret and agreed assessments. Assessment & Evaluation in Higher Education, 26(6):551--561, 2001.Google ScholarCross Ref
- Nathan Van Houdnos. Can the internet grade math? crowdsourcing a complex scoring task and picking the optimal crowd size. Dietrich College of Humanities and Social Sciences at Research Showcase @ Carnegie Mellon University, 2011.Google Scholar
- Joel R Tetreault, Elena Filatova, and Martin Chodorow. Rethinking grammatical error annotation and evaluation with the amazon mechanical turk. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, pages 45--48. Association for Computational Linguistics, 2010. Google ScholarDigital Library
- Nitin Madnani, Joel Tetreault, Martin Chodorow, and Alla Rozovskaya. They can help: using crowdsourcing to improve the evaluation of grammatical error detection systems. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 508--513. Association for Computational Linguistics, 2011. Google ScholarDigital Library
- Sherrie A Kossoudji. English language ability and the labor market opportunities of hispanic and east asian immigrant men. Journal of Labor Economics, pages 205--228, 1988.Google ScholarCross Ref
- Elizabeth J Erling and Philip Seargeant. English and development: Policy, pedagogy and globalization, volume 17. Multilingual Matters, 2013.Google ScholarCross Ref
- Cahit Guven and Asadul Islam. Age at migration, language proficiency and socio-economic outcomes: Evidence from australia. Technical report, 2013.Google Scholar
- Kåre Sjölander. An hmm-based system for automatic segmentation and alignment of speech. In Proceedings of Fonetik, volume 2003, pages 93--96. Citeseer, 2003.Google Scholar
- Chinmay E Kulkarni, Richard Socher, Michael S Bernstein, and Scott R Klemmer. Scaling short-answer grading by combining peer assessment with algorithmic scoring. In Proceedings of the first ACM conference on Learning@ scale conference, pages 99--108. ACM, 2014. Google ScholarDigital Library
- Marina Boia, Claudiu Cristian Musat, and Boi Faltings. Acquiring commonsense knowledge for sentiment analysis using human computation. In Proceedings of the companion publication of the 23rd international conference on World wide web companion, pages 225--226. International World Wide Web Conferences Steering Committee, 2014. Google ScholarDigital Library
- Ece Kamar, Severin Hacker, and Eric Horvitz. Combining human and machine intelligence in large-scale crowdsourcing. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages 467--474. International Foundation for Autonomous Agents and Multiagent Systems, 2012. Google ScholarDigital Library
- Sivan Sabato and Adam Kalai. Feature multi-selection among subjective features. arXiv preprint arXiv:1302.4297, 2013.Google Scholar
- Michael S Bernstein, Joel Brandt, Robert C Miller, and David R Karger. Crowds in two seconds: Enabling realtime crowd-powered interfaces. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pages 33--42. ACM, 2011. Google ScholarDigital Library
- Walter S Lasecki, Christopher D Miller, and Jeffrey P Bigham. Warping time for more effective real-time crowdsourcing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 2033--2036. ACM, 2013. Google ScholarDigital Library
- Cambridge EOCL Examinations. Using thetextitCEFR: Principles of good practice. at University of Cambridge, 2011.Google Scholar
- Eric John Dobson. English Pronunciation, 1500--1700: Phonology, volume 2. Clarendon Press, 1957.Google Scholar
- Christopher Brumfit and Christopher J Brumfit. Communicative methodology in language teaching: The roles of fluency and accuracy, volume 129. Cambridge University Press Cambridge, 1984.Google Scholar
- Robert Stalnaker. The problem of logical omniscience, ii. context and content: Essays on intentionality in speech and thought (pp. 255--273), 1999.Google ScholarCross Ref
- David Brazil. A grammar of speech. Oxford University Press, USA, 1995.Google Scholar
- Voxforge. 2014. http://www.voxforge.org/home/dev/autoaudioseg.Google Scholar
- Mehdi Hosseini, Ingemar J Cox, Nata\vsa Milić-Frayling, Gabriella Kazai, and Vishwa Vinay. On aggregating labels from multiple crowd workers to infer relevance of documents. In Advances in information retrieval, pages 182--194. Springer, 2012. Google ScholarDigital Library
- Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, et al. The htk book (for htk version 3.4). Cambridge university engineering department, 2(2):2--3, 2006.Google Scholar
- John S Garofolo, Lori F Lamel, William M Fisher, Jonathon G Fiscus, and David S Pallett. Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1--1.1. NASA STI/Recon Technical Report N, 93:27403, 1993.Google Scholar
- Leonardo Neumeyer, Horacio Franco, Mitchel Weintraub, and Patti Price. Automatic text-independent pronunciation scoring of foreign language student speech. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, volume 3, pages 1457--1460. IEEE, 1996.Google ScholarCross Ref
- Catia Cucchiarini, Helmer Strik, and Lou Boves. Quantitative assessment of second language learners\textquoteright\ fluency by means of automatic speech recognition technology. The Journal of the Acoustical Society of America, 107(2):989--999, 2000.Google ScholarCross Ref
- LightSide. 2013. http://lightsidelabs.com/.Google Scholar
- AfterTheDeadline. 2014. http://www.afterthedeadline.com/.Google Scholar
- Gabriele Paolacci, Jesse Chandler, and Panagiotis G Ipeirotis. Running experiments on amazon mechanical turk. Judgment and Decision making, 5(5):411--419, 2010.Google Scholar
- Amazon Mechanical Turk. 2014. https://requester.mturk.com/tour.Google Scholar
- Jonathan G Fiscus. A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (rover). In Automatic Speech Recognition and Understanding, 1997. Proceedings., 1997 IEEE Workshop on, pages 347--354. IEEE, 1997.Google ScholarCross Ref
- Ahmet Aker, Mahmoud El-Haj, M-Dyaa Albakour, and Udo Kruschwitz. Assessing crowdsourcing quality through objective tasks. In LREC, pages 1456--1461. Citeseer, 2012.Google Scholar
- Quoc Viet Hung Nguyen, Tam Nguyen Thanh, Tran Lam Ngoc, and Karl Aberer. An evaluation of aggregation techniques in crowdsourcing. In The 14th International Conference on Web Information System Engineering (WISE), 2013, number EPFL-CONF-187456, 2013.Google Scholar
Index Terms
- Spoken English Grading: Machine Learning with Crowd Intelligence
Recommendations
Question Independent Grading using Machine Learning: The Case of Computer Program Grading
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningLearning supervised models to grade open-ended responses is an expensive process. A model has to be trained for every prompt/question separately, which in turn requires graded samples. In automatic programming evaluation specifically, the focus of this ...
AMEO 2015: A dataset comprising AMCAT test scores, biodata details and employment outcomes of job seekers
CODS '16: Proceedings of the 3rd IKDD Conference on Data Science, 2016More than a million engineers enter the global workforce every year. A relevant question is what determines the jobs and salaries these engineers are offered right after graduation. Previous studies have shown the influence of various factors such as ...
A system to grade computer programming skills using machine learning
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data miningThe automatic evaluation of computer programs is a nascent area of research with a potential for large-scale impact. Extant program assessment systems score mostly based on the number of test-cases passed, providing no insight into the competency of the ...
Comments