skip to main content
10.1145/2783258.2788595acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Spoken English Grading: Machine Learning with Crowd Intelligence

Published:10 August 2015Publication History

ABSTRACT

In this paper, we address the problem of grading spontaneous speech using a combination of machine learning and crowdsourcing. Traditional machine learning techniques solve the stated problem inadequately as automatic speaker-independent speech transcription is inaccurate. The features derived from it are also inaccurate and so is the machine learning model developed for speech evaluation. We propose a framework that combines machine learning with crowdsourcing. This entails identifying human intelligence tasks in the feature derivation step and using crowdsourcing to get them completed. We post the task of speech transcription to a large community of online workers (crowd). We also get spoken English grades from the crowd. We achieve 95% transcription accuracy by combining transcriptions from multiple crowd workers. Speech and prosody features are derived by force aligning the speech samples on these highly accurate transcriptions. Additionally, we derive surface and semantic level features directly from the transcription. We demonstrate the efficacy of our approach by predicting expert graded speech sample of 566 adult non-native speakers across two different countries - India and Philippines. Using the regression modeling technique, we are able achieve a Pearson correlation of 0.79 on the Philippines set and 0.74 on the Indian set with expert grades, an accuracy much higher than any previously reported machine learning approach. Our approach has an accuracy that rivals that of expert agreement. We show the value of the system through a case study in a real-world industrial deployment. This work is timely given the huge requirement of spoken English training and assessment.

Skip Supplemental Material Section

Supplemental Material

p2089.mp4

mp4

244.6 MB

References

  1. Menucha Birenbaum and Kikumi K Tatsuoka. Open-ended versus multiple-choice response formats - it does make a difference for diagnostic purposes. Applied Psychological Measurement, 11(4):385--395, 1987.Google ScholarGoogle ScholarCross RefCross Ref
  2. Varun Aggarwal, Shahank Srikant, and Vinay Shashidhar. Principles for using machine learning in the assessment of open response items: Programming assessment as a case study. NIPS - Workshop on Data Driven Education, 2013.Google ScholarGoogle Scholar
  3. P Mitros, Vikas Paruchuri, John Rogosic, and Diana Huang. An integrated framework for the grading of freeform responses. MIT Learning International Networks Consortium, 2013.Google ScholarGoogle Scholar
  4. Cliff E Beevers and Jane S Paterson. Automatic assessment of problem-solving skills in mathematics. Active Learning in Higher Education, 4(2):127--144, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  5. SVAR. 2014. http://www.aspiringminds.in/talent-evaluation/spoken-english-SVAR.html.Google ScholarGoogle Scholar
  6. Klaus Zechner, Derrick Higgins, Xiaoming Xi, and David M Williamson. Automatic scoring of non-native spontaneous speech in tests of spoken english. Speech Communication, 51(10):883--895, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Sumit Basu, Chuck Jacobs, and Lucy Vanderwende. Powergrading: a clustering approach to amplify human effort for short answer grading. TACL, 1:391--402, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  8. Donald E Powers, Jill C Burstein, Martin Chodorow, Mary E Fowles, and Karen Kukich. Stumpingtextite-rater: challenging the validity of automated essay scoring. Computers in Human Behavior, 18(2):103--134, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  9. Cem Akkaya, Alexander Conrad, Janyce Wiebe, and Rada Mihalcea. Amazon mechanical turk for subjectivity word sense disambiguation. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 195--203. Association for Computational Linguistics, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. ASID Lang and Joshua Rio-Ross. Using amazon mechanical turk to transcribe historical handwritten documents. The Code4Lib Journal, 2011.Google ScholarGoogle Scholar
  11. Matthew Marge, Satanjeev Banerjee, and Alexander I Rudnicky. Using the amazon mechanical turk for transcription of spoken language. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 5270--5273. IEEE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  12. Thierry Buecheler, Jan Henrik Sieg, Rudolf M Füchslin, and Rolf Pfeifer. Crowdsourcing, open innovation and collective intelligence in the scientific method-a research agenda and operational framework. In ALIFE, pages 679--686, 2010.Google ScholarGoogle Scholar
  13. Mark Lejk and Michael Wyvill. The effect of the inclusion of selfassessment with peer assessment of contributions to a group project: A quantitative study of secret and agreed assessments. Assessment & Evaluation in Higher Education, 26(6):551--561, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  14. Nathan Van Houdnos. Can the internet grade math? crowdsourcing a complex scoring task and picking the optimal crowd size. Dietrich College of Humanities and Social Sciences at Research Showcase @ Carnegie Mellon University, 2011.Google ScholarGoogle Scholar
  15. Joel R Tetreault, Elena Filatova, and Martin Chodorow. Rethinking grammatical error annotation and evaluation with the amazon mechanical turk. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, pages 45--48. Association for Computational Linguistics, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Nitin Madnani, Joel Tetreault, Martin Chodorow, and Alla Rozovskaya. They can help: using crowdsourcing to improve the evaluation of grammatical error detection systems. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 508--513. Association for Computational Linguistics, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sherrie A Kossoudji. English language ability and the labor market opportunities of hispanic and east asian immigrant men. Journal of Labor Economics, pages 205--228, 1988.Google ScholarGoogle ScholarCross RefCross Ref
  18. Elizabeth J Erling and Philip Seargeant. English and development: Policy, pedagogy and globalization, volume 17. Multilingual Matters, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  19. Cahit Guven and Asadul Islam. Age at migration, language proficiency and socio-economic outcomes: Evidence from australia. Technical report, 2013.Google ScholarGoogle Scholar
  20. Kåre Sjölander. An hmm-based system for automatic segmentation and alignment of speech. In Proceedings of Fonetik, volume 2003, pages 93--96. Citeseer, 2003.Google ScholarGoogle Scholar
  21. Chinmay E Kulkarni, Richard Socher, Michael S Bernstein, and Scott R Klemmer. Scaling short-answer grading by combining peer assessment with algorithmic scoring. In Proceedings of the first ACM conference on Learning@ scale conference, pages 99--108. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Marina Boia, Claudiu Cristian Musat, and Boi Faltings. Acquiring commonsense knowledge for sentiment analysis using human computation. In Proceedings of the companion publication of the 23rd international conference on World wide web companion, pages 225--226. International World Wide Web Conferences Steering Committee, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ece Kamar, Severin Hacker, and Eric Horvitz. Combining human and machine intelligence in large-scale crowdsourcing. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages 467--474. International Foundation for Autonomous Agents and Multiagent Systems, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Sivan Sabato and Adam Kalai. Feature multi-selection among subjective features. arXiv preprint arXiv:1302.4297, 2013.Google ScholarGoogle Scholar
  25. Michael S Bernstein, Joel Brandt, Robert C Miller, and David R Karger. Crowds in two seconds: Enabling realtime crowd-powered interfaces. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pages 33--42. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Walter S Lasecki, Christopher D Miller, and Jeffrey P Bigham. Warping time for more effective real-time crowdsourcing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 2033--2036. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Cambridge EOCL Examinations. Using thetextitCEFR: Principles of good practice. at University of Cambridge, 2011.Google ScholarGoogle Scholar
  28. Eric John Dobson. English Pronunciation, 1500--1700: Phonology, volume 2. Clarendon Press, 1957.Google ScholarGoogle Scholar
  29. Christopher Brumfit and Christopher J Brumfit. Communicative methodology in language teaching: The roles of fluency and accuracy, volume 129. Cambridge University Press Cambridge, 1984.Google ScholarGoogle Scholar
  30. Robert Stalnaker. The problem of logical omniscience, ii. context and content: Essays on intentionality in speech and thought (pp. 255--273), 1999.Google ScholarGoogle ScholarCross RefCross Ref
  31. David Brazil. A grammar of speech. Oxford University Press, USA, 1995.Google ScholarGoogle Scholar
  32. Voxforge. 2014. http://www.voxforge.org/home/dev/autoaudioseg.Google ScholarGoogle Scholar
  33. Mehdi Hosseini, Ingemar J Cox, Nata\vsa Milić-Frayling, Gabriella Kazai, and Vishwa Vinay. On aggregating labels from multiple crowd workers to infer relevance of documents. In Advances in information retrieval, pages 182--194. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, et al. The htk book (for htk version 3.4). Cambridge university engineering department, 2(2):2--3, 2006.Google ScholarGoogle Scholar
  35. John S Garofolo, Lori F Lamel, William M Fisher, Jonathon G Fiscus, and David S Pallett. Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1--1.1. NASA STI/Recon Technical Report N, 93:27403, 1993.Google ScholarGoogle Scholar
  36. Leonardo Neumeyer, Horacio Franco, Mitchel Weintraub, and Patti Price. Automatic text-independent pronunciation scoring of foreign language student speech. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, volume 3, pages 1457--1460. IEEE, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  37. Catia Cucchiarini, Helmer Strik, and Lou Boves. Quantitative assessment of second language learners\textquoteright\ fluency by means of automatic speech recognition technology. The Journal of the Acoustical Society of America, 107(2):989--999, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  38. LightSide. 2013. http://lightsidelabs.com/.Google ScholarGoogle Scholar
  39. AfterTheDeadline. 2014. http://www.afterthedeadline.com/.Google ScholarGoogle Scholar
  40. Gabriele Paolacci, Jesse Chandler, and Panagiotis G Ipeirotis. Running experiments on amazon mechanical turk. Judgment and Decision making, 5(5):411--419, 2010.Google ScholarGoogle Scholar
  41. Amazon Mechanical Turk. 2014. https://requester.mturk.com/tour.Google ScholarGoogle Scholar
  42. Jonathan G Fiscus. A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (rover). In Automatic Speech Recognition and Understanding, 1997. Proceedings., 1997 IEEE Workshop on, pages 347--354. IEEE, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  43. Ahmet Aker, Mahmoud El-Haj, M-Dyaa Albakour, and Udo Kruschwitz. Assessing crowdsourcing quality through objective tasks. In LREC, pages 1456--1461. Citeseer, 2012.Google ScholarGoogle Scholar
  44. Quoc Viet Hung Nguyen, Tam Nguyen Thanh, Tran Lam Ngoc, and Karl Aberer. An evaluation of aggregation techniques in crowdsourcing. In The 14th International Conference on Web Information System Engineering (WISE), 2013, number EPFL-CONF-187456, 2013.Google ScholarGoogle Scholar

Index Terms

  1. Spoken English Grading: Machine Learning with Crowd Intelligence

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
      August 2015
      2378 pages
      ISBN:9781450336642
      DOI:10.1145/2783258

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 August 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      KDD '15 Paper Acceptance Rate160of819submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader