Skip to main content
Log in

Creating a ground truth multilingual dataset of news and talk show transcriptions through crowdsourcing

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper describes the development of a multilingual and multigenre manually annotated speech dataset, freely available to the research community as ground truth for the evaluation of automatic transcription systems and spoken language translation systems. The dataset includes two video genres—television broadcast news and talk-shows—and covers Flemish, English, German, and Italian, for a total of about 35 h of television speech. Besides segmentation and orthographic transcription, we added a very rich annotation on the audio signal, both at the linguistic level (e.g. filled pauses, pronunciation errors, disfluencies, speech in a foreign language) and at the acoustic level (e.g. background noise and different types of non-speech events). Furthermore, a subset of the transcriptions is translated in four directions, namely Flemish to English, German to English, German to Italian and English to Italian. The development of this dataset was organized in several phases, relying on expert transcribers as well as involving non-expert contributors through crowdsourcing. We first conducted a feasibility study to test and compare two methods for crowdsourcing speech transcription on broadcast news data. These methods are based on different transcription processes (i.e. parallel vs. iterative) and incorporate two different quality control mechanisms. With both methods, we achieved near-expert transcription quality—in terms of word error rate—for English, German and Italian data. Instead, for Flemish data we were not able to get a sufficient response from the crowd to complete the offered transcription tasks. The results obtained demonstrate that the viability of methods for crowdsourcing speech transcription significantly depends on the target language. This paper provides a detailed comparison of the results obtained with the two crowdsourcing methods tested, describes the main characteristics of the final ground truth resource created as well as the methodology adopted, and the guidelines prepared for its development.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://www.mturk.com/mturk/.

  2. http://crowdflower.com/.

  3. Flemish is the variant of Dutch spoken in the Flanders.

  4. https://hlt-mt.fbk.eu/technologies/tosca-mp-speech-ground-truth.

  5. http://tosca-mp.eu.

  6. The results achieved within the TOSCA-MP project for both types of evaluation can be found in Deliverable 2.3. from the project website http://tosca-mp.eu/publications/public-deliverables/.

  7. http://lands.let.ru.nl/cgn/ehome.htm.

  8. Many authors contributed to Eskénazi et al. (2013), which is a how-to manual both directed at experienced users, already confident with crowdsourcing techniques but willing to explore new directions, and at beginners who need to effectively use crowdsourcing.

  9. Please note that CastingWords uses experts to do the transcription.

  10. http://www.crowdguru.de/.

  11. http://www.get-paid.com/.

  12. http://coinworker.com/.

  13. http://lands.let.ru.nl/cgn/ehome.htm.

  14. http://wordlist.aspell.net/varcon/.

  15. An analogous distinction is reflected also in the final dataset having 54.2 and 27.1 % of international topics in English and in Italian newscasts respectively.

  16. A table summarizing the results of different works on crowdsourcing speech transcription is available in Parent (2013).

  17. VRT is the Flemish public broadcasting company that broadcasts in Flemish language for the Flemish Region and Community of Belgium.

  18. DW-TV is a set of television channels, provided by the international German broadcaster Deutsche Welle, that broadcasts in several languages including English.

  19. https://hlt-mt.fbk.eu/technologies/tosca-mp-speech-ground-truth.

  20. http://trans.sourceforge.net/en/presentation.php.

  21. The interface is available for research purposes upon request.

References

  • Adda, G., Mariani, J. J., Besacier, L., & Gelas, H. (2013). Crowdsourcing for speech processing: Applications to data collection, transcription and assessment (1st ed.). Oxford: Wiley.

    Google Scholar 

  • Audhkhasi, K., Georgiou, P., & Narayanan, S. (2011a). Accurate transcription of broadcast news speech using multiple noisy transcribers and unsupervised reliability metrics. In Proceedings of ICASSP, Prague, Czech Republic (pp. 4980–4983).

  • Audhkhasi, K., Georgiou, P., & Narayanan, S. (2011b). Reliability-weighted acoustic model adaptation using crowd sourced transcriptions. In Proceedings of INTERSPEECH, Florence, Italy (pp. 3045–3048).

  • Baum, D., Schneider, D., Bardeli, R., Schwenninger, J., Samlowski, B., Winkler, T., & Khler, J. (2010). DiSCo—a German evaluation corpus for challenging problems in the broadcast domain. In Proceedings of the language resources and evaluation conference (LREC), Valletta, Malta (pp. 1695–1699).

  • Brugnara, F., Falavigna, D., Giuliani, D., & Gretter, R. (2012). Analysis of the characteristics of talk-show TV programs. In Proceedings of INTERSPEECH’12, Portland, OR.

  • Cohen, J. (2007). The GALE project: A description and an update. In Proceedings of IEEE workshop on automatic speech recognition and understanding (ASRU) (pp. 237–237).

  • Eickhoff, C., & de Vries, A. (2011). How crowdsourcable is your task? In Proceedings of the WSDM workshop on crowdsourcing for search and datamining, Hong Kong (pp. 11–14).

  • Eskénazi, M., Levow, G. A., Meng, H., Parent, G., & Suendermann, D. (2013). Crowdsourcing for speech processing: Applications to data collection, transcription and assessment (1st ed.). Oxford: Wiley.

    Book  Google Scholar 

  • Evanini, K., Higgins, D., & Zechner, K. (2010). Using Amazon mechanical turk for transcription of non-native speech. In Proceedings of NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, Los Angeles, USA (pp. 53–56).

  • Federico, M., Giordani, D., & Coletti, P. (2000). Development and evaluation of an Italian broadcast news corpus. In Proceedings of the language resources and evaluation conference (LREC), Athens, Greece.

  • Fiscus, J. (1997). A post-processing system to yield reduced word error rates: Recogniser output voting error reduction (ROVER). In Proceedings of ASRU, Santa Barbara, CA (pp. 347–352).

  • Fort, K., Adda, G., Sagot, B., Mariani, J., and Couillault, A. (2011). Crowdsourcing for language resource development: Criticisms about Amazon mechanical turk overpowering use. In Human language technology challenges for computer science and linguistics (pp. 303–314). NewYork: Springer.

  • Fort, K., Adda, G., & Cohen, K. B. (2011). Amazon mechanical turk: Gold mine or coal mine? Computational Linguistics, 37(2), 413–420.

    Article  Google Scholar 

  • Furui, S., Nakamura, M., Ichiba, T., & Iwano, K. (2005). Why is the recognition of spontaneous speech so hard? In Text, speech and dialogue, series of Lecture Notes in Artificial Intelligence, Springer (pp. 9–22).

  • Gelas, H., Abate, S., Besacier, L., & Pellegrino, F. (2011). Quality assessment of crowdsourcing transcriptions for African languages. In Proceedings of INTERSPEECH’11, Florence, Italy (pp. 3065–3068).

  • Gelas, H., Besacier, L., & Pellegrino, F. (2012). Developments of swahili resources for an automatic speech recognition system, Cape-Town, South Africa. In Proceedings of SLTU-workshop on spoken language technologies for under-resourced languages.

  • Glenn, M. L., Strassel, S., Lee, H., Maeda, K., Zakhary, R., & Li, X. (2010). Transcription methods for consistency, volume and efficiency. In Proceedings of LREC’10, Valletta, Malta.

  • Graff, D. (2002). An overview of broadcast news corpora. Speech Communication, 37(1–2), 15–26.

    Article  Google Scholar 

  • Gruenstein, A., McGraw, I., & Sutherland, A. M. (2009). A self-transcribing speech corpus: Collecting continuous speech with an online educational game. In SLaTE (pp. 109–112).

  • Lamel, L., Gauvain, J.-L., & Adda, G. (2002). Lightly supervised and unsupervised acoustic model training. Computer Speech and Language, 16(1), 115–129.

    Article  Google Scholar 

  • Lee, C. Y., & Glass, J. R. (2011) A Transcription Task for Crowdsourcing with Automatic Quality Control. In Proceedings of INTERSPEECH, Florence, Italy (pp. 3041–3044).

  • Liem, B., Zhang, H., & Chen, Y. (2011). An iterative dual pathway structure for speech-to-text transcription. In Proceedings of human computation AAAI workshop, Toronto, Canada.

  • Little, G., Chilton, L. B., Goldman, M., & Miller, R. C. (2010). Exploring iterative and parallel human computation processes. In Proceedings of HCOMP 10, Washington DC, USA (pp. 68–76).

  • Marge, M., Banerjee, S., & Rudnicky, A. (2010a). Using the Amazon mechanical turk for transcription of spoken language. In Proceedings of ICASSP, Dallas, USA (pp. 5270–5273).

  • Marge, M., Banerjee, S., & Rudnicky, A.I. (2010b). Using the Amazon mechanical turk to transcribe and annotate meeting speech for extractive summarization. In Proceedings of NAACL-HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, Los Angeles, USA (pp. 99–107).

  • Negri, M., & Mehdad, Y. (2010). Creating a Bi-lingual entailment corpus through translations with mechanical turk: $100 for a 10-day rush. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, Los Angeles, USA (pp. 212–216).

  • Negri, M., Bentivogli, L., Mehdad, Y., Giampiccolo, D., & Marchetti, A. (2011). Divide and conquer: Crowdsourcing the creation of cross-lingual textual entailment corpora. In Proceedings of the 2011 conference on empirical methods in natural language processing (EMNLP), Edinburgh, Scotland, UK (pp. 670–679).

  • Novotney, S., Callison-Burch, C., & Cheap, (2010). Fast and good enough: Automatic Speech recognition with non-expert transcription. In Proceedings of NAACL-HLT 2010 workshop on creating speech and language data with 87 Amazon’s mechanical turk, Los Angeles, California, USA (pp. 207–215).

  • Parent, G., & Eskenazi, M. (2010). Toward better crowdsourced transcription: Transcription of a year of the let’s go bus information system data. In Proceedings of SLT, Berkeley, USA (pp. 312–317).

  • Parent, G. (2013). Crowdsourcing for speech transcription. In M. Eskénazi, G. A. Levow, H. Meng, G. Parent, & D. Suendermann (Eds.), Crowdsourcing for speech processing: Applications to data collection, transcription and assessment (1st ed., pp. 72–105). Oxford: Wiley.

    Chapter  Google Scholar 

  • Riccardi, G., Ghosh, A., Chowdhury, S., & Bayer, A. O. (2013). Motivational feedback in crowdsourcing: A case study in speech transcription. In Proceedings of INTERSPEECH, Lyon, France (pp. 1111–1115).

  • Sprugnoli, R., Moretti, G., Fuoli, M., Giuliani, D., Bentivogli, L., Pianta, E., Gretter, R., & Brugnara, F. (2013). Comparing two methods for crowdsourcing speech transcription. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), Vancouver, Canada (pp. 8116–8120).

  • Sundermeyer, M., Nussbaum-Thom, M., Wiesler, S., Plahl, C., Mousa, A. D., Hahn, S., Nolden, D., Schluter, R., & Ney, H. (2011). The RWTH 2010 Quaero ASR evaluation system for English, French, and German. In Proceedings of IEEEE ICASSP (pp. 2212–2215).

  • Williams, J. D., Melamed, I. D., Alonso, T., Hollister, B., & Wilpon, J. (2011). Crowd-sourcing for difficult transcription of speech. In: Proceedings of IEEE ASRU workshop, Hawaii, USA (pp. 535–540).

  • Zhou, H., Baskov, D., & Lease, M. (2013). Crowdsourcing transcription beyond mechanical turk. In First AAAI conference on human computation and crowdsourcing (pp. 9–16).

Download references

Acknowledgements

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 287532, TOSCA-MP Task-oriented search and content annotation for media production (http://www.tosca-mp.eu.). This paper is in memory of our colleague and friend Emanuele Pianta.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rachele Sprugnoli.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sprugnoli, R., Moretti, G., Bentivogli, L. et al. Creating a ground truth multilingual dataset of news and talk show transcriptions through crowdsourcing. Lang Resources & Evaluation 51, 283–317 (2017). https://doi.org/10.1007/s10579-016-9372-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-016-9372-5

Keywords

Navigation