A system for high quality crowdsourced indigenous language transcription

Munyaradzi, Ngoni; Suleman, Hussein

doi:10.1007/s00799-014-0112-4

A system for high quality crowdsourced indigenous language transcription

Published: 11 April 2014

Volume 14, pages 117–125, (2014)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Ngoni Munyaradzi¹ &
Hussein Suleman¹

1441 Accesses
Explore all metrics

Abstract

In this article, a crowdsourcing method is proposed to transcribe manuscripts from the Bleek and Lloyd Collection, where non-expert volunteers transcribe pages of the handwritten text using an online tool. The digital Bleek and Lloyd Collection is a rare collection that contains artwork, notebooks and dictionaries of the indigenous people of Southern Africa. The notebooks, in particular, contain stories that encode the language, culture and beliefs of these people, handwritten in now-extinct languages with a specialized notation system. Previous attempts have been made to convert the approximately 20,000 pages of text to a machine-readable form using machine learning algorithms but, due to the complexity of the text, the recognition accuracy was low. This article presents details of the system used to enable transcription by volunteers as well as results from experiments that were conducted to determine the quality and consistency of transcriptions. The results show that volunteers are able to produce reliable transcriptions of high quality. The inter-transcriber agreement is 80 % for |Xam text and 95 % for English text. When the |Xam text transcriptions produced by the volunteers are compared with a gold standard, the volunteers achieve an average accuracy of 64.75 %, which exceeded that in previous work. Finally, the degree of transcription agreement correlates with the degree of transcription accuracy. This suggests that the quality of unseen data can be assessed based on the degree of agreement among transcribers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Quality Assessment in Crowdsourced Indigenous Language Transcription

Effective Crowdsourcing in the EDT Project with Probabilistic Indexes

Creating a ground truth multilingual dataset of news and talk show transcriptions through crowdsourcing

Article 31 October 2016

Rachele Sprugnoli, Giovanni Moretti, … Diego Giuliani

Notes

References

Anderson, David P., Cobb, Jeff, Korpela, Eric, Lebofsky, Matt, Werthimer, Dan: Seti@home: an experiment in public-resource computing. Commun. ACM 45(11), 56–61 (2002)
Article Google Scholar
Bossa. http://boinc.berkeley.edu/trac/wiki/bossaintro
Callison-Burch, C.: Fast, cheap, and creative: evaluating translation quality using amazons mechanical turk. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. EMNLP’09, vol. 1, pp. 286–295. Association for Computational Linguistics, Stroudsburg (2009)
Catlin-Groves, C.L.: The citizen science landscape: from volunteers to citizen sensors and beyond. Int. J. Zool. 2012, p. 14 (2012). doi:10.1155/2012/349630. Article ID 349630
Causer, T., Valerie, W.: Building a volunteer community: results and findings from Transcribe Bentham. Digit. Humanit. Q. 6(2) (2012)
Kanefsky, B., Barlow, N.G., Gulick, V.C.: Can distributed volunteers accomplish massive data analysis tasks? In: Lunar and Planetary Institute Science Conference Abstracts. Lunar and Planetary Institute, vol. 32, pp. 1272. Technical Report (2001)
Lee, J.H.: Crowdsourcing music similarity judgments using mechanical turk. In: Proceedings of the ISMIR 2010, pp. 183–188 (2010)
Lee, J.H., Xiao, H.: Generating ground truth for music mood classification using mechanical turk. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL’12, pp. 129–138. ACM, New York (2012)
Levenshtein, V.I. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8):707–710 (1966)
Google Scholar
Marge, M., Satanjeev, B., Rudnicky, A.I.: Using the Amazon Mechanical Turk to transcribe and annotate meeting speech for extractive summarization. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT’10, pp. 99–107. Association for Computational Linguistics, Stroudsburg (2010)
Nowak, S., Rüger, S.: How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the International Conference on Multimedia Information Retrieval, MIR’10, pp. 557–566. ACM, New York (2010)
Shachaf, P.: The paradox of expertise: is the wikipedia reference desk as good as your library? J. Doc. 65(6), 977–996 (2009)
Article Google Scholar
Suleman, H.: Digital libraries without databases: the Bleek and Lloyd collection. In: Research and Advanced Technology for Digital Libraries, pp. 392–403 (2007)
Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., Blum, M.: RECAPTCHA: human-based character recognition via web security measures. Science 321, 1465–1468 (2008)
Article MATH MathSciNet Google Scholar
Williams, K.: Learning to read Bushman: automatic handwriting recognition for Bushman languages. MSc, Department of Computer Science, University of Cape Town (2012)
Williams, K., Suleman, H.: Creating a handwriting recognition corpus for Bushman languages. In: Proceedings of the 13th International Conference on Asia-Pacific Digital Libraries: for Cultural Heritage, Knowledge Dissemination, and Future Creation, ICADL’11, pp. 222–231. Springer, Berlin (2011)
Yujian, L., Liu B.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6):1091–1095 (2007)
Google Scholar

Download references

Acknowledgments

This research was partially funded by the National Research Foundation of South Africa (Grant numbers: 85470 and 83998), the Citizen Cyberscience Centre and University of Cape Town. The authors acknowledge that opinions, findings and conclusions or recommendations expressed in this publication are that of the authors and that the NRF accepts no liability whatsoever in this regard.

Author information

Authors and Affiliations

University of Cape Town, Cape Town, South Africa
Ngoni Munyaradzi & Hussein Suleman

Authors

Ngoni Munyaradzi
View author publications
You can also search for this author in PubMed Google Scholar
Hussein Suleman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hussein Suleman.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Munyaradzi, N., Suleman, H. A system for high quality crowdsourced indigenous language transcription. Int J Digit Libr 14, 117–125 (2014). https://doi.org/10.1007/s00799-014-0112-4

Download citation

Received: 30 October 2013
Revised: 15 March 2014
Accepted: 19 March 2014
Published: 11 April 2014
Issue Date: August 2014
DOI: https://doi.org/10.1007/s00799-014-0112-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A system for high quality crowdsourced indigenous language transcription

Abstract

Access this article

Similar content being viewed by others

Quality Assessment in Crowdsourced Indigenous Language Transcription

Effective Crowdsourcing in the EDT Project with Probabilistic Indexes

Creating a ground truth multilingual dataset of news and talk show transcriptions through crowdsourcing

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A system for high quality crowdsourced indigenous language transcription

Abstract

Access this article

Similar content being viewed by others

Quality Assessment in Crowdsourced Indigenous Language Transcription

Effective Crowdsourcing in the EDT Project with Probabilistic Indexes

Creating a ground truth multilingual dataset of news and talk show transcriptions through crowdsourcing

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation