Skip to main content
Log in

A system for high quality crowdsourced indigenous language transcription

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

In this article, a crowdsourcing method is proposed to transcribe manuscripts from the Bleek and Lloyd Collection, where non-expert volunteers transcribe pages of the handwritten text using an online tool. The digital Bleek and Lloyd Collection is a rare collection that contains artwork, notebooks and dictionaries of the indigenous people of Southern Africa. The notebooks, in particular, contain stories that encode the language, culture and beliefs of these people, handwritten in now-extinct languages with a specialized notation system. Previous attempts have been made to convert the approximately 20,000 pages of text to a machine-readable form using machine learning algorithms but, due to the complexity of the text, the recognition accuracy was low. This article presents details of the system used to enable transcription by volunteers as well as results from experiments that were conducted to determine the quality and consistency of transcriptions. The results show that volunteers are able to produce reliable transcriptions of high quality. The inter-transcriber agreement is 80 % for |Xam text and 95 % for English text. When the |Xam text transcriptions produced by the volunteers are compared with a gold standard, the volunteers achieve an average accuracy of 64.75 %, which exceeded that in previous work. Finally, the degree of transcription agreement correlates with the degree of transcription accuracy. This suggests that the quality of unseen data can be assessed based on the degree of agreement among transcribers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://www.google.com/recaptcha.

  2. http://boinc.berkeley.edu/anderson/.

References

  1. Anderson, David P., Cobb, Jeff, Korpela, Eric, Lebofsky, Matt, Werthimer, Dan: Seti@home: an experiment in public-resource computing. Commun. ACM 45(11), 56–61 (2002)

    Article  Google Scholar 

  2. Bossa. http://boinc.berkeley.edu/trac/wiki/bossaintro

  3. Callison-Burch, C.: Fast, cheap, and creative: evaluating translation quality using amazons mechanical turk. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. EMNLP’09, vol. 1, pp. 286–295. Association for Computational Linguistics, Stroudsburg (2009)

  4. Catlin-Groves, C.L.: The citizen science landscape: from volunteers to citizen sensors and beyond. Int. J. Zool. 2012, p. 14 (2012). doi:10.1155/2012/349630. Article ID 349630

  5. Causer, T., Valerie, W.: Building a volunteer community: results and findings from Transcribe Bentham. Digit. Humanit. Q. 6(2) (2012)

  6. Kanefsky, B., Barlow, N.G., Gulick, V.C.: Can distributed volunteers accomplish massive data analysis tasks? In: Lunar and Planetary Institute Science Conference Abstracts. Lunar and Planetary Institute, vol. 32, pp. 1272. Technical Report (2001)

  7. Lee, J.H.: Crowdsourcing music similarity judgments using mechanical turk. In: Proceedings of the ISMIR 2010, pp. 183–188 (2010)

  8. Lee, J.H., Xiao, H.: Generating ground truth for music mood classification using mechanical turk. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL’12, pp. 129–138. ACM, New York (2012)

  9. Levenshtein, V.I. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8):707–710 (1966)

    Google Scholar 

  10. Marge, M., Satanjeev, B., Rudnicky, A.I.: Using the Amazon Mechanical Turk to transcribe and annotate meeting speech for extractive summarization. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT’10, pp. 99–107. Association for Computational Linguistics, Stroudsburg (2010)

  11. Nowak, S., Rüger, S.: How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the International Conference on Multimedia Information Retrieval, MIR’10, pp. 557–566. ACM, New York (2010)

  12. Shachaf, P.: The paradox of expertise: is the wikipedia reference desk as good as your library? J. Doc. 65(6), 977–996 (2009)

    Article  Google Scholar 

  13. Suleman, H.: Digital libraries without databases: the Bleek and Lloyd collection. In: Research and Advanced Technology for Digital Libraries, pp. 392–403 (2007)

  14. Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., Blum, M.: RECAPTCHA: human-based character recognition via web security measures. Science 321, 1465–1468 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  15. Williams, K.: Learning to read Bushman: automatic handwriting recognition for Bushman languages. MSc, Department of Computer Science, University of Cape Town (2012)

  16. Williams, K., Suleman, H.: Creating a handwriting recognition corpus for Bushman languages. In: Proceedings of the 13th International Conference on Asia-Pacific Digital Libraries: for Cultural Heritage, Knowledge Dissemination, and Future Creation, ICADL’11, pp. 222–231. Springer, Berlin (2011)

  17. Yujian, L., Liu B.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6):1091–1095 (2007)

    Google Scholar 

Download references

Acknowledgments

This research was partially funded by the National Research Foundation of South Africa (Grant numbers: 85470 and 83998), the Citizen Cyberscience Centre and University of Cape Town. The authors acknowledge that opinions, findings and conclusions or recommendations expressed in this publication are that of the authors and that the NRF accepts no liability whatsoever in this regard.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hussein Suleman.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Munyaradzi, N., Suleman, H. A system for high quality crowdsourced indigenous language transcription. Int J Digit Libr 14, 117–125 (2014). https://doi.org/10.1007/s00799-014-0112-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-014-0112-4

Keywords

Navigation