Abstract
Automatic speech recognition (ASR) technology has matured over the past few decades and has made significant impacts in a variety of fields, from assistive technologies to commercial products. However, ASR system development is a resource intensive activity and requires language resources in the form of text annotated audio recordings and pronunciation dictionaries. Unfortunately, many languages found in the developing world fall into the resource-scarce category and due to this resource scarcity the deployment of ASR systems in the developing world is severely inhibited. One approach to assist with resource-scarce ASR system development, is to select “useful” training samples which could reduce the resources needed to collect new corpora. In this work, we propose a new data selection framework which can be used to design a speech recognition corpus. We show for limited data sets, independent of language and bandwidth, the most effective strategy for data selection is frequency-matched selection and that the widely-used maximum entropy methods generally produced the least promising results. In our model, the frequency-matched selection method corresponds to a logarithmic relationship between accuracy and corpus size; we also investigated other model relationships, and found that a hyperbolic relationship (as suggested from simple asymptotic arguments in learning theory) may lead to somewhat better performance under certain conditions.









Similar content being viewed by others
References
Barnard, E. (1994). A model for nonpolynomial decrease in error rate with increasing sample size. IEEE Transactions on Neural Networks, 5(6), 994–997.
Barnard, E., Davel, M., & van Heerden, C. (2009). ASR corpus design for resource-scarce languages. In Proceedings of INTERSPEECH, ISCA (pp. 2847–2850). Brighton, UK.
Erol, B., Cohen, J., Etoh, M., Hon, H. W., Luo, J., & Schalkwyk, J. (2009). Mobile media search. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP) (pp. 4897–4900). Taipei, Taiwan.
Fisher, W. M., Doddington, G. R., & Goudie-Marshall, K. M. (1986). The DARPA speech recognition research database: specifications and status. In Proceedings of the DARPA workshop on speech recognition (pp. 93–99).
Gillick, L., & Cox, S. J. (1989). Some statistical issues in the comparison of speech recognition algorithms. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP), Vol. 1 (pp. 532–535). Glasgow, Scotland.
Gouvêa, E., & Davel, M. H. (2011). Kullback-Leibler divergence-based ASR training data selection. In Proceedings of INTERSPEECH (pp. 2297–2300). Florence, Italy.
Graff, D., Wu, Z., MacIntyre, R., & Liberman, M. (1997). The 1996 broadcast news speech and language-model corpus. In Proceedings of the DARPA workshop on spoken language technology (pp. 11–14). Citeseer.
Kleynhans, N. T. (2013). Automatic speech recognition for resource-scarce environments. Ph.D. thesis, North-West University, Potchefstroom Campus.
Moore, R. K. (2003). A comparison of the data requirements of automatic speech recognition systems and human listeners. In Proceedings of EUROSPEECH (pp. 2582–2584). Geneva, Switzerland.
Navratil, J. (2001). Spoken language recognition-a step toward multilinguality in speech processing. IEEE Transactions on Speech and Audio Processing, 9(6), 678–685.
Paul, D. B., & Baker, J. M. (1992). The design for the Wall Street Journal-based CSR corpus. In Proceedings of the workshop on speech and natural language, association for computational linguistics (pp. 357–362).
Rabiner, L. R. (1997). Applications of speech recognition in the area of telecommunications. In Proceedings of the IEEE workshop on automatic speech recognition and understanding, 1997 (pp. 501–510). Santa Barbara, California, USA.
Reynolds, D. A. (2001). Automatic speaker recognition: Current approaches and future trends. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP) (pp. 1–6). Salt Lake City, Utah, USA.
Santen, J. P. H., & Buchsbaum, A. L. (1997). Methods for optimal text selection. In: Proceedings of EUROSPEECH, ISCA (pp. 553–556). Rhodes, Greece.
Wu, Y., Zhang, R., & Rudnicky, A. (2007). Data selection for speech recognition. In: IEEE workshop on automatic speech recognition and understanding, ASRU, 2007 (pp. 562–565). Pittsburgh, Pennsylvania, USA.
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., et al. (2009). The HTK book. Revised for HTK version 3.4 http://htk.eng.cam.ac.uk.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kleynhans, N.T., Barnard, E. Efficient data selection for ASR. Lang Resources & Evaluation 49, 327–353 (2015). https://doi.org/10.1007/s10579-014-9285-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-014-9285-0