Skip to main content
Log in

KALAKA-3: a database for the assessment of spoken language recognition technology on YouTube audios

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

KALAKA-3 is a speech database specifically designed for the development and evaluation of Spoken Language Recognition (SLR) systems. The database provides TV broadcast speech for training, and audio data extracted from YouTube videos for tuning and testing. The database was created to support the Albayzin 2012 Language Recognition Evaluation (LRE), which featured two language recognition tasks, both dealing with European languages. The first one involved six target languages (Basque, Catalan, English, Galician, Portuguese and Spanish) for which there was plenty of training data, whereas the second one involved four target languages (French, German, Greek and Italian) for which no training data was provided. This second task tried to simulate the use case of low resource languages. Two separate sets of YouTube audio files were provided to test the performance of language recognition systems on both tasks. To allow open-set tests, these datasets included speech in 11 additional (Out-Of-Set) European languages. In this paper, we first discuss the design issues considered when creating the database and describe the data collection procedure. Then, we present the results attained in the Albayzin 2012 LRE, along with the performance of state-of-the-art systems on the four evaluation tracks defined on the database. Both series of results demonstrate the usefulness of KALAKA-3 as a challenging benchmark for the advancement of SLR technology. As far as we know, this is the first database specifically designed to benchmark SLR technology on YouTube audios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://www.nist.gov/itl/iad/mig/lre.cfm.

  2. http://www.rthabla.es/.

  3. http://aspell.net/.

  4. YouTube API v2.0: https://developers.google.com/youtube/2.0/developers_guide_protocol_audience.

  5. http://rg3.github.io/youtube-dl/.

  6. http://www.ffmpeg.org/.

  7. http://sox.sourceforge.net/.

  8. https://sites.google.com/site/albayzinlre2012.

  9. The EER of a system for a given task specifies \(P_{miss}\) and \(P_{fa}\) at the threshold \(\theta _{EER}\) where both error rates are equal.

References

  • Bertoldi, N., & Federico, M. (2003). Cross-language spoken document retrieval on the TREC SDR collection. In Advances in Cross-Language Information Retrieval. Lecture Notes in Computer Science (Vol. 2785/2003, pp. 476–481). New York: Springer.

  • Brümmer, N (2008). FoCal: Toolkit for evaluation, fusion and calibration of statistical pattern recognizers. https://sites.google.com/site/nikobrummer/focal.

  • Brümmer, N., & van Leeuwen, D. (2006). On calibration of language recognition scores. In Proceedings of Odyssey: The speaker and language recognition workshop, pp. 1–8.

  • Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Outlet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.

    Article  Google Scholar 

  • D’Haro, L.F., Glembek, O., Plchot, O., Matejka, P., Soufifar, M., Cordoba, R., et al. (2012). Phonotactic language recognition using i-vectors and phoneme posteriorgram counts. In Interspeech 2012, Portland (OR), USA.

  • D’Haro, L. F., de Córdoba, R., Caraballo, M. A., & Pardo, J. M. (2013). Low-resource language recognition using a fusion of phoneme posteriorgram counts, acoustic and glottal-based I-vectors. In Proceedings of ICASSP (pp. 6852–6856). Canada: Vancouver.

  • D’Haro, L. F., de Córdoba, R., Palacios, C. S., & Echeverry, J. D. (2014a). Extended phone log-likelihood ratio features and acoustic-based i-vectors for language recognition. In Proceedings of ICASSP (pp. 5342–5346). Italy: Florence.

  • D’Haro, L.F., Córdoba, R., Salamea, C., & Ferreiros, J. (2014b). Language recognition using phonotactic-based shifted delta coefficients and multiple phone recognizers. In Proceedings of interspeech, Singapore, pp. 3042–3046.

  • Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, L. J., & Bordel, G. (2012). On the use of log-likelihood ratios as features in spoken language recognition. In IEEE workshop on spoken language technology (SLT), Miami, Florida, USA.

  • Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874. Software available at http://www.csie.ntu.edu.tw/~cjlin/liblinear.

  • Li, H., Ma, B., & Lee, C. H. (2007). A vector space modeling approach to spoken language identification. IEEE Transactions on Audio, Speech and Language Processing, 15(1), 271–284.

    Article  Google Scholar 

  • Li, H., Ma, B., & Lee, K. A. (2013). Spoken language recognition: From fundamentals to practice. Proceedings of the IEEE, 101(5), 1136–1159.

    Article  Google Scholar 

  • Ma, B., Guan, C., Li, H., & Lee, C. H. (2002). Multilingual speech recognition with language identification. In Proceedings of ICSLP (Interspeech), pp 505–508.

  • Martin, A. F., Greenberg, C. S., Howard, J. M., Doddington, G. R., & Godfrey, J. J. (2014). NIST language recognition evaluation past and future. In Proceedings of Odyssey: The speaker and language recognition workshop (pp. 145–151). Finland: Joensuu.

  • Martinez, D., Plchot, O., Burget, L., Glembek O, & Matejka, P. (2011). Language recognition in iVectors space. In Proceedings of interspeech, pp 861–864.

  • Martínez, D., Burget, L., Ferrer, L., & Scheffer, N. (2012). iVector-based Prosodic system for language identification. In Proceedings of ICASSP, Japan, pp 4861–4864.

  • Matejka, P., Schwarz, P., Cernocky, J., & Chytil, P. (2005). Phonotactic language identification using high quality phoneme recognition. In Proceedings of interspeech (pp. 2237–2241). Portugal: Lisboa.

  • Penagarikano, M., Varona, A., Rodriguez-Fuentes, L. J., & Bordel, G. (2011a). Dimensionality reduction for using high-order n-grams in SVM-based phonotactic language recognition. In Proceedings of interspeech 2011 (pp. 853–856). Italy: Florence.

  • Penagarikano, M., Varona, A., Rodríguez-Fuentes, L. J., & Bordel, G. (2011b). A dynamic approach to the selection of high-order n-grams in phonotactic language recognition. In Proceedings of ICASSP (pp. 4412–4415). Prague: Czech Republic.

  • Penagarikano, M., Varona, A., Rodríguez-Fuentes, L. J., & Bordel, G. (2011c). Improved modeling of cross-decoder phone co-occurrences in SVM-based phonotactic language recognition. IEEE Transactions on Audio, Speech and Language Processing, 19(8), 2348–2363.

    Article  Google Scholar 

  • Richardson, F., & Campbell, W. (2008). Language recognition with discriminative keyword selection. In Proceedings of ICASSP, pp 4145–4148.

  • Rodriguez-Fuentes, L.J., Penagarikano, M., Bordel, G., & Varona, A. (2010a). The Albayzin 2008 language recognition evaluation. In Proceedings of Odyssey: The speaker and language recognition workshop, pp 172–179.

  • Rodriguez-Fuentes, L. J., Penagarikano, M., Bordel, G., Varona, A., & Diez, M. (2010b). KALAKA: A TV broadcast speech database for the evaluation of language recognition systems. In Proceedings of the 7th international conference on language resources and evaluation (LREC 2010) (pp. 1678–1685). Malta: Valleta.

  • Rodriguez-Fuentes, L. J., Penagarikano, M., Varona, A., Diez, M., & Bordel, G. (2011). The Albayzin 2010 language recognition evaluation. In Proceedings of interspeech, pp. 1529–1532.

  • Rodriguez-Fuentes, L. J., Brümmer, N., Penagarikano, M., Varona, A., Diez, M., & Bordel, G. (2012a). The Albayzin 2012 language recognition evaluation plan (Albayzin 2012 LRE). URL: http://iberspeech2012.ii.uam.es/images/PDFs/albayzin_lre12_evalplan_v1.3_springer

  • Rodriguez-Fuentes, L. J., Penagarikano, M., Varona, A., Diez, M., & Bordel, G. (2012b). KALAKA-2: a TV broadcast speech database for the recognition of Iberian languages in clean and noisy environments. In Proceedings of the LREC (pp. 99–105). Turkey: Istanbul.

  • Rodriguez-Fuentes, L. J., Varona, A., Diez, M., Penagarikano, M., & Bordel, G. (2012c). Evaluation of spoken language recognition technology using broadcast speech: Performance and challenges. In: Odyssey 2012: The speaker and language recognition workshop, Singapore.

  • Rodriguez-Fuentes, L. J., Penagarikano, M., Varona, A., Diez, M., & Bordel, G. (2013). The Albayzin 2012 language recognition evaluation. In Proceedings of interspeech, pp 1497–1501.

  • Schwarz, P. (2008) Phoneme recognition based on long temporal context. PhD thesis, Faculty of Information Technology, Brno University of Technology. http://www.fit.vutbr.cz/. Brno, Czech Republic.

  • Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In Proceedings of interspeech, pp. 257–286.

  • Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., & Deller, J. R. (2002). Approaches to language identification using Gaussian mixture models and shifted Delta Cepstral features. In Proceedings of ICSLP (Interspeech), pp. 89–92.

  • Waibel, A., Geutner, P., Tomoyiko, L. M., Schultz, T., & Woszczyna, M. (2000). Multilinguality in speech and spoken language systems. Proceedings of the IEEE, Special Issue on Spoken Language Processing, 88(8), 1181–1190.

    Google Scholar 

  • Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Lui, X., et al. (2006). The HTK book (for HTK Version 3.4). Cambridge, UK: Entropic, Ltd.

    Google Scholar 

  • Zue, V. W., & Glass, J. R. (2000). Conversational interfaces: Advances and challenges. Proceedings of the IEEE, Special Issue on Spoken Language Processing, 88(8), 1166–1180.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luis Javier Rodríguez-Fuentes.

Additional information

This work was supported by the University of the Basque Country under grant GIU13/28.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rodríguez-Fuentes, L.J., Penagarikano, M., Varona, A. et al. KALAKA-3: a database for the assessment of spoken language recognition technology on YouTube audios. Lang Resources & Evaluation 50, 221–243 (2016). https://doi.org/10.1007/s10579-015-9324-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-015-9324-5

Keywords

Navigation