Skip to main content

Advertisement

Log in

The Corpus DIMEx100: transcription and evaluation

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

In this paper the transcription and evaluation of the corpus DIMEx100 for Mexican Spanish is presented. First we describe the corpus and explain the linguistic and computational motivation for its design and collection process; then, the phonetic antecedents and the alphabet adopted for the transcription task are presented; the corpus has been transcribed at three different granularity levels, which are also specified in detail. The corpus statistics for each transcription level are also presented. A set of phonetic rules describing phonetic context observed empirically in spontaneous conversation is also validated with the transcription. The corpus has been used for the construction of acoustic models and a phonetic dictionary for the construction of a speech recognition system. Initial performance results suggest that the data can be used to train good quality acoustic models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. http://leibniz.iimas.unam.mx/~luis/DIME/.

  2. Perplexity is a commonly used measure of the goodness of a language model that could be intuitively thought of representing the average number of word choices at every predictive step; the lower the number, the better.

  3. http://www.steinberg.net/.

  4. SALA includes a speech corpus of Mexican Spanish with orthographic transcriptions and a pronunciation lexicon with a phonemic transcription (i.e., canonical pronunciations), and it is targeted for the construction of ASR systems for mobile telephone applications. SALA is available as an ELRA resource at: http://catalog.elra.info/index.php.

  5. Indeed, we verified experimentally that word recognition performance on unseen data may be up to 50% worse when all pronunciation alternatives are included in the dictionary.

References

  • Alarcos, E. (1950/1965). Fonología española. Madrid: Gredos.

    Google Scholar 

  • Canfield, D. L. (1981/1992). Spanish pronunciation in the Americas. Chicago: The University of Chicago Press.

    Google Scholar 

  • Clarkson, P., & Rosenfeld, R. (1997). Statistical language modeling using CMU-Cambridge Toolkit. In Proceedings of Eurospeech’97, Rhodes, Greece, pp. 2207–2710.

  • Cuétara, J. (2004). Fonética de la ciudad de México. Aportaciones desde las tecnologías del habla. MSc. Dissertation, Universidad Nacional Autónoma de México, México.

  • Fetter, P. (1998). Detection and transcription of out-of-vocabulary words in continuous-speech recognition, PhD thesis, Daimler-Benz AG, aug 1998. Verbmobil Report 231.

  • Guirao, M., & Borzone, A. M. (1972). Fonemas, sílabas y palabras en el español de Buenos Aires. Filología, 16, 135–165.

    Google Scholar 

  • Hieronymus, J. L. (1997). Worldbet phonetic symbols for multilanguage speech recognition and synthesis. New Jersey: AT&T and Bell Labs.

    Google Scholar 

  • Kirschning, I. (2001). Research and Development of Speech Technology and Applications for Mexican Spanish at the Tlatoa Group (Development Consortium at CHI 2001, Seattle, WA).

  • Lander, T. (1997). The CSLU labeling guide. Oregon: Oregon Graduate Institute of Science and Technology. http://cslu.cse.ogi.edu/corpora/docs/labeling.pdf.

  • Llisterri, J., Machuca, M. J., de la Mota, C., Riera, M., & Ríos, A. (2003). The perception of lexical stress in Spanish, in Proceedings of the 15th International Congress of Phonetic Sciences. Barcelona, 3–9 August 2003. pp. 2023–2026. http://liceu.uab.es/~joaquim/publicacions/Llisterri_Machuca_Mota_Riera_Rios_03_Perception_Stress_Spanish.pdf.

  • Llisterri, J., Machuca, M. J., de la Mota, C., Riera, M., & Ríos, A. (2005). Corpus orales para el desarrollo de las tecnologías del habla en español. Oralia. Análisis del discurso oral, 8, 289–325. http://liceu.uab.es/~joaquim/publicacions/Llisterri_Machuca_Mota_Riera_Rios_05_Corpus_Orales_Tecnologias_Habla_Espanol.pdf.

  • Llisterri, J., & Mariño, J. B. (1993). Spanish adaptation of SAMPA and automatic phonetic transcription. Technical Report. SAM-A/UPC/001/v1 – ESPRIT PROJECT 6819 (SAM-A) Speech Technology Assessment in Multilingual Applications. http://liceu.uab.es/~joaquim/publicacions/SAMPA_Spanish_93.pdf.

  • Lope Blanch, J. M. (1963–1964/1983). En torno a las vocales caedizas del español mexicano, in Estudios sobre el español de México, pp. 57-77. México: Universidad Nacional Autónoma de México.

  • Moreno, A., Comeyne, R., Haslam, K., van den Heuvel, H., Höge, H., Horbach, S., et al. (2000). SALA: Speechdat Across Latin America. Results of the First Phase, Proceedings of the second international conference on language resources and evaluation. Greece: Athens.

    Google Scholar 

  • Moreno de Alba, J. (1994). La Pronunciación del Español de México. México: El Colegio de México.

    Google Scholar 

  • Moreno, A., & Mariño, J. (1998). Spanish dialects: Phonetic transcription, Proceedings of ICSLP’98, the fifth international conference on spoken language processing. Rundle, Mall: Causal Productions.

    Google Scholar 

  • Navarro Tomás, T. (1918/1970). Manual de pronunciación española. Madrid: Consejo Superior de Investigaciones Científicas.

  • Navarro Tomás, T. (1946/1966). Escala de frecuencia de fonemas españoles in Estudios de fonología española (pp. 15–30). New York: Las Américas Publishing Company).

  • NIST (2007). Speech recognition scoring toolkit (SCTK) Version 2.2.4. http://www.nist.gov/speech/tools.

  • Pérez, E. H. (2003). Frecuencia de fonemas. e-rthabla, Revista electrónica de Tecnología del Habla 1. http://lorien.die.upm.es/~lapiz/e-rthabla/numeros/N1/N1_A4.pdf.

  • Perissinotto, G. (1975). Fonología del español hablado en la Ciudad de México. Ensayo de un método sociolingüístico. México: El Colegio de México.

    Google Scholar 

  • Pineda, L. A., Massé, A., Meza, I., Salas, M., Schwarz, E., Uraga, E., & Villaseñor, L. (2002). The DIME Project, Proceedings of MICAI2002, Lectures Notes in Artificial Intelligence,vol. 2313, pp.166–175, Springer-Verlag.

  • Pineda, L. A., Villaseñor, L., Cuétara, J., Castellanos, H., & López, I. (2004). DIMEx100: A new phonetic and speech corpus for Mexican Spanish, en Advances. In C. Lemaitre, C. A. Reyes, & J. A. Gonzalez (Eds.), Artificial intelligence, Iberamia-2004, lectures notes in artificial intelligence (vol. 3315, pp. 974–983), Springer-Verlag,

  • Quilis, A. (1981/1988). Fonética acústica de la lengua española. Madrid: Gredos.

  • Quilis, A., & Esgueva, M. (1980). Frecuencia de fonemas en el español hablado. Lingüística Española Actual, 2(1), 1–25.

    Google Scholar 

  • Ríos Mestre, A. (1999). La transcripción fonética automática del diccionario electrónico de formas simples flexivas del español: estudio fonológico del léxico, Estudios de Lingüística Española, vol. 4. http://elies.rediris.es/elies4/.

  • Rojo, G. (1991) Frecuencia de fonemas en español actual. In M. Brea & F. M. Fernández Rei (Eds.), Homenaxe ó profesor Constantino García (pp. 451–467). Santiago de Compostela: Universidade de Santiago de Compostela, Servicio de Publicación e Intercambio Científico.

  • Sphinx (2006). The CMU sphinx open source speech recognition engines. http://cmusphinx.sourceforge.net/html/cmusphinx.php.

  • Strik, H., & Cucchiarini, C. (1998). Modeling pronunciation variation for ASR: Overview and comparison of methods. In H. Strik, J. M. Kessens, & M. Wester (Eds.), Proceedings of the ESCA workshop ‘modeling pronunciation variation for automatic speech recognition’, Rolduc, Kerkrade, 4–6 May 1998, pp. 137–144.

  • Sutton, S., Cole, R., et al. (1998). Universal speech tools: The CSLU toolkit. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), pp. 3221–3224, Sydney, Australia, November 1998. http://www.cslu.ogi.edu.

  • Villaseñor, L., Massé, A. & Pineda, L. (2000). The DIME Corpus, Memorias 3º. Proceedings of Encuentro Internacional de Ciencias de la Computación ENC01, Tomo II, C. Zozaya, M. Mejía, P. Noriega y A. Sánchez (Eds.), SMCC, Aguascalientes, Ags. México, September, 2001.

  • Villaseñor, L., Montes y Gómez, M., Vaufreydaz, D. & Serignat, J. F. (2004). Experiments on the Construction of a Phonetically Balanced Corpus from the WEB, Proceedings of CICLING2004, LNCS, Springer-Verlag, vol. 2945, 416–419.

  • Wells, J. (1998). SAMPA. Computer readable phonetic alphabet. University College London, http://www.phon.ucl.ac.uk/home/sampa.

Download references

Acknowledgments

The corpus DIMEx100 has been developed within the context of the DIME Project, at IIMAS, UNAM, with the collaboration of the Facultad de Filosofía y Letras, UNAM, and INAOE in Tonanzintla, Puebla. The authors wish to thank the enthusiastic participation of all members of the project who were involved in the collection and transcription of the corpus: Fernanda López, Varinia Estrada, Sergio Coria, Iván Moreno, Ivonne López, Arturo Wong, Laura Pérez, René López, Alejandro Acosta, Alejandro Carrasco, Rafael Torres, Gerardo Mendoza, Ana Ceballos, Alejandra Espinosa and Isabel López; special thanks go to Alejandro Reyes for technical support at INAOE, and to the 100 speakers that provided their voice for the corpus. We also thank James Allen for his continuous collaboration and encouragement along the development of this project. The authors also acknowledge the support of CONACyT’s grant 39380-U and PAPIIT-UNAM grant IN121206.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luis A. Pineda.

Appendices

Appendix 1

See Table 9.

Table 9 Transcription level T-54

Appendix 2

See Table 10.

Table 10 Transcription level T-44

Appendix 3

See Table 11.

Table 11 Transcription level T-22

Appendix 4

See Table 12.

Table 12 Mean time duration of phonetic units (in miliseconds) in the levels T54, T44 and T22

Appendix 5

See Table 13.

Table 13 Equivalent symbols between IPA and Mexbet

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pineda, L.A., Castellanos, H., Cuétara, J. et al. The Corpus DIMEx100: transcription and evaluation. Lang Resources & Evaluation 44, 347–370 (2010). https://doi.org/10.1007/s10579-009-9109-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-009-9109-9

Keywords

Navigation