Temporal Issues and Recognition Errors on the Capitalization of Speech Transcriptions

Batista, Fernando; Mamede, Nuno; Trancoso, Isabel

doi:10.1007/978-3-540-87391-4_8

Temporal Issues and Recognition Errors on the Capitalization of Speech Transcriptions

Fernando Batista^1,2,
Nuno Mamede^1,3 &
Isabel Trancoso^1,3

Conference paper

949 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5246))

Abstract

This paper investigates the capitalization task over Broadcast News speech transcriptions. Most of the capitalization information is provided by two large newspaper corpora, and the spoken language model is produced by retraining the newspaper language models with spoken data. Three different corpora subsets from different time periods are used for evaluation, revealing the importance of available training data in nearby time periods. Results are provided both for manual and automatic transcriptions, showing also the impact of the recognition errors in the capitalization task. Our approach is based on maximum entropy models and uses unlimited vocabulary. The language model produced with this approach can be sorted and then pruned, in order to reduce computational resources, without much impact in the final results.

This work was funded by PRIME National Project TECNOVOZ number 03/165 and supported by ISCTE.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chelba, C., Acero, A.: Adaptation of maximum entropy capitalizer: Little data can help a lot. In: EMNLP 2004 (2004)
Google Scholar
Lita, L.V., Ittycheriah, A., Roukos, S., Kambhatla, N.: tRuEcasIng. In: Proc. of the 41^st annual meeting on ACL, Morristown, NJ, USA, pp. 152–159 (2003)
Google Scholar
Kim, J., Woodland, P.C.: Automatic capitalisation generation for speech input. Computer Speech & Language 18, 67–90 (2004)
Article Google Scholar
Wang, W., Knight, K., Marcu, D.: Capitalizing machine translation. In: HLT-NAACL, Morristown, NJ, USA, ACL, pp. 1–8 (2006)
Google Scholar
Batista, F., Mamede, N., Caseiro, D., Trancoso, I.: A lightweight on-the-fly capitalization system for automatic speech recognition. In: Proc. of RANLP 2007 (2007)
Google Scholar
Mota, C.: How to keep up with language dynamics? A case study on Named Entity Recognition. Ph.D. thesis, IST / UTL (2008)
Google Scholar
Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proc. of the Joint SIGDAT Conference on EMNLP (1999)
Google Scholar
Makhoul, J., Kubala, F., Schwartz, R., Weischedel, R.: Performance measures for information extraction. In: Proc. of the DARPA BN Workshop (1999)
Google Scholar
Berger, A.L., Pietra, S.A.D., Pietra, V.J.D.: A maximum entropy approach to natural language processing. Computational Linguistics 22, 39–71 (1996)
Google Scholar
Daumé III, H.: Notes on CG and LM-BFGS optimization of logistic regression (2004)
Google Scholar
Meinedo, H., Caseiro, D., Neto, J.P., Trancoso, I.: Audimus.media: A broadcast news speech recognition system for the european portuguese language. In: Mamede, N.J., Baptista, J., Trancoso, I., Nunes, M.d.G.V. (eds.) PROPOR 2003. LNCS, vol. 2721, pp. 9–17. Springer, Heidelberg (2003)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

L2F - Spoken Language Systems Laboratory - INESC ID Lisboa, R. Alves Redol, 9, 1000-029, Lisboa, Portugal
Fernando Batista, Nuno Mamede & Isabel Trancoso
ISCTE - Instituto de Cièncias do Trabalho e da Empresa, Portugal
Fernando Batista
IST - Instituto Superior Técnico, Portugal
Nuno Mamede & Isabel Trancoso

Authors

Fernando Batista
View author publications
You can also search for this author in PubMed Google Scholar
Nuno Mamede
View author publications
You can also search for this author in PubMed Google Scholar
Isabel Trancoso
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Petr Sojka Aleš Horák Ivan Kopeček Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Batista, F., Mamede, N., Trancoso, I. (2008). Temporal Issues and Recognition Errors on the Capitalization of Speech Transcriptions. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2008. Lecture Notes in Computer Science(), vol 5246. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87391-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-540-87391-4_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87390-7
Online ISBN: 978-3-540-87391-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics