Transfer Learning for Tandem ASR Feature Extraction

Frankel, Joe; Çetin, Özgür; Morgan, Nelson

doi:10.1007/978-3-540-78155-4_20

Transfer Learning for Tandem ASR Feature Extraction

Joe Frankel^1,2,
Özgür Çetin² &
Nelson Morgan²

Conference paper

1042 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4892))

Abstract

Tandem automatic speech recognition (ASR), in which one or an ensemble of multi-layer perceptrons (MLPs) is used to provide a non-linear transform of the acoustic parameters, has become a standard technique in a number of state-of-the-art systems. In this paper, we examine the question of how to transfer learning from out-of-domain data to new tasks.

Our primary focus is to develop tandem features for recognition of speech from the meetings domain. We show that adapting MLPs originally trained on conversational telephone speech leads to lower word error rates than training MLPs solely on the target data. Multi-task learning, in which a single MLP is trained to perform a secondary task (in this case a speech enhancement mapping from farfield to nearfield signals) is also shown to be advantageous.

We also present recognition experiments on broadcast news data which suggest that structure learned from English speech can be adapted to Mandarin Chinese. The performance of tandem MLPs trained on 440 hours of Mandarin speech with a random initialization was achieved by adapted MLPs using about 97 hours of data in the target language.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Hermansky, H., Ellis, D., Sharma, S.: Tandem connectionist feature stream extraction for conventional hmm systems. In: Proc ICASSP, Istanbul, Turkey, vol. III, pp. 1635–1638 (2000)
Google Scholar
Trentin, E., Gori, M.: A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing 37(1), 91–126 (2001)
Article MATH Google Scholar
Stolcke, A., Grezl, F., Hwang, M.Y., Lei, X., Morgan, N., Vergyri, D.: Cross-domain and cross-language portability of acoustic features estimated by multilayer perceptrons. In: Proc. ICASSP, Toulouse, France (2006)
Google Scholar
Zheng, J., Çetin, O., Hwang, M.Y., Lei, X., Stolcke, A., Morgan, N.: Combining discriminative feature, transform, and model training for large vocabulary speech recognition. In: Proc. ICASSP, Honolulu (2007)
Google Scholar
Caruana, R.: Multitask learning. Machine Learning 28(1), 41–75 (1997)
Article Google Scholar
Hain, T., Burget, L., Dines, J., Garau, G., Karafiat, M., Lincoln, M., Vepa, J., Wan, V.: The AMI meeting transcription system: Progress and performance. In: NIST RT 2006 Workshop (2006)
Google Scholar
Hermansky, H.: TRAP-TANDEM: Data-driven extraction of temporal features from speech. In: IDIAP-RR 50, IDIAP, Martigny, Switzerland (2003)
Google Scholar
Morgan, N., Zhu, Q., Stolcke, A., Sonmez, K., Sivadas, S., Shinozaki, T., Ostendorf, M., Jain, J., Hermansky, H., Ellis, D., Doddington, G., Chen, B., Çetin, O., Bourlard, H., Athineos, M.: Pushing the Envelope - Aside. IEEE Signal Processing Magazine 22(5), 81–88 (2005)
Article Google Scholar
Chen, B., Zhu, Q., Morgan, N.: Learning long term temporal feature in LVCSR using neural networks. In: Proc. ICSLP, pp. 612–615 (2004)
Google Scholar
Zhu, Q., Stolcke, A., Chen, B.Y., Morgan, N.: Using MLP features in SRI’s conversational speech recognition system. In: Proc. Eurospeech, Portugal (2005)
Google Scholar
Janin, A., Stolcke, A., Anguera, X., Boakye, K., Çetin, O., Frankel, J., Zheng, J.: The ICSI-SRI spring 2006 meeting recognition system. In: Proc. MLMI, Washington DC, USA (2006)
Google Scholar
Hwang, M.Y., Wang, W., Lei, X., Zheng, J., Çetin, O., Peng, G.: Advances in Mandarin broadcast speech recognition. In: Proc.Interspeech, Antwerp (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Edinburgh,
Joe Frankel
International Computer Science Institute,
Joe Frankel, Özgür Çetin & Nelson Morgan

Authors

Joe Frankel
View author publications
You can also search for this author in PubMed Google Scholar
Özgür Çetin
View author publications
You can also search for this author in PubMed Google Scholar
Nelson Morgan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Andrei Popescu-Belis Steve Renals Hervé Bourlard

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Frankel, J., Çetin, Ö., Morgan, N. (2008). Transfer Learning for Tandem ASR Feature Extraction. In: Popescu-Belis, A., Renals, S., Bourlard, H. (eds) Machine Learning for Multimodal Interaction. MLMI 2007. Lecture Notes in Computer Science, vol 4892. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78155-4_20

Download citation

DOI: https://doi.org/10.1007/978-3-540-78155-4_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78154-7
Online ISBN: 978-3-540-78155-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics