Abstract
A common technique in visual object recognition is to use a sparse encoding of low-level input with a feature dictionary followed by a spatial pooling over local neighbourhoods. While some methods stack these in alternating layers within hierarchies, using these two stages alone can also produce state-of-the-art results. Following from vision, this framework is moving in to speech and audio processing tasks. We investigate the effect of architectural choices when applied to a spoken digit recognition task. We find that the unsupervised learning of features has a negligible effect on the classification, with the number of and size of the features being a greater determinant for recognition. Finally, we show that, given an optimised architecture, sparse coding performs comparably with Hidden Markov Models (HMMs) and outperforms K-means clustering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Shamma, S.: On the role of space and time in auditory processing. Trends Cogn. Sci. 5(8), 340–348 (2001)
Olshausen, B., Field, D.: Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, 607–609 (1996)
Riesenhuber, M., Poggio, T.: Hierarchical models of object recognition in cortex. Nature Neuroscience 2(11), 1019–1025 (1999)
Yang, J., Yu, K., Gong, Y., Huang, T.: Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)
Labusch, K., Barth, E., Martinetz, T.: Simple method for high performance digit recognition based on sparse coding. IEEE Trans. Neural Netw. 19(11), 1985–1989 (2008)
Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: International Conference on Computer Vision (2009)
Kleinschmidt, M.: Methods for capturing spectro-temporal modulations in automatic speech recognition. Acta Acust. Acust. 88(3), 416–422 (2002)
Heckmann, M., Domont, X., Joublin, F., Goerick, C.: A hierarchical framework for spectro-temporal feature extraction. Speech Communication 53(5), 736–752 (2011)
Cho, Y., Choi, S.: Nonnegative features of spectro-temporal sounds for classification. Pattern Recognition Lett. 26(9), 1327–1336 (2005)
Henaff, M., Jarrett, K., Kavukcuoglu, K., LeCun, Y.: Unsupervised learning of sparse features for scalable audio classification. In: Proceedings of International Symposium on Music Information Retrieval (2011)
Coates, A., Lee, H., Ng, A.: An analysis of single-layer networks in unsupervised feature learning. In: AISTATS, vol. 14 (2011)
Coates, A., Ng, A.: The importance of encoding versus training with sparse coding and vector quantization. In: Proceedings of the Twenty-Eighth International Conference on Machine Learning (2010)
Saxe, A., Koh, P., Chen, Z., Bhand, M., Suresh, B., Ng, A.: On random weights and unsupervised feature learning. In: Twenty-Eighth International Conference on Machine Learning (2011)
Tošić, I., Frossard, P.: Dictionary learning: What is the right representation for my signal? IEEE Signal Processing Magazine 28(2), 27–38 (2011)
Pati, Y., Rezaiifar, R., Krishnaprasad, P.: Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet Decomposition. In: Proc. Asilomar Conf. Signals, Syst., Comput., vol. 1, pp. 40–44. Pacific Grove, CA (1993)
van Gemert, J.C., Geusebroek, J.-M., Veenman, C.J., Smeulders, A.W.M.: Kernel Codebooks for Scene Categorization. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 696–709. Springer, Heidelberg (2008)
Scherer, D., Müller, A., Behnke, S.: Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN 2010, Part III. LNCS, vol. 6354, pp. 92–101. Springer, Heidelberg (2010)
Pedregosa, J., et al.: Scikit-learn: Machine Learning in Python. JMLR 12, 2825–2830 (2011), Software, http://scikit-learn.org/stable/
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: LIBLINEAR: A Library for Large Linear Classification, Software, http://www.csie.ntu.edu.tw/~cjlin/liblinear
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines, Software, http://www.csie.ntu.edu.tw/~cjlin/libsvm
Hirsch, H.G., Pearce, D.: The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: Proc. of ISCA ASR 2000 Workshop, Paris, France, pp. 181–188 (2000)
Plannerer, B.: Mel-Spectral Toolbox (2004), http://www.speech-recognition.de/matlab-examples.html
Aharon, M., Elad, M., Bruckstein, A.M.: K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representation. IEEE Trans. on Signal Processing 54(11), 4311–4322 (2006)
Rubenstein, R.: KSVD-Box v13, http://www.cs.technion.ac.il/~ronrubin/software.html
Li, Y., Osher, S.: Coordinate descent optimization for l1 minimization with application to compressed sensing; a greedy algorithm. Inverse Problems and Imaging 3(3), 487–503 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
O’Donnell, F., Triefenbach, F., Martens, JP., Schrauwen, B. (2012). Effects of Architecture Choices on Sparse Coding in Speech Recognition. In: Villa, A.E.P., Duch, W., Érdi, P., Masulli, F., Palm, G. (eds) Artificial Neural Networks and Machine Learning – ICANN 2012. ICANN 2012. Lecture Notes in Computer Science, vol 7552. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33269-2_79
Download citation
DOI: https://doi.org/10.1007/978-3-642-33269-2_79
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33268-5
Online ISBN: 978-3-642-33269-2
eBook Packages: Computer ScienceComputer Science (R0)