Skip to main content
Log in

A comparative analysis of pooling strategies for convolutional neural network based Hindi ASR

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

State-of-the-art speech recognition is witnessing its golden era as convolutional neural network (CNN) becomes the leader in this domain. CNN based acoustic models have been shown significant improvement in speech recognition tasks. This improvement is achieved due to the special components of CNN, i.e., local filters, weight sharing, and pooling. However, lack of core understanding renders this powerful model as a black-box machine. Although, CNN is performing well in speech recognition still further investigation will help in achieving better recognition rate. Pooling is a very important component of CNN that reduces the dimensionality of the feature-map and offers compact feature representation. Various pooling methods like max pooling, average pooling, stochastic pooling, mixed pooling, \({\text{L}}_{\text{p}}\) pooling, multi-scale orderless pooling, and spectral pooling have their own advantages and disadvantages. In this paper, we deeply explore the state-of-the-art pooling for speech recognition tasks. This paper also helps to investigate that which pooling method performs well in which condition. This work explores different pooling methods for different architectures on Hindi speech dataset. The experimental results show that max pooling performs well when tested for clean speech and stochastic pooling works well in the noisy environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Abdel-Hamid O, Mohamed A, Jiang H, Penn G (2012) Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. Paper presented at the 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP). https://doi.org/10.1109/ICASSP.2012.6288864

  • Abdel-Hamid O, Deng L, Yu D (2013) Exploring convolutional neural network structures and optimization techniques for speech recognition. Paper presented at the interspeech. In: Bimbot F, Cerisara C, Fougeron C, Gravier G, Lamel L, Pellegrino F, Perrier P (eds) Interspeech, pp 3366–3370

  • Abdel-Hamid O, Mohamed AR, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE ACM Trans Audio Speech Lang Process 22(10):1533–1545. https://doi.org/10.1109/TASLP.2014.2339736

    Article  Google Scholar 

  • Adiga A, Magimai M, Seelamantula CS (2013) Gammatone wavelet cepstral coefficients for robust speech recognition. Paper presented at the TENCON 2013–2013 IEEE region 10 conference (31194). https://doi.org/10.1109/TENCON.2013.6718948

  • Aggarwal RK, Dave M (2011) Discriminative techniques for Hindi speech recognition system information systems for Indian languages. Springer, Berlin, pp 261–266. https://doi.org/10.1007/978-3-642-19403-0_45

    Book  Google Scholar 

  • Aggarwal RK, Dave M (2012a) Filterbank optimization for robust ASR using GA and PSO. Int J Speech Technol 15(2):191–201. https://doi.org/10.1007/s10772-012-9133-9

    Article  Google Scholar 

  • Aggarwal RK, Dave M (2012b) Integration of multiple acoustic and language models for improved Hindi speech recognition system. Int J Speech Technol 15(2):165–180. https://doi.org/10.1007/s10772-012-9131-y

    Article  Google Scholar 

  • Aggarwal RK, Dave M (2013) Performance evaluation of sequentially combined heterogeneous feature streams for Hindi speech recognition system. Telecommun Syst 52(3):1457–1466. https://doi.org/10.1007/s11235-011-9623-0

    Article  Google Scholar 

  • Ba J, Frey B (2013) Adaptive dropout for training deep neural networks. In: Proceedings of the 26th international conference on neural information processing systems (NIPS’13), vol 2, pp 3084–3092

  • Bhowmik T, Mandal SKD (2016) Deep neural network based phonological feature extraction for Bengali continuous speech. In: 2016 international conference on signal and information processing (IConSIP), pp 1–5. https://doi.org/10.1109/ICONSIP.2016.7857491

  • Biswas A, Sahu PK, Chandra M (2014) Admissible wavelet packet features based on human inner ear frequency response for Hindi consonant recognition. Comput Electr Eng 40(4):1111–1122. https://doi.org/10.1016/j.compeleceng.2014.01.008

    Article  Google Scholar 

  • Biswas A, Sahu P, Bhowmick A, Chandra M (2016a) Speech recognition using ERB-like admissible wavelet packet decomposition based on perceptual sub-band weighting. IETE J Res 62(2):129–139. https://doi.org/10.1080/03772063.2015.1056844

    Article  Google Scholar 

  • Biswas A, Sahu P, Chandra M (2016b) Admissible wavelet packet sub-band based harmonic energy features using ANOVA fusion techniques for Hindi phoneme recognition. IET Signal Proc 10(8):902–911. https://doi.org/10.1049/iet-spr.2015.0488

    Article  Google Scholar 

  • Boureau Y-L, Cun YL (2008) Sparse feature learning for deep belief networks. In: Proceedings of the 20th international conference on neural information processing systems (NIPS’07), pp 1185–1192

  • Bruna J, Szlam A, LeCun Y (2014) Signal recovery from pooling representations. In: Proceedings of the 31st international conference on machine learning, ICML 2014 Beijing, China

  • Clevert D-A, Unterthiner T, Hochreiter S (2016) Fast and accurate deep network learning by exponential linear units (elus). Paper presented at the international conference on learning representations (ICLR)

  • Das B, Mandal S, Mitra P (2011) Bengali speech corpus for continuous automatic speech recognition system. In: Paper presented at the 2011 international conference on speech database and assessments (Oriental COCOSDA), Hsinchu, 2011, pp 51–55. https://doi.org/10.1109/ICSDA.2011.6085979

  • Dean J, Corrado G, Monga R, Chen K, Devin M, Le QV, Mao M, Ranzato M, Senior A, Tucker P, Yang K, Ng A (2012) Large scale distributed deep networks. In: Proceedings of the 25th international conference on neural information processing systems (NIPS'12), pp 1223–1231

  • Dony R (2001) Karhunen–Loeve transform. In: The transform and data compression handbook, vol 1. CRC Press, Boca Raton, pp 1–34

    Google Scholar 

  • Dua M, Aggarwal RK, Biswas M (2018a) Performance evaluation of Hindi speech recognition system using optimized filterbanks. Eng Sci Technol Int J 21(3):389–398. https://doi.org/10.1016/j.jestch.2018.04.005

    Article  Google Scholar 

  • Dua M, Aggarwal RK, Biswas M (2018b) GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-018-0828-x

    Article  Google Scholar 

  • Duhamel P, Piron B, Etcheto JM (1988) On computing the inverse DFT. IEEE Trans Acoust Speech Signal Process 36(2):285–286. https://doi.org/10.1109/TASSP.1986.1164811

    Article  MATH  Google Scholar 

  • Feng Y, Hao P, Zhang P, Liu X, Wu F, Wang H (2019) Supervoxel based weakly-supervised multi-level 3D CNNs for lung nodule detection and segmentation. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-018-01170-5

    Article  Google Scholar 

  • Fukushima K, Miyake S (1982) Neocognitron: a new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recogn 15(6):455–469. https://doi.org/10.1016/0031-3203(82)90024-3

    Article  Google Scholar 

  • Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN (2017) Convolutional sequence to sequence learning. In: Proceedings of the 34th international conference on machine learning (ICML’17), pp 1243–1252

  • Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: Proceedings of 13th European conference on computer vision, pp 392–407. https://doi.org/10.1007/978-3-319-10584-0_26

    Chapter  Google Scholar 

  • Goodfellow IJ, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. In: Proceedings of the 30th international conference on machine learning (ICML’13), pp 1319–1327

  • He K, Zhang X, Ren S, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Proceedings of the 13th European conference on computer vision (ECCV 2014), pp 346–361. https://doi.org/10.1007/978-3-319-10578-9_23

    Chapter  Google Scholar 

  • He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the 2015 IEEE international conference on computer vision (ICCV’15), pp 1026–1034. https://doi.org/10.1109/ICCV.2015.123

  • Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR (2012) Improving neural networks by preventing co-adaptation of feature detectors. CoRR abs/1207.0580

  • Hu W, Cao J, Lai X, Liu J (2019) Mean amplitude spectrum based epileptic state classification for seizure prediction using convolutional neural networks. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-019-01220-6

    Article  Google Scholar 

  • Huang X, Acero A, Hon H-W (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice Hall PTR, Upper Saddle River

    Google Scholar 

  • Hubel DH, Wiesel TN (1962) Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol 160(1):106–154. https://doi.org/10.1113/jphysiol.1962.sp006837

    Article  Google Scholar 

  • Imran J, Raman B (2019) Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-019-01239-9

    Article  Google Scholar 

  • Jarrett K, Kavukcuoglu K, Ranzato MA, LeCun Y (2009) What is the best multi-stage architecture for object recognition? Paper presented at the 2009 IEEE 12th international conference on computer vision. https://doi.org/10.1109/ICCV.2009.5459469

  • Jegou H, Perronnin F, Douze M, Sanchez J, Perez P, Schmid C (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716. https://doi.org/10.1109/TPAMI.2011.235

    Article  Google Scholar 

  • Koenderink JJ, Van Doorn AJ (1999) The structure of locally orderless images. Int J Comput Vis 31(2–3):159–168. https://doi.org/10.1023/A:1008065931878

    Article  Google Scholar 

  • LeCun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard WE, Jackel LD (1990) Handwritten digit recognition with a back-propagation network. In: Advances in Neural Information Processing Systems (NIPS 1989)

  • LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791

    Article  Google Scholar 

  • LeCun Y, Huang FJ, Bottou L (2004) Learning methods for generic object recognition with invariance to pose and lighting. In: Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition (CVPR 2004)

  • Liu X, Wang Y, Chen X, Gales MJ, Woodland PC (2014) Efficient lattice rescoring using recurrent neural network language models. Paper presented at the 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). https://doi.org/10.1109/ICASSP.2014.6854535

  • Liu L, Shen C, van den Hengel A (2017) Cross-convolutional-layer pooling for image recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2305–2313. https://doi.org/10.1109/TPAMI.2016.2637921

    Article  Google Scholar 

  • Ma M, Huang L, Xiang B, Zhou B (2015) Dependency-based convolutional neural networks for sentence embedding. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing, vol 2, pp 174–179

  • Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: ICML workshop on deep learning for audio, speech and language processing

  • Mandal P, Jain S, Ojha G, Shukla A (2015) Development of Hindi speech recognition system of agricultural commodities using deep neural network. In: INTERSPEECH-2015, pp 1241–1245

  • Mathieu M, Henaff M, LeCun Y (2014) Fast training of convolutional networks through FFTS. In: International conference on learning representations (ICLR2014), CBLS, April 2014. arXiv:1312.5851

  • Mishra A, Chandra M, Biswas A, Sharan S (2011) Robust features for connected Hindi digits recognition. Int J Signal Process Image Process Pattern Recogn 4(2):79–90

    Google Scholar 

  • Nahid MMH, Islam MA, Islam MS (2016) A noble approach for recognizing Bangla real number automatically using CMU Sphinx4. In: 5th international conference on informatics, electronics and vision (ICIEV 2016). IEEE, pp 844–849. https://doi.org/10.1109/ICIEV.2016.7760121

  • Nahid MMH, Purkaystha B, Islam MS (2017) Bengali speech recognition: a double layered LSTM-RNN approach. In: 20th international conference of computer and information technology (ICCIT 2017), pp 1–6. https://doi.org/10.1109/ICCITECHN.2017.8281848

  • Nguyen LD, Gao R, Lin D, Lin Z (2019) Biomedical image classification based on a feature concatenation and ensemble of deep CNNs. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-019-01276-4

    Article  Google Scholar 

  • Pasricha V, Aggarwal R (2016) Hybrid architecture for robust speech recognition system. In: 2016 international conference on recent advances and innovations in engineering (ICRAIE). IEEE, pp 1–7. https://doi.org/10.1109/ICRAIE.2016.7939586

  • Passricha V, Aggarwal RK (2018) Convolutional support vector machines for speech recognition. Int J Speech Technol 1:1. https://doi.org/10.1007/s10772-018-09584-4

    Article  Google Scholar 

  • Passricha V, Aggarwal RK (2019) A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition. J Intell Syst. https://doi.org/10.1515/jisys-2018-0372

    Article  Google Scholar 

  • Ren JS, Xu L (2015) On vectorization of deep convolutional neural networks for vision tasks. Paper presented at the Proceedings of the twenty-ninth AAAI conference on artificial intelligence, Austin, Texas

  • Reza M, Rashid W, Mostakim M (2017) Prodorshok I: a Bengali isolated speech dataset for voice-based assistive technologies: a comparative analysis of the effects of data augmentation on HMM-GMM and DNN classifiers. In: 2017 IEEE region 10 humanitarian technology conference (R10-HTC). IEEE, pp 396–399. https://doi.org/10.1109/R10-HTC.2017.8288983

  • Rippel O, Snoek J, Adams RP (2015) Spectral representations for convolutional neural networks. In: Proceedings of the 28th international conference on neural information processing systems (NIPS’15), vol 2, pp 2449–2457

  • Sainath TN, Kingsbury B, Mohamed AR, Dahl GE, Saon G, Soltau H, Beran T, Aravkin AY, Ramabhadran B (2013a) Improvements to deep convolutional neural networks for LVCSR. In: 2013 IEEE workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 315–320. https://doi.org/10.1109/ASRU.2013.6707749

  • Sainath TN, Mohamed AR, Kingsbury B, Ramabhadran B (2013b) Deep convolutional neural networks for LVCSR. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 8614–8618. https://doi.org/10.1109/ICASSP.2013.6639347

  • Samudravijaya K, Rao PVS, Agrawal S (2000). Hindi speech database. In: Sixth international conference on spoken language processing (ICSLP 2000), Beijing, China

  • Sermanet P, Chintala S, LeCun Y (2012) Convolutional neural networks applied to house numbers digit classification. In: 21st international conference on pattern recognition (ICPR 2012), pp 3288–3291

  • Singhal S, Passricha V, Sharma P, Aggarwal RK (2018) Multi-level region-of-interest CNNs for end to end speech recognition. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-018-1146-z

    Article  Google Scholar 

  • Soltau H, Kuo HK, Mangu L, Saon G, Beran T (2013) Neural network acoustic models for the DARPA RATS program. In: Interspeech, pp 3092–3096

  • Springenberg JT, Riedmiller M (2013) Improving deep neural networks with probabilistic maxout units. CoRR:1312.6116

  • Sze V, Chen Y-H, Yang T-J, Emer JS (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 105(12):2295–2329. https://doi.org/10.1109/JPROC.2017.2761740

    Article  Google Scholar 

  • Tompson J, Goroshin R, Jain A, LeCun Y, Bregler C (2015) Efficient object localization using convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 648–656

  • Toth L (2014a) Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). https://doi.org/10.1109/ICASSP.2014.6853584

  • Toth L (2014b) Convolutional deep maxout networks for phone recognition. In: Fifteenth annual conference of the international speech communication association (INTERSPEECH)

  • Toth L (2015) Phone recognition with hierarchical convolutional deep maxout networks. Eurasip J Audio Speech Music Process. https://doi.org/10.1186/s13636-015-0068-3

    Article  Google Scholar 

  • Viikki O, Laurila K (1998) Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Commun 25(1–3):133–147. https://doi.org/10.1016/S0167-6393(98)00033-8

    Article  Google Scholar 

  • Wan L, Zeiler M, Zhang S, LeCun Y, Fergus R (2013) Regularization of neural networks using dropconnect. In: Proceedings of the 30th international conference on machine learning (ICML), pp 1058–1066

  • Wang S, Manning C (2013) Fast dropout training. In: Proceedings of the 30th international conference on machine learning (ICML), pp 118–126

  • Xu B, Wang N, Chen T, Li M (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853

  • Yu D, Wang H, Chen P, Wei Z (2014) Mixed pooling for convolutional neural networks rough sets and knowledge technology. Springer International Publishing, Cham, pp 364–375. https://doi.org/10.1007/978-3-319-11740-9_34

    Chapter  Google Scholar 

  • Zavala-Mondragon LA, Lamichhane B, Zhang L, Haan GD (2019) CNN-SkelPose: a CNN-based skeleton estimation algorithm for clinical applications. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-019-01259-5

    Article  Google Scholar 

  • Zeiler MD, Fergus R (2013) Stochastic pooling for regularization of deep convolutional neural networks. In: Proceedings of the international conference on learning representation (ICLR)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajesh Kumar Aggarwal.

Ethics declarations

Conflict of interest

There is no conflict of interest for this paper.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 22 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Passricha, V., Aggarwal, R.K. A comparative analysis of pooling strategies for convolutional neural network based Hindi ASR. J Ambient Intell Human Comput 11, 675–691 (2020). https://doi.org/10.1007/s12652-019-01325-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-019-01325-y

Keywords

Navigation