Skip to main content
Log in

Emergent spatio-temporal multimodal learning using a developmental network

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Conventional machine learning needs humans to train each module with hand-handcrafted data and symbols manually, and the results of these methods are confined to particular tasks. To address this limitation, in this paper we design a multimodal autonomous learning architecture based on a developmental network for the audio and vision co-development. The developmental network is a biological inspired mechanism, which can make an agent to develop and integrate audition and vision simultaneously. Furthermore, synapse maintenance is introduced in the vision information learning to enhance the video recognition rate and neuron regenesis mechanism is implemented to enhance the network usage efficiency. In the experiments, a number of fundamental words are acquired and identified using the proposed learning methodology without any prior knowledge about the objects or the verbal questions before running. The experiments show that the proposed learning method can achieve significantly high recognition rates in comparison with the state-of-the-art method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Droniou A, Ivaldi S, Sigaud O (2015) Deep unsupervised network for multimodal perception, representation and classification. Robot Auton Syst 71(3):83–98

    Article  Google Scholar 

  2. Bertenthal BI, Campos JJ, Barrett KC (1984) Continuities and discontinuities in development: chapter 8. Plenum Press, New York

    Google Scholar 

  3. Botvinick M, Cohen J (1998) Rubber hands ‘feel’ touch that eyes see. Nature 391(6669):756

    Article  Google Scholar 

  4. Brdiczka O, Maisonnasse J, Reignier P, Crowley JL (2009) Detecting small group activities from multimodal observations. Appl Intell 30(1):47–57

    Article  Google Scholar 

  5. Bareham CA, Georgieva SD, Kamke MR, Lloyd D, Bekinschtein TA, Mattingley JB (2018) Role of the right inferior parietal cortex in auditory selective attention: an rTMS study. Cortex 99:30–38

    Article  Google Scholar 

  6. Erhan D, Bengio Y, Courville A, Manzagol P-A, Vincent P (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11(3):625–660

    MathSciNet  MATH  Google Scholar 

  7. Fredenslund K Computational complexity of neural networks. https://kasperfred.com/posts/computational-complexity-of-neural-networks

  8. Han X, Dai Q (2018) Batch-normalized mlpconv-wise supervised pre-training network in network. Appl Intell 48:142–155

    Article  Google Scholar 

  9. Hennecke ME, Prasad KV, Stork DG (1994) Using deformable templates to infer visual speech dynamics. In: Proceedings of the 1994 28th Asilomar conference on signals, systems, and computers, pp 578-582. Pacific Grove, CA, USA

  10. Hennecke ME, Prasad KV, Stork DG (1995) Automatic speech recognition system using acoustic and visual signals. In: Proceedings of the 1995 29th Asilomar conference on signals, systems, and computers, pp 1214-1218. Pacific Grove, CA, USA

  11. Hirsch H, Spinelli D (1971) Modification of the distribution of receptive field oroentation in cats by selectively visual exposure during development. Exp Brain Res 13:509–527

    Google Scholar 

  12. Huang F, Zhang S, Zhang J, Yu G (2017) Multimodal learning for topic sentiment analysis in microblogging. Neurocomputing 253:144–153

    Article  Google Scholar 

  13. Huang J, Kingsbury B (2013) Audio-visual deep learning for noise robust speech recognition. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, pp 7596-7599. Vancouver, Canada

  14. Hwang W, Weng J (2000) Hierarchical discriminant regression. IEEE Trans Pattern Anal Mach Intell 22 (11):1277–1293

    Article  Google Scholar 

  15. Ariav I, Dov D, Cohen I (2018) A deep architecture for audio-visual voice activity detection in the presence of transients. Signal Process 142:69–74

    Article  Google Scholar 

  16. Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T (2015) Audio-visual speech recognition using deep learning. Appl Intell 42(4):722–737

    Article  Google Scholar 

  17. Gurban M, Thiran JP, Drugman T, Dutoit T (2008) Dynamic modality weighting for multi-stream hmms in audio-visual speech recognition. In: Proceedings of the 10th international conference on multimodal interfaces, pp 237-240. Chania, Greece

  18. Mangin O, Oudeyer PY (2013) Learning semantic components from subsymbolic multimodal perception. In: Proceedings of the third joint international conference on development and learning and epigenetic robotics (ICDL), pp 1–7

  19. McDermott E, Katagiri S (1994) Prototype-based minimum error training for speech recognition. Appl Intell 4(3):245–256

    Article  Google Scholar 

  20. McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748

    Article  Google Scholar 

  21. Mercer N (2000) Words and minds: how we use language to think together. Routledge, London, UK

  22. Smith NA, Folland NA, Martinez DM, Trainor LJ (2017) Multisensory object perception in infancy: 4-month-olds perceive a mistuned harmonic as a separate auditory and visual object. Cognition 164:1–7

    Article  Google Scholar 

  23. Altieri N, Stevenson RA, Wallace MT, Wenger MJ (2015) Learning to associate auditory and visual stimuli: behavioral and neural mechanisms. Brain Topogr 28(3):479–493

    Article  Google Scholar 

  24. Russell SJ, Norvig P (2011) Artificial intelligence a modern approach, 3rd. Prentice Hall, Inc., New Jersey

    MATH  Google Scholar 

  25. Khan S, Xu G, Chan R, Yan H (2017) An online spatio-temporal tensor learning model for visual tracking and its applications to facial expression recognition. Expert Syst Appl 90:427–438

    Article  Google Scholar 

  26. Song G, Dai Q (2017) A novel double deep elms ensemble system for time series forecasting. Knowl-Based Syst 134:31–49

    Article  Google Scholar 

  27. Stork DG, Wolff G, Levine E (1992) Neural network lipreading system for improved speech recognition. In: Proceedings of the international joint conference on neural networks (IJCNN), pp 286-295, Baltimore, MD, USA

  28. Wang D, Chen J, Liu L (2017) How internal neurons represent the short context: an emergent perspective. Progress Artif Intell 6(1):67–77

    Article  Google Scholar 

  29. Wang D, Duan Y, Weng J (2018) Motivated optimal developmental learning for sequential tasks without using rigid time-discounts. IEEE Trans Neural Netw Learn Syst 29(10):4917–4931

    Article  Google Scholar 

  30. Wang D, Liu L (2015) Face recognition in complex background: developmental network and synapse maintenance. Int J Smart Home 9(10):47–62

    Article  Google Scholar 

  31. Wang D, Shan H, Tian Y, Liu L (2018) Emergent face orientation recognition with internal neurons of the developmental network. Progress Artif Intell 7(4):359–367

    Article  Google Scholar 

  32. Wang D, Wang J, Liu L (2018) Developmental network: an internal emergent object feature learning. Neural Process Lett 48(2):1135–1159

    Article  Google Scholar 

  33. Weng J (2007) On developmental mental architectures. Neurocomputing 70(13):2303–2323

    Article  Google Scholar 

  34. Weng J (2011) Why have we passed neural networks no not abstract well. Nat Intell INNS Mag 1(1):13–22

    Google Scholar 

  35. Weng J (2012) Symbolic models and emergent models: a review. IEEE Trans Auton Ment Dev 4(1):29–53

    Article  Google Scholar 

  36. Weng J, Luciw M (2012) Brain-like emergent spatial processing. IEEE Trans Auton Ment Dev 4(2):161–185

    Article  Google Scholar 

  37. Weng J, McClelland J, Pentland A, Sporns O, Stockman I, Sur M, Thelen E (2001) Autonomous mental development by robots and animals. Science 291:599–600

    Article  Google Scholar 

  38. Zhang W, Zhang Y, Ma L, Guan J, Gong S (2015) Multimodal learning for facial expression recognition. Pattern Recogn 48(10):3191–3202

    Article  Google Scholar 

  39. Zhang Y, Weng J (2010) Spatio-temporal developmental learning. IEEE Trans Auton Ment Dev 2(3):149–166

    Article  Google Scholar 

Download references

Acknowledgements

This research is supported by China Postdoctoral Science Foundation under Grant 2016M592311, National Natural Science Foundation of China under Grants 61603343 and 61703372, the Key Scientific Research Project of Henan Higher Education under Grant 18A413012, and the Science&Technology Innovation Team Project of Henan Province under Grant 17IRTSTHN013.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianbin Xin.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, D., Xin, J. Emergent spatio-temporal multimodal learning using a developmental network. Appl Intell 49, 1306–1323 (2019). https://doi.org/10.1007/s10489-018-1337-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-018-1337-5

Keywords

Navigation