Abstract
Conventional machine learning needs humans to train each module with hand-handcrafted data and symbols manually, and the results of these methods are confined to particular tasks. To address this limitation, in this paper we design a multimodal autonomous learning architecture based on a developmental network for the audio and vision co-development. The developmental network is a biological inspired mechanism, which can make an agent to develop and integrate audition and vision simultaneously. Furthermore, synapse maintenance is introduced in the vision information learning to enhance the video recognition rate and neuron regenesis mechanism is implemented to enhance the network usage efficiency. In the experiments, a number of fundamental words are acquired and identified using the proposed learning methodology without any prior knowledge about the objects or the verbal questions before running. The experiments show that the proposed learning method can achieve significantly high recognition rates in comparison with the state-of-the-art method.
Similar content being viewed by others
References
Droniou A, Ivaldi S, Sigaud O (2015) Deep unsupervised network for multimodal perception, representation and classification. Robot Auton Syst 71(3):83–98
Bertenthal BI, Campos JJ, Barrett KC (1984) Continuities and discontinuities in development: chapter 8. Plenum Press, New York
Botvinick M, Cohen J (1998) Rubber hands ‘feel’ touch that eyes see. Nature 391(6669):756
Brdiczka O, Maisonnasse J, Reignier P, Crowley JL (2009) Detecting small group activities from multimodal observations. Appl Intell 30(1):47–57
Bareham CA, Georgieva SD, Kamke MR, Lloyd D, Bekinschtein TA, Mattingley JB (2018) Role of the right inferior parietal cortex in auditory selective attention: an rTMS study. Cortex 99:30–38
Erhan D, Bengio Y, Courville A, Manzagol P-A, Vincent P (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11(3):625–660
Fredenslund K Computational complexity of neural networks. https://kasperfred.com/posts/computational-complexity-of-neural-networks
Han X, Dai Q (2018) Batch-normalized mlpconv-wise supervised pre-training network in network. Appl Intell 48:142–155
Hennecke ME, Prasad KV, Stork DG (1994) Using deformable templates to infer visual speech dynamics. In: Proceedings of the 1994 28th Asilomar conference on signals, systems, and computers, pp 578-582. Pacific Grove, CA, USA
Hennecke ME, Prasad KV, Stork DG (1995) Automatic speech recognition system using acoustic and visual signals. In: Proceedings of the 1995 29th Asilomar conference on signals, systems, and computers, pp 1214-1218. Pacific Grove, CA, USA
Hirsch H, Spinelli D (1971) Modification of the distribution of receptive field oroentation in cats by selectively visual exposure during development. Exp Brain Res 13:509–527
Huang F, Zhang S, Zhang J, Yu G (2017) Multimodal learning for topic sentiment analysis in microblogging. Neurocomputing 253:144–153
Huang J, Kingsbury B (2013) Audio-visual deep learning for noise robust speech recognition. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, pp 7596-7599. Vancouver, Canada
Hwang W, Weng J (2000) Hierarchical discriminant regression. IEEE Trans Pattern Anal Mach Intell 22 (11):1277–1293
Ariav I, Dov D, Cohen I (2018) A deep architecture for audio-visual voice activity detection in the presence of transients. Signal Process 142:69–74
Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T (2015) Audio-visual speech recognition using deep learning. Appl Intell 42(4):722–737
Gurban M, Thiran JP, Drugman T, Dutoit T (2008) Dynamic modality weighting for multi-stream hmms in audio-visual speech recognition. In: Proceedings of the 10th international conference on multimodal interfaces, pp 237-240. Chania, Greece
Mangin O, Oudeyer PY (2013) Learning semantic components from subsymbolic multimodal perception. In: Proceedings of the third joint international conference on development and learning and epigenetic robotics (ICDL), pp 1–7
McDermott E, Katagiri S (1994) Prototype-based minimum error training for speech recognition. Appl Intell 4(3):245–256
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748
Mercer N (2000) Words and minds: how we use language to think together. Routledge, London, UK
Smith NA, Folland NA, Martinez DM, Trainor LJ (2017) Multisensory object perception in infancy: 4-month-olds perceive a mistuned harmonic as a separate auditory and visual object. Cognition 164:1–7
Altieri N, Stevenson RA, Wallace MT, Wenger MJ (2015) Learning to associate auditory and visual stimuli: behavioral and neural mechanisms. Brain Topogr 28(3):479–493
Russell SJ, Norvig P (2011) Artificial intelligence a modern approach, 3rd. Prentice Hall, Inc., New Jersey
Khan S, Xu G, Chan R, Yan H (2017) An online spatio-temporal tensor learning model for visual tracking and its applications to facial expression recognition. Expert Syst Appl 90:427–438
Song G, Dai Q (2017) A novel double deep elms ensemble system for time series forecasting. Knowl-Based Syst 134:31–49
Stork DG, Wolff G, Levine E (1992) Neural network lipreading system for improved speech recognition. In: Proceedings of the international joint conference on neural networks (IJCNN), pp 286-295, Baltimore, MD, USA
Wang D, Chen J, Liu L (2017) How internal neurons represent the short context: an emergent perspective. Progress Artif Intell 6(1):67–77
Wang D, Duan Y, Weng J (2018) Motivated optimal developmental learning for sequential tasks without using rigid time-discounts. IEEE Trans Neural Netw Learn Syst 29(10):4917–4931
Wang D, Liu L (2015) Face recognition in complex background: developmental network and synapse maintenance. Int J Smart Home 9(10):47–62
Wang D, Shan H, Tian Y, Liu L (2018) Emergent face orientation recognition with internal neurons of the developmental network. Progress Artif Intell 7(4):359–367
Wang D, Wang J, Liu L (2018) Developmental network: an internal emergent object feature learning. Neural Process Lett 48(2):1135–1159
Weng J (2007) On developmental mental architectures. Neurocomputing 70(13):2303–2323
Weng J (2011) Why have we passed neural networks no not abstract well. Nat Intell INNS Mag 1(1):13–22
Weng J (2012) Symbolic models and emergent models: a review. IEEE Trans Auton Ment Dev 4(1):29–53
Weng J, Luciw M (2012) Brain-like emergent spatial processing. IEEE Trans Auton Ment Dev 4(2):161–185
Weng J, McClelland J, Pentland A, Sporns O, Stockman I, Sur M, Thelen E (2001) Autonomous mental development by robots and animals. Science 291:599–600
Zhang W, Zhang Y, Ma L, Guan J, Gong S (2015) Multimodal learning for facial expression recognition. Pattern Recogn 48(10):3191–3202
Zhang Y, Weng J (2010) Spatio-temporal developmental learning. IEEE Trans Auton Ment Dev 2(3):149–166
Acknowledgements
This research is supported by China Postdoctoral Science Foundation under Grant 2016M592311, National Natural Science Foundation of China under Grants 61603343 and 61703372, the Key Scientific Research Project of Henan Higher Education under Grant 18A413012, and the Science&Technology Innovation Team Project of Henan Province under Grant 17IRTSTHN013.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, D., Xin, J. Emergent spatio-temporal multimodal learning using a developmental network. Appl Intell 49, 1306–1323 (2019). https://doi.org/10.1007/s10489-018-1337-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-018-1337-5