Emergent spatio-temporal multimodal learning using a developmental network

Wang, Dongshu; Xin, Jianbin

doi:10.1007/s10489-018-1337-5

Emergent spatio-temporal multimodal learning using a developmental network

Published: 05 November 2018

Volume 49, pages 1306–1323, (2019)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

367 Accesses
5 Citations
Explore all metrics

Abstract

Conventional machine learning needs humans to train each module with hand-handcrafted data and symbols manually, and the results of these methods are confined to particular tasks. To address this limitation, in this paper we design a multimodal autonomous learning architecture based on a developmental network for the audio and vision co-development. The developmental network is a biological inspired mechanism, which can make an agent to develop and integrate audition and vision simultaneously. Furthermore, synapse maintenance is introduced in the vision information learning to enhance the video recognition rate and neuron regenesis mechanism is implemented to enhance the network usage efficiency. In the experiments, a number of fundamental words are acquired and identified using the proposed learning methodology without any prior knowledge about the objects or the verbal questions before running. The experiments show that the proposed learning method can achieve significantly high recognition rates in comparison with the state-of-the-art method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Joint Learning Framework of Visual Sensory Representation, Eye Movements and Depth Representation for Developmental Robotic Agents

Artificial Intelligence: The Point of View of Developmental Robotics

Task-Nonspecific and Modality-Nonspecific AI

References

Droniou A, Ivaldi S, Sigaud O (2015) Deep unsupervised network for multimodal perception, representation and classification. Robot Auton Syst 71(3):83–98
Article Google Scholar
Bertenthal BI, Campos JJ, Barrett KC (1984) Continuities and discontinuities in development: chapter 8. Plenum Press, New York
Google Scholar
Botvinick M, Cohen J (1998) Rubber hands ‘feel’ touch that eyes see. Nature 391(6669):756
Article Google Scholar
Brdiczka O, Maisonnasse J, Reignier P, Crowley JL (2009) Detecting small group activities from multimodal observations. Appl Intell 30(1):47–57
Article Google Scholar
Bareham CA, Georgieva SD, Kamke MR, Lloyd D, Bekinschtein TA, Mattingley JB (2018) Role of the right inferior parietal cortex in auditory selective attention: an rTMS study. Cortex 99:30–38
Article Google Scholar
Erhan D, Bengio Y, Courville A, Manzagol P-A, Vincent P (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11(3):625–660
MathSciNet MATH Google Scholar
Fredenslund K Computational complexity of neural networks. https://kasperfred.com/posts/computational-complexity-of-neural-networks
Han X, Dai Q (2018) Batch-normalized mlpconv-wise supervised pre-training network in network. Appl Intell 48:142–155
Article Google Scholar
Hennecke ME, Prasad KV, Stork DG (1994) Using deformable templates to infer visual speech dynamics. In: Proceedings of the 1994 28th Asilomar conference on signals, systems, and computers, pp 578-582. Pacific Grove, CA, USA
Hennecke ME, Prasad KV, Stork DG (1995) Automatic speech recognition system using acoustic and visual signals. In: Proceedings of the 1995 29th Asilomar conference on signals, systems, and computers, pp 1214-1218. Pacific Grove, CA, USA
Hirsch H, Spinelli D (1971) Modification of the distribution of receptive field oroentation in cats by selectively visual exposure during development. Exp Brain Res 13:509–527
Google Scholar
Huang F, Zhang S, Zhang J, Yu G (2017) Multimodal learning for topic sentiment analysis in microblogging. Neurocomputing 253:144–153
Article Google Scholar
Huang J, Kingsbury B (2013) Audio-visual deep learning for noise robust speech recognition. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, pp 7596-7599. Vancouver, Canada
Hwang W, Weng J (2000) Hierarchical discriminant regression. IEEE Trans Pattern Anal Mach Intell 22 (11):1277–1293
Article Google Scholar
Ariav I, Dov D, Cohen I (2018) A deep architecture for audio-visual voice activity detection in the presence of transients. Signal Process 142:69–74
Article Google Scholar
Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T (2015) Audio-visual speech recognition using deep learning. Appl Intell 42(4):722–737
Article Google Scholar
Gurban M, Thiran JP, Drugman T, Dutoit T (2008) Dynamic modality weighting for multi-stream hmms in audio-visual speech recognition. In: Proceedings of the 10th international conference on multimodal interfaces, pp 237-240. Chania, Greece
Mangin O, Oudeyer PY (2013) Learning semantic components from subsymbolic multimodal perception. In: Proceedings of the third joint international conference on development and learning and epigenetic robotics (ICDL), pp 1–7
McDermott E, Katagiri S (1994) Prototype-based minimum error training for speech recognition. Appl Intell 4(3):245–256
Article Google Scholar
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748
Article Google Scholar
Mercer N (2000) Words and minds: how we use language to think together. Routledge, London, UK
Smith NA, Folland NA, Martinez DM, Trainor LJ (2017) Multisensory object perception in infancy: 4-month-olds perceive a mistuned harmonic as a separate auditory and visual object. Cognition 164:1–7
Article Google Scholar
Altieri N, Stevenson RA, Wallace MT, Wenger MJ (2015) Learning to associate auditory and visual stimuli: behavioral and neural mechanisms. Brain Topogr 28(3):479–493
Article Google Scholar
Russell SJ, Norvig P (2011) Artificial intelligence a modern approach, 3rd. Prentice Hall, Inc., New Jersey
MATH Google Scholar
Khan S, Xu G, Chan R, Yan H (2017) An online spatio-temporal tensor learning model for visual tracking and its applications to facial expression recognition. Expert Syst Appl 90:427–438
Article Google Scholar
Song G, Dai Q (2017) A novel double deep elms ensemble system for time series forecasting. Knowl-Based Syst 134:31–49
Article Google Scholar
Stork DG, Wolff G, Levine E (1992) Neural network lipreading system for improved speech recognition. In: Proceedings of the international joint conference on neural networks (IJCNN), pp 286-295, Baltimore, MD, USA
Wang D, Chen J, Liu L (2017) How internal neurons represent the short context: an emergent perspective. Progress Artif Intell 6(1):67–77
Article Google Scholar
Wang D, Duan Y, Weng J (2018) Motivated optimal developmental learning for sequential tasks without using rigid time-discounts. IEEE Trans Neural Netw Learn Syst 29(10):4917–4931
Article Google Scholar
Wang D, Liu L (2015) Face recognition in complex background: developmental network and synapse maintenance. Int J Smart Home 9(10):47–62
Article Google Scholar
Wang D, Shan H, Tian Y, Liu L (2018) Emergent face orientation recognition with internal neurons of the developmental network. Progress Artif Intell 7(4):359–367
Article Google Scholar
Wang D, Wang J, Liu L (2018) Developmental network: an internal emergent object feature learning. Neural Process Lett 48(2):1135–1159
Article Google Scholar
Weng J (2007) On developmental mental architectures. Neurocomputing 70(13):2303–2323
Article Google Scholar
Weng J (2011) Why have we passed neural networks no not abstract well. Nat Intell INNS Mag 1(1):13–22
Google Scholar
Weng J (2012) Symbolic models and emergent models: a review. IEEE Trans Auton Ment Dev 4(1):29–53
Article Google Scholar
Weng J, Luciw M (2012) Brain-like emergent spatial processing. IEEE Trans Auton Ment Dev 4(2):161–185
Article Google Scholar
Weng J, McClelland J, Pentland A, Sporns O, Stockman I, Sur M, Thelen E (2001) Autonomous mental development by robots and animals. Science 291:599–600
Article Google Scholar
Zhang W, Zhang Y, Ma L, Guan J, Gong S (2015) Multimodal learning for facial expression recognition. Pattern Recogn 48(10):3191–3202
Article Google Scholar
Zhang Y, Weng J (2010) Spatio-temporal developmental learning. IEEE Trans Auton Ment Dev 2(3):149–166
Article Google Scholar

Download references

Acknowledgements

This research is supported by China Postdoctoral Science Foundation under Grant 2016M592311, National Natural Science Foundation of China under Grants 61603343 and 61703372, the Key Scientific Research Project of Henan Higher Education under Grant 18A413012, and the Science&Technology Innovation Team Project of Henan Province under Grant 17IRTSTHN013.

Author information

Authors and Affiliations

School of Electrical Engineering, Zhengzhou University, No.100, Science Road, Zhengzhou, 450001, People’s Republic of China
Dongshu Wang & Jianbin Xin

Authors

Dongshu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianbin Xin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianbin Xin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, D., Xin, J. Emergent spatio-temporal multimodal learning using a developmental network. Appl Intell 49, 1306–1323 (2019). https://doi.org/10.1007/s10489-018-1337-5

Download citation

Published: 05 November 2018
Issue Date: 15 April 2019
DOI: https://doi.org/10.1007/s10489-018-1337-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Emergent spatio-temporal multimodal learning using a developmental network

Abstract

Access this article

Similar content being viewed by others

A Joint Learning Framework of Visual Sensory Representation, Eye Movements and Depth Representation for Developmental Robotic Agents

Artificial Intelligence: The Point of View of Developmental Robotics

Task-Nonspecific and Modality-Nonspecific AI

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Emergent spatio-temporal multimodal learning using a developmental network

Abstract

Access this article

Similar content being viewed by others

A Joint Learning Framework of Visual Sensory Representation, Eye Movements and Depth Representation for Developmental Robotic Agents

Artificial Intelligence: The Point of View of Developmental Robotics

Task-Nonspecific and Modality-Nonspecific AI

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation