Information-dense actions as contexts

doi:10.1016/j.neucom.2018.05.056

Neurocomputing

Volume 311, 15 October 2018, Pages 164-175

https://doi.org/10.1016/j.neucom.2018.05.056 Get rights and content

Abstract

In artificial intelligence, many temporal processing tasks such as speech recognition, video analysis, and natural language processing depend on not only spatial contents of the current sensory input frame but also the relevant context in the attended past. It is illusive how brains use temporal contexts. Many computer methods, such as Hidden Markov chains and recurrent neural networks, require the human programmer to handcraft contexts as symbolic contexts. It has been proved that our Developmental Networks (DN) are capable of learning any emergent Turing Machine (TM), they can learn patterns as their states from human teachers’ scrupulous supervision. In this paper, we explain why contexts are important for temporal processing. We study how agent actions are natural sources of contexts and enable muscle neurons to autonomously generate actions as contexts. In humans, muscle actions correspond to the firings of muscle neurons. They are information-dense in time and correlated with the cognitive and motor skills of the individual. Some actions are meant to handle time warping, while others are not (e.g., for time duration counting). We model actions as information-dense action patterns. We use entropy to define the new concept of information-dense. We also introduce the free-of-labeling technique. We experiment with DN for recognition of audio sequences as an example of modality, but the principles are modality independent. Our experimental results show how the information-dense actions and the free-of-labeling mechanisms help DN to generate temporal contexts. This work is a necessary step toward our goal to enable machines to autonomously abstract contexts from actions through life-long development.

Introduction

At least five factors [1] contribute to intelligence: (1) genome (or the developmental program), (2) the sensors, (3) the effectors, (4) the computational resources, and (5) the environment (body, teachers, and other physical facts) that each individual lives through. All these 5 factors are important to the development of agent. The goal of this work mainly corresponds to (3) the effectors and (5) the environment that facilitates the development of agent’s simple-to-complex skills. In particular, we study how the information in the effectors facilitates the development of the agent’s behaviors under the environment. By doing so, we do not mean that other factors are totally irrelevant.

There are two types of skills [2], declarative skills (e.g., verbal) that can be expressed using a certain language, and non-declarative skills (e.g., bike riding) since such skills are typically not demonstrated using a language. For classification tasks, the actions often correspond to declarative skills. For robotic navigation tasks, the actions often correspond to non-declarative skills. Although we used the term “action” in this paper, the term “action” is for both declarative and non-declarative skills, not just an action that an arm carries out.

Symbolic labels are commonly used in speech recognition. They have been handcrafted by human programmers, as labels for phonemes, words, and sentences.

Compared to symbolic labels, actions are information-dense in time and provide much information for temporal processing tasks.

Much effort has been spent to overcome temporal processing problems in recent years. Many existing methods are task-specific because they involve handcrafted symbols. These methods cannot learn new concepts after symbol design for the symbols need to be fully handcrafted in advance. Instead of symbols, DN uses a framework of autonomous development that with patterns as representations. Such representations automatically emerge from experience, and can be learned incrementally.

Time warping means that two sequences are treated equally if they are similar but the duration of some segments in a sequence is different. Time warping is a common objective for temporal processing which needs to be treated carefully.

Dynamic time warping (DTW) is an algorithm widely used in the analysis of video, audio, and graphics sequence. It measures the similarity between two sequences with some variances in time duration. For example, both the voice recognition in [3] and the spoken word recognition in [4] used DTW. DTW itself does not use probabilities.

Hidden Markov Models (HMMs) use probabilities which implicitly deal with time warping instead of explicitly modeling time warping. HMMs are often used hierarchically. For example, phonemes, letters, words, are three levels in a hierarchy. In each level, each state in HMMs corresponds to a stage. For example, in [5], HMMs were used to model two-handed tracking from videos. In [6], Vogler and Metaxas used HMMs to recognize American Sign Language (ASL) sentences. HMMs were utilized to recognize Arabic handwriting in [7].

Both DTW and HMMs use symbolic representations, because their states, often formed by a clustering technique, are handcrafted.

Time duration is opposite to time warping. Time duration means the duration of two different sequences is the key factor to distinguish them. In English, phone duration helps in distinguishing several words from each other, such as “pitch” and “peach” or “ship” and “sheep”. In some other languages like Finnish, phone durations can be the only clue in discriminating between certain words [8]. Good time duration modeling can be a major issue in temporal processing. Hidden semi-Markov model (HSMM) in [9] and expanded state HMM (ESHMM) in [10] are extended by explicitly approximating state duration distributions in HMMs framework. The work in [11] compares and evaluates the performances of these extended HMMs methods with duration modeling techniques.

Neural networks use connections to reach a certain type of flexibility in temporal trajectories. They at least partially use emergent representations (i.e., patterns of neuronal firing instead of a series of symbols), but the emergent representations have often been mixed with symbolic representations, e.g., handcrafted internal representations (states like those in Kalman filters).

Neural networks use natural and discriminative training to estimate the probabilities of frames in the temporal stream. Many successful neural networks based methods are adept in handling short-time units such as individual phonemes in speech recognition and isolated words in language processing. The work in [12] utilized the neural network for phoneme classification. A sliding-mode neural network was presented for the tracking control of the robot manipulator in [13]. Collobert and Weston proposed a convolutional neural network model for sentence analysis (e.g., chunks, semantic roles) in [14]. Recently Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNN) have been used in this field [15]. They can detect latent temporal dependencies. To make the neural networks more efficient, some methods are proposed for intermittent measurement and dynamic analysis of the neural network (e.g., [16], [17]).

These systems require a sophisticated design by a human programmer who typically focuses on one specific task. Without a general-purpose framework and fully emergent representations, they do not fully use temporal contexts during processing, and do not directly take actions from output side as contexts.

A developmental method is task-nonspecific. This is because the developmental method never involves symbols in any of the internal representations and is never restricted in the representations of one specific task. If a teacher uses symbols, they only use emergent patterns (in sensors or effectors). Unlike symbols, which are either the same or different, patterns of the same sensors or effectors have distances in the neuronal inner product spaces, so new patterns never observed can be dealt with based on their similarities with observed patterns.

DN is skull-closed, different from symbolic methods, which means it does not need human designers to manually adjust internal representations after birth and can learn incrementally during interaction with natural environments. According to Weng [18], the networks, which can generate emergent representations, can easily deal with natural inputs and motor actions, and incrementally learn. DN can deal with information-dense natural sensory patterns and has motor areas representing information-dense actions. DN performs well in pattern recognition tasks. It is used as an object recognition network for a navigation mobile application in [19]. Its implementation of visual parking assistance system is present in [20].

Our goal is to use DN to process temporal streams in real-time. DN can demonstrably learn any TM immediately one transition at a time and free of any errors [18]. The TM was assumed to be handcrafted upfront. Although the human common sense knowledge base can be considered as a grand TM at a coarse language level, the states at a fine-grained time level (e.g., 20 ms–100 ms) are typically unavailable.

This work investigates how actions at a fine-grained time level are useful as states, where we regard states and actions are the same: both declarative skills and non-declarative skills can be expressed as fine-grained actions.

In this work, we use automatically generable and temporally information-dense patterns as temporally information-dense actions. Natural robotic actions are better since no tangible restrictions are imposed on such action patterns. But natural robotic actions are difficult to come by without the method here first being sufficiently investigated. we plan to use natural action patterns from the robot body to replace current temporally information-dense patterns when the DN-equipped autonomous robot body is completed.

Our original work has been accepted by 2017 International Joint Conference on Neural Networks (IJCNN) [21]. Based on that, we extend the work more than 40% to this archival journal version. The main additional novel parts of this journal version are:

1.
The information entropy of temporal sequence has been used to mathematically define the density of actions in the conference version. The entropy values have been computed for all experimental settings to contrast their different values in terms of mathematically defined information-dense concept.
2.
The patterns of concept 2 motor neurons (dense) are much more now, and they are automatically generated but before they were much fewer and were handcrafted. In other words, the concept 2 actions automatically emerge as patterns but before they are handcrafted labels. This greatly reduced the cost of system development, because the programmer does not need to care which patterns correspond to a label — the vector representation and the inner product space automatically take care of such similarity among patterns.
3.
Volume information representation is added to replace original adding “energy-component” element method — a batch method — when processing the waveform. This mechanism corresponds to a hypothesized “genes prepositioned” but “partially emergent” feature in the sensory input of every neuron that might be present in many “purely emergent” features. The experimental results show that this volume-feature performs better than the “energy-component” in the conference version.
4.
We refine the definition of “hair cells” so that each covers shorter frames and with certain overlap between the consecutive frames. This refinement provides more information-dense contexts and improved the performance.
5.
The hidden neurons now have locations inside “skull’. Before, all hidden neurons are location-free, which is common in many artificial neural networks. This new mechanism encourages the smoothness of hidden representations — nearby neurons detect similar features. It allows the recruitment of neurons during the life-long learning implemented by Hebbian learning. This process of recruitment gradually adapts a “hierarchically” smooth representations to better fit the changing distribution.

The remainder of the paper is organized as follows: we first discuss the theory part in Section 2. The DN algorithm and some key details are listed in Section 3. In Section 4 we present the implementation details and analysis of experimental results. Concluding remarks are offered in Section 5.

Section snippets

Theory: information-dense

How does an autonomous agent generate actions that are sufficient for not only declarative skills but also non-declarative skills [2]? It is arguable these two categories of skills are all driven by muscle neurons at temporally dense fine-grained levels. By “temporally dense” we mean that the signal must change its value at a high temporal frequency. Although a written language is often documented at a temporally sparse word level (i.e., declarative), the continuous pronunciation of each word

Developmental networks

In DN, the neurons in the X area receive and transfer sensory information; the neurons in the Z area emerge and transfer the actions or concepts. The skull-closed Y area bi-directionally connects X and Z areas like a bridge. The firing neurons in the Y area are the winners with best representation patterns of (X, Z) in last moment. The Z area then emerge the next firing pattern according to the firing Y neurons. The general algorithm of DN is shown in algorithm 1.

DN is in asynchronous mode and

Experiments

As temporally fine-grained examples, the experiments in this work used DN for phoneme recognition through time series as only examples. Our methods are modality independent, potentially applicable to vision, audition, natural languages, etc. In addition, we are also doing experiments using video as input for autonomous navigation and using words as inputs for natural language acquisition. However, we do not discuss vision and natural languages in this paper due to the space limitation, and plan

Conclusions and discussions

We argued that actions may serve as information-dense states and provide rich context information for the learning agent. Such a density is quantified and measured by entropy of information in actions. Furthermore, some discrete, fine-grained, and handcrafted labels typically used by Markov models such as phoneme stages helped by k-mean clustering, which are popular in speech recognition and object recognition research, are not always necessary. We introduced the free-of-labeling property of

Xiang Wu was born in Jiangsu, China, in 1989. He received the B.S. degree in electrical engineering and automation from Nanjing University of Science and Technology, Nanjing, China, in 2012. He is currently pursuing the Ph.D. degree in control science and engineering at school of Automation, Nanjing University of Science and Technology, Nanjing, China. From 2016 to 2017, he was a Visiting Ph.D. student with the department of Computer Science and Engineering, Michigan State University, East

References (40)

J. Pylkkönen et al.
Duration modeling techniques for continuous speech recognition
Proceedings of the International Conference on Spoken Language Processing, Jeju Island, Korea
(2004)
R. Collobert et al.
A unified architecture for natural language processing: deep neural networks with multitask learning
Proceedings of the Twenty Fifth International Conference on Machine Learning, Helsinki, Finland
(2008)
WengJ.
Brain as an emergent finite automaton: a theory and three theorems
Int. J. Intell. Sci.
(2015)
M. Glickstein
How are visual areas of the brain connected to motor areas for the sensory guidance of movement?
Trends in Neurosciences
(2000)
WangY. et al.
Synapse maintenance in the where-what networks
2011 International Joint Conference on Neural Networks, San Jose, CA, USA
(2011)
G. Bronchti et al.
Auditory activation of ‘visual’ cortical areas in the blind mole rat (spalax ehrenbergi)
European Journal of Neuroscience
(2002)
DengL. et al.
Speech processing: a dynamic and optimization-oriented approach
(2003)
WengJ.
Natural and Artificial Intelligence: Introduction to Computational Brain-Mind
(2012)
SunR. et al.
The interaction of the explicit and the implicit in skill learning: a dual-process approach
Psychol. Rev.
(2005)
L. Muda et al.
Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques
J. Comput.
(2010)

H. Sakoe et al.

Dynamic programming algorithm optimization for spoken word recognition

IEEE Trans. Acoust. Speech Signal Process.

(1978)

M. Brand et al.

Coupled hidden Markov models for complex action recognition

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, USA

(1997)

C. Vogler et al.

ASL recognition based on a coupling between HMMs and 3D motion analysis

Proceedings of the Sixth International Conference on Computer Vision, Bombay, India

(1998)

K. Jayech et al.

Synchronous multi-stream hidden Markov model for offline arabic handwriting recognition without explicit segmentation

Neurocomputing

(2016)

M. Russell et al.

Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Tampa, FL, USA

(1985)

A. Bonafonte et al.

Duration modeling with expanded HMM applied to speech recognition

Proceedings of the Fourth International Conference on Spoken Language, Philadelphia, PA, USA

(1996)

M. Russell et al.

Experimental evaluation of duration modelling techniques for automatic speech recognition

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Dallas, TX, USA

(1987)

A. Waibel et al.

Phoneme recognition using time-delay neural networks

IEEE Trans. Acoust. Speech. Signal Process.

(1989)

WaiR.

Tracking control based on neural network strategy for robot manipulator

Neurocomputing

(2003)

A. Graves et al.

Speech recognition with deep recurrent neural networks

Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada

(2013)

Cited by (7)

Learning to recognize while learning to speak: Self-supervision and developing a speaking motor
2021, Neural Networks
Citation Excerpt :
Different from many other neural networks which act like a “black box”, DN-1 learns an ETM with clear logics and statistical optimality. DN-1 has been successfully experimented with different modalities (e.g., vision Zheng & Weng, 2016, audition Wu, Bo, & Weng, 2018, and text Castro-Garcia & Weng, 2019). Although DN-1 has been tested with multiple hidden areas, there are two restrictions. (
Traditionally, learning speech synthesis and speech recognition were investigated as two separate tasks. This separation hinders incremental development for concurrent synthesis and recognition, where partially-learned synthesis and partially-learned recognition must help each other throughout lifelong learning. This work is a paradigm shift—we treat synthesis and recognition as two intertwined aspects of a lifelong learning agent. Furthermore, in contrast to existing recognition or synthesis systems, babies do not need their mothers to directly supervise their vocal tracts at every moment during the learning. We argue that self-generated non-symbolic states/actions at fine-grained time level help such a learner as necessary temporal contexts. Here, we approach a new and challenging problem—how to enable an autonomous learning system to develop an artificial speaking motor for generating temporally-dense (e.g., frame-wise) actions on the fly without human handcrafting a set of symbolic states. The self-generated states/actions are Muscles-like, High-dimensional, Temporally-dense and Globally-smooth (MHTG), so that these states/actions are directly attended for concurrent synthesis and recognition for each time frame. Human teachers are relieved from supervising learner’s motor ends. The Candid Covariance-free Incremental (CCI) Principal Component Analysis (PCA) is applied to develop such an artificial speaking motor where PCA features drive the motor. Since each life must develop normally, each Developmental Network-2 (DN-2) reaches the same network (maximum likelihood, ML) regardless of randomly initialized weights, where ML is not just for a function approximator but rather an emergent Turing Machine. The machine-synthesized sounds are evaluated by both the neural network and humans with recognition experiments. Our experimental results showed learning-to-synthesize and learning-to-recognize-through-synthesis for phonemes. This work corresponds to a key step toward our goal to close a great gap toward fully autonomous machine learning directly from the physical world.
Pairwise comparison learning based bearing health quantitative modeling and its application in service life prediction
2019, Future Generation Computer Systems
Citation Excerpt :
Cognitive computing is a new type of computing mode aiming to enhance human cognitive performance and support human decision-making by mimicking the functioning of human brain [1,2]. Cognitive computing is widely used in many research domains, such as Natural Language Processing (NLP) [3], High Performance Computing (HPC) [4], Knowledge Representation and Reasoning (KRR) [5], Machine Learning (ML) [6,7] and Industrial Internet of Things (IIoT) [8,9]. One of the core challenges in exploring the full potential of cognitive computing is to efficiently process and analyze data for supporting decision-making in a correct and expected manner.
Cognitive computing is expected to meet the challenges posed by the avalanche problem of data being produced by experimental instruments and sensors in academia and industry. How to systematically, purposefully and reasonably interact with human beings and make-decision accordingly is one of the key factors for exerting the potential of cognitive computing and providing services for human beings. As one of the crucial supporting technologies for industrial equipment health management, bearing health analysis has increasingly become an important research field that is promising to improve the reliability and efficiency of modern industrial systems. One of the main challenges in condition-based maintenance and management of bearing is the health quantitative modeling and assessment. Therefore, a learning-based health modeling method, on the basis of newly defined multidimensional frequency-domain health feature, is proposed to realize quantitative assessment of bearing health state. First, a multilayer neural network with a special structure is designed. Then, a novel algorithm, namely PAirwiSe CompArison Learning (PASCAL) is proposed for network parameters learning. In addition, experiments are designed and carried out on a real industrial bearing testing dataset to verify the feasibility and efficiency of the proposed health modeling method. Experimental results are compared with those of two others recent research works, and the performance is measured with a percentage error metric.
Developmental Network-2: The Autonomous Generation of Optimal Internal-Representation Hierarchy
2022, IEEE Transactions on Neural Networks and Learning Systems
Behavioral Decision-Making of Mobile Robot in Unknown Environment with the Cognitive Transfer
2021, Journal of Intelligent and Robotic Systems: Theory and Applications
Muscle Vectors as Temporally Dense 'Labels'
2020, Proceedings of the International Joint Conference on Neural Networks
Neuron-Wise Inhibition Zones and Auditory Experiments
2019, IEEE Transactions on Industrial Electronics

View all citing articles on Scopus

Yuming Bo received the B.S., M.S., and Ph.D. degrees in navigation, guidance and control from Nanjing University of Science and Technology, Nanjing, China. He worked as a professor in control science and engineering at school of Automation, Nanjing University of Science and Technology, Nanjing, China. He is a member of the Chinese Association of Automation and Vice Chairman of Jiangsu Branch. His research interests include guidance, navigation and control, filtering and system optimization, and image processing.

Juyang Weng received the BS degree from Fudan University, in 1982, M. Sc. and PhD degrees from the University of Illinois at Urbana-Champaign, in 1985 and 1989, respectively, all in computer science. He is currently professor of Computer Science and Engineering, faculty member of the Cognitive Science Program, and faculty member of the Neuroscience Program at Michigan State University, East Lansing. He was a visiting professor at the Computer Science School of Fudan University, Nov. 2003 - March 2014. Since the work of Cresceptron (ICCV 1993), he expanded his research interests in biologically inspired systems to developmental learning, including perception, cognition, behaviors, motivation, and abstract reasoning skills. He has published over 300 research articles on related subjects, including task muddiness, intelligence metrics, mental architectures, vision, audition, touch, attention, recognition, autonomous navigation, and natural language understanding. He coauthored with T. S. Huang and N. Ahuja a research monograph titled Motion and Structure from Image Sequences and authored a book titled Natural and Artificial Intelligence: Computational Introduction to Computational Brain-Mind. Dr. Weng is an Editor-in-Chief of the International Journal of Humanoid Robotics, the Editor-in-Chief of the Brain-Mind Magazine, and an associate editor of the IEEE Transactions on Autonomous Mental Development (now Cognitive and Developmental Systems). With others, he helped to create the series of International Conference on Development and Learning (ICDL), the IEEE Transactions on Autonomous Mental Development, and the startup GENISAMA LLC. He was an associate editor of the IEEE Transactions on Pattern Recognition and Machine Intelligence and the IEEE Transactions on Image Processing. He is a Fellow of IEEE.

View full text

Information-dense actions as contexts

Abstract

Introduction

Section snippets

Theory: information-dense

Developmental networks

Experiments

Conclusions and discussions

Int. J. Intell. Sci.

Trends in Neurosciences

European Journal of Neuroscience

Natural and Artificial Intelligence: Introduction to Computational Brain-Mind

The interaction of the explicit and the implicit in skill learning: a dual-process approach

Psychol. Rev.

Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques

J. Comput.

Dynamic programming algorithm optimization for spoken word recognition

IEEE Trans. Acoust. Speech Signal Process.

Coupled hidden Markov models for complex action recognition

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, USA

ASL recognition based on a coupling between HMMs and 3D motion analysis

Proceedings of the Sixth International Conference on Computer Vision, Bombay, India

Synchronous multi-stream hidden Markov model for offline arabic handwriting recognition without explicit segmentation

Neurocomputing

Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Tampa, FL, USA

Duration modeling with expanded HMM applied to speech recognition

Proceedings of the Fourth International Conference on Spoken Language, Philadelphia, PA, USA

Experimental evaluation of duration modelling techniques for automatic speech recognition

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Dallas, TX, USA

Phoneme recognition using time-delay neural networks

IEEE Trans. Acoust. Speech. Signal Process.

Tracking control based on neural network strategy for robot manipulator

Neurocomputing

Speech recognition with deep recurrent neural networks

Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada