Elsevier

Neurocomputing

Volume 311, 15 October 2018, Pages 164-175
Neurocomputing

Information-dense actions as contexts

https://doi.org/10.1016/j.neucom.2018.05.056Get rights and content

Abstract

In artificial intelligence, many temporal processing tasks such as speech recognition, video analysis, and natural language processing depend on not only spatial contents of the current sensory input frame but also the relevant context in the attended past. It is illusive how brains use temporal contexts. Many computer methods, such as Hidden Markov chains and recurrent neural networks, require the human programmer to handcraft contexts as symbolic contexts. It has been proved that our Developmental Networks (DN) are capable of learning any emergent Turing Machine (TM), they can learn patterns as their states from human teachers’ scrupulous supervision. In this paper, we explain why contexts are important for temporal processing. We study how agent actions are natural sources of contexts and enable muscle neurons to autonomously generate actions as contexts. In humans, muscle actions correspond to the firings of muscle neurons. They are information-dense in time and correlated with the cognitive and motor skills of the individual. Some actions are meant to handle time warping, while others are not (e.g., for time duration counting). We model actions as information-dense action patterns. We use entropy to define the new concept of information-dense. We also introduce the free-of-labeling technique. We experiment with DN for recognition of audio sequences as an example of modality, but the principles are modality independent. Our experimental results show how the information-dense actions and the free-of-labeling mechanisms help DN to generate temporal contexts. This work is a necessary step toward our goal to enable machines to autonomously abstract contexts from actions through life-long development.

Introduction

At least five factors [1] contribute to intelligence: (1) genome (or the developmental program), (2) the sensors, (3) the effectors, (4) the computational resources, and (5) the environment (body, teachers, and other physical facts) that each individual lives through. All these 5 factors are important to the development of agent. The goal of this work mainly corresponds to (3) the effectors and (5) the environment that facilitates the development of agent’s simple-to-complex skills. In particular, we study how the information in the effectors facilitates the development of the agent’s behaviors under the environment. By doing so, we do not mean that other factors are totally irrelevant.

There are two types of skills [2], declarative skills (e.g., verbal) that can be expressed using a certain language, and non-declarative skills (e.g., bike riding) since such skills are typically not demonstrated using a language. For classification tasks, the actions often correspond to declarative skills. For robotic navigation tasks, the actions often correspond to non-declarative skills. Although we used the term “action” in this paper, the term “action” is for both declarative and non-declarative skills, not just an action that an arm carries out.

Symbolic labels are commonly used in speech recognition. They have been handcrafted by human programmers, as labels for phonemes, words, and sentences.

Compared to symbolic labels, actions are information-dense in time and provide much information for temporal processing tasks.

Much effort has been spent to overcome temporal processing problems in recent years. Many existing methods are task-specific because they involve handcrafted symbols. These methods cannot learn new concepts after symbol design for the symbols need to be fully handcrafted in advance. Instead of symbols, DN uses a framework of autonomous development that with patterns as representations. Such representations automatically emerge from experience, and can be learned incrementally.

Time warping means that two sequences are treated equally if they are similar but the duration of some segments in a sequence is different. Time warping is a common objective for temporal processing which needs to be treated carefully.

Dynamic time warping (DTW) is an algorithm widely used in the analysis of video, audio, and graphics sequence. It measures the similarity between two sequences with some variances in time duration. For example, both the voice recognition in [3] and the spoken word recognition in [4] used DTW. DTW itself does not use probabilities.

Hidden Markov Models (HMMs) use probabilities which implicitly deal with time warping instead of explicitly modeling time warping. HMMs are often used hierarchically. For example, phonemes, letters, words, are three levels in a hierarchy. In each level, each state in HMMs corresponds to a stage. For example, in [5], HMMs were used to model two-handed tracking from videos. In [6], Vogler and Metaxas used HMMs to recognize American Sign Language (ASL) sentences. HMMs were utilized to recognize Arabic handwriting in [7].

Both DTW and HMMs use symbolic representations, because their states, often formed by a clustering technique, are handcrafted.

Time duration is opposite to time warping. Time duration means the duration of two different sequences is the key factor to distinguish them. In English, phone duration helps in distinguishing several words from each other, such as “pitch” and “peach” or “ship” and “sheep”. In some other languages like Finnish, phone durations can be the only clue in discriminating between certain words [8]. Good time duration modeling can be a major issue in temporal processing. Hidden semi-Markov model (HSMM) in [9] and expanded state HMM (ESHMM) in [10] are extended by explicitly approximating state duration distributions in HMMs framework. The work in [11] compares and evaluates the performances of these extended HMMs methods with duration modeling techniques.

Neural networks use connections to reach a certain type of flexibility in temporal trajectories. They at least partially use emergent representations (i.e., patterns of neuronal firing instead of a series of symbols), but the emergent representations have often been mixed with symbolic representations, e.g., handcrafted internal representations (states like those in Kalman filters).

Neural networks use natural and discriminative training to estimate the probabilities of frames in the temporal stream. Many successful neural networks based methods are adept in handling short-time units such as individual phonemes in speech recognition and isolated words in language processing. The work in [12] utilized the neural network for phoneme classification. A sliding-mode neural network was presented for the tracking control of the robot manipulator in [13]. Collobert and Weston proposed a convolutional neural network model for sentence analysis (e.g., chunks, semantic roles) in [14]. Recently Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNN) have been used in this field [15]. They can detect latent temporal dependencies. To make the neural networks more efficient, some methods are proposed for intermittent measurement and dynamic analysis of the neural network (e.g., [16], [17]).

These systems require a sophisticated design by a human programmer who typically focuses on one specific task. Without a general-purpose framework and fully emergent representations, they do not fully use temporal contexts during processing, and do not directly take actions from output side as contexts.

A developmental method is task-nonspecific. This is because the developmental method never involves symbols in any of the internal representations and is never restricted in the representations of one specific task. If a teacher uses symbols, they only use emergent patterns (in sensors or effectors). Unlike symbols, which are either the same or different, patterns of the same sensors or effectors have distances in the neuronal inner product spaces, so new patterns never observed can be dealt with based on their similarities with observed patterns.

DN is skull-closed, different from symbolic methods, which means it does not need human designers to manually adjust internal representations after birth and can learn incrementally during interaction with natural environments. According to Weng [18], the networks, which can generate emergent representations, can easily deal with natural inputs and motor actions, and incrementally learn. DN can deal with information-dense natural sensory patterns and has motor areas representing information-dense actions. DN performs well in pattern recognition tasks. It is used as an object recognition network for a navigation mobile application in [19]. Its implementation of visual parking assistance system is present in [20].

Our goal is to use DN to process temporal streams in real-time. DN can demonstrably learn any TM immediately one transition at a time and free of any errors [18]. The TM was assumed to be handcrafted upfront. Although the human common sense knowledge base can be considered as a grand TM at a coarse language level, the states at a fine-grained time level (e.g., 20 ms–100 ms) are typically unavailable.

This work investigates how actions at a fine-grained time level are useful as states, where we regard states and actions are the same: both declarative skills and non-declarative skills can be expressed as fine-grained actions.

In this work, we use automatically generable and temporally information-dense patterns as temporally information-dense actions. Natural robotic actions are better since no tangible restrictions are imposed on such action patterns. But natural robotic actions are difficult to come by without the method here first being sufficiently investigated. we plan to use natural action patterns from the robot body to replace current temporally information-dense patterns when the DN-equipped autonomous robot body is completed.

Our original work has been accepted by 2017 International Joint Conference on Neural Networks (IJCNN) [21]. Based on that, we extend the work more than 40% to this archival journal version. The main additional novel parts of this journal version are:

  • 1.

    The information entropy of temporal sequence has been used to mathematically define the density of actions in the conference version. The entropy values have been computed for all experimental settings to contrast their different values in terms of mathematically defined information-dense concept.

  • 2.

    The patterns of concept 2 motor neurons (dense) are much more now, and they are automatically generated but before they were much fewer and were handcrafted. In other words, the concept 2 actions automatically emerge as patterns but before they are handcrafted labels. This greatly reduced the cost of system development, because the programmer does not need to care which patterns correspond to a label — the vector representation and the inner product space automatically take care of such similarity among patterns.

  • 3.

    Volume information representation is added to replace original adding “energy-component” element method — a batch method — when processing the waveform. This mechanism corresponds to a hypothesized “genes prepositioned” but “partially emergent” feature in the sensory input of every neuron that might be present in many “purely emergent” features. The experimental results show that this volume-feature performs better than the “energy-component” in the conference version.

  • 4.

    We refine the definition of “hair cells” so that each covers shorter frames and with certain overlap between the consecutive frames. This refinement provides more information-dense contexts and improved the performance.

  • 5.

    The hidden neurons now have locations inside “skull’. Before, all hidden neurons are location-free, which is common in many artificial neural networks. This new mechanism encourages the smoothness of hidden representations — nearby neurons detect similar features. It allows the recruitment of neurons during the life-long learning implemented by Hebbian learning. This process of recruitment gradually adapts a “hierarchically” smooth representations to better fit the changing distribution.

The remainder of the paper is organized as follows: we first discuss the theory part in Section 2. The DN algorithm and some key details are listed in Section 3. In Section 4 we present the implementation details and analysis of experimental results. Concluding remarks are offered in Section 5.

Section snippets

Theory: information-dense

How does an autonomous agent generate actions that are sufficient for not only declarative skills but also non-declarative skills [2]? It is arguable these two categories of skills are all driven by muscle neurons at temporally dense fine-grained levels. By “temporally dense” we mean that the signal must change its value at a high temporal frequency. Although a written language is often documented at a temporally sparse word level (i.e., declarative), the continuous pronunciation of each word

Developmental networks

In DN, the neurons in the X area receive and transfer sensory information; the neurons in the Z area emerge and transfer the actions or concepts. The skull-closed Y area bi-directionally connects X and Z areas like a bridge. The firing neurons in the Y area are the winners with best representation patterns of (X, Z) in last moment. The Z area then emerge the next firing pattern according to the firing Y neurons. The general algorithm of DN is shown in algorithm 1.

DN is in asynchronous mode and

Experiments

As temporally fine-grained examples, the experiments in this work used DN for phoneme recognition through time series as only examples. Our methods are modality independent, potentially applicable to vision, audition, natural languages, etc. In addition, we are also doing experiments using video as input for autonomous navigation and using words as inputs for natural language acquisition. However, we do not discuss vision and natural languages in this paper due to the space limitation, and plan

Conclusions and discussions

We argued that actions may serve as information-dense states and provide rich context information for the learning agent. Such a density is quantified and measured by entropy of information in actions. Furthermore, some discrete, fine-grained, and handcrafted labels typically used by Markov models such as phoneme stages helped by k-mean clustering, which are popular in speech recognition and object recognition research, are not always necessary. We introduced the free-of-labeling property of

Xiang Wu was born in Jiangsu, China, in 1989. He received the B.S. degree in electrical engineering and automation from Nanjing University of Science and Technology, Nanjing, China, in 2012. He is currently pursuing the Ph.D. degree in control science and engineering at school of Automation, Nanjing University of Science and Technology, Nanjing, China. From 2016 to 2017, he was a Visiting Ph.D. student with the department of Computer Science and Engineering, Michigan State University, East

References (40)

  • H. Sakoe et al.

    Dynamic programming algorithm optimization for spoken word recognition

    IEEE Trans. Acoust. Speech Signal Process.

    (1978)
  • M. Brand et al.

    Coupled hidden Markov models for complex action recognition

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, USA

    (1997)
  • C. Vogler et al.

    ASL recognition based on a coupling between HMMs and 3D motion analysis

    Proceedings of the Sixth International Conference on Computer Vision, Bombay, India

    (1998)
  • K. Jayech et al.

    Synchronous multi-stream hidden Markov model for offline arabic handwriting recognition without explicit segmentation

    Neurocomputing

    (2016)
  • M. Russell et al.

    Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition

    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Tampa, FL, USA

    (1985)
  • A. Bonafonte et al.

    Duration modeling with expanded HMM applied to speech recognition

    Proceedings of the Fourth International Conference on Spoken Language, Philadelphia, PA, USA

    (1996)
  • M. Russell et al.

    Experimental evaluation of duration modelling techniques for automatic speech recognition

    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Dallas, TX, USA

    (1987)
  • A. Waibel et al.

    Phoneme recognition using time-delay neural networks

    IEEE Trans. Acoust. Speech. Signal Process.

    (1989)
  • WaiR.

    Tracking control based on neural network strategy for robot manipulator

    Neurocomputing

    (2003)
  • A. Graves et al.

    Speech recognition with deep recurrent neural networks

    Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada

    (2013)
  • Cited by (7)

    • Learning to recognize while learning to speak: Self-supervision and developing a speaking motor

      2021, Neural Networks
      Citation Excerpt :

      Different from many other neural networks which act like a “black box”, DN-1 learns an ETM with clear logics and statistical optimality. DN-1 has been successfully experimented with different modalities (e.g., vision Zheng & Weng, 2016, audition Wu, Bo, & Weng, 2018, and text Castro-Garcia & Weng, 2019). Although DN-1 has been tested with multiple hidden areas, there are two restrictions. (

    • Pairwise comparison learning based bearing health quantitative modeling and its application in service life prediction

      2019, Future Generation Computer Systems
      Citation Excerpt :

      Cognitive computing is a new type of computing mode aiming to enhance human cognitive performance and support human decision-making by mimicking the functioning of human brain [1,2]. Cognitive computing is widely used in many research domains, such as Natural Language Processing (NLP) [3], High Performance Computing (HPC) [4], Knowledge Representation and Reasoning (KRR) [5], Machine Learning (ML) [6,7] and Industrial Internet of Things (IIoT) [8,9]. One of the core challenges in exploring the full potential of cognitive computing is to efficiently process and analyze data for supporting decision-making in a correct and expected manner.

    • Behavioral Decision-Making of Mobile Robot in Unknown Environment with the Cognitive Transfer

      2021, Journal of Intelligent and Robotic Systems: Theory and Applications
    • Muscle Vectors as Temporally Dense 'Labels'

      2020, Proceedings of the International Joint Conference on Neural Networks
    • Neuron-Wise Inhibition Zones and Auditory Experiments

      2019, IEEE Transactions on Industrial Electronics
    View all citing articles on Scopus

    Xiang Wu was born in Jiangsu, China, in 1989. He received the B.S. degree in electrical engineering and automation from Nanjing University of Science and Technology, Nanjing, China, in 2012. He is currently pursuing the Ph.D. degree in control science and engineering at school of Automation, Nanjing University of Science and Technology, Nanjing, China. From 2016 to 2017, he was a Visiting Ph.D. student with the department of Computer Science and Engineering, Michigan State University, East Lansing, USA. His current research interests include neural networks, pattern recognition, and auditory processing.

    Yuming Bo received the B.S., M.S., and Ph.D. degrees in navigation, guidance and control from Nanjing University of Science and Technology, Nanjing, China. He worked as a professor in control science and engineering at school of Automation, Nanjing University of Science and Technology, Nanjing, China. He is a member of the Chinese Association of Automation and Vice Chairman of Jiangsu Branch. His research interests include guidance, navigation and control, filtering and system optimization, and image processing.

    Juyang Weng received the BS degree from Fudan University, in 1982, M. Sc. and PhD degrees from the University of Illinois at Urbana-Champaign, in 1985 and 1989, respectively, all in computer science. He is currently professor of Computer Science and Engineering, faculty member of the Cognitive Science Program, and faculty member of the Neuroscience Program at Michigan State University, East Lansing. He was a visiting professor at the Computer Science School of Fudan University, Nov. 2003 - March 2014. Since the work of Cresceptron (ICCV 1993), he expanded his research interests in biologically inspired systems to developmental learning, including perception, cognition, behaviors, motivation, and abstract reasoning skills. He has published over 300 research articles on related subjects, including task muddiness, intelligence metrics, mental architectures, vision, audition, touch, attention, recognition, autonomous navigation, and natural language understanding. He coauthored with T. S. Huang and N. Ahuja a research monograph titled Motion and Structure from Image Sequences and authored a book titled Natural and Artificial Intelligence: Computational Introduction to Computational Brain-Mind. Dr. Weng is an Editor-in-Chief of the International Journal of Humanoid Robotics, the Editor-in-Chief of the Brain-Mind Magazine, and an associate editor of the IEEE Transactions on Autonomous Mental Development (now Cognitive and Developmental Systems). With others, he helped to create the series of International Conference on Development and Learning (ICDL), the IEEE Transactions on Autonomous Mental Development, and the startup GENISAMA LLC. He was an associate editor of the IEEE Transactions on Pattern Recognition and Machine Intelligence and the IEEE Transactions on Image Processing. He is a Fellow of IEEE.

    View full text