Spatial and temporal features unified self-supervised representation learning networks

https://doi.org/10.1016/j.robot.2022.104256Get rights and content

Highlights

  • Learning a sample efficient unified state representation from the observations using a proposed network called ShivNet.

  • Demonstration of robust behaviour of unified approach to different embodiment, domain shift and viewpoint difference.

  • Incorporating RL with ShivNet for self-supervised robotic manipulation tasks.

  • Learning a mapping between state-representation and motor commands for robotic tasks.

Abstract

Robot manipulation tasks can be carried out effectively, provided the state representation is satisfactorily detailed. Embodiment difference, Viewpoint difference, and Domain difference are some of the challenges in learning from human demonstration. This work proposes a self-supervised and multi-viewpoint spatial and temporal features unified representation learning method. The algorithm consists of two components: (a) Spatial Component, which learns the setting of the environment, i.e., on which pixels to focus on most to get the best representation of the image regardless of point of view, and (b) Temporal Component that learns how snapshots taken from multiple viewpoints simultaneously (i.e., at the same time-step but from a different viewpoint) are similar and how these snaps are different from snaps taken at a different time-step but same viewpoint. Further, these representations are integrated with the Reinforcement Learning (RL) framework to learn accurate behaviors from videos of humans performing the manipulation task. The effectiveness of this approach is illustrated by training the robots to learn various manipulation tasks i.e., (a) grab objects (b) lift objects (c) open and close drawers from expert demonstrations provided by humans. The algorithm shows great promise and is highly successful across all the manipulation tasks. The robot learns to pick up objects of various shapes, sizes and colors having different orientations and placements on the table. The robot also successfully learns how to open and close drawers. The method is highly sample efficient and addresses the challenges of embodiment, viewpoint, and domain difference.

Introduction

Applications of Reinforcement Learning span from path planning [1], Gaming [2] to guiding agents in unseen situations [2]. But when it comes to applying reinforcement learning to robotic tasks [3] the need for a suitable state representation becomes a fundamental challenge. Faithful state representation has always been one of the essential requirements when it comes to generating accurate behaviors in robots. Traditionally, that is treated through manually designing the state representations. However, it is troublesome to hand-design state representations for complex manipulation tasks. When the learning is directly from humans to robots the fundamental embodiment difference between human experts and robots poses a difficult challenge for representation design. Additionally, the differences between the real world and simulated environment make designing representations through hand even harder. This becomes even more harder when the point of view between the demonstrations of human experts and robots is different. Thus, there are three challenges to address: Embodiment difference, Domain shift, and Viewpoint difference. For a robot to imitate directly from third-person view expert demonstrations, the ideal representation must encode two aspects: (1) The relevant parameters encoding the interplay among objects and human solely from visual inputs of the demonstration and (2) and the mapping of this interplay among human experts and objects onto the robot.

Time-Contrastive Networks (TCN) [4] use temporal features and Deep Spatial Autoencoders (DSA) for Visuomotor Learning [5] use spatial features for representing the state. TCN [4] also address the three issues i.e., embodiment difference, domain shift, and viewpoint difference; however, it requires a large amount of training data from human experts. DSA [5] solves the issue of learning state representation while being sample efficient but it does not address the embodiment difference, domain shift, and viewpoint difference challenges. Also as shown in our experiments section the success of these methods drop significantly when the objects seen at test time are different (different shape, size, or color) than objects seen during the training.

In the proposed approach presented here, the challenges are addressed by unifying benefits of two previous works, namely; TCN [4] and DSA [5]. This unified representation encodes the task relevant parameters and a mapping from human to robot and is also able to handle situations when the objects at test time differ in shape, size, or color than objects seen during training time in addition to being sample efficient and addressing the three challenges: Embodiment difference, Domain shift, and Viewpoint difference. Through the experiment it is proved that by unifying the TCN and DSA, representations that outperform both of these methods can be generated which are robust to dissimilarity in objects during train and test times. Our main contribution lies in the proposed framework: Self-Supervised and Multi-Viewpoint Spatial and Temporal Features Unified Representation learning, which can integrate into any RL algorithm to generate the desired behavior in the robot. Through experiments, the sample efficiency and effectiveness of this approach is demonstrated for various robotic manipulation tasks like grasping, lifting objects of different shape, size, and color, and opening and closing drawers of different colors and types.

Related Work

Imitation Learning (IL) [6] can be framed as the problem of learning a policy from expert demonstrations [7], [8], [9] and has been proved powerful in many applications including helicopter flight [10], placing an object inside a cup, and playing table tennis [11], performing human-like reaching motions [12]. IL can be divided into behavioral cloning and inverse reinforcement learning (IRL). Behavioral cloning considers the problem of IL as supervised learning where expert behavior is provided in the form of state–action pairs [13] while the objective of the Inverse Reinforcement Learning (IRL) is to recover a reward function from expert demonstrations [14], [15], [16], [17] that can be used to extract the policy using reinforcement learning [18], [19]. Both IL methods require expert demonstrations within the identical context, i.e., environment, viewpoint, etc., as the learner. In robotics, this can be accomplished by means of kinesthetics demonstrations [20] or teleoperation [21]. But these strategies call for substantial expertise. Also, this differs from the manner human beings or animals imitate. Humans or animals analyze new abilities by looking at different humans or animals; they do not rely on egocentric observations or ground truth actions. Also, the embodiment among actors and learners is hardly ever the same. Various works have studied and proposed strategies for imitating an observed demonstration in different settings, such as demonstration captured from different viewpoints or with a different embodiment such as human [22], [23], [24]. Liu [25] proposed a method to learn an imitation policy by translating the context from expert demonstrations to the learner by means of minimizing the distance to the translated demonstrations. However, in this work, any demonstrations with embodiment differences (e.g., human fingers vs. robot grippers) are explicitly excluded. Other approaches have included predictive modeling [26], [27], context translation [25], [28], learning reward representations [24], meta-learning [29], and the usage of explicit pose and object detection [30], [31], [32], [33], [34], resolving the correspondence problem [35] by means of instrumenting paired data collection or manually matching hand-specified key points. Here an approach to IL is proposed which relies only on visual inputs. This is accomplished through a combination of two components: spatial component and temporal component. The spatial component is trained to learn features representing the setting of the environment, inclusive of the positions of items within the scene. In contrast, the temporal component discovers attributes that do not change across viewpoints but change across time while ignoring the variables such as background, motion blur, lighting, etc. The components are unified to carry out self-supervised robotic control.

To summarize, in this work, first, a unified state representation is learned from the observations, which is termed as ShivNet. Then, a RL algorithm is incorporated with these learned state representations for carrying out the self-supervised robotic manipulation tasks by learning a mapping between the state representation and corresponding motor command required to perform the task.

The contributions of this work are twofold:

(1) Unifying the benefits of TCN [4] and DSA [5] to learn representations robust to a different embodiment, domain shift, and viewpoint difference in addition to being sample efficient. The representations are capable of performing the tasks on objects not seen during training time.

(2) Using this approach along with RL to solve an imitation learning problem which is typically solved using IRL or Behavioral cloning. Although IRL or Behavioral cloning would appear more appropriate in such applications, those strategies require either learning the reward function as in the case of IRL or obtaining expert demonstrations for behavioral cloning (in the form of state–action pairs and most certainly in the same embodiment as learner).

The paper is organized into four sections. Section 2 provides the methodology including the background knowledge of concepts related to the proposed framework, the proposed algorithm in detail, including various components and their training. Section 3 presents experimental setup which is employed for validation and demonstration of the method proposed along with the results of the experiments and the comparison against various baselines. Lastly, Section 4 discusses the conclusion, limitations, and future scope of this work.

Section snippets

Methods

To generate dependable representations leading to better behavior generation, a unified space, and temporal features representation approach, ShivNet, is proposed. ShivNet comprises two primary components: Spatial Component and Temporal Component.

The Spatial Component finds out the features that describe the environment for the task under consideration – e.g., the positions of objects irrespective of the viewpoint.

Temporal Component discovers aspects that remain unchanged across viewpoints but

Task setup

In this section, the approach is employed to train robots to acquire manipulation skills that is lifting and grasping blocks, opening, and closing drawers directly from a third-person view video demonstration of a human (which has an entirely different embodiment) performing the task. The human demonstrations collected were in the real world, and robot performing tasks were in a simulation, proving that the unified representation approach focuses on the task more and less on the environment in

Conclusion

In this work, a framework for self-supervised and multi-view representation learning is proposed. We show that these representations could provide signals to train robots to learn manipulation skills like lifting and grasping objects of different shapes, sizes and colors and opening and closing drawers of different colors and types. This approach proved effective in overcoming the issues, namely, embodiment difference, domain shift, and viewpoint difference, typically present in imitation

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Rahul Choudhary is currently a undergraduate student at IIT, Kharagpur, India. His area of research is Reinforcement Learning and Sequential Decision Making for various application domains, especially robotics.

References (45)

  • A.J. Ijspeert, J. Nakanishi, S. Schaal, 2002. Movement imitation with nonlinear dynamical systems in humanoid robots....
  • RatliffN.D. et al.

    Imitation learning for locomotion and manipulation

    (2007)
  • MüllingK. et al.

    Learning to select and generalize striking movements in robot table tennis

    Int. J. Robot. Res.

    (2013)
  • AbbeelP. et al.

    Autonomous helicopter aerobatics through apprenticeship learning

    Int. J. Robot. Res.

    (2010)
  • KoberJ. et al.

    Learning motor primitives for robotics

  • PomerleauD.A.

    Efficient training of artificial neural networks for autonomous navigation

    Neural Comput.

    (1991)
  • A.Y. Ng, S.J. Russell, Algorithms for inverse reinforcement learning, in: ICML, 2000, pp....
  • AbbeelP. et al.

    Apprenticeship learning via inverse reinforcement learning

  • LevineS. et al.

    Nonlinear inverse reinforcement learning with Gaussian processes

  • ZiebartB.D. et al.

    Maximum entropy inverse reinforcement learning

  • RatliffN.D. et al.

    Maximum margin planning

  • RamachandranD. et al.

    Bayesian inverse reinforcement learning

  • Rahul Choudhary is currently a undergraduate student at IIT, Kharagpur, India. His area of research is Reinforcement Learning and Sequential Decision Making for various application domains, especially robotics.

    Rahee Walambe received M.Phil., Ph.D. Degree from Lancaster University, UK, in 2008. Her area of research is applied Deep Learning and AI in the field of Robotics and Healthcare. She is recipient of number of international research grants in area of robotics and AI.

    Ketan kotecha pursued Ph.D.& M.Tech. from (IIT Bombay) and is currently holding the positions as Head, Symbiosis Centre for Applied AI (SCAAI), Director, Symbiosis Institute of Technology, Dean, Faculty of Engineering, Symbiosis International (Deemed University). He is an expert in AI and Deep Learning. He has published widely in a number of excellent peer-reviewed journals on various topics ranging from cutting-edge AI, education policies, teaching–learning​ practices and AI for all. He is a recipient of multiple international research grants and awards.

    View full text