Spatial and temporal features unified self-supervised representation learning networks
Introduction
Applications of Reinforcement Learning span from path planning [1], Gaming [2] to guiding agents in unseen situations [2]. But when it comes to applying reinforcement learning to robotic tasks [3] the need for a suitable state representation becomes a fundamental challenge. Faithful state representation has always been one of the essential requirements when it comes to generating accurate behaviors in robots. Traditionally, that is treated through manually designing the state representations. However, it is troublesome to hand-design state representations for complex manipulation tasks. When the learning is directly from humans to robots the fundamental embodiment difference between human experts and robots poses a difficult challenge for representation design. Additionally, the differences between the real world and simulated environment make designing representations through hand even harder. This becomes even more harder when the point of view between the demonstrations of human experts and robots is different. Thus, there are three challenges to address: Embodiment difference, Domain shift, and Viewpoint difference. For a robot to imitate directly from third-person view expert demonstrations, the ideal representation must encode two aspects: (1) The relevant parameters encoding the interplay among objects and human solely from visual inputs of the demonstration and (2) and the mapping of this interplay among human experts and objects onto the robot.
Time-Contrastive Networks (TCN) [4] use temporal features and Deep Spatial Autoencoders (DSA) for Visuomotor Learning [5] use spatial features for representing the state. TCN [4] also address the three issues i.e., embodiment difference, domain shift, and viewpoint difference; however, it requires a large amount of training data from human experts. DSA [5] solves the issue of learning state representation while being sample efficient but it does not address the embodiment difference, domain shift, and viewpoint difference challenges. Also as shown in our experiments section the success of these methods drop significantly when the objects seen at test time are different (different shape, size, or color) than objects seen during the training.
In the proposed approach presented here, the challenges are addressed by unifying benefits of two previous works, namely; TCN [4] and DSA [5]. This unified representation encodes the task relevant parameters and a mapping from human to robot and is also able to handle situations when the objects at test time differ in shape, size, or color than objects seen during training time in addition to being sample efficient and addressing the three challenges: Embodiment difference, Domain shift, and Viewpoint difference. Through the experiment it is proved that by unifying the TCN and DSA, representations that outperform both of these methods can be generated which are robust to dissimilarity in objects during train and test times. Our main contribution lies in the proposed framework: Self-Supervised and Multi-Viewpoint Spatial and Temporal Features Unified Representation learning, which can integrate into any RL algorithm to generate the desired behavior in the robot. Through experiments, the sample efficiency and effectiveness of this approach is demonstrated for various robotic manipulation tasks like grasping, lifting objects of different shape, size, and color, and opening and closing drawers of different colors and types.
Related Work
Imitation Learning (IL) [6] can be framed as the problem of learning a policy from expert demonstrations [7], [8], [9] and has been proved powerful in many applications including helicopter flight [10], placing an object inside a cup, and playing table tennis [11], performing human-like reaching motions [12]. IL can be divided into behavioral cloning and inverse reinforcement learning (IRL). Behavioral cloning considers the problem of IL as supervised learning where expert behavior is provided in the form of state–action pairs [13] while the objective of the Inverse Reinforcement Learning (IRL) is to recover a reward function from expert demonstrations [14], [15], [16], [17] that can be used to extract the policy using reinforcement learning [18], [19]. Both IL methods require expert demonstrations within the identical context, i.e., environment, viewpoint, etc., as the learner. In robotics, this can be accomplished by means of kinesthetics demonstrations [20] or teleoperation [21]. But these strategies call for substantial expertise. Also, this differs from the manner human beings or animals imitate. Humans or animals analyze new abilities by looking at different humans or animals; they do not rely on egocentric observations or ground truth actions. Also, the embodiment among actors and learners is hardly ever the same. Various works have studied and proposed strategies for imitating an observed demonstration in different settings, such as demonstration captured from different viewpoints or with a different embodiment such as human [22], [23], [24]. Liu [25] proposed a method to learn an imitation policy by translating the context from expert demonstrations to the learner by means of minimizing the distance to the translated demonstrations. However, in this work, any demonstrations with embodiment differences (e.g., human fingers vs. robot grippers) are explicitly excluded. Other approaches have included predictive modeling [26], [27], context translation [25], [28], learning reward representations [24], meta-learning [29], and the usage of explicit pose and object detection [30], [31], [32], [33], [34], resolving the correspondence problem [35] by means of instrumenting paired data collection or manually matching hand-specified key points. Here an approach to IL is proposed which relies only on visual inputs. This is accomplished through a combination of two components: spatial component and temporal component. The spatial component is trained to learn features representing the setting of the environment, inclusive of the positions of items within the scene. In contrast, the temporal component discovers attributes that do not change across viewpoints but change across time while ignoring the variables such as background, motion blur, lighting, etc. The components are unified to carry out self-supervised robotic control.
To summarize, in this work, first, a unified state representation is learned from the observations, which is termed as ShivNet. Then, a RL algorithm is incorporated with these learned state representations for carrying out the self-supervised robotic manipulation tasks by learning a mapping between the state representation and corresponding motor command required to perform the task.
The contributions of this work are twofold:
(1) Unifying the benefits of TCN [4] and DSA [5] to learn representations robust to a different embodiment, domain shift, and viewpoint difference in addition to being sample efficient. The representations are capable of performing the tasks on objects not seen during training time.
(2) Using this approach along with RL to solve an imitation learning problem which is typically solved using IRL or Behavioral cloning. Although IRL or Behavioral cloning would appear more appropriate in such applications, those strategies require either learning the reward function as in the case of IRL or obtaining expert demonstrations for behavioral cloning (in the form of state–action pairs and most certainly in the same embodiment as learner).
The paper is organized into four sections. Section 2 provides the methodology including the background knowledge of concepts related to the proposed framework, the proposed algorithm in detail, including various components and their training. Section 3 presents experimental setup which is employed for validation and demonstration of the method proposed along with the results of the experiments and the comparison against various baselines. Lastly, Section 4 discusses the conclusion, limitations, and future scope of this work.
Section snippets
Methods
To generate dependable representations leading to better behavior generation, a unified space, and temporal features representation approach, ShivNet, is proposed. ShivNet comprises two primary components: Spatial Component and Temporal Component.
The Spatial Component finds out the features that describe the environment for the task under consideration – e.g., the positions of objects irrespective of the viewpoint.
Temporal Component discovers aspects that remain unchanged across viewpoints but
Task setup
In this section, the approach is employed to train robots to acquire manipulation skills that is lifting and grasping blocks, opening, and closing drawers directly from a third-person view video demonstration of a human (which has an entirely different embodiment) performing the task. The human demonstrations collected were in the real world, and robot performing tasks were in a simulation, proving that the unified representation approach focuses on the task more and less on the environment in
Conclusion
In this work, a framework for self-supervised and multi-view representation learning is proposed. We show that these representations could provide signals to train robots to learn manipulation skills like lifting and grasping objects of different shapes, sizes and colors and opening and closing drawers of different colors and types. This approach proved effective in overcoming the issues, namely, embodiment difference, domain shift, and viewpoint difference, typically present in imitation
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Rahul Choudhary is currently a undergraduate student at IIT, Kharagpur, India. His area of research is Reinforcement Learning and Sequential Decision Making for various application domains, especially robotics.
References (45)
- et al.
A survey of robot learning from demonstration
Robot. Auton. Syst.
(2009) - et al.
Learning human arm movements by imitation
Robot. Auton. Syst.
(2001) - et al.
A kendama learning robot based on bi-directional theory
Neural Netw.
(1996) - et al.
A syntactic approach to robot imitation learning using probabilistic activity grammars
Robot. Auton. Syst.
(2013) - et al.
Transferring skills to humanoid robots by extracting semantic representations from observations of human activities
Artificial Intelligence
(2017) - et al.
Novel best path selection approach based on hybrid improved a* algorithm and reinforcement learning
Appl. Intell.
(2021) - et al.
LS-visiondraughts: Improving the performance of an agent for checkers by integrating computational intelligence, reinforcement learning and a powerful search method
Appl. Intell.
(2014) - et al.
XCS-based reinforcement learning algorithm for motion planning of a spherical mobile robot
Appl. Intell.
(2016) - et al.
Time-contrastive networks: Self-supervised learning from video
- et al.
Deep spatial autoencoders for visuomotor learning
Imitation learning for locomotion and manipulation
Learning to select and generalize striking movements in robot table tennis
Int. J. Robot. Res.
Autonomous helicopter aerobatics through apprenticeship learning
Int. J. Robot. Res.
Learning motor primitives for robotics
Efficient training of artificial neural networks for autonomous navigation
Neural Comput.
Apprenticeship learning via inverse reinforcement learning
Nonlinear inverse reinforcement learning with Gaussian processes
Maximum entropy inverse reinforcement learning
Maximum margin planning
Bayesian inverse reinforcement learning
Cited by (2)
Industrial camera model positioned on an effector for automated tool center point calibration
2024, Scientific Reports
Rahul Choudhary is currently a undergraduate student at IIT, Kharagpur, India. His area of research is Reinforcement Learning and Sequential Decision Making for various application domains, especially robotics.
Rahee Walambe received M.Phil., Ph.D. Degree from Lancaster University, UK, in 2008. Her area of research is applied Deep Learning and AI in the field of Robotics and Healthcare. She is recipient of number of international research grants in area of robotics and AI.
Ketan kotecha pursued Ph.D.& M.Tech. from (IIT Bombay) and is currently holding the positions as Head, Symbiosis Centre for Applied AI (SCAAI), Director, Symbiosis Institute of Technology, Dean, Faculty of Engineering, Symbiosis International (Deemed University). He is an expert in AI and Deep Learning. He has published widely in a number of excellent peer-reviewed journals on various topics ranging from cutting-edge AI, education policies, teaching–learning practices and AI for all. He is a recipient of multiple international research grants and awards.