Keywords

1 Introduction

The purpose of this study was to explore a more natural method of manipulator teleoperation to optimize productivity and usability.

This subject solved the three-dimensional spatial position recognition between the operator’s hand and the operation target, and integrates the spatial position of the manipulator, the operation target and the hand action to form a unified mixed reality scene. uses hand gesture recognition technology to convert the operator’s hand action to the manipulator’s execution instruction, avoids the complexity which is caused by the operation translation and the posture handle control respectively.

According to the characteristics of hand and manipulator, the human in the teleoperation control circuit model and an immersed virtual operation scene was established, sets up a fusion virtual teleoperation scene, which was used to find the suitable visual gesture recognition method and teleoperation control model. The three-dimensional scene reconstruction enables the operator to have a clear understanding of the three-dimensional situation of the whole teleoperation scene, and can perceive the spatial relationship between the operator, the manipulator and the target in real time, thus improving the execution efficiency of the teleoperation.

2 System Design

The research of this subject is based on the interactive technology of computer vision, including hand motion recognition, mapping relationship between hand and manipulator, target space position relationship and information enhancement in virtual scene. It uses mature commercial depth camera equipment (such as Kinect, Leapmotion, Intel Creative, ZED camera, etc.) as recognition input equipment to track hand motion. uses a manipulator with multiple joints, and installs a camera on the wrist or other related parts of the manipulator to obtain the depth image of the operating target. The spatial position recognition of the operating target (non-cooperative target) is accomplished through the depth image and the measurement data output from the manipulator.

Using virtual reality technology, the operator’s virtual hand action, virtual manipulator and virtual operation target are generated in a unified virtual scene, and provided to the operator in a stereoscopic way. The operator can realize the visual perception of the operation scene through the 3D large field of view display helmet or desktop 3D display device.

According to the above technical route and based on the basic principle of teleoperation, an experimental system based on visual recognition interaction is established. The operator’s hand is recognized by computer vision, and the three-dimensional model of hand action is established on the basis of the most optimized algorithm. Through the augmented reality technology and the measurement data of the manipulator itself, the position of the operation target in space is identified. A unified scene is formed by the fusion of manipulator, manipulation target space position and hand movement. The operator perceives the operation scene by observing the virtual scene, recognizes the human hand by computer vision, and finally drives the manipulator of the terminal to complete the manipulator operation on the target object. The principle can be shown in the following Fig. 1.

Fig. 1.
figure 1

Manipulator operating prototype based on machine vision.

In the above-mentioned system, through the interactive equipment of visual recognition, the control consistency expression of both ends of manual and manipulator operation is completed, and a complete man-in-loop teleoperation control system is formed.

3 Hand Motion Recognition

There are many devices that provide hand pose data such as Intel RealSense, Leapmotion, Kinect etc. Due to the high accuracy of Leapmotion and its compatibility with the Oculus Rift [1], we chose the pose data provided by the Leapmotion for gesture recognition.

3.1 Gesture Recognition Algorithm

As previously reported, although the Leapmotion device provides only a limited set of relevant points and not a complete description of the hand shape [2], the device provides enough relevant points for gesture recognition, avoiding complex computations required when extracting gesture recognition from depth and color data.

The grabbing recognition algorithm proposed in this project supported two types of gestures [3]. The first grabbing gesture was the grab, in which the user curled his/her fingers inwards towards the palm center C as depicted in Fig. 2. The grab value of the grab gesture gc was determined by the following algorithm:

Fig. 2.
figure 2

The features of hand (Grab)

$$ g_{c} = \frac{{\mathop \sum \nolimits_{i = 1}^{5} \left| {F_{i} - C} \right|}}{5} $$
(1)

The other grabbing gesture modeled a pinching pose, in which the user clumped the fingertips together (Fig. 3). The algorithm used to determine grab value of the pinch gesture gp was computed in a similar fashion:

Fig. 3.
figure 3

Pinch gesture (Color figure online)

$$ g_{p} = \frac{{\mathop \sum \nolimits_{i = 2}^{5} \left| {F_{i} - F_{1} } \right|}}{4} $$
(2)

This algorithm simply calculated the average distance (Fig. 3, blue dashed lines) of the thumb tip to each of the other fingertips. Hence, if the finger types were spread out, then gp would be large. If the fingertips were clumped together then gp would be small.

In order for the FSM to transition into the gesture state, gc and gp must have been below a trigger threshold. For this research, the grab GT threshold (GT) was the same for both the grab gesture and the pinch gesture. was calculated by multiplying the length of the metacarpal bone of the thumb finger (m) by a grab threshold constant α:

$$ G_{T} = \alpha \cdot m $$
(3)

The length of the metacarpal bone was extracted directly from data provided by the Leapmotion sensor. Hence, the grab threshold value is different for each user. Users with bigger hands had a higher threshold and vice versa.

The release threshold (RT) dictated when the hand transitioned out of the grabbing state. This is similar to the grab threshold in such a way that it was determined by multiplying m by the release threshold constant β.

$$ R_{T} = \beta \cdot m $$
(4)

For this research \( \upbeta \ge \alpha \), because the release threshold needed to be greater than the grab threshold reducing the chance that the user might inadvertently change their hand pose and cause the FSM to transition out of the gesture state (i.e. grab value ≥ RT). If the grabbing value was greater than GT but less than RT the FSM remained in the gesture state.

3.2 Noise Suppression

Although, in theory, the tracking accuracy of Leap Motion is less than submillimeter, environmental noise causes the Leapmotion to incorrectly detect the hand pose state and position. Noise signals include room lighting, shadows, hand jumping [4], and visual and numerical block singular value solvers. Prior research [5] has proposed an adaptive cut-off frequency of the low pass filtering method; therefore, this study used the real-time speed of palm to remove hand pose states whose velocity is above a certain threshold. As an example, consider a time frame that includes a history of all the hand states where Hi represents the hand state at time frame i in a set H. The hand state included properties such as hand position Pi, hand rotation Ri, and hand velocity Vi. Vi is determined by taking the absolute difference in hand position from the current time frame and the average position of the previous five hand frames (including itself):

$$ V_{i} = \left| {P_{i} - \frac{{\mathop \sum \nolimits_{n = i - 4}^{i} P_{n} }}{5}} \right| $$
(5)

The low pass filtering method filtered out all the hand states Hi in which their corresponding Vi is above a cutoff threshold Vcutoff. The new set of hand states \( \widehat{\text{H}} \) were rendered and used for gesture recognition:

$$ \widehat{H} = \left\{ {H_{i} |V_{i} < V_{cutoff} } \right\} $$
(6)

The lower filter cutoff is used to filter out any noise detected by the Leap Motion that caused the hand to jump to a completely different location in the virtual environment. For example, consider a set of five time frames of hand state \( {\text{H}}_{1} ,\,{\text{H}}_{2} ,\,{\text{H}}_{3} ,\,{\text{H}}_{4} ,\,{\text{H}}_{5} \), where all hand states were correctly detected by the Leap Motion except for H5, which was completely off from where the users hand position was. V5 will be very high (greater than Vcutoff), and hence the low pass filtering algorithm will not include H5 in the \( \widehat{\text{H}} \) set to be rendered for the gesture recognition process.

3.3 Snapback Method

Implementing separate thresholds for the Gesture State (S1) and the Release State (S2) has its drawbacks. When the subject was in the act of releasing (going from S1 to S2) there was a period of time in which the object inadvertently shifted since RT had not been exceeded yet. During the period required to release the object (exceed RT), the object was still attached to the virtual hand and adhered to the Move State manipulation rules (S3); hence if the hand moved during the act of releasing, the object moved too.

To solve this issue and facilitate object stability during release, a snapback algorithm was designed to return an object to its position when the release was first initiated.

As shown in Figs. 4 and 5, a set of parameters was defined to implement the snapback algorithm: Grab Value (GV), Release Threshold (RT), Grab Threshold GT), Slope value (SV), and Slope Threshold (ST). As previously described, when GV fell below GT during grabbing, the program would move into the gesture state. A similar approach was used for releasing, except a different threshold was used, RT. SVmeasured how fast the grab value changed during any single frame. ST was used as a breakpoint in the snapback algorithm. Throughout the duration of the program, a history kept a list of the previous N hand states, call this H. Hence, \( {\text{H}}_{{{\text{N}} - 1}} \) would be the hand state one frame before the current hand state because it is the last index in the list of histories. Consequently H0 would be the hand state N frames before. During the act of releasing, at the moment the grab value of the current hand frame was above the release threshold, the program triggered a snapback which retraced movement back through the previous N hand states stored in H (indexcounter starts at N − 1, then decrements) until one of the follow conditions are met:

Fig. 4.
figure 4

Snapback graphical representation. Top dashed line represents \( R_{T} \), bottom dashed line represents \( G_{T} \), and the two vertical lines represent the start and end of the releasing phase.

Fig. 5.
figure 5

Snapback graphical representation of slope. In general, the \( S_{V} \) will be above the \( S_{T} \) during the act of releasing.

$$ H_{index} \left[ {G_{V} } \right] < G_{T} $$
$$ H_{index} \left[ {S_{V} } \right] < S_{T} $$
$$ index = 0 $$

Condition one assessed whether the grab value of the hand state at that specific index was less than the grab threshold. Condition two assessed whether the slope value was less than the slope threshold. Lastly, condition three assessed whether the program reached the end of the history (cannot retrace further back). When any of these three conditions were met, say at \( {\text{index}} = {\text{T}} \), the object was returned (snapback) to the position and rotation state at the time frame associated with \( {\text{H}}_{\text{T}} \). This is done by calculating the deviation in position (dP) and deviation in angle (dA) between the current hand state (Current) and the hand state it was snapped back to (HT):

$$ dP = H_{T} \left[ {position} \right] - Current[position] $$
(7)
$$ dA = Quaternion.inverse\left( {H_{T} \left[ {rotation} \right], Current\left[ {rotation} \right]} \right) $$
(8)

where dP, and dA were used rotate and translate the object back to the state corresponding to HT:

$$ Object_{new} \left[ {position} \right] = Object_{old} \left[ {position} \right] + dP $$
(9)
$$ Object_{new} \left[ {rotation} \right] = Quaternion.rotate(Object_{old} \left[ {rotation} \right], dA) $$
(10)

For example, in Fig. 4, \( {\text{R}}_{\text{T}} = .13 \), \( {\text{G}}_{\text{T}} = .03 \), and \( {\text{S}}_{\text{T}} = .001 \). The program retraced back through the time frames (Fig. 4, section between two vertical lines), until \( {\text{G}}_{\text{v}}\,<\,{\text{G}}_{\text{T}} \) at time frame 248. The object being grabbed snapped back to the position and rotational values corresponding to time frame 248. Other times, if condition two was met before condition one, then the breakpoint would occur when \( {\text{S}}_{\text{v}} < .001 \).

4 Grasp Rules

When grasping objects in virtual system, some grasping rules must be established to ensure the natural grasping operation. When the grasping rules are satisfied between virtual hand and virtual object, it is considered that virtual hand grasps the object. From then on, virtual hand and virtual object maintain the grasping relationship, that is, the base coordinate system of the object is attached to the virtual hand, so that the movement of virtual hand and other operations can control the movement or other states of the virtual object. When the operation is completed, the state of virtual hand is switched, such as loosening fingers to make the virtual hand and the virtual object detached. The grasping rules are no longer satisfied between the simulated objects, thus the grasping relationship is released.

4.1 Grasp Rules Based on Normal Vector of Point Contact Plane

According to the relationship between grasping posture and grasping stability, combined with the geometry and material characteristics of the object, this paper formulates an improved virtual hand grasping rule based on normal vector of point contact plane. This rule guarantees the correctness and naturalness of grasping. The rule includes the following two parts:

  1. (1)

    Position Judgment of Point Contact Method. Two or more fingers must be in contact with the object, including the thumb, or three or more fingers must be in contact with the object, and at least three of them are not in a straight line.

  2. (2)

    Normal Vector Judgment of Point Contact Method. The normal vector between planes in contact with objects must be greater than a threshold angle, which is temporarily defined as 90°.

As shown in the Fig. 6 below, there are three fingers (thumb, index finger and middle finger) in contact with the cube (satisfying Rule 1), three normal vectors of the contact plane N1, N2, N3 (N3, the normal vectors of the edge of the object, is assumed to be over-centroid point G), and there are three angles \( {\text{q}}_{12} \), \( {\text{q}}_{13} \), \( {\text{q}}_{23} \) between the two, which are greater than 90° (satisfying Rule 2), so the object is considered to be grabbed.

Fig. 6.
figure 6

Grasp rules based on geometry

Grasp rules based on geometry can basically guarantee natural interaction and meet the immersion requirements of virtual reality system. The virtual hand grasping rules based on normal vector of point contact plane have the following characteristics:

  1. (1)

    It can effectively avoid some misoperations. Because virtual hand inevitably contacts with other objects in the process of motion. Only Rules 1 and 2 are satisfied at the same time the object can be grasped. And Rule 2 can avoid misoperations.

  2. (2)

    The complexity of calculation is avoided. Rule 2 is judged only when Rule 1 is satisfied. Normal vector and angle of contact plane are not need real-time calculation in the whole process.

  3. (3)

    The stability of grasping is improved. When virtual hand grasps an object, the hand will inevitably shake slightly. The judgment rules allow this kind of shaking slightly, avoiding misrelease.

  4. (4)

    According to the material of the object, the threshold is set to improve the reality of grasping the object.

  5. (5)

    Rule 2 only deals with normal vector and angle of contact plane. This rule applies not only to grasping simple objects, but also to grasping complex objects.

4.2 Research on Normal Vector Threshold of Point Contact Plane

Firstly, the concept of anti-interference and stable grasping is proposed. As shown in the Fig. 7, taking two finger grasping as an example, the force provided by fingertips is expressed by friction cone, which is the resultant force range of pressure and friction. \( F_{e} \), the resultant force of external forces, i.e. interference force, is exerted on an object other than the force exerted on the hand. The maximum component of fingertip contact force along the opposite direction of \( F_{e} \) is recorded as anti-interference force \( f_{1} \) and anti-interference force \( f_{2} \) respectively. If \( f_{1} + f_{2} \ge F_{e} \) satisfied, the object can be grasped steadily with two fingers, otherwise it can’t be grasped. In the m fingers grasp, the interference force \( f_{1} ,\,f_{2} \cdot \cdot \cdot ,\,f_{m} \) are the maximum component of the contact force along the opposite direction of the interference force \( F_{e} \). If \( f_{1} + f_{2} \cdot \cdot \cdot + f_{m} \ge F_{e} \) satisfied, the stable grasp of anti-interference is satisfied, otherwise, the stable grasp of anti-interference is not satisfied.

Fig. 7.
figure 7

A simple judgment of grasp force

5 Manipulator Mapping

The attitude estimation method is used to calculate hand attitude when interact with the manipulator. The motion of the manipulator is controlled by palm information (including position and posture), and the motion of the robot hand is controlled by finger posture. After obtaining the relative coordinates of the hand, coordinate transformation is carried out, and the hand coordinates are mapped to the manipulator. The coordinate system satisfies the right hand principle. The incremental information of hand motion is used to control the end movement of the manipulator. It only needs to rotate the coordinates directly without paying attention to the initial deviation. The coordinate transformation relationship is shown in the following Fig. 8.

Fig. 8.
figure 8

Mapping diagram of two coordinate systems

Define the point coordinates in Leapmotion coordinate system as (x, y, z) and the corresponding point coordinates in manipulator coordinate system as (x1, y1, z1), satisfying the following transformation relations:

$$ \begin{aligned} \left[ {\begin{array}{*{20}c} {x1} \\ {y1} \\ {z1} \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} 1 & 0 & 0 \\ 0 & {\cos ( - \pi /2)} & {\sin ( - \pi /2)} \\ 0 & { - \sin ( - \pi /2)} & {\cos ( - \pi /2)} \\ \end{array} } \right] \times \left[ {\begin{array}{*{20}c} {\cos ( - \pi /2)} & 0 & { - \sin ( - \pi /2)} \\ 0 & 1 & 0 \\ {\sin ( - \pi /2)} & 0 & {\cos ( - \pi /2)} \\ \end{array} } \right] \hfill \\ = \left[ {\begin{array}{*{20}c} 1 & 0 & 0 \\ 0 & 0 & { - 1} \\ 0 & 1 & 0 \\ \end{array} } \right] \times \left[ {\begin{array}{*{20}c} 0 & 0 & 1 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ \end{array} } \right]\left[ {\begin{array}{*{20}l} x \hfill \\ y \hfill \\ z \hfill \\ \end{array} } \right] = \left[ {\begin{array}{*{20}l} z \hfill \\ x \hfill \\ y \hfill \\ \end{array} } \right] \hfill \\ \end{aligned} $$
(11)

6 Scene Fusion

This paper solves the problem of three-dimensional rendering based on video stream. The discrete frame images of left and right eyes in the video captured by the camera are processed and sent to the left and right display screens of the 3D helmet respectively to form the three-dimensional image.

When rendering, the projection axes must be parallel to each other and the left and right views are completely independent. In addition to the special settings required for the position relationship between the camera and the eye, the other parameters of the camera are very similar to those used in conventional non-stereo rendering.

In the process of rendering, it is found that when full-screen rendering is performed with high resolution, the frame rate may be below 60, which seriously affects the use effect. This paper improves the rendering performance by reducing FOV and queuing ahead of time.

Reducing the cross-pixel of FOV to improve performance may shrink the field of view and reduce immersion, but if the frame rate is not enough, it may lead to vertigo and other problems. This paper uses the way of reducing filling rate to reduce FOV and improve performance. For a fixed pixel density on the retina, smaller FOVs have fewer pixels. When there are fewer visible objects in each frame, there will be fewer animations, fewer state changes, and fewer calls to draw.

In order to improve the parallelism of CPU and GPU and enhance the frame processing ability of GPU, the time-ahead queuing method is used in rendering process. When the advance queue is disabled, the CPU starts processing the next frame immediately after the last frame is displayed. If the GPU cannot process it in time, the previous frame will be displayed. This will cause the picture to jitter. When queuing ahead of time is available, CPU can start earlier, which provides GPU with more time for frame processing and makes scene display more smoothly (Fig. 9).

Fig. 9.
figure 9

Three-dimensional rendering based on video stream

7 Experiments

7.1 Subjects

A total of 15 subjects participated in the experiment, including 10 males, with an average age of 32 years. Two of them had experience in using virtual reality systems.

7.2 Task and Experimental Setup

The experiment of manipulator manipulation based on visual gesture is to test the availability and effectiveness of manipulator manipulation based on visual gesture.

The experimental system uses two modes:

  1. (1)

    Leapmotion for hand data acquisition and Oculus CV1 for scene display.

  2. (2)

    Second, the handle controls data acquisition and display by helmet or flat panel display.

The composition of the experimental system is shown in the following Fig. 10.

Fig. 10.
figure 10

Schematic diagram of experimental system

The subjects controlled the manipulator to shoot at the fixed target paper by teleoperation. Record the shooting time and accuracy to evaluate the accuracy and effectiveness of the operation process. The teleoperation mode based on visual gesture is compared with that based on traditional handle. The experimental scenario is shown in the following Fig. 11.

Fig. 11.
figure 11

Experimental scenario

Standard target images were used in the experiment, which consisted of 10 rings. The center of the target was a solid black circle with a diameter of 4 mm, and the radius of the outer rings was increased by 2 mm in turn.

The experiment is divided into two parts. 1. Use the handle and helmet to control the manipulator to shoot. 2. Use gestures and helmet to control the manipulator to shoot. The experimental results are recorded in the following Table 1.

Table 1. Test result record table template

Each participant completed two groups of experiments: gesture operation and handle operation, each group operated 10 times.

7.3 Results

The results of two experiments are obtained. The details are as follows (Tables 2 and 3).

Table 2. The result of gesture group
Table 3. The result of handle group

As shown in Fig. 12, the task completion time of the handle operation and the gesture operation decreases and tends to be stable with the increase of operation times. The gesture operation is always about 13 s faster than the handle operation. As shown in Fig. 13, the number of hits increases and tends to be stable with the increase of the number of experiments. Hit rings of the handle operation are always higher than those of the gesture operation. As shown in Table 4, gesture operation is more difficult and loaded.

Fig. 12.
figure 12

Operation time comparison diagram

Fig. 13.
figure 13

Rings comparison diagram

Table 4. Comparison of operation load and experimental difficulty

8 Discussion and Conclusion

This paper recorded the subjective feelings of 15 participants, including:

  1. (1)

    Compared with gesture operation, the handle operation has more obvious delay.

  2. (2)

    The gesture operation is more intuitive than the handle operation.

  3. (3)

    There is a certain difference between the image seen in HMD and the scene observed by naked eyes in reality.

  4. (4)

    Handle operation can control the movement of manipulator more accurately.

The analysis shows that the rotation angle of the manipulator needs mapping calculation while using the handle operation mode, but the manipulator can be directly mapped with gesture operation, so the handle operation has a relatively obvious delay; the manipulator can be precisely controlled by keys in the handle operation, and the accuracy of the handle operation mode is obviously better than that of the gesture operation mode. The visual difference in HMD is caused by the discrepancy between the two cameras of the stereo camera and the actual distance between the human eyes.

Experiments show that the accuracy of the gesture operation is slightly lower than that of the handle operation, but it is easier to operate, and the operation is natural and smooth, which accords with the advantages of human-computer interaction habits. For scenes with low precision requirements, the advantages of gesture operation are obvious. It can be used as an operation mode of teleoperation. It has wide application prospects in the field of teleoperation system or task, such as space station, extraterrestrial exploration, robots, UAV and so on.