Keywords

1 Introduction

Recently, virtual band or Orchestra has been becoming a hot research topic in virtual reality research field. Mendelsohn museum’s virtual Orchestra adopts Leap Motion [1] to recognize gesture. Dynamic matrix sound system (DMS system) [2], developed by Key Laboratory of Media Audio & Video Ministry of Education, is a multi-channel input and output sound system which is based on source synthesis and Huygens’ Principle. Gesture recognition-based band direction is one of the most important foundations of virtual band. Kindest and Leap Motion are commonly used in hand gesture recognition. But Leap Motion’s motion capture range is limited. And kinect can’t accurately identify hand gesture. Besides, both devices’ adaptability is poor because they need to use SDK to configure development. Nowadays, most mobile terminal devices such as iPhones and tablets are equipped with monocular stationary camera. Devices with monocular camera are more portable in comparison with Kinect and Leap Motion.

In the area of hand gesture recognition and target tracking algorithm, Angelov et al. proposed Recursive Density Estimation technique, and used SIFT features to find key points of the rectangular box area to achieve object tracking and gesture recognition [3]. JongShill Lee et al. segmented hand gesture information from video sequences which have complex background and finally achieved a recognition rate of 95% [4]. Gesture library of OpenCV could realize gesture recognition based on the Hu matrix and stereo vision. Jin-Yinn et al. identified and extracted features by B-spline curves, and adopted the Bayes rule to classify target and gesture recognition [5]. Xiuhui Wang et al. collected hand gesture information through numbers of cameras and combined with hand model features to match and train database [6]. Haibing Ren et al. completed the segmentation of skin color under complex background [7]. Jiyu Zhu et al. constructed two dimensional hand gestures by extracting the information of palm center and mass of palm center [8]. Zhiyong Xiao et al. accomplished remote computer control by hand gesture recognition [9].

With the inspiration of gesture recognition and VR technology, we design a virtual orchestra studio to display music. In order to increase the portability of the system, we use mobile terminal devices with monocular camera to capture gesture. Based on Xcode platform and dynamic hand gesture recognition technology, we developed a virtual music control system. This paper is organized in the following way: Sect. 1 introduces the background of our study and related study of gesture recognize. Section 2 presents the detailed introduction of the virtual music control system. Section 3 introduces the algorithm we adopt in this paper. Section 4 described experiment. Lastly, Sect. 5 is conclusions and future prospects.

2 Pipeline of the Virtual Music Control System

There are three modules in the virtual music control system: control terminal, client terminal and server. Firstly, from the gesture images captured by mobile devices, control terminal recognizes hand gesture and generates control instructions. Secondly, control instructions are sent to client terminal via server. The server is an indispensable part to ensure real-time communication. Thirdly, client terminal receives and responds to instructions. Figure 1 shows the pipeline of the virtual music control system.

Fig. 1.
figure 1

Structure of the virtual music control system.

2.1 Control Terminal

Control terminal is responsible for analyzing hand gestures. In this session, we design a gesture recognize APP. After capturing hand gesture, control terminal operates image preprocessing, gesture segmentation, dynamic gesture tracking, contour feature extraction, gesture recognition, and gesture instruction definition and transmission. Considering of the above process, the control terminal is designed as shown in Fig. 2.

Fig. 2.
figure 2

Control terminal.

Our virtual music control system is able to realize many functions about music control, such as play, stop, rhythm change, switch songs and etc. With the number of fingers and the direction of hand trajectorys, we can define different instructions. Firstly, when the gesture recognition is completed, we begin to detect the number of fingers. If 30 frame images are all determined to have 5 fingers, music will be paused. If the number of fingers is 2, songs will be switched; in the case of the trajectory is towards left, last song will play; in the case of the trajectory is towards right, next song will play. If the number of fingers is 4, music rhythm will be changed; in the case of the trajectory is towards left, music rhythm will speed up, in the case of the trajectory is towards right, music rhythm will slow down. Instructions for music controling are defined as shown in Fig. 3.

Fig. 3.
figure 3

Gesture instructions for music.

2.2 Client Terminal

The virtual music control system can realize the control of music by accepting instructions from control terminal. In order to make the client running well in all LAN, we build a web version of music player which will ensure well compatibility of the system. It’s easy for users to operate because they just need to open a web page instead of downloading a particular program. Opening any browser equipment installed in the same local area network (LAN), and entering the client site, our web version of music player will run.

2.3 Client Terminal

The server module is responsible for information transmission and real-time communication between control terminal and client terminal. Websocket protocol is used in the construction of web page which can satisfy the demand of real-time communication. In traditional real-time communication method, such as Ajax polling, browser needs to constantly send server request and long links which will consume a lot of bandwidth and server resources. Websocket protocol proposed by HTML5 establishes a long TCP link between browser and server to actively push data to client, and realizes bidirectional real-time communication. What’s more, we can directly use JavaScript to achieve communication in any Websocket browser.

3 Algorithm Description

In this section, I introduce some hand gesture algorithm. Firstly, according to the process of gesture segmentation, part A introduces the main methods of image preprocessing and two gesture segmentation methods. Part B introduces dynamic gesture tracking method and proposes our improved algorithm. Part C introduces the contour based feature extraction and calibration and detection of finger. Lastly, part D introduces methods for recognizing gestures and related gesture command.

3.1 Hand Gesture Segmentation

Hand gesture segmentation based on skin color uses hand skin color as significant characteristics, because characterization of human hand skin color information will not be influenced by different capture devices [10]. But this method can’t fully realize exact segmentation and contour extraction. We improved the trajectory tracking algorithm as follows.

Firstly, color images captured by camera are preprocessed. And smoothing filtering method is adopted to eliminate noise interference [11]. Secondly, we maximize the segmentation of skin color region, and generate binary image D1. Thirdly, we separate the dynamic gesture from static background to generate the dynamic gesture binary image D2. Last but not least, the merger of D1 and D2 are applied to obtain the complete binary image. Figure 4 shows the gesture segmentation process in detail.

Fig. 4.
figure 4

Gesture segmentation process

3.2 Dynamic Hand Gesture Tracking

The flexibility of human hand will lead to large gesture transformation, and requirement of real-time performance is higher for the interactive recognition system. As time interval between adjacent frames is very short, the difference of hand gesture between adjacent frames is very small. So, motion between adjacent two frames is recognized as uniform motion. We adopt dynamic hand gesture tracking algorithm based on Cam-shift tracking algorithm and Kalman filter prediction. Following is the detailed algorithm process.

Firstly, we initialize the tracking area, and use Cam-shift tracking algorithm. If tracking failed or be occluded in the process, we need to retain the searching window for the second time tracking and searching. Secondly, we adopt Kalman filter in real-time prediction and update the searching window as well as in Cam-shift tracking. Thirdly, keep on continuously updating and following up until hand gestures are tracked. Cam-shift tracking algorithm’s adaptability is not well, but our gesture tracking process will improve this weakness.

3.3 Feature Extraction Based on Contour

In this part, we conduct feature extraction based on gesture contour, including smoothing, convex point detection and fingertip calibration. We use open source libraries in Canny computational theory of edge detection to detect convex buns and convex hull defects of finger contours. Next, we smooth fingers’ contour curve to eliminate the defect of the contour. Figure 5 shows the contrast effects of smoothing.

Fig. 5.
figure 5

Left: before finger contour smoothing, Right: after finger contour smoothing

We all know that, curves of fingertip and finger seam are more bent than fingers’ outer contour. We set a certain threshold to determine fingertips and finger seams, and then we need to distinguish fingertips from finger seams. We use image transmission method [12] to detect the number of fingers. Figure 6 shows the image transmission method adopted on hand.

Fig. 6.
figure 6

Gesture segmentation process

3.4 Gesture Recognition

In our experiment, gesture recognition is based on contour recognition which can be achieved through template matching method. We will sum up all points on the contour to get contour moment. A contour moment \( \left( {\upalpha ,\;\upbeta} \right) \) is defined as follows:

$$ m = m_{\alpha ,\beta } = \sum\nolimits_{{{\text{i}} = 1}}^{n} {I(x,y)x^{\alpha } } y^{\beta } $$
(1)

α stand for moments on X axis, β stand for moments on Y axis. Use (1) to calculate the summation of all the points. m represents the point on the boundary of gesture contour only when \( \upalpha\; = \;0 \) and \( \upbeta\; = \;0 \).

Hu moment is a kind of outline of contour moment, which has the characteristic of geometric invariance. By using the normalized central moment, Hu moment defines various matrix equations with rotation, translation, scaling characteristic in the second and three order. Following are two matrix equations which are better for 2D gesture contour. When \( \upalpha\; + \;\upbeta = 0 \), then

$$ M_{1} = h_{20} + h_{02} $$
(2)
$$ M_{2} = \left( {h_{20} + h_{02} } \right)^{2} + 4h_{11}^{2} $$
(3)

When completing the characteristic data acquisition of the gesture image, we invoke the cvMatShapes function in OpenCV to realize the similarity calculation of two profiles. The smaller the results are, the more similar two pictures will be. The following is our hand gesture template in shown Fig. 7.

Fig. 7.
figure 7

Hand gesture template

4 Experiments

4.1 Tracking Motion Algorithm

In order to test the recognition performance of our dynamic hand gesture tracking algorithm, we set up a database composed by 100 hand movement scenes Table 1 shows the contrast of gesture recognition rate between the traditional Cam-shift algorithm and the improved algorithm. It clearly shows that the accuracy of the improved Cam-shift algorithm is higher than that of the traditional Cam-shift algorithm.

Table 1. The comparison of algorithm’s recognition rate.

4.2 Control Terminal

In the experiment part, Xcode platform and iTouch5 will be used. We write a gesture recognition program on Xcode and adopt iTouch 5 to capture dynamic hand gestures. Along with Xcode program compiled successfully, the iTouch5 will generate a gesture recognition APP named ‘Gesture Recognizer’. After a series of gesture analyzing, the Gesture Recognizer APP can distinguish hand gestures. According to different gestures, the program generates different instructions and sends them to client terminal to control music. Figure 8 shows the Gesture Recognizer APP on iTouch and the startup interface of the APP. Figure 9 show the recognition results of number of opening fingers.

Fig. 8.
figure 8

Gesture segmentation process.

Fig. 9.
figure 9

Recognition results of number of opening fingers.

4.3 Client Terminal

The virtual music control system can realize the control of music by accepting gesture characteristic information. We use JavaScript, CSS and Html5 to implement the web music player. JavaScript is used to implement the interaction module of web music player. CSS is adopting to create buttons and background image of web music player. And lastly, label < audio > of Html5 has powerful audio function, such as play, play-back, jump, and buffer. Figures 10, 11, and 12 show the interface of web music player changed with songs.

Fig. 10.
figure 10

Interface of song 1 of web music player.

Fig. 11.
figure 11

Interface of song 2 of web music player.

Fig. 12.
figure 12

Interface of song 3 of web music player.

4.4 Server

To set up the server, we should know that servlet API is the nature of servlet. And Jetty is an open source Java server that provides the operating environment for the JSP and servelt to realize web development and production realization.

In the experiment, we set up the server websocket-server to complete gesture image commands’ receive, process, and other processes by sending Jetty.

Via the server websocket-server structures, we ultimately realize the functionality that getting service data from the control terminal, and then send the data from the server to the client websocket-client.

In the experimental, we use the package Jetty to set terminal window server to open the server. And in the command line input java-jar/start.jar command, we can open the server response. Within the same LAN equipment, the recognition APP and can client terminal can realize communication and information transmission.

5 Conclusion and Future Work

In this paper we develop a virtual music control system based on gesture recognition. This paper introduced how the control terminal complete gesture recognition in detail, including image preprocessing, gesture segmentation, feature extraction, trajectory tracking, gesture recognition and instruction definition etc. Our virtual music control system has strong practical significance. Control terminal APP has a strong portability on cellphones, and client terminal’s web music player has strong compatibility on many browsers, and our server module can realize the functions of the real-time communication.

Next step, we will research on hand rotation to increase the diversity of gestures and to define more instructions. And we will keep studying on optimizing related algorithm to improve the accuracy of gesture recognition and shorter gesture recognition time. What’s more, we will add more features on the web player display to make it more beautiful.