Keywords

1 Introduction

In amyotrophic lateral sclerosis (ALS) patients with speech and motor dysfunction, multiple system atrophy (MSA) patients, and muscular dystrophy patients, it becomes difficult to communicate their intentions due to severe motor dysfunction. Since eye movements often function until the end-stage, our research project focuses on developing an eye movement-based communication support system. In this paper, we develop a system that can be used, especially at night, using a wearable camera.

A (nurse) calling system is a tool for calling a nurse or a caregiver at a hospital or a nursing care insurance facility, and a family at home. It is an indispensable tool for patients to use when a physical abnormality occurs or if they have questions about their life. The patient pushes a call sensor or button, and calls a remote nurse or family member. However, for patients with ALS patients who have difficulty moving their muscles, an input device must be prepared according to the residual function. The eyeSwitch [1] is an operation support switch that can be operated (ON/OFF) by eye movements. The user can make calls, operate home appliances, and communicate through call devices, environmental control devices, and communication devices. The eyeSwitch can be used at night, but it requires large eye movements, it is difficult to detect slight eye movements. The eyeSwitch needs to be fixed near the bed with an arm, but the position needs to be corrected each time the patient’s position moves.

In this paper, we introduced an image-based method for detecting the pupil center with a wearable camera with high accuracy at night for a calling system. We also develop a prototype of a calling system that can be used at night time and evaluate the effectiveness of its performance through subject experiments.

2 Related Research

This section briefly introduces the nurse call and eye movement analysis related to this paper.

Ongenae et al. developed an ontology-based Nurse Call System [11], which assesses the priority of a call based on the current context and assigns the most appropriate caregiver to a call. Traditional push button-flashing lamp call systems are not integrated with other hospital automation systems. Unluturk et al. developed an integrated system of Nurse Call System Software, the Wireless Phone System Software, the Location System Software, and the communication protocol [13]. By using this system, both the nurse and the patient know that the system will dedicate the next available nurse if the primary nurse is not available. Klemets and Toussaint proposed a nurse call system [8] that allows nurses to discern the reason behind a nurse call allows them to make a more accurate decision and relieves stress. Regarding the nurse call system, there are many studies on improving the whole system, not individual devices such as switches.

Images and electromyography are available as means for analyzing eye movements. Since the latter uses contact sensors, this paper targets the former, which is a non-contact sensor. As for the image-based eye movement analysis, some products have already released that can estimate the gaze point rather than the movement of the eyes. The devices used can be roughly classified into two types; non-wearable and wearable devices. The former is a screen-based eye tracker that attaches to a display, for example, Tobii Pro Nano [5] and Tobii Pro Fusion [3]. The latter is the type with a small camera mounted on the eyeglass frame, for example, Tobii Pro Glass 3 [4] and Gazo GPE3 [2]. The research target related to eye movement analysis is divided into two types; it uses the existing eye tracker to analyze the gaze [10, 14] and proposes a method for detecting the eye or the pupil center point [6, 7, 15].

Fig. 1.
figure 1

Overview of the proposed system.

3 Night Time Calling System

3.1 Overview

The proposed system consists of a wearable camera, computer, relay controller, and nurse call, as shown in Fig. 1.

Our system is assumed to use at night. Therefore, a standard color camera is not suitable for shooting at night. Of course, the visible light illumination is not used because the user is too drowsy to sleep. Thus, a near-infrared LED (IR-LED) and near-infrared camera (IR camera) are used. Wearable cameras are not affected by the movement of the user’s head, and can always capture stable eye images. Although wearing a wearable camera during sleep puts a burden on the user, we decided to use a wearable camera after discussing it with a physical therapist. The wearable camera attached to the mannequin on the right in Fig. 1 is the device used in our system.

The computer processes all eye images taken by the wearable camera. A large-scale and high-performance computer is desired. However, our system is assumed to install near the bed of the user. At the facility’s request, our system avoids the use of both wired and wireless networks. For the above reasons, we adopted a small computer with a GPU for our system.

If our system sends a continuous signal directly from the user’s computer to a nurse call, the nurse call will ring every time the nurse call receives the signal. This system uses a relay controller to prevent a nurse call from being made due to a malfunction.

3.2 Pupil Center Detection

Our system uses the CNN-based pupil center detection method proposed by Chinsatit and Saitoh [6]. The method uses two CNN models, as shown in Fig. 2. The first CNN model is used to classify the eye state, and the second is used to estimate the pupil center position.

Fig. 2.
figure 2

Two-part CNN model.

The architecture of the classification model is based on AlexNet [9]. The output of this model is two; the closed eye or the non-closed eye. Since the pupil’s center cannot be detected from the closed eye, unnecessary processing is skipped.

The second CNN model is based on the pose regression ConvNet [12]. The output of this model is the pupil center position \((P_x, P_y)\).

3.3 ROI Extraction

The eye image is taken with a wearable camera. However, the background sometimes appears in the eye image. In this case, since the pupil detection accuracy may decrease, the region of interest (ROI) is first extracted instead of directly inputting the captured image to the CNN.

An ROI is extracted based on an intensity difference value between two consecutive frames. When accumulates pixels of difference value equal to or larger than a threshold value. The maximum region in the accumulated image within a fixed time is extracted. Next, two types of ROI extracted from this region. The first is to extract without considering the aspect ratio (named ROI1), and the other is to extract a rectangle with a fixed aspect ratio of 3:2 (named ROI2).

Our system uses a wearable camera. Therefore, a pixel having no motion, such as background, has a low difference value. On the other hand, the difference in the pixels of the eyes and the skin around them becomes large due to blinking and eye movements. This makes it possible to crop the ROI around the eyes.

3.4 Calling Mechanism

In our system, the user’s intention is read from the detected pupil center, and a signal is outputted from the computer to the relay controller for calling. The target users in this study are patients with intractable neurological diseases. However, the progression of symptoms varies in individuals. For example, the direction and amount of movement of the eye are different. Therefore, it is desirable that the parameters can adjust for each user, and our system adopts a policy of manual adjustment.

In our system, when the user wants to call a person, he or she moves his/her eyes by a certain amount in the up, down, left, or right directions. In other words, four thresholds (upper, lower, left, and right) for the eye position are set in advance, and when the eye position exceeds any of the thresholds, it is determined that the user intends to call. Upon detecting this movement, the system sends a signal to the relay controller.

Figure 3 shows four eye images in which the eye image and four thresholds are drawn. Here, the green circle is the detected pupil center point, and the rectangle around the pupil is four thresholds. In the figure, the left side has the eye facing the front, and the pupil’s center is inside the rectangle. The second and third from the left are examples of exceeding the right and lower thresholds, respectively. In the figures, red bands on the right side and the lower side of the image are drawn to visually display the direction in which the threshold value is exceeded. The rightmost one is an example with an eye closed.

Fig. 3.
figure 3

Eye images with thresholds.

3.5 Implementation

As described in Sect. 3.1, our system needs to operate standalone without using the network. Therefore, in our system, we constructed a CNN server by Flask, a web application framework, in the computer, and sent the eye image of the wearable camera acquired by the client software to the server, and received the pupil center position which is the output of CNN. Figure 4 shows the process flow of our system.

Fig. 4.
figure 4

Process flow of our system.

Some users can move their eyes quickly, while others can move their eyes only slowly. In the former case, the signal of the relay controller is transmitted from the computer immediately after the eye position exceeds the threshold value. Even if the eye cannot be transmitted because the detection error at the center of the pupil does not exceed the threshold value, it can be moved again to take measures against false detection. Therefore, in this case, the single pulse transmission mode is performed in which a signal is transmitted once each time the threshold value is exceeded. On the other hand, in the latter case, the continuous pulse transmission mode in which signals are continuously transmitted while the threshold value is exceeded is adopted. Switching between these two modes allows the setting to be changed by the user.

Our system requires manual settings such as thresholds and transmission modes, which allows people with various symptoms to respond.

Figure 5 is an image captured on the computer monitor during the experiment. In the figure, the upper left is an eye image overlaid with information. The upper right is the temporal transition of the pupil center’s vertical position, and the lower left is the horizontal transition of the pupil center, which is visualized in real-time.

4 Evaluation Experiments

In this research, two experiments were conducted to evaluate the proposed system: verification of the accuracy of pupil center detection and quantitative evaluation of call success.

Fig. 5.
figure 5

Main windows of our system.

4.1 Pupil Center Detection

Dataset. We collected eye images from seven people; four healthy staff and three patients, using the proposed system. Table 1 shows the subject information and the number of collected images. At the time of collection, subjects had their eyes moved in five directions: front, up, down, left, and right. Since the number of collected images differs depending on the subjects from Table 1, it was decided to use 400 eye images from each subject, for a total of 2,800 eye images in this experiment.

Table 1. Subject information and the number of collected images.

The size of the eye image taken by the wearable camera is \(1280 \times 720\) [pixels]. However, it was resized to \(120 \times 80\) [pixels] to reduce the processing time. It is necessary to give the ground truth to the eye state and the center of the pupil for each eye image to train and evaluate the CNN models. This work was done visually. Regarding the eye condition, a label of “Non-closed eye” was given when 50% or more of the pupil was visible, and a label of “Closed eye” was given otherwise.

Experimental Conditions. In this paper, we propose two types of ROI extraction methods of ROI1 and ROI2. Therefore, as for ROI, two extraction methods were compared.

It is desirable to prepare a lot of data for training the CNN model. However, in this experiment, we have not collected enough data, so we introduce two approaches, data augmentation (DA) and fine-tuning (FT). Regarding DA, we generated four images for each eye image, which was a combination of scaling, translation, rotation, and brightness value correction. Concerning FT, 1,980 images were collected from six healthy males, three females, nine healthy persons, using a different wearable camera, for a total of 17,820 images. The weight of the CNN models learned by using this is used as the initial value of FT.

Eye-state recognition and pupil center detection were performed under eight conditions that combined the application of two types of ROI, DA, and FT.

A person-independent task was conducted. That is, the test data was one patient, and the training data was six (the remaining two patients and four healthy staff). The experiment was conducted by the one-patient-leave-out method.

Result and Discussion. Experimental results are shown in Table 2. In the table, \(E_p\) means the error between the ground truth and the detection result of the pupil center.

Regarding the eye-state recognition, the non-closed eye’s recognition accuracy is higher than that of the closed eye. This is presumed to be due to the small number of closed eye training data. The highest recognition accuracy of 82.1% was obtained when ROI1 was used without applying DA and FT.

Regarding the pupil center detection task, the average error was at most 2.3 pixels, although the error varied depending on the conditions. The minimum average error of 1.17 pixels was obtained when DA and FT were applied using ROI1. Figure 6 shows the eye images in which the ground truth (green point) and the detection result (red point) are plotted. The errors of Figs. 6(a), (b), (c) and (d) were 2.63, 2.90, 8.73 and 8.82, respectively. Based on the plot results, it is judged that Figs. 6(a)(b) have been detected successfully. On the contrary, Figs. 6(c)(d) judges that the detection has failed.

Table 2. Eye state recognition and pupil center detection results.
Fig. 6.
figure 6

Pupil center detection results.

4.2 Calling Experiment

Experimental Protocols. A call experiment was conducted on five healthy people. The experiment time for each subject was about five minutes. The subject experimented while lying on the bed, as shown on the left side of Fig. 7. In order to avoid moving the eyes other than calling, the subject gazed at the image on the monitor mounted on the wall to look at the front, as shown on the right side of Fig. 7.

To reproduce an open call, we prepared a voice stimulus pointing in any direction up, down, left, or right. During the experiment, the subject instructed to perform eye movement after this voice stimulus. The time and direction of voice stimulation are random. Twelve audio stimuli were given in one experiment; that is, the subject was called 12 times by eye movement.

Fig. 7.
figure 7

Experimental scenes of calling experiment.

Result and Discussion. The experiment was conducted by turning off the lights at night. When the brightness during the experiment was measured with an illuminometer, the minimum, maximum, and average were 3.92 lx, 16.7 lx, and 9.9 lx, respectively.

The correct call in response to the voice stimulus was considered successful. The numbers of true positives (TP), false negatives (FN), and false positives (FP) were counted in all experiments. We also calculated precision P, recall R, and F-measure F by the following equations: \(P = TP/(TP + FP)\), \(R = TP/(TP + FN)\), \(F = 2 PR/(P + R)\).

Fig. 8.
figure 8

Transition of pupil center (Subject S4).

Table 3 shows the result. The precision, recall, and F-measure of all subjects were 0.833, 1.000, and 0.909, respectively. The recall is 1.000, which means that the call succeeded without missing. On the other hand, the precision was 0.833. This is because the wrong pupil position was detected when the eyes closed with blinking by S2 and S5.

Table 3. Experimental result of calling experiment.

Figure 8 is a graph showing the temporal transition of the pupil center coordinates in the subject experiment of S4. The horizontal axis is the number of frames, which corresponds to time. The vertical axis is the x or y coordinate. The red curves are the coordinate of the detected pupil center. The horizontal lines of green and blue mean the left and right or upper and lower thresholds. The vertical pink strip indicates that the eyes are closed. From these graphs, it can be confirmed that the pupil position exceeds the upper threshold or the lower threshold for 12 calls.

5 Conclusion

In this research, we developed a system that allows patients to call without stress at night using eye movements. Two experiments of pupil center detection and calling experiments were conducted to evaluate the effectiveness of the development system. As a result, a high detection accuracy with an average error of 1.17 was obtained for detecting the pupil center. In the subject experiment, the experiment was conducted not for the patient but the healthy person, and a high call success rate was obtained.

The user of our system is a patient. We have not been able to perform a call experiment with patient cooperation. We will work on this experiment in the future. In the pupil center detection, there is a failure due to blinking so that we address this problem.