Keywords

1 Introduction

The past years have seen increasingly rapid advances in the development of technical assistance systems that aim at supporting elderly persons at home. This application field is referred to the term AAL. Together with medical partners, we are developing an AAL system that aims at helping elderly at an early stage of dementia to continue to live longer in their familiar environment instead of moving to a nursing home. In order to optimise the caring process, we firstly propose to support elderly by reminding them to perform the daily routines which they tend to forget. Secondly, meta data about the patients’ performed daily activities can be accessed through an intuitive web interface by their caregivers. By integrating this information in their patients’ individual caring plans, caregivers shall be supported during their daily work. Moreover, the individual needs of every patient can be assessed and attended to in a more appropriate fashion.

To this end, we recognise ADLs by integrating so-called smart sensors into the elderlies’ living environment. These sensors are composed of a stereo camera and an internal processing unit. The sensors are mounted at the ceiling in the living environment and monitor ADLs – without releasing raw image data, but only meta data, e.g. the detected activities.

The meta data is being logged real-time into a database, which can be accessed by the caring personnel via a web interface. In this way, caring personnel can obtain information that has been inaccessible so far because elderly with dementia are often incapable to communicate the daily activities they already have performed. Caregivers could, as another example, better interpret a patient’s uncooperative behaviour during the morning visit when they are informed by the system that the patient was very active during the night and showed a restless sleeping behaviour. Such type of information allows caregivers to better understand and interpret their patients behaviour. As a consequence, the caring process can be adapted to individual needs of the elderly. Another contribution of our system is that caregivers can now react promptly to sudden changes of their patients’ behaviour and directly adapt the necessary care to the actual circumstances. Elderly at an early stage of dementia can benefit from the individualised caring plan. At the same time, they can be supported to maintain their daily routines with the help of reminding messages provided by the system.

The first part of the presented study is focussing on the high-level recognition of hygiene-related activities in a bathroom using pose and proximity information. This choice of room was mainly motivated by medical reasons. The frequency of toileting, for example, provides relevant information for diagnosing and treating incontinence. The second part presents the detection of hygiene-related activities by using a machine learning based approach that evaluates skeleton information, i.e. 3-D joint positions.

The paper is structured as follows: We present related work in Sect. 2 while focussing on different approaches for ADL recognition. Section 3 describes the modules that are related to high-level reasoning based activity recognition. The employed person detection algorithm and the pose estimation algorithm are briefly summarised in Sects. 3.1 and 3.2 respectively. In Sect. 3.3, the high-level reasoning evaluating the proximity to objects and pose information is described in detail. Section 3.4 presents the analysis method for the reasoning algorithm, whereas the results are presented and discussed in Sect. 3.5. Section 4 explains and evaluates the skeleton-based algorithm, including the aspects of sequence duration and frame skipping, the description of feature vectors as well as the obtained results accompanied by a discussion. Section 5 draws conclusions about the accuracy of both the high-level reasoning and the skeleton-based algorithm. Besides, an outline regarding further developments is given.

2 Related Work

A number of studies have employed different types of sensors, such as motion sensors [15] or body-worn sensors [11], to analyse daily activities or to detect emergencies. Pirsiavash and Ramanan [6] reported that ADLs can be detected by processing the first-person camera view acquired by a wearable camera. Several previous studies have attempted to monitor home activities using acoustics, including bathroom-related activities. Fogarty et al. [4] installed low-cost sensors in the water distribution structure of a home to measure water usage patterns and deduce activities especially performed in a kitchen and in a bathroom. The sensors were attached to the outside of the water pipes in the basement. These sensors consisted of microphones that provided audio signals indicating that the toilet was flushed, the shower was used or the sink was active, for example. The evaluation of these audio signals allowed the recognition of activities that are connected to water consumption. Chen et al. [3] focussed on activity monitoring in bathrooms while using omni-directional microphones. The recognised activities included washing hands, teeth brushing, flushing the toilet and urination.

In our project, we have decided to apply non-wearable optical sensors. These are wide-angle stereo cameras that are designed to be mounted at the ceiling of a room. We came to this decision because people with dementia are apt to remove wearable sensors and they often forget to put them on again. Moreover, compared to motion sensors or acoustic data, image data can provide information with a higher grade of detail.

A meaningful indicator for ADLs represents the room a person stays in [10, 15]. Richter et al. [10] gave evidence that, based on the chronological order of the rooms a person entered, several ADLs could be detected. They therefore introduced a person detection algorithm that derived the person’s position in a room using a stereo camera and then assigned the corresponding room the person stayed in. If the person was localised in the sleeping room in the morning and afterwards in the bathroom for several minutes, then they deduced that the person has attended to personal hygiene. In a similar way, other activities, such as the sleeping behaviour or preparing food, can be detected.

However, for a more detailed prediction of daily activities, the evaluation of the room itself is insufficient. For this reason, Richter et al. [8] proposed a concept that deduces ADLs by analysing the proximity of a person to certain objects in the room. Moreover, pose information obtained by a machine learning based approach [9] was included. This work focussed on three objects in the bathroom, the shower, the sink and the toilet. In order to deduce whether a person is close to an object, the person is required to be localised within the room he or she currently stays in. For this purpose, person detection algorithms were required to be applied. Several studies have introduced person detection algorithms, such as Harville and Li [5], Yous et al. [19] and Richter et al. [10], which work on image data obtained by stereo sensors. The studies of Harville et al. and Yous et al. have revealed shortcomings of their stereo vision-based algorithms. Harville and Li [5] determined a point-wise mean positional error from reference data of 160 mm. Yous et al. [19] evaluated their algorithm empirically by marking detected persons by a cuboid. Moreover, false positive and false negative rates were determined. However, there has been little quantitative analysis of spatial accuracy in the mentioned studies. When utilizing a person detection algorithm, it is essential to know the grade of reliability of the obtained person’s position. If the person is detected close to the toilet, for example, it is necessary to know how reliable this information is. In this way, uncertainties can be reduced so that misinterpretations can be avoided.

In order to reason about activities performed in a bathroom, the work of Richter et al. [8] applied the person detection algorithm described in [10]. The authors aimed at refining the room assignment to the assignment of objects the person is probably occupied with in a certain room. The accuracy analysis of this person detection algorithm yields that persons can be localised in a very accurate way: During their evaluation, they determined a mean error ranging from 74 mm to 87 mm [8]. Thus, the obtained position can be relied on even if the specific objects are very close to each other. On the basis of this finding, they have designed a high-level reasoning algorithm that recognises bathroom-related activities.

In addition to the high-level reasoning algorithm described in [8], we introduce a skeleton-based algorithm that recognises actions that are related to the bathroom on the one hand and that are the basis for further reasoning about bathroom-related activities on the other hand. The presented skeleton-based activity recognition was inspired by the work of Raptis and Sigal [7] and Beaudry et al. [1]. Raptis and Sigal [7] demonstrated that by only considering local discriminative key frames in a sequence, actions can be accurately classified. Similarly, this was also shown in early publications, such as in the works of Carlsson and Sullivan [2] as well as of Schindler and Van Gool [12]. Beaudry et al. [1] transformed trajectories of relevant points from the optical flow into the frequency domain. In their experiments, they used the KTH dataset [13] and showed that the classification rates are comparable to the results of the highest state-of-the-art algorithms.

Since the launch of the Kinect, researchers have the opportunity to access human skeleton data easily. Already existing approaches, such as the work of Beaudry et al. [1], can now be adapted to skeleton joint coordinates instead of using detected relevant points. Recent work already highlighted that the integration of joint data led to improved classification results. Yao et al. [17] compared the performance of low-level appearance features with pose-based features derived from joint coordinates of successive frames. They demonstrated that their pose-based features outperform appearance-based features under the same circumstances, such as same classifier and same dataset. Wang et al. [16] calculated potential and kinetic energy features of skeletal joint using key frames. The survey of Ye et al. [18] can be referred to for more details regarding the recent works in the field.

Our investigation focusses on the evaluation of different feature vectors, e.g. features obtained by a Fourier transformation on skeleton joints. The impact of different sequence durations as well as the utilisation of key frames is a further part of this study. Moreover, the influence of the frequency resolution changes were investigated.

3 Activity Recognition Based on Proximity to Objects and Pose Information

In this section, we present the approach introduced by Richter et al. [8] by explaining the algorithms that allow the detection of bathroom-related activities. This approach is the basis for our further developments.

3.1 Person Localisation

The person detection algorithm that is applied in the study of Richter et al. locates a person in a 3-D point cloud. This point cloud is derived from an image pair of a stereo sensor, which is composed of high-quality wide field of view lenses. They have a focal length of 3.5 mm, whereas the sensor has a resolution of \(1360\times 1024\) pixels. The coordinates are calculated with respect to a world coordinate system whose x-y plane is aligned with the floor of the flat. The z axis represents the height above the floor, and is not relevant in this work. After a foreground-background segmentation [21], all points belonging to the foreground are projected onto the x-y plane. In a subsequent step, blobs are detected on the resulting projection image. The centers of blobs with a certain size are regarded as the centers of detected persons and are denoted as \(p_{\mathrm {s}} = \left( x_{\mathrm {s}}, y_{\mathrm {s}}\right) \). This center \(p_{\mathrm {s}}\) is used for proximity determination in a later processing step. For a better legibility, we dispense with vector signs in this paper.

3.2 Pose Information

In order to derive information about the general pose for the high-level reasoning in [8], Richter et al. apply an algorithm that first trains a linear classifier, i.e. a support vector machine [9]. The feature vector is a histogram that represents the distribution of points belonging to a person’s surface according to their z component. After training, the classifier can distinguish between standing, sitting and lying. In the work of Richter et al. [8], only sitting and standing are relevant.

3.3 High-Level Reasoning Using Position and Pose Information

The high-level reasoning algorithm [8] is able to detect the activities “showering”, “using the toilet” and activities normally performed in front of a sink, such as “washing hands, combing and teeth brushing”. An overview about this algorithm is presented in Fig. 1. The algorithm unites the person’s and objects’ position data as well as information about the person’s general pose. First of all, the algorithm determines whether a person is close to a certain object. In their study, Richter et al. used the three objects “shower”, “toilet” and “sink”. For a comparison of the respective position data, stereo sensors are distributed in a test flat and extrinsically calibrated so that they share the same world coordinate system. All determined positions are specified with respect to the origin of this coordinate system. The N object’s positions, i.e. their centres and their expansions are stored in a look-up-table (LUT). As a result, there are N entries in the LUT, whereas the expansions are used as thresholds \(thresh_{n}\) for determining whether a person is close to a certain object.

Fig. 1.
figure 1

Overview about the high-level reasoning algorithm that evaluates a person’s proximity to objects and pose information.

In order to decide whether a person is close to an object, the distance \(dist_{\mathrm {n}}\) between the person’s position \(p_{\mathrm {s}} =\left( x_{\mathrm {s}}, y_{\mathrm {s}}\right) \) and all the objects’ positions \(p_{\mathrm {o,n}} = \left( x_{\mathrm {o,n}}, y_{\mathrm {o,n}}\right) \) is compared with the corresponding expansion values \(thres_{n}\) in the LUT, whereas \( n \in \{1, 2, ..., N\}\). At this point, N denotes the number of objects and n the index of a specific object. A person is considered to be close to an object if the distance between their centres is smaller that the stored threshold value in the LUT. Then, the boolean variable proximity is 1, corresponding to close, otherwise it is 0, corresponding to far. If there are N objects in the room, the following equations are applied N times to check the proximity criterion.

$$\begin{aligned} dist_{\mathrm {n}} = \left\| p_{\mathrm {o,n}} - p_{\mathrm {s}} \right\| , \, n \in \{1, ..., N\} \, , \end{aligned}$$
(1)
$$\begin{aligned} proximity= {\left\{ \begin{array}{ll} 1,&{} \text {if } dist_{\mathrm {n}} < thresh_{\mathrm {n}}\\ 0, &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(2)

In this context, it can be stated that the smaller the room and the closer the objects are to each other, the more challenging the described assignment will be.

In a second step, the integration of pose information, i.e. standing and sitting, allows to draw conclusions about the following ADLs:

  • Activities that are performed when standing in front of a sink, such as washing hands, combing and teeth brushing, if the person is close to the sink and standing.

  • Using the toilet if the person is close to the toilet and sitting.

  • Taking a shower if the person is close to the shower and standing.

  • Other activities if none of the above mentioned activities are detected.

The listed activities are written to a First-In First-Out (FIFO) and only after a certain amount of detections within the FIFO element, the activity is forwarded as a valid system output. If none of the previously mentioned scenarios occur, we assume that the person is doing another action.

Fig. 2.
figure 2

A scenario in the bathroom of a testing flat, where a person is attending to her personal hygiene.

Figure 2 illustrates a scenario in the bathroom of a testing flat, where a person is attending to his or her personal hygiene. In this point cloud, the person is detected to be very close to the toilet. As the person is also determined to be sitting, we consider that the person probably uses the toilet. This view was generated during experiments. As high attention is devoted to privacy aspects, this view is not visible outside of the smart sensor in the final implementation.

3.4 Analysis Method

Richter et al. analysed their algorithm by recording video sequences with three persons in the bathroom of their testing flat. The probands performed a typical morning routine: After entering the bathroom they attend to his or her personal hygiene including activities, such as using the toilet, washing hands and showering. Simultaneously, the system determines the performed ADLs using the high-level reasoning algorithm. Both the recorded frames and the output of the algorithm are accompanied by timestamps and saved together to the memory. Each frame of the recorded sequence is labelled with the actual ADL afterwards. In a final step, the actual ADL as well as the system output are plotted over time to allow a comparison. In this way, it is also possible to analyse the system latency.

3.5 Results and Discussion

Figures 34 and 5 present the results for each sequence that has been recorded in the bathroom of the testing flat. The filled bars represent the real, i.e. the manually labelled, activity, whereas the activity that was detected by the high-level reasoning algorithm is marked with a line. The ordinate shows the time in seconds, while the abscissa shows the bathroom activities.

The reasoning algorithm shows results of high quality even for the small bath room in the testing flat. From the charts, it can be seen that only few minor false detections occurred, whereas all the performed activities were correctly detected. Additionally, a comparatively small delay can be observed, which is due to the Kalman filter that was applied in the person detection algorithm. Moreover, this delay is caused because of the low-pass behaviour, induced by the FIFO, of the reasoning algorithm. In view of the present application, however, these delays do not affect the system functionality. In summary, it was found that the high-level reasoning algorithm reliably detected most of the activities that were performed in the recordings.

Fig. 3.
figure 3

Comparison between real and determined activity for subject 1.

Fig. 4.
figure 4

Comparison between real and determined activity for subject 2.

Fig. 5.
figure 5

Comparison between real and determined activity for subject 3.

Thus far, we have presented a reasoning algorithm based on proximity to objects and pose information to predict typical bathroom activities. We expect that the availability of more detailed information about patient’s activities will contribute to a better reasoning and understanding of the patient’s behaviour. Therefore, we further enhanced the performance of our reasoning by evaluating human skeleton joints.

4 System Enhancement with Skeleton-Based Activity Recognition

In this section, we present a Fourier transformation based method where we evaluate human skeleton joints to enhance the presented system.

The Kinect device is chosen for this study, because it provides skeleton joint information. The device obtains skeleton joints by processing depth data using the algorithm invented by Shotton et al. [14]. In this study, this data is the base for the proof of concept while using Kinect skeleton sequences for an off-line classification. This study, therefore provides information about the classification accuracy of the designed off-line activity recognition approach.

The presented approach focusses on recognising actions performed in a bathroom by applying a Fourier transformation on skeleton joints using only samples of a sequence instead of all captured frames.

4.1 Actions

In the presented approach, the aim is to classify the following actions, with the hereby assigned labels A1 to A6, that contribute to ADL recognition in a bathroom scenario:

  • A1: Moving a hand to the mouth

  • A2: Teeth brushing

  • A3: Walking

  • A4: Standing up

  • A5: Sitting down

  • A6: Idle

Moving the hand to the mouth can indicate that the person is cleaning the mouth with water after teeth brushing or washing the face. The action teeth brushing is recognised when a person is moving the hand in a repetitive way in front of the mouth as it is characteristic for teeth brushing. Both sitting down and standing up can be a strong clue that the toilet has been used. Walking can be considered as moving from one place in the bathroom to another, whereas the person is considered to be idle when none of the other actions is performed.

4.2 Sequence Duration and Frame Skipping

For this study, we recorded sequences containing skeleton data using the Kinect sensor. Each sequence shows a person performing the previously defined actions. The final dataset we used is a combination of our own recordings and the dataset provided by Yu et al. [20]. Except action A1, which was performed twice by 56 different persons, each action was performed twice by 28 different persons. This, except for action A1, results in an overall number of 56 examples per action, whereas 26 were used for training and the remaining 30 examples for testing. For action A1, 52 examples were used for training and 60 examples were used for testing.

The sequences were recorded at 30 fps having a total number \(num_{\mathrm {BF}}\) of 256 frames, 128 frames and 64 frames per sequence. This corresponds to sequence durations \(d_{\mathrm {m}}\) of approximately 8.53 s (\(d_1\)), 4.27 s (\(d_2\)) and 2.13 s (\(d_3\)) respectively, see Eq. 3. These different settings result from the following observation: When we measured the number of frames that the different actions were actually consuming, we realised that actions, such as sitting down and standing up, were performed in half the time we originally set with 256 frames. Therefore, we investigated the influence of a decreasing duration of the recordings from 8.53 s to 4.27 s (128 frames) and to 2.13 s (64 frames).

$$\begin{aligned} d_{\mathrm {m}} = \left\lfloor \frac{num_{\mathrm {BF}}}{30\,\mathrm {fps}}\right\rfloor , \quad num_{\mathrm {BF}} \in \{ 256, 128, 64 \,\mathrm {frames}\}, \quad m \in \{ 1,2,3\} \; . \end{aligned}$$
(3)

Instead of finding key frames in the style of Raptis and Sigal [7], we implemented key frame generation by frame skipping in order to reduce the number of frames for classification according to Eq. 4. Consider the sequence duration to be 256 frames, i.e. 8.53 s. In the first case, we keep all the frames (\(frame_{\mathrm {keep}} = 1\)), which results in 256 key frames. In the second case, when we keep every third frame (\(frame_{\mathrm {keep}} = 3\)), we will have 85 resulting key frames, which corresponds to a frame rate of 10 fps. When we skip five frames between two successive frames, i.e. we keep every sixth frame (\(frame_{\mathrm {keep}} = 6\)), we will have a resulting frame number of 42 key frames, which corresponds to a frame rate of 5 fps. For the durations of 8.53 s and 4.27 s, we performed the frame skipping in the same way as described above. This results in 128, 42 and 21 key frames for a duration of 8.53 s; and in 64, 21 and 10 key frames for a duration of 4.27 s.

$$\begin{aligned} num_{\mathrm {KF}} = \left\lfloor \frac{num_{\mathrm {BF}}}{frame_{\mathrm {keep}}}\right\rfloor \; . \end{aligned}$$
(4)

4.3 Feature Vectors

In this study, we investigated the performance of different feature vectors that serve as an input to a linear one-versus-one multi-class support vector machine. The calculations include all the joints the Kinect software development kit provides except the feet and ankle joints. In this paper, the joints are numbered with the indices k ranging from 1 to K, whereas \(k=1\) denotes the hip center \(hip\_center\) and K is the overall number of joints.

In the following, the feature vectors are presented with their denotation and the corresponding explanations.

xyz : For this feature vector calculation, the Cartesian joint coordinates w.r.t. the sensor are transformed and then concatenated to a feature vector. For the \(hip\_center\), the following transformation is applied:

$$\begin{aligned} hip\_center\_trans_{\mathrm {t}} = hip\_center_{\mathrm {t}}-hip\_center_{1}, \quad t \in \{1, ..., N\} \; . \end{aligned}$$
(5)

For all N frames of a recorded sequence, the hip center from the first frame \(hip\_center_{1}\) is subtracted from the hip center positions \(hip\_center_{\mathrm {t}}\) of all the following frames. The idea is to detect translations of the whole skeleton, when the person walks, but as well when the person sits down or stands up. For the remaining joints \(joint_{\mathrm {k}}, \, k \in \{2, ..., K\}\), the following equation is valid:

$$\begin{aligned} joint\_trans_{\mathrm {t, k}}= & {} joint_{\mathrm {t, k}}-shoulder\_center_{\mathrm {t, k}} \end{aligned}$$
(6)
$$\begin{aligned} t\in & {} \{1, ..., N\}, \, k \in \{2, ..., K\} \; . \end{aligned}$$
(7)

For every frame, the joints \(joint_{\mathrm {t}}\) are thereby translated in a coordinate system with the \(shoulder\_center\) as origin. The main reason is to detect relative movements to the body, such as the moving hand joint while brushing teeth. The feature vector is constructed as follows: The x components of the transformed hip coordinates \(hip\_center\_trans_{\mathrm {t}}\) are sorted frame by frame. In the same way, the y and z components of \(hip\_center\_trans_{\mathrm {t}}\) are listed and then appended to the list of the x components. In the same manner, the time series of x, y and z components for the other joints are created by using \(joint\_trans_{\mathrm {t, k}}\). Afterwards, these lists are appended joint by joint.

\(\varvec{\mathcal {F}({xyz})}\) : Each coordinate of the transformed Cartesian joint coordinates of the hip \(hip\_center\_trans_{\mathrm {t}}\) and of the remaining joints \(joint\_trans_{\mathrm {t, k}}, \, k = \{2, ..., K\}\) can be treated as a one-dimensional time-varying signal. We applied Fast Fourier Transformation (FFT) to each of these signals. The amplitude responses for each coordinate are assembled for one joint. The feature vector is formed by concatenating the Fourier transformations joint by joint.

xyz, \(\varvec{\mathcal {F}}\)(xyz): The feature vectors xyz and \(\mathcal {F}(xyz)\) are assembled to one feature vector by concatenating them.

\(\varvec{\rho \theta \phi }\) : This feature vector has the same structure as the feature vector xyz. However, the Cartesian coordinates are transformed to spherical coordinates.

\(\varvec{\mathcal {F}(\rho \theta \phi )}\) : Analogue to feature vector \(\mathcal {F}(xyz)\), we applied FFT to the single spherical coordinates and assembled the amplitude responses joint by joint.

\(\varvec{\rho \theta \phi ,\mathcal {F}(\rho \theta \phi )}\) : The feature vectors \(\rho \theta \phi \) and \(\mathcal {F}(\rho \theta \phi )\) are assembled to one feature vector by concatenating them.

4.4 Results and Discussion

From Table 1, the following relationships can be deduced: The highest classifications rates were achieved with feature vectors that combine Cartesian or spherical coordinates respectively with their corresponding Fourier transformation, i.e. \(xyz,\mathcal {F}(xyz)\) and \(\rho \theta \phi ,\mathcal {F}(\rho \theta \phi )\). The best classification rate can be reached by using the feature \(\rho \theta \phi ,\mathcal {F}(\rho \theta \phi )\) with a duration of 8.53 s and 42 KF, which corresponds to a sampling rate of 5 fps.

Table 1. Overall classification rates in percent using defined features for different numbers of key frames (KF) and different frame rates. The three highest classification rates are highlighted within each of the three blocks. A block represents the three columns that belong to one duration.
Table 2. Overall classification rates in percent using zero padding for the durations 4.27 s and 2.13 s. By doing this, the same frequency resolution is achieved for the different durations.

With regard to Cartesian and spherical coordinates, we can deduce that neither of the two representations shows significant improvements when compared to each other.

The different number of key frames does not show a strong effect on the classification rates. However, we can state the classification results do not decrease with a smaller number of frames. In some cases, the classification rates even improved slightly. Consequently, in order to save processing power, the algorithm should preferably be used with only a small number of key frames.

Moreover, it is obvious that the classification rates decrease with smaller duration of the captured sequence (8.53 to 2.31 s) for the same number of key frames. This could be due to two reasons: Firstly, with a longer duration, more information about the activity can be obtained, whereas for shorter durations, less information is available for processing. The second reason could be that with shorter durations and for the set number of key frames and sampling rate, the frequency resolution decreases, i.e. the distance between coefficients in the spectrum is higher. This can cause a loss of information about the frequency behaviour of the activity. In order to explore these aspects, we padded the sequences with a duration of 4.2 s and 2.13 s with zeros so that a number of 256 frames is reached, and performed the Fourier transformation again. By doing this, we achieve the same frequency resolution for different sequence durations. The results in Table 2 reveal that by setting the same frequency resolution for all three duration scenarios, the classification rates do not show a substantial change. Several of the classification rates decrease whereas others increase slightly. This indicates that the decrease of classification rates for shorter durations as shown in Table 1 is not caused by the lower frequency resolution, but rather by the smaller extract of the recorded action.

For the best feature vector, i.e. \(\rho \theta \phi ,\mathcal {F}(\rho \theta \phi )\), we calculated the confusion matrix, see Table 3. The evaluation confirms that all actions could be classified accurately.

Table 3. Confusion matrix for feature vector \(\rho \theta \phi ,\mathcal {F}(\rho \theta \phi )\) with 256 frames (8.53 s) and 42 key frames. The overall classification rate is 96.66%.

5 Conclusions and Future Work

The main goal of this study was to investigate an activity recognition approach with the aim of refining ADL recognition for activities typically performed in a bathroom. We therefore presented an existing high-level reasoning algorithm that determines ADLs based on proximity to objects and pose information. In addition to this, we introduced a skeleton data-based action recognition algorithm that perspectively, after integrating it into the existing system, can enhance the performance of this AAL system. In contrast to the high-level reasoning algorithm, this new approach employs machine learning techniques.

The evaluation of the reasoning algorithm gave evidence that activities normally performed in front of a sink, such as “washing hands, combing, teeth brushing, etc.”, “showering” and “using the toilet” could be accurately detected. Since these test had been conducted in a small room, it is likely that the algorithm will show good results in larger rooms as well. This aspect will be evaluated in future tests. Further research will enhance the algorithm by including more objects and other rooms with the aim of recognising further ADLs, such as “preparing food”, “washing up” or “cooking”. Besides detecting the proximity to locally fixed objects, an extension to moving objects by using object detection algorithms is sensible as well. Further work needs to be done in order to integrate the presented skeleton-based action recognition into the current high-level reasoning system. This implies, inter alia, to convert the off-line classification of pre-recorded skeleton sequences with known length to an on-line version. This study yielded that basic actions, such as “teeth brushing” or “rising an object to the mouth”, can be reliably recognised using spherical coordinates of skeleton joints and their Fourier transformation as a feature vector. Thus, we added new activities that can be recognised by our AAL system. Future work will focus on the evaluation of the order of performed actions and draw further conclusions about bathroom activities. For example, sitting down and standing up could be used to better judge the likelihood whether the person used the toilet.

In this study, we used the algorithm of Shotton et al. [14] in combination with the Kinect to obtain skeletal data. For our AAL application, however, we plan to adapt this algorithm to work with data derived by stereo sensors, so that it can be integrated in our existing stereo sensor system.

In addition to the above mentioned work, we intend to install the designed system in real living environments. To achieve this, we will continue working together with local housing associations and our partners from care facilities. Ensuring a high quality of care for people with dementia should be a priority for our society. With regard to the lack of caring personnel and the increasing number of elderly, technical support systems could contribute to achieve that by reminding patients and providing care-related information to the caring staff. However, although these modern developments can be beneficial for all involved persons, we should not forget that such technologies can and shall never replace human closeness and care.