research-article

Public Access

Passive and Context-Aware In-Home Vital Signs Monitoring Using Co-Located UWB-Depth Sensor Fusion

Authors:

Fan YeAuthors Info & Claims

ACM Transactions on Computing for Healthcare, Volume 3, Issue 4

Article No.: 45, Pages 1 - 31

https://doi.org/10.1145/3549941

Published: 03 November 2022 Publication History

All formats PDF

Abstract

Basic vital signs such as heart and respiratory rates (HR and RR) are essential bio-indicators. Their longitudinal in-home collection enables prediction and detection of disease onset and change, providing for earlier health intervention. In this article, we propose a robust, non-touch vital signs monitoring system using a pair of co-located Ultra-Wide Band (UWB) and depth sensors. By extensive manual examination, we identify four typical temporal and spectral signal patterns and their suitable vital sign estimators. We devise a probabilistic weighted framework (PWF) that quantifies evidence of these patterns to update the weighted combination of estimator output to track the vital signs robustly. We also design a “heatmap”-based signal quality detector to exclude the disturbed signal from inadvertent motions. To monitor multiple co-habiting subjects in-home, we build a two-branch long short-term memory (LSTM) neural network to distinguish between individuals and their activities, providing activity context crucial to disambiguating critical from normal vital sign variability. To achieve reliable context annotation, we carefully devise the feature set of the consecutive skeletal poses from the depth data, and develop a probabilistic tracking model to tackle non-line-of-sight (NLOS) cases. Our experimental results demonstrate the robustness and superior performance of the individual modules as well as the end-to-end system for passive and context-aware vital sign monitoring.

1 Introduction

Basic vital signs, including respiration and heart rates, are predictors for assessing overall changes in health status [11] and myriad medical issues, including respiratory, cardiac, and sleep conditions [45, 51].

Continuous vital sign data collected in individuals’ home environments can be analyzed to monitor disease onset/progression/resolution and the impact of new or changed medications. Such in-home assessment can have tremendous benefits for anyone living with a chronic health condition, especially for older adults who face multiple chronic diseases and health conditions.

Longitudinal in-home monitoring requires low-cost, robust, and passive sensing. Traditional hospital tests, such as electrocardiograms (EKGs), are expensive, are not designed for continuous in-home data collection, and require well-trained medical personnel to set up equipment and monitor the output. Despite their popularity, wearables (e.g., Apple Watch, Fitbit) have inconvenient and restraining daily maintenance overheads (e.g., charge, wear), especially difficult among physically and cognitively challenged older adults.

Recent radio-based passive-sensing solutions [29] exemplified by Wi-Fi [45, 58, 79], frequency-modulated continuous wave (FMCW) [5], and ultra-wide band (UWB) [53, 64] hold promise for longitudinal in-home monitoring. Temporal and spectral methods extracting vital signs against multipath [79], cluttered environments [39] are proposed.

However, robustness against harmonics and intermodulation has not received sufficient attention. Because neither heartbeat nor respiratory signals are purely sinusoidal and the phase of RF signals for vital sign extraction exhibits non-linearity, high-order harmonics and intermodulations (i.e., linear combinations of heart and respiration rates) exist and frequently carry energy stronger than the fundamental heart rate. They produce high spectral peaks within the normal heart rate range, for example, 50–150 beat per minute (bpm). Thus, spectral methods that simply pick the highest peaks to identify the heartbeat signal easily fail (\(1/3\) of the time in our experiments). Making matters worse, we observe that their frequencies and magnitudes are time-varying, and the pattern keeps changing over time, defying simple predictions. These issues are not described or tackled in recent radio sensing work. The electrical engineering community has conducted some related studies [21, 64] however, in-depth validation and conclusive comparison are still lacking.

Similarly, robustness against signal corruption remains elusive. Due to inevitable significant body motion, the signal might be corrupted beyond recognition by even well-trained humans. Such signals must be detected and excluded to avoid producing erroneous results. Existing methods [5, 29] rely on spectral energy or temporal waveform assumptions that are susceptible to dynamic changes; thus, they are not reliable enough.

Longitudinal in-home monitoring also needs to address two other issues: (1) differentiate multiple co-habiting subjects in proximity, reliably extracting and associating vital sign data to each of them; and (2) identify the physical activity context to disambiguate pathological and normal changes in vital signs (e.g., the heart rate increases after exercises or food are normal but during sleep apnea [45, 51] are abnormal).

In this article, we propose VitalHub (as shown in Figure 1), a robust vital sign monitoring system meeting the requirements of longitudinal in-home monitoring using a pair of co-located UWB and depth sensors. Based on a manual examination of over 6,000 data samples, we identify four typical temporal and spectral patterns (present in \(98.55\%\) of the data) and a suitable heart/respiration rate estimator for each. To handle harmonics and intermodulation, we devise a probabilistic weighted framework (PWF) that quantifies the cumulative evidence of these patterns to adaptively update the weighted combination of estimator outputs to track the vital signs robustly. To detect corrupted signals, we generate a two-dimensional (2D) “heatmap” representing the likelihood of different heart/respiration rate estimates and train a ResNet [28] model to produce a confidence value of how likely the signal is corrupted.

Fig. 1.

The depth sensor provides multiple functions: (1) recognizes human bodies and their relative depths, which enables us to identify which “range bins” (i.e., signals reflected from objects within certain depth ranges) correspond to the human bodies for the segmentation and extraction of vital signals as illustrated in Figure 1; (2) differentiates co-habiting subjects from skeletal walking patterns [70] for proper data association; and (3) recognizes body poses to produce the context of physical activities. Unlike RGB cameras, the depth sensor does not provide fine visual details; thus, it is more privacy friendly to subjects.

We have implemented a VitalHub prototype and conducted extensive experiments. We collected data from 8 volunteers in 56 sessions (2–10 min per session) in both stationary (e.g., sitting still) and non-stationary (e.g., natural upper body swaying) poses at 3 different distances/angles. We spent over 72 man-hours to manually label more than 40,000 30 s time-windowed signals regarding whether they were corrupted beyond human recognition to provide training data.

Our PWF aided by the detector achieves a 1.5/3.2 bpm error at the 80-percentile for respiration and heart rates, even though individual estimators may produce 10–20 bpm errors in heart rates.¹ These are very close to 1.2/1.5 bpm errors from an idealistic oracle that always knows whether the signal is corrupted, which are the best instantaneous range bin and estimator (none of which practically feasible).

We make the following contributions in this work:

•

We systematically describe the challenges of non-touch vital sign monitoring using a low-cost COTS UWB sensor in realistic in-home scenarios. The challenges manifest in the form of dynamic signal patterns with harmonics and intermodulation interference, making robust heart rate estimation difficult, for which we do not find sufficient description or treatment in the literature.

•

We design a probabilistic weighted framework (PWF) that adaptively adjusts the weights in combining the outputs of four estimators based on quantitative evidence of respective patterns to address the challenges caused by the harmonics and intermodulation interference for heart rate estimation. Extensive experiments show that our proposed system achieves within a 0.3/1.7 bpm error to the upper limit of an idealistic oracle, demonstrating the robustness of the PWF.

•

To better understand the contributions of our vital sign monitoring pipeline, we conduct a comparative evaluation with related work. Specifically, we compare VitalHub with three representative methods dealing with harmonics and intermodulation issues. We show that VitalHub achieves a \(\le\)5 bpm error 98.5% of the time in heart rate estimation, while others achieve only 51.2–81.3%, and we share insights on how assumptions they rely on may not hold in reality. We also compare our heatmap-based signal quality detector against 4 other common methods and that find it achieves near-human performance at \(96\%\) for both precision and recall in detecting and excluding corrupted signals caused by inadvertent motions, whereas others at best achieve 89% /83%.

•

We develop a context annotation module, including a two-branch long short-term memory (LSTM) recurrent neural network and a probabilistic model, for identity tracking and activity recognition with a feature set derived from the skeleton data, using a depth camera independent of the ambient lighting conditions. Experiments show that the LSTM model outperforms baseline classifiers and achieves 90% median accuracy for differentiating 8 people from skeletal walking patterns, and 96% F1-score for recognizing 6 common daily activities.

VitalHub produces robust measurements for respiration and heart rates against harmonics and intermodulation. It reliably differentiates co-habiting subjects to associate data, and generates activity contexts, thus offering a suitable solution for longitudinal in-home vital sign data collection valuable for future customized health analytics (e.g., the detection of anomalous deviation from a user’s normal patterns of vital signs during certain daily activities).

2 Design Considerations

2.1 Design Goals

To achieve longitudinal in-home vital sign monitoring, we identify several goals as follows:

•

Robustness. Vital signs should be robustly extracted against dynamic changes, including corruptions, strong harmonics and intermodulation, and in the presence of multiple co-habiting people.

•

Passive, non-touch sensing. The monitoring should not require active user efforts, such as charging batteries or wearing devices. This is critical for older adults who would benefit the most from longitudinal monitoring but who are too physically and cognitively challenged for active efforts.

•

Context awareness. Co-habiting subjects must be distinguished from each other to properly associate the data to the corresponding individuals. The activity context of the subjects is needed to help disambiguate abnormal changes (physical or mental) from normal ones.

•

Privacy friendly. We have conducted 5 discussion groups, each with 10 to 15 participants (a total of 57 older adults) to gain their perceptions on technologies for in-home health monitoring. They come from both urban and suburban areas, with diversity in gender, age (60 to 80+ years), and socioeconomic status. At the beginning of each session, the participants were asked to complete a paper-based questionnaire so that their opinions were not influenced or biased by our later presentation. In questionnaires, we found that people did not want to be monitored by regular RGB cameras, which show fine-grained images/videos. In discussions, they overwhelmingly expressed strong disinterests and privacy concerns, no matter who is watching, even close family members such as adult children living separately. However, they were more receptive to coarse-grained silhouettes from depth sensors because no visual details of facial expression or clothing were available.

2.2 Hardware Choices

To achieve the goals presented earlier, we choose a co-located UWB and depth sensor pair as the hardware platform. The UWB signal is sensitive to tiny displacements of the chest wall due to the heartbeat and respiration. The UWB system is found to be highly immune to multipath effects, with energy spanning a wide frequency bandwidth [40], and requires less complexity in the architecture than FMCW systems [8, 76] for wireless sensing.² Less complexity in the architecture usually implies a less expensive COTS solution for low-cost deployment. Thus, we choose a UWB sensor [7] as the RF frontend for non-touch vital sign monitoring.

To balance the requirements of context awareness and privacy friendliness, we adopt a depth camera to (1) detect and locate human bodies to help select the range bins in the received UWB signals corresponding to the chest walls; (2) distinguish multiple people simultaneously present in the Field-of-View (FoV) for data-identity association; and (3) produce context information without using intrusive RGB images. Notably, the depth camera in our implementation uses an infrared (IR)–based time-of-flight (TOF) method for depth sensing, and the human body detection [70] is based on the depth image without RGB data. Therefore, the features of VitalHub enabled by the depth camera are independent of the ambient lighting conditions. Thus, it works well both day and night.

2.3 Background of UWB-Based Vital Sign Extraction

This section provides necessary background regarding vital sign extraction via UWB signal modeling. Heart and respiratory rates are two of the five vital signs collected at each physical examination [11]. It is the combination of heartbeat and respiration that comprises chest displacements. The instantaneous distance (\(d(t)\)) of the chest wall away from the UWB sensor can be measured with high sensitivity for vital sign extraction, and can ideally be expressed [63] as:

\begin{equation} \begin{aligned}d(t) &=d_{0}+D(t) \\ &=d_{0}+d_{r} \sin \left(2 \pi f_{r} t\right)+d_{h} \sin \left(2 \pi f_{h} t\right), \end{aligned} \end{equation}

(1)

where \(d_{0}\) is the nominal distance between the UWB sensor and the targeted chest wall (i.e., provided by the depth sensor to select the proper “range bin”), \(D(t)\) is the chest wall displacement; \(d_{r}\) and \(d_{h}\) are the displacement amplitudes, and \(f_{r}\) and \(f_{h}\) the rates of respiration and heartbeat, respectively.

The instantaneous channel response \(h(t, \tau)\) at time t with a short delay \(\tau\) can be formulated as:

\begin{equation} \begin{aligned}h(t, \tau) = \alpha _{D} \delta \left(\tau -\tau _{D}(t)\right) + \sum _{i=1} \alpha _{i} \delta \left(\tau -\tau _{i}\right), \end{aligned} \end{equation}

(2)

where \(\alpha _{D}\) and \(\alpha _{i}\) represent the magnitudes of the channel response of the target and other static objects, and \(\tau _{D}(t)=2 d(t) / c\) and \(\tau _{i}\) are corresponding delays (c being the speed of light). It indicates that the channel response of the target can be spatially distinguished from the clutter according to the range/TOF. The received signal \(s(t, \tau)\) can be derived as a convolution of the channel response and the transmitted impulse \(p(\tau)\) as

\begin{equation} \begin{aligned}s(t, \tau) &=p(\tau) * h(t, \tau) \\ &= \alpha _{D} p\left(\tau -\tau _{D}(t)\right) + \sum _{i=1} \alpha _{i} p\left(\tau -\tau _{i}\right). \end{aligned} \end{equation}

(3)

Therefore, the segment of received signal \(\alpha _{D} p\left(\tau -\tau _{D}(t)\right)\) shifts in the phase according to the two-way echo delay \(\tau _{D}(t)=2 d(t) / c\) due to the chest displacement, and two vital signs (heart and respiratory rates) can be extracted from the phase. The phase can be modeled as

\begin{equation} \begin{aligned}\phi (t)=\phi _{0}+\phi _{D}(t), \end{aligned} \end{equation}

(4)

where \(\phi _{0}\) is the initial phase of the received signal at the nominal distance \(d_0\), \(\phi _{D}(t)=2\kappa D(t)\) is the phase modulated by the physiological movements, and \(\kappa =2\pi /\lambda\) denotes the angular wave number, determined by the wavelength \(\lambda\) of the carrier wave.

2.4 Robustness Challenges

Robust vital sign extraction based on the derived phase model in (4) is challenging due to the following issues. First, the perceived phase can be noisy due to imperfect hardware. As the UWB signal is sampled at extremely high frequencies (23.328 GHz in our case), imperfect synchronization between the transmitter and receiver would result in a sampling time offset (STO), thus, a time-variant phase drift \(\phi _{STO}(t)\). Therefore, the phase model (4) needs to be updated as

\begin{equation} \begin{aligned}\phi (t)=\phi _{0}+\phi _{D}(t)+\phi _{STO}(t), \end{aligned} \end{equation}

(5)

which makes direct extraction difficult, especially when the phase drift from desynchronization becomes larger than that (\(\phi _{D}(t)\)) from physiological motion.

Second, the chest wall movements due to either heartbeat or respiration are not purely sinusoidal. Thus, harmonic components exist for both. As normal heart rates span a wide range (e.g., 50–150 bpm), the higher-order harmonics of respiration can co-exist in the same range. Larger respiration motions also produce strong harmonics, making it difficult to decide the correct fundamental frequency of heart rate. To address this issue, we introduce a probabilistic weighting framework (in Section 4.3.2) that adaptively reduces the interference of respiration harmonics to the heart rate estimation.

Third, in realistic scenarios, the phase of the received signal is more complex than linearly proportional to the chest wall displacement \(\phi _{D}(t)=2\kappa D(t)\). We start our reasoning with the scattering model of the human body for vital signs extraction [50, 82, 95]. The human body is more complex than a point scatterer. Rather, it has a three-dimensional (3D) shape and should be modeled as a collection of point scatterers at different depths [50]. Therefore, the received signal is a superposition of the scattered signals from different body parts which may interfere with each other, constructively or destructively. If we categorize the scattered signals into two sets, one is modulated by physiological movements (denoted by M) and the other is static (denoted by N) The resulting signal can be expressed as

\begin{equation} \begin{aligned}s(t)&=\sum _{m\in M}\alpha _{m} e^{-j(\phi _m + 2\kappa D(t))} &+& \sum _{n\in N}\alpha _{n} e^{-j\phi _n}\\ &= \alpha _{\bar{m}} e^{-j\phi _{\bar{m}}} e^{-j 2\kappa D(t)} &+&\alpha _{\bar{n}} e^{-j \phi _{\bar{n}}},\\ \end{aligned} \end{equation}

(6)

where the first term varies over time according to the physiological movements and the second term indicates static, thus, DC components. The subscripts m and n denote the elements from the set M and set N separately, \(\bar{m}\) and \(\bar{n}\) denote their resulting summed terms, respectively, and \(\phi\) denotes the phase offset from the relative distances between point scatterers. Then, the phase of the resulting signal \(s(t)\) can be obtained from the in-phase signal (\(I(t)=\Re \lbrace s(t)\rbrace\)) and quadrature signal (\(Q(t)=\Im \lbrace s(t)\rbrace\)) with the arctangent demodulation method:

\begin{equation} \begin{aligned}\phi _{D}(t) &= \arctan \frac{Q(t)}{I(t)} = \arctan \frac{\alpha _{\bar{m}} \cos ({2\kappa D(t)+\phi _{\bar{m}}}) +\alpha _{\bar{n}} \cos ({\phi _{\bar{n}})} }{\alpha _{\bar{m}} \sin ({2\kappa D(t)+\phi _{\bar{m}}}) +\alpha _{\bar{n}} \sin ({\phi _{\bar{n}})} }. \end{aligned} \end{equation}

(7)

Therefore, the resulting phase of the received signal \(s(t)\) can be expressed as a nonlinear function in terms of \(D(t)\), which can be approximated by its Taylor series as follows:

\begin{equation} \begin{aligned}\phi _{D}(t) &=\sum _{i=1}^{\infty } a_{i} D^{i}(t)=(a_{1}D(t)+a_{2}D^{2}(t)+a_{3}D^{3}(t)+\cdots), \end{aligned} \end{equation}

(8)

where \(a_i\) is the coefficient of the i-th order term. The higher-order terms in the Taylor series of the nonlinear signal result in the intermodulation products between heartbeat and respiratory signals [39]. Take the second-order term, for example:

\begin{equation} D^{2}(t) = d_{r}^{2} \sin (2 \pi f_{r} t)^2+d_{h}^{2} \sin (2 \pi f_{h} t)^2 +2d_{r}d_{h} \sin (2 \pi f_{r} t) \sin (2 \pi f_{h} t). \end{equation}

(9)

According to the product-to-sum formulas, the trigonometric functions in (9) can be expressed as:

\begin{equation} \begin{aligned}\sin (2 \pi f_{r} t)^2 &= \frac{1}{2}-\frac{1}{2}\cos (\underline{4 \pi f_{r} t}), \\ \sin (2 \pi f_{h} t)^2 &= \frac{1}{2}-\frac{1}{2}\cos (\underline{4 \pi f_{h} t}), \\ \sin (2 \pi f_{h} t) \sin (2 \pi f_{r} t) &= \frac{1}{2}\cos (\underline{2 \pi (f_{h}-f_{r}) t}) -\frac{1}{2}\cos (\underline{2 \pi (f_{h}+f_{r}) t}). \\ \end{aligned} \end{equation}

(10)

As indicated in the underlined items in (10), intermodulation components manifest as spectral components at frequencies that are linear combinations of heart and respiratory rates (i.e., {\(mf_{h}\pm nf_{r}|m,n\in \mathbb {N}_{0}\)}). Such components could exist in the normal heart rate range, making it even more difficult to determine the correct frequency component of the heartbeat. The coefficients \(a_i\) in (8) are time variant, resulting in unpredictable and dynamic magnitudes of intermodulation components. Therefore, a method that relies on certain assumptions on signal patterns may fail under different patterns.

3 VitalHub Overview

Figure 2 illustrates the overall framework of VitalHub, which fuses inputs from a pair of co-located UWB and depth sensors to tackle challenges described in Section 2.4 for robust and context-aware vital sign monitoring. Thus, there are two streams of data (i.e., UWB and depth data) processed in parallel and combined organically.

Fig. 2.

To process the UWB data stream for vital sign extraction, we first develop a preprocessing pipeline (detailed in Section 4.1) to deal with STO issues and extract the vital signal (i.e., the phase change in UWB signal due to physiological motions). Then, we introduce a signal quality detector (detailed in Section 4.2) to tell signals where vital signs are “available” for estimation from corrupted ones due to inadvertent movements, even in the presence of harmonics and intermodulation. Moreover, we propose a PWF in vital sign estimation (detailed in Section 4.3) to specifically deal with the challenges in robust heart rate estimation in the presence of dynamic signal patterns.

While the UWB sensor is sensitive to minute movements for vital signs estimation, it is relatively “blind” to the context information (e.g., where the subject of interest is located, which subjects are present, and what activities each subject is conducting). We leverage the complementary characteristics of the depth sensor to achieve context awareness.

We process the depth data to support unambiguous monitoring in cohabiting scenarios to correctly associate respective context (e.g., identities and activities) to the UWB echo pulses from different subjects. To be specific, we leverage a human pose recognition model [70] to detect the body parts—thus, the poses of the subjects—present in the FoV of the depth sensor. It outputs 3D positions of body joints as a representation of skeletal pose. We use the predicted position of the torso to help locate the segments of UWB signals corresponding to the chest wall for vital signal extraction (explained in Section 4.1). We further leverage spatial and temporal features of consecutive skeletal poses to generate the context information (detailed in Section 5). Context is needed for the completeness of the passive and context-aware monitoring system.

4 Vital Sign Monitoring

In this section, we describe the vital sign monitoring module, which consists of three stages: (1) signal preprocessing to extract vital signals from received noisy UWB echoes; (2) signal quality detector; and (3) vital sign estimation to robustly measure heart/respiration rates in the presence of unpredictable and dynamic signal patterns.

4.1 UWB Signal Preprocessing

We design a UWB signal preprocessing pipeline to extract vital signals (i.e., phase changes due to physiological movements) from the reflected UWB pulses.

Signal Segmentation. This step locates the segments of received UWB signals corresponding to the target (i.e., chest walls). The pulses reflected from different distances are received at different arrival times. Thus, we segment signals into range bins, each corresponding to a different 5 cm depth range.³ Our UWB sensor has a range of 10 m, leading to about 200 range bins. As illustrated in Figure 1, we leverage the human body distance measurement from the context annotation module (in Section 5) to decide which range bin corresponds to which identified human body, thus, further processing signals in those bins.

Vital Signal Sanitization. Next, we remove the time-variant phase drift \(\phi _{STO}(t)\) due to STO (analyzed in Section 2.4). Because \(\phi _{STO}(t)\) is caused by unknown jitters in the sampling system, it is impossible to describe with a mathematical model. Fortunately, the same jitters exist in signals from all range bins and the direct path (i.e., the signal received from the transmitter, without reflection from any object). The direct path signal can be expressed as \(\phi _{r}(t)=\phi _{0}^r+\phi _{STO}(t)\), where \(\phi _{0}^r\) is the initial phase of the direct path signal and is static. Therefore, we can simply use \(\phi _{r}(t)\) as a reference to cancel out \(\phi _{STO}(t)\) as follows to obtain sanitized vital signals in the form of relative phases:

\begin{equation} \begin{aligned}\phi ^{\prime }(t)=\phi (t)-\phi _{r}(t) =\phi _{D}(t)+\phi _{0}-\phi _{0}^{r}, \end{aligned} \end{equation}

(11)

where \(\phi _{0}\) and \(\phi _{0}^{r}\) are both static, and \(\phi _{D}(t)\) is the phase modulated by the physiological movements from which we estimate vital signs.

4.2 Signal Quality Detector

Next, we describe how to detect whether the signal is corrupted beyond recognition or whether vital signs are still “available.” Large body motions (e.g., swaying) cause severe disruptions in the signal. Such “unavailable” signals must be detected and excluded to avoid producing erroneous results. Motion detection [5, 45, 90] based on periodicity in the time domain and/or condensed energy in the frequency domain have been proposed. However, strong respiration harmonics and intermodulation can dominate and mingle with such features from the much weaker heartbeat, and thresholding-based detectors cannot reliably tell them apart.

We propose a 2D heatmap-based detector that incorporates the spectral amplitudes at different frequencies. The heatmap \(HM(f_{r},\ f_{h})\) borrows the concept of “joint probability distribution” and the value of each pixel is defined at the respiration/heart rate candidate pair \(\lbrace f_{r},\ f_{h}\rbrace\):

\begin{equation} \begin{aligned}HM(f_{r},\ f_{h}) &= \sum _{z\in \mathbb {Z}(f_{r},\ f_{h})} A(z), \end{aligned} \end{equation}

(12)

where \(A(z)\) denotes the spectral amplitude of the signal at the frequency of z and \(\mathbb {Z}(f_{r}, f_{h})\) is a set of potential harmonic and intermodulation frequencies, which can be expressed as {\(mf_{h}\pm nf_{r}|m,n\in \mathbb {N}_{0}\)}. When the signal is not corrupted much, harmonic and intermodulation frequencies of (\(f_{r}, f_{h}\)) close to the true respiratory and heart rates would have significant energy. Thus, \(HM(f_{r}, f_{h})\) would gain relatively large values of \(A(z)\). This will visually appear as vertical and horizontal lines of large HM values in the heatmap. We show three representative samples in Figures 3, 4, and 5 for “available,” partially “available,” and “unavailable” signals. In Figure 3, the ground truth RR and HR are 21 and 60 bpm, respectively. The heatmap shows a horizontal line near 20 bpm on RR and a vertical line near 60 bpm on HR with red (i.e., larger values). Such visual patterns are used to detect whether a signal is “available.”⁴

Fig. 3.

Fig. 4.

Fig. 5.

To learn the spatial-invariant features from the 2D heatmap, we adopt the ResNet-18 model as the detector. The ResNet [28] model was initially proposed for image recognition, and takes 3-channel image data (i.e., RGB images) as input. We modify the first convolutional layer to process the heatmap, which is in the format of a 1-channel gray-valued image. We also adjust the final layer (i.e., fc layer with softmax) to output a vector of two numbers \((\alpha , \beta)\), both within \([0,\ 1]\), indicating the normalized probabilities of availability and unavailability. The larger one determines the binary classification result of signal availability. Therefore, the probability of availability \(\alpha\) can be used to indicate the signal quality.

The method of training and validation of the signal quality detector is described in Section 7.1. Signals detected as “available” are passed for vital sign estimation.

However, signals from the range bin that was directly located by the depth camera may not be suitable for vital sign estimation due to the offset error of the depth measure and the imperfect placement between UWB and depth sensors. Based on our preliminary experiments as described in Section 7.1 and especially in Figure 10, we note that adjacent range bins need to be considered to measure vital signs with better signal quality. To be specific, we flag a period as “available” when at least one range bin among 7 adjacent range bins (i.e., within \(\pm 15\) cm range) is classified as “available.” With \(\alpha\) as the signal quality indicator, we select a range bin with the largest \(\alpha\) among adjacent range bins for vital sign estimation during the “available” period.

4.3 Vital Sign Estimation

As the phase of the UWB signal reflected from the chest wall changes corresponding to the physiological motions, we are able to extract the respiration and heartbeat. Given the center frequency of UWB signal 8.75 \(\text{GHz}\) in our design, a 0.2- to 0.5-mm displacement caused by a heartbeat [68] translates to a \(2.1^{\circ }\)–\(5.3^{\circ }\) change in phase, while a 4- to 12-mm displacement caused by respiration [19] translates to a \(42.0^{\circ }\)–\(126.0^{\circ }\) change in phase [38]. The heartbeat signal is orders of magnitude weaker than and totally buried by respiration signal in time domain. The difference in the typical frequency ranges between respiration (\(\sim\)6–18 bpm) and heartbeat (\(\sim\)50–150 bpm) allows them to be extracted separately. Figure 6 shows that with fine-tuned bandpass filters applied upon the phase signal, the respiration and heartbeat can be easily recognized from the FFT spectrum. While the example case looks straightforward, to robustly measure vital signs is still an open challenge. We will introduce estimation methods for both respiration and heart rates.

Fig. 6.

4.3.1 Respiration Rate Estimation.

As the respiration frequency is usually from 0.1 to 0.3 \(\text{Hz}\), we use a 2-order Butterworth bandpass filter with a pass band of 0.1–0.8 \(\text{Hz}\) to remove the DC component and high-frequency noise. Since the whole chest moves upon respiration, it has a larger radar cross section (RCS) and displacement. Thus, the phase signals are stable enough that we can easily estimate the respiration rate by counting the peaks. We use a time window of 30 s (which usually contains 5–8 breathing cycles) and calculate the time intervals between adjacent peaks. Then, we average the interval to obtain the respiration rate \(f_r\).

4.3.2 Heart Rate Estimation.

Extracting the heart rate is more challenging due to its much smaller RCS and displacements, thus, much weaker magnitudes in both temporal and spectral domains. As explained in Section 2.4, harmonics and intermodulation from respiration can easily dominate the heartbeat signal and their patterns are dynamic.

To robustly measure heart rate, we propose a PWF that (1) incorporates four heart rate estimators, each suitable to one of four identified temporal and spectral patterns; (2) adaptively combines heart rate candidates generated by the estimators with the quantified cumulative evidence of each pattern; and (3) leverages limits in heart rate temporal changes to smooth continuous measures.

Heartbeat Signal Extraction. In this step, we filter noises and respiration signals, and enhance the heartbeat signal for estimation. While the heartbeat signal presents periodic changes, the noise behaves randomly and can be modeled as Gaussian. We use auto-correlation [69] to zero out the noise and enhance the periodic pattern of heartbeats. We observe that because of its higher frequency, heartbeats cause larger changes among adjacent sampling points than respiration. We use the second-order difference to make the heartbeat more prominent.

Then, we use the Discrete Wavelet Transform (DWT) as the filter bank [79] to extract heartbeat signals because the DWT can retain the inherently irregular shape of the vital signals whereas the conventional filters (e.g., Butterworth filter [34]) would smooth the shape and result in loss of information for temporal analysis. We progressively split the signal into approximation coefficients (from the low-pass filter) and detail coefficients (from the high-pass filter) with the previously decomposed coefficients and reconstruct the signal with the coefficients in the interested frequency range (0.625–5 Hz, which covers both fundamental and second-order harmonics). With L iterations (corresponding to L scales), an approximation coefficient \(\gamma ^{(L)}\) and a sequence of detail coefficients \(\upsilon ^{(1)}, \upsilon ^{(2)}, \ldots , \upsilon ^{(L)}\) are calculated in (13).

\begin{equation} \left\lbrace \begin{array}{@{}ll} {\gamma _{k}^{(L)}=\sum _{n\in \mathbb {Z}}{s}[{n}] \varphi _{2^{L} n - k}^{(L)}, L \in \mathbb {Z}}, \\ {\upsilon _{k}^{(l)}=\sum _{n\in \mathbb {Z}}{s}[{n}] \psi _{2^{l} n - k}^{l}, l \in \lbrace 1, \ldots , L\rbrace }, \end{array}\right. \end{equation}

(13)

where \(\varphi\) denotes the scaling function and \(\psi\) the wavelet. The heartbeat signal can be reconstructed using the inverse DWT:

\begin{equation} s[n]=\sum _{k \in \mathbb {Z}} \gamma _{k}^{(L)} \varphi _{2^{L}n- k}^{(L)}+\sum _{l=1}^{L} \sum _{k \in \mathbb {Z}} \upsilon _{k}^{(l)} \psi _{2^{l}n- k}^{l}. \end{equation}

(14)

In VitalHub, we select the Daubechies(db4) wavelet as the mother wavelet [22] and split the signal into 4 levels. The detail coefficients \(\upsilon ^{(3)} + \upsilon ^{(4)}\) (ranging from 0.625 Hz to 2.5 Hz) are used to reconstruct the heartbeat signal. The coefficients \(\upsilon ^{(4)}+\gamma ^{(4)}\) (ranging from 1.25 Hz to 5 Hz) are used to reconstruct the second-order harmonic component of the heartbeat signal.

Ensemble of Heart Rate Estimators. Based on manual examination of over 6,000 data samples, we identify four typical temporal/spectral patterns (present in \(98.55\%\) of the data) and identify a suitable estimator of the fundamental heart rate based on each domain pattern, including (1) zero-crossing (ZC), (2) peak interval (PK), (3) local maximum detection in the spectrum of the heart rate range (LMD), and (4) spectral peak detection in the range of the heartbeat signals’ second-order harmonics (SOH).

The first two handles two temporal patterns. ZC estimates the heart rate by counting the number of zero-crossings in a time window, dealing with a periodic pattern of temporal changes between negative and positive values. Higher-order harmonics of respiration may cause more negative-to-positive transitions, thus, a falsely higher heart rate. The PK measures the average interval between adjacent local maxima in a time window, thus, the heart rate. It is relatively immune to signals of larger energy but sensitive to high-frequency jitters.

The latter two handle two spectral patterns. When the fundamental spectral peak of heartbeat has significant energy [5], LMD detects such high peaks in the heart rate range (50–150 bpm). When higher-order harmonics or intermodulation of respiration has strong energy, they may overwhelm the heart peak in this range. SOH selects spectral peaks in the range of the SOH of the heartbeat (100–300 bpm), then halves them as estimates. We observe that respiration harmonics and intermodulation have much weaker energy in this range [64]. Due to partial overlap with the heartbeat fundamental frequency range, respiration may still produce significant peaks on occasion, resulting in erroneous heart rate estimation.

Using a sliding window, we produce a heart rate candidate set \(C_{t}\) at time t, including \(C_{t}^F\), 2 estimates from ZC, PK, and 3 largest peaks from LMD, and \(C_{t}^S\), 3 largest peaks from SOH. Unless explicitly stated, a candidate \(c_{t}^m\) is chosen from the combined set \(C_{t}=C_{t}^F \cup C_{t}^S\).

Probabilistic Heart Rate Tracking. We formulate the continuous heart rate estimation as tracking the “trend” of changes, with the state update equation as follows:

\begin{equation} \hat{x}_{t} = x_{t-1} + \dot{x}_{t-1}\triangle t + \varepsilon _p, \end{equation}

(15)

where \(x_{t-1}\) is the state (i.e., heart rate) we have estimated at time \(t-1\), \(\hat{x}_{t}\) is the heart rate predicted at time t, \(\triangle t\) is the estimation interval (set to 1 second in our configuration), and \(\varepsilon _p \sim \mathcal {N}(0, \sigma _p^2)\) is the process noise. Because errors accumulate over time, the predictions must be calibrated using evidence from observations.

The four temporal/spectral patterns are present most of the time (\(\gt\)98%); thus, the heart rate candidate set \(C_{t}\) very likely includes the correct one. The key is to determine which one. We quantify the evidence of each candidate \(c_t^m\) to determine its weight and calibrate predictions.

•

Respiration Harmonics. Assume that the fundamental respiration frequency is \(f^r_t\). Then, its harmonics are represented as \(H^r_t=\lbrace f^r_t, 2f^r_t, \ldots ,Nf^r_t\rbrace\), where N is empirically limited at 5 because those beyond the 5th are negligible [64]. The closer a candidate is to any respiration harmonic, the less likely it is true, which can be formulated in the following weight:

\begin{equation} P_r({c_t^m}) = 1 - g_{r}(\min _{n} (abs(c_t^m - n\cdot f^r_t)), \end{equation}

where \(n \in \lbrace 1,2, \ldots ,N\rbrace\), \(g_{r}(\cdot) \sim N(0, \sigma _r^2)\) is a Gaussian distribution and \(\sigma _r\) is empirically set to 2.

•

Heartbeat Harmonics. Heartbeat signal also has harmonics, while random noise may not. Thus, the existence of high-order harmonics can be used as evidence of the heartbeat fundamental frequency \(f_h\). As the heartbeat signal is relatively weak, we consider only its SOH. This weight can be calculated as follows:

\begin{equation} {P_h({c_t^m}) = g_{h}(\min _{n} (abs(c_t^m - c_t^n))}, \end{equation}

\begin{equation} {P_h({c_t^n}) = g_{h}(\min _{m} (abs(c_t^m - c_t^n))}, \end{equation}

where \(c_t^m \in C_{t}^F\), \(c_t^n \in C_{t}^S\), \(g_{h}(\cdot) \sim N(0, \sigma _h^2)\) is another Gaussian and \(\sigma _h\) is empirically set to 2.

•

Peak Prominence. We observe that real peaks are usually “sharp” (i.e., higher prominence), even though the amplitude may be small. While we have estimations from both the time domain (i.e., ZC, PK) and the frequency domain (i.e., LMD, SOH), we use the prominence of the spectral peaks of the heartbeat signal reconstructed according to (14) at the corresponding (estimated) frequencies to regulate their weights, because the spectral pattern (i.e., the distribution of the peak prominence) is resilient to noise and can serve as a reliable indicator for selecting the vital sign candidates estimated from either temporal or spectral methods. We use an exponential distribution to represent this weight:

\begin{equation} P_p({c_t^m}) = 1 - e^{-\alpha \cdot p(c_t^m)}, \end{equation}

where \(p(c_t^m)\) is the peak prominence that quantifies how much the candidate \(c_t^m\) peak stands out due to its height and location relative to other nearby peaks, and the scale factor \(\alpha\) is empirically set to 1.

•

Temporal locality. The heart rate is not likely to change abruptly in a short time (e.g., 1 s), and the next heart rate is usually close to the current one. Therefore, we quantify how close a candidate is to the previous estimation as:

\begin{equation} P_l({c_t^m}) =g(abs(c_t^m - x_{t-1})), \end{equation}

where \(g_{l}(\cdot) \sim N(0, \sigma _l^2)\) is another Gaussian. \(\sigma _l\) is the variance of heart rate trend.

We define the likelihood of a candidate to be the heart rate as the cumulative evidence in a product form:

\begin{align} \mathcal {L}_{t}^m &= P_r({c_t^m}) \cdot P_h({c_t^m})\cdot P_p({c_t^m})\cdot P_l({c_t^m}). \end{align}

(21)

The normalized weight for a candidate is expressed as

\begin{equation} \omega _t^m = \frac{\mathcal {L}_t^m}{\sum _{j=1}^{M_t}\mathcal {L}_t^j}, m = 1,2, \ldots ,M_t. \end{equation}

(22)

Then, we take the weighted average of all of the candidates as a new measurement:

\begin{equation} \bar{c}_{t} = \sum _{c_{t}^n\in C_{t}}\omega ^n_{t}\cdot c_{t}^n. \end{equation}

(23)

We observe that the error of the weighted measurement can be considered zero-mean Gaussian (using Kolmogorov-Smirnov statistic found at 0.036, less than 0.05, the threshold at which two distributions are considered the same [25]). Therefore, we apply the Kalman Filter to iteratively repeat the following steps to update the heart rate at discrete timesteps upon each new candidate set:

\begin{equation} \begin{array}{c}{K_{t}=\frac{\sigma _{t-1}^{2}}{\sigma _{M}^{2}+\sigma _{t-1}^{2}}}, \\ {\sigma _{t}^{2}=\left(1-K_{t}\right)\sigma _{t-1}^{2}}, \\ {{x}_{t}=\hat{x}_{t} + K_{t} (\bar{c}_{t} - \hat{x}_{t}}), \end{array} \end{equation}

(24)

where \(K_t\) is the Kalman Gain, \(\sigma _M^{2}\) and \(\sigma _t^{2}\) are the variances of measurement noise (from \(\bar{c}_{t}\)) and process noise initialized with \(\sigma _p^{2}\).

5 Context Annotation

To enable context-aware cohabited monitoring, subject identities and activities must be labeled with the detected vital signs. The meaning of context annotation is twofold for health analytics over continuous vital sign measurements: (1) The recorded vital sign data can be used for meaningful customized analytics only when correctly associated with the corresponding identities. (2) With the annotated activity context, it helps detect anomalous deviations from a user’s normal distribution (e.g., a user’s vital signs will rise during exercise but will become stabilized during sleep) and reduce false alarms. We present an effective, privacy-friendly user identification approach based on human walking skeleton data and a probabilistic model for continuous user identity tracking under occlusions. Leveraging the same set of features, we recognize the physical activities as context information for each individual.

5.1 User Identification

To capture both spatial and temporal features while a user is walking, we leverage the skeleton data from a short period (a few steps’ walking when the user enters the monitoring zone) as input for recognition rather than individual frames.

Features. We leverage the dynamic body joint locations tracked from depth sensors as features for user identification. We calculate the vectors between adjacent body joints to obtain the features \(\mathbb {V}=\lbrace \vec{v}_0, \vec{v}_1, \ldots ,\vec{v}_{N-1}\rbrace\), where \(\vec{v}_i=[x_i-x_j, y_i-y_j, z_i-z_j]\) is the vector from joint i to joint j, and N is the number of pairs of joints we use. Accordingly, we can also calculate the lengths for all of the vectors as \(\mathbb {L}=\lbrace l_0,l_1, \ldots ,l_{N-1}\rbrace\), where \(l_i=|\vec{v}_i|\), and the angles at major joints whose angle changes indicate specific activities (e.g., certain changes in neck angle indicate nodding). \(\mathbb {A}=\lbrace a_0, a_1, \ldots ,a_M\rbrace\), where \(a_i=cos^{-1}(\frac{\vec{v}_i\cdot \vec{v}_j}{| \vec{v}_i | \cdot |\vec{v}_j|})\). We choose the combination of \(\lbrace \mathbb {V}, \mathbb {L}\rbrace\) as features because it shows best performance in the evaluation. We choose a time window of 2 seconds (i.e., 120 frames with 60 fps) to balance the recognition latency and accuracy, and each training sample has a dimension of \((120\times (3+1)\times N)\). To reduce noise in the feature data, we use a Savitzky-Golay filter [20] to smooth data and filter out jitters among adjacent frames.

Identification Model. We design a deep recurrent model with stacked-LSTM layers [30], each with 128 hidden units, and a fully connected layer with Softmax activation to output prediction results. Figure 7 shows the framework of the deep learning model for context annotation (i.e., user identification and activity recognition). We choose to use stacked LSTM layers because they have been found to have better capacity to learn useful features from abstract sequential data [10]. We empirically tune and finalize our model with two stacked LSTM layers to balance the trade-off between model capacity and computing complexity based on our preliminary experiments. While the LSTM [30] itself is robust to the vanishing gradient issue by its nature, we apply batch normalization [65] in the training phase to further prevent and eliminate the vanishing gradient issue. We train the model using cross-entropy loss [18] and the Adam optimizer [37]. With the input of a series of features sets \(\lbrace \mathbb {V}, \mathbb {L}\rbrace\), both static features (e.g., limb lengths) and dynamic features (e.g., the walking pattern) can be effectively used for reliable user identification.

Fig. 7.

5.2 Probabilistic Identity Tracking

The identities of each user’s skeleton data must be tracked continuously. Naïvely running LSTM inference continuously is inefficient and error prone. The situation becomes even worse when non-line-of-sight (NLOS) happens due to occlusions. We propose a lightweight probabilistic identity tracking algorithm that keeps tracking identities of each user from earlier LSTM inferences to reduce complexity and improve accuracy.

The basic intuition is that human movements are continuous, thus, we can predict the trajectories based on previous locations and directions upon transient occlusions. Given 60 fps of depth sensor, the moving distances between two adjacent frames are small enough that the skeletons can be easily tracked. We need to recover the identities only when occlusions happen (e.g., one user blocks the line of sight of another).

As shown in Figure 8, when there’s only one user A in the FoV, the identity can be tracked easily by tracking the skeleton. When both A and B are present, the identity of B may be lost if occluded by A when B is walking. The problem becomes more complex when more users are present. We formulate this into a probabilistic estimation problem leveraging the user movement trajectories: we predict the “next appear” locations of the users when they are occluded and estimate the identity probabilities based on how close the predicted “next appear” locations are to where the lost skeletons reappear in the FoV. This works in three steps, as follows.

Fig. 8.

1. User Movement Prediction. We choose the user’s head location coordinate \(x_{t}^{k}\) to represent the k-th user’s location at time t. Given the user’s location \(x_{t-1}^k\) at time \(t-1\) and the control signal \(u_{t}^k=(v_t^k,\omega _t^k),\) where \(v_t^k\) is the moving speed and \(\omega _t^k\) is the heading direction (estimated from the existing trajectory), the current predicted location \(\hat{x}_t^{k}\) of the k-th user is estimated as a displacement plus a Gaussian noise:

\begin{equation} \hat{x}_t^{k} = x_{t-1}^{k} + u_{t}^k\triangle t + \varepsilon . \end{equation}

(25)

2. Identity Probability Update. When users are occluded, we keep predicting the “next appear” locations using Equation (25). Note that the longer the user disappears, the more predictions we make and the lower the accuracy. As shown in Figure 8, users B and C walk in opposite directions and are both occluded by user A.⁵ \(\hat{B}\) and \(\hat{C}\) are the predicted locations for B and C, respectively. Suppose that one skeleton S appears again at location \(x_t^S\) after the occlusion; we need to recover the identity for S. In this case, we denote the probability for S to be user B or C as \(P_{S\hat{B}}\) and \(P_{S\hat{C}}\), respectively, which can be estimated as follows:

\begin{equation} \begin{split}& P_{S\hat{B}}=g(\hat{x}^B_t)\cdot e^{-\alpha \cdot \tau }, \end{split} \end{equation}

(26)

where \(g(.) \sim N(x_t^S, \sigma ^2)\) is a Gaussian distribution, \(e^{-\alpha \tau }\) is a time decaying factor, and \(\tau\) is the eclipsed time of the occlusion. Similarly, we can estimate \(P_{S\hat{C}}\).

3. User Identity Recovery. After we get \(P_{S\hat{B}}\) and \(P_{S\hat{C}}\), we first sort the probabilities and find the maximum one \(P_{max}\) and compare it with a threshold \(\epsilon\), which is the minimum probability we need to recover S. If \(P_{max} \lt \epsilon\), it means that we do not have enough confidence to recover S to either of the occluded users. In such cases, we need to run a LSTM model to obtain the identity from skeleton features. In general, a larger \(\epsilon\) makes identity recovery more robust but is more compute expensive. If \(P_{max} \ge \epsilon\), we further compare the difference between \(P_{max}\) and the second highest probability \(P_{second}\). The user’s identity is recovered only when \(P_{max}-P_{second} \ge \eta\), where \(\eta\) is another threshold used to ensure sufficient difference in probabilities to avoid ambiguity among multiple identities. A larger \(\eta\) makes identity recovery more robust but incurs more computation to reach the threshold. We set \(\epsilon =0.8\) and \(\eta =0.5\) empirically to balance robustness and computation. In practice, once the users are recognized by initial LSTM inferences, the probabilistic tracking algorithm can track the identities efficiently, invoking compute-heavy LSTM inferences only occasionally.

5.3 Activity Context Recognition

Knowing the concurrent activity (e.g., exercising or sleeping) is critical to detect health anomalies given the same vital sign changes (e.g., increased heart rates). We apply transfer learning with the LSTM model trained in Section 5.1 to recognize the activity. Since the pretrained LSTM model has learned sophisticated features from sequential skeleton data, we feed them to a new fully connected layer (classifier) to recognize different activity categories. As shown in Figure 7, the sequential skeleton data feed the shared forward path of the LSTM model, and the extracted feature vector generates predictions of both user identity and activity.

6 Testbed

In this section, we describe the implementation of our testbed and experimental setup for evaluation.

6.1 Implementation

VitalHub adopts a COTS IR-UWB sensor XeThru X4M03 [48] as its frontend for wireless sensing. The transmitted pulse is configured to be within the frequency band 7.25 to 10.2 GHz, centered at 8.75 GHz, and the sampling frequency is 23.328 GHz. The frame rate of the UWB sensor is configured to be 10 frames-per-second (fps). Each frame includes samples of the echo pulses reflected from the objects within the range of 10 m. Kinect XBox serves as the depth sensor in VitalHub. Its software development kit (SDK) incorporates the human body pose recognition model [70] to detect human bodies present in the field of view (FoV) at 60 fps. Both modalities stream data to the same backend PC via serial port. For computations of the whole pipeline, we use a laptop ROG Strix GL704 as the backend PC, which has an Intel i7-8750 2.2GHz CPU, 16 GB RAM, and NVIDIA RTX 2060 GPU. We implement deep learning models with PyTorch and run them using the GPU on the laptop.

6.2 Experimental Setup

Figure 9(b) shows the hardware setup of a Kinect XBox One sensor with RGB camera covered and a co-located UWB sensor. We conduct experiments in a room with a size of \(4.5\times 9\) \(m^2\) (shown in Figure 9(a)).

Fig. 9.

We invited 8 students as participants for data collection (heights 156–192 cm, weights 49–108 kg), following a preestablished protocol that protected the anonymity of the students. We use two medical devices approved by the US Food and Drug Administration, Nonin LifeSense II [55] and the Masimo Pulse Oximeter [2], which can measure instantaneous heart and respiratory rates as the ground truth. We use timestamps to obtain the alignment between instantaneous vital sign estimation and the corresponding ground truth over time. Although results for each module are presented separately, VitalHub inherently integrates and produces data in a holistic pipeline concurrently.

7 Microbenchmarks

Before we delve into the end-to-end evaluation of VitalHub for vital sign monitoring, we start with a few microbenchmarks to demonstrate the performance of signal quality detector and context annotation. Then, to evaluate the end-to-end performance in Section 8, the data are retrieved according to the recognized identities based on context annotation; the pretrained signal quality detector is used to filter in the time domain (i.e., sliding windows) and in the space domain (i.e., range bins) for robust vital sign estimation against inadvertent motions.

7.1 Signal Quality Detector

We first evaluate the signal quality detector for the classification of signal availability. Then, we demonstrate how the detector can boost the performance of vital sign monitoring by reducing erroneous results from corrupted signals.

Classification. We compare the heatmap-based detector (HM) against 4 existing detectors based on moving average (MABD) [90], moving variance (MVBD) [90], average variance energy (AVE) [45], and flat spectrum (FSD) [5].

We build a balanced dataset consisting of 20,000 data samples, with equal numbers of “available” and “unavailable” samples randomly selected from 40,782 manually labeled ones. Each data sample is the vital signal in a 30-second time window from one of 7 adjacent range bins centered at the depth sensor–reported human body distance. We label a data sample as “available” if well-trained human observation identifies sufficient temporal periodicity and/or spectral peaks for both respiration and heartbeat, even under strong noises; otherwise, it is “unavailable.” Therefore, for an identified “available” data sample, we know for sure that the vital sign information exists. Thus, failure to extract accurate readings indicates limitations of estimation algorithms.

We use precision (P), recall (R), and F-score (\(= 2 \frac{P \cdot R}{P + R}\)) as metrics. Precision is the fraction of true positives among all identified positives, defined as \(P = \frac{TP}{TP + FP}\); recall is the fraction of identified positives among all true positives, defined as \(R = \frac{TP}{TP + FN}\). A high precision means that unavailable data is unlikely to be falsely identified as “available”; a high recall means that the available data can be correctly identified and, thus, can be utilized for monitoring. F-score quantifies the balance between precision and recall.

We apply 5-fold cross-validation. In each iteration, we take \(80\%\) of the dataset for training our detector or searching thresholds of others and the remaining \(20\%\) for testing. For fair comparison, each threshold is selected when the respective F-score is maximized. The HM detector uses the Adam optimizer [18], which minimizes cross entropy [37] as loss function. This measures the discrepancy between predicted and actual labels.

Table 1 shows that the time domain methods MABD, MVBD, and AVE have relatively low recall. This is because the temporal signal is dominated by respiration signal in the shape, and sensitive to noises (e.g., environment, body movements). Achieving high precision requires “strict” selection, thus, low recall and loss of available data. The frequency domain method FSD has better performance but is still about \(10\%\) worse than our HM detector. It assumes that the spectral peak sharpness (i.e., how condensed the energy is) indicates the availability of both respiration and heartbeat signals. However, this is not always the case. In addition, respiration harmonics and intermodulation can also reduce the sharpness even if both respiration and heartbeat signals are available.

Table 1.

	Precision (%)	Recall (%)	F-score (%)
MABD	93.11	47.80	63.17
MVBD	80.08	49.17	60.93
AVE	97.82	50.90	66.59
FSD	89.37	83.39	86.28
HM	96.41	96.29	96.35

Table 1. Precision, Recall and F-score of Signal Quality Detectors

The HM detector requires more computation. The generation of and inference on the heatmap take \(72.49 \pm 6.99\) ms and \(7.26 \pm 6.84\) ms, respectively, short enough for real-time measurements updated every 1 s.

Range Bin Selection. We observe that the distance reported by the depth sensor may not give the range bin with the best signal quality. To test the depth sensor’s accuracy, we select a set of locations within 4.5 m of the depth camera (Figure 10(a)). We ask the testing volunteer to stand at each location, and we measure the ground truth distance to the head, chest, left arm, and left leg using a laser measurement tool.

Fig. 10.

The errors of the depth sensor for four body parts are shown to be 10 cm at 80-percentile in the cumulative distribution function (CDF) curves (Figure 10(b)), which is about twice the size of a range bin (\(5.14\ cm\)). Thus, we search 7 adjacent range bins (\(\pm 15\) cm) centered at the depth camera reported one, and select the one with the highest signal quality indicator \(\alpha\) (provided by the trained HM detector).

Ablation Study. To study the effectiveness of the signal quality detector on the end-to-end system, we compare the performance in vital sign estimation with the progressive ablation of range bin selection (Sel.) and availability classification (Clf.) against an impractical “Oracle” that always knows whether the signal is available, which is the best range bin, and best estimator (among the four used in PWF) at each moment.

Figure 11 shows obvious performance degradation each time bin selection or classification is removed. With both of them, VitalHub achieves end-to-end respiratory and heart rate estimation at 1.5/3.2 bpm errors at 80-percentile, very close to 1.2/1.5 bpm errors by the idealistic oracle. This shows the necessity of the detector, which enables VitalHub to approach the “ceiling” of the oracle.

Fig. 11.

7.2 Context Annotation

We first compare the accuracy of different features and machine learning models (Section 7.2.1), then the effectiveness of identity tracking under occlusions (Section 7.2.2), and, finally, the accuracy of activity recognition (Section 7.2.3).

7.2.1 User Identification.

Considering that VitalHub targets in-home deployment, we invite 8 volunteers (used as the maximum number of family members) to collect walking skeleton data for evaluation of user identification. Each contributes 5 min of data, resulting in a total of 18,000 frames at 60 fps. To generate the training data, we use a time window of 2 s with a step of 0.1 s. Thus, the total training data size is around \(\frac{5\times 60s}{0.1s}\times 8=24{,}000\) samples. Another 2 min of data from each volunteer are collected separately as testing data.

Precision, Recall, F-score. We adopt the same metrics as in Section 7.1 to evaluate the identification model. Table 2 shows the results of different feature combinations (described in Section 5) under our LSTM model. The combination \(\lbrace \mathbb {V},\mathbb {L}\rbrace\) outperforms all others; thus, it is selected as final features.

Table 2.

	Precision (%)	Recall (%)	F-score (%)
\(\lbrace \mathbb {V}\rbrace\)	85.60	85.12	85.01
\(\lbrace \mathbb {V},\mathbb {A}\rbrace\)	63.24	57.80	56.19
\(\lbrace \mathbb {V},\mathbb {L}\rbrace\)	86.98	85.49	85.37
\(\lbrace \mathbb {V},\mathbb {L},\mathbb {A}\rbrace\)	80.47	77.97	78.14

Table 2. Precision, Recall and F-score of Different Features

Different Classifier Models. We compare the performance of different classifiers using the same test dataset with the feature of \(\lbrace \mathbb {V},\mathbb {L}\rbrace\). Linear Discriminant Analysis (LDA), K-Nearest Neighbor (KNN), Decision Tree (DT), Naïve Bayesian (NB) and Support Vector Machine (SVM) are used as baseline classifiers to compare with our LSTM-based model. Notably, all of the models have been carefully examined and configured with optimal settings regarding the respective parameters of interest. Specifically, LDA is configured with a threshold of 0.0002 for the solver of Singular Value Decomposition; KNN is configured with the distance metric of Minkowski and the number of neighbors of 5; DT is configured with the Gini impurity as the criterion of split; NB is configured with Gaussian distribution; SVM is configured with the regularization parameter C of 1.0 and tolerance for stopping of 0.001; and LSTM is configured with the number of stacked layers of 2. Figure 12 shows the accuracy of each model. Our LSTM-based model outperforms baseline significantly, with a median accuracy around 90%. The only outlier in the LSTM result is caused by excessive movements during data collection, resulting in large variations in raw data.

Fig. 12.

Different Number of Trials. To avoid “pollution” from false identifications, the model makes decisions only when successive N trials produce the same result, at the cost of lower recall. Figure 13 shows the precision, recall, and F-score with different numbers of trials. We choose 3 trials, as the precision is close to 100% and the recall is acceptable to still generate enough data in longitudinal monitoring. In this setting, 3 LSTM prediction cycles take about 6 s to recognize a user.

Fig. 13.

7.2.2 Identity Tracking and Maintenance.

To evaluate the robustness of our probabilistic identity tracking, we asked volunteers to simulate the occlusion cases discussed in Section 5.2. Results show that our algorithm can resume the identities instantly when users reappear after temporary occlusions (\(\sim\)2 s). Figure 14 shows one example in which two participants walk across the FoV of the depth camera; their respective traces are being tracked correctly and continuously even when they occlude each other. The LSTM reidentification after long occlusions requires multiple cycles, thus longer time (e.g., 3 cycles take \(\sim\)6 s), and reidentification succeeds for all cases that we test. These showthat our identity tracking algorithm is effective and robust.

Fig. 14.

7.2.3 Activity Context Recognition.

We collect 32,828 and 19,478 samples, with the same format as in the user identification, for training and testing data of the following activities: eating (ET), lying (LY), running (RN), sitting (ST), standing (SD), and walking (WK). Every participant is asked to perform each category of activities repeatedly during data collection. The results are shown in Figure 15. Overall, it achieves \(96.27\%\) precision, \(96.09\%\) recall, and \(96.10\%\) F1-score. A small portion of running samples are falsely recognized as standing or walking for transitions in between. This performance is sufficient for longitudinal in-home monitoring.

Fig. 15.

8 End-to-end Vital Sign Monitoring

In this section, we compare different methods in estimating vital signs and dealing with non-linearity issues (e.g., harmonics, intermodulation, and dynamic signal patterns, as described in Section 2.4). We then study the impact of user and environment factors on end-to-end vital sign monitoring performance.

8.1 Estimators and Non-linearity Study

8.1.1 Individual Estimators.

We evaluate the effectiveness of all 4 estimators for heart rate estimation (Figure 16). We define a “working ratio” metric as the fraction of time in which an estimator has \(\lt 5\) bpm error, which is an acceptable error range for long-term monitoring. The SOH estimator has the highest working ratio, because the SOH of the heart rate is spectrally free of the high-order harmonics of respiration. Temporal methods ZC and PK have relatively low working ratio due to sensitivity to noise and interference. However, they produce more gradual changes in output compared with spectral methods LMD and SOH, helping avoid large jumps between spectral peaks for smooth tracking. VitalHub combines all of them and achieves an over \(98\%\) working ratio. This demonstrates the effectiveness of the PWF combining less reliable estimators to achieve more robust estimation.

Fig. 16.

8.1.2 Robustness against Harmonics and Intermodulation.

We identify two representative methods, SHAPA [53] and HMLD [93], dealing with harmonics and intermodulation and compare them. They leverage the frequency relation between harmonics. SHAPA tries to find three spectral peaks (i.e., “harmonic path”) with a \(1:2:3\) ratio in frequency, with magnitude larger than some preset threshold. HMLD tries to find a pair of stable spectral peaks with a \(1:2\) ratio in frequency. Figure 18 shows that whereas VitalHub reaches a \(98\%\) working ratio, SHAPA and HMLD deliver only \(51.2\%\) and \(76.3\%\), respectively. Both methods rely on presumed signal patterns, which may not always happen in reality.

SHAPA is very sensitive to the signal-to-noise ratio (SNR). The preset threshold is supposed to filter out most noise peaks while leaving those from fundamental and harmonics of heartbeat. However, when the SNR is low, even with a well-tuned threshold (one that is just below all harmonic peaks), noise can easily cause incorrect estimation. Figure 17(a) shows the threshold set at the minimum magnitude of all harmonic peaks. However, many noise peaks exist above the threshold, and some present a better \(1:2:3\) relation than the real heart rate and its harmonics. Thus, the algorithm chooses an incorrect harmonic path (62, 124, 186) instead of the true harmonic path (56, 110, 164). HMLD has similar problems: Figure 17(b) shows a wrong estimation (94, which is the intermodulation of heartbeat (56) and respiration (19). This shows that designs relying on presumed signal patterns are not robust enough.

Fig. 17.

Fig. 18.

8.1.3 Dealing with Unpredictable and Dynamic Signal Patterns.

We implement WiBreathe [61] for comparison as it is the most related work that identifies and addresses unpredictable and dynamic vital signal patterns. We caution that WiBreathe was designed for respiration only; thus, the comparison serves not to criticize but rather to shed light on how applicable its techniques are for heart rate. WiBreathe adaptively combines several estimators’ output under the assumption that the majority of them would produce correct estimations. For fairness, we compare only the strategies in combining estimator outputs, while all other components, such as preprocessing pipelines, are the same. Figure 19 shows that the working ratio of WiBreathe can be up to \(81.3\%\) of the time, much lower than VitalHub’s \(98\%\). We find that the majority of the HR candidates from the estimators can be incorrect, causing WiBreathe to fail to make the correct estimation. Our PWF strategy uses the cumulative evidence; thus, it can still select the correct candidate even it is not in the majority. Rather, it possesses stronger evidence, dealing with dynamic signal patterns more effectively.

Fig. 19.

8.2 User and Environment Factors

8.2.1 Impact of Distances.

We vary the distance between 1.5 to 4.5 m with a step length of 1 m while keeping the orientation of the subject at 0 degrees (facing frontally). The results are shown in Figure 20. We can see that respiratory and heart rate estimations are very stable even at 4.5 m, up to the range of the depth sensor.

Fig. 20.

8.2.2 Impact of Orientations.

We vary the orientation of the subject at 0, 45, and 90 degrees while keeping 2.5 m distance (shown in Figure 21). Interestingly, we observe that HR accuracy is not affected much by the orientation. However, RR error at 90 degrees more than triples. This is because the breathing chest movement in the mediolateral axis (i.e., side) is only around 0.6 to 1.1 mm [77], much smaller than that in the front. Therefore, it is more susceptible to errors.

Fig. 21.

8.2.3 Impact of Ambient RF Sources.

To evaluate the impact of ambient RF sources, we compare the performance in two settings: (1) low Wi-Fi traffic, in which the Wi-Fi signal comes from nearby buildings but no Wi-Fi device running indoors; and(2) intense Wi-Fi traffic, in which 3 Raspberry Pis, 4 laptops, 4 smartphones, and 2 Wi-Fi routers keep streaming data indoors. Figure 22 shows a negligible decrease in the respiration and heart rate measurement. This is because UWB spreads the energy on a wide-frequency bandwidth; thus, narrow-band Wi-Fi signals in 2.4/5 GHz do not present severe interferences.

Fig. 22.

8.2.4 Multi-user Minimum Resolvable Distance.

The minimum resolvable distance, that is, how close adjacent subjects can be without interfering with each other, is a very critical factor for co-habiting scenarios. We invite 3 volunteers, separated 1 m initially, gradually decreasing the distance between them at 10-cm intervals, until any two of them appear identical in measurement, meaning that they are too close for the system to differentiate. We find that all 3 volunteers can be reliably monitored even when separated at only 20 cm, with performance comparable to the single-user setting.

9 Discussion

Extensible Framework. In our vital sign monitoring module, we combine several estimators’ output and leverage prior knowledge about vital signs to produce a weighted sum estimation to deal with the challenges from harmonics and intermodulation. Our framework can easily accommodate more estimation methods and other prior knowledge to improve performance as advances are made in these fields. Evidence for signal patterns suitable for such methods will be quantified to update respective weights.

Body Motions. VitalHub measures vital signs only in quasi-stationary settings (e.g., talking, typing, watching TV) where the displacement of the chest wall due to physiological movements (i.e., heartbeat and respiration) are the main sources of distance, thus, phase changes. However, in non-stationary settings, large body movements (e.g., swaying of the body) can overwhelm physiological movements, causing severe disturbances to phase changes and making it impossible to directly extract vital signs. We adopt a signal quality detector to flag the periods when the signal is “unavailable” to exclude corrupted signals that can degrade the accuracy of vital sign measurements. In longitudinal monitoring, even after discarding corrupted data, there is still a sufficient amount to detect health status changes because of the continuous, long-term measurements.

Recently, regular RGB cameras or a pair of radio sensors before and behind the body have been used to measure and cancel random body movements [27, 44]. In our surveys, people have expressed serious concerns regarding regular cameras due to privacy issues. Using a pair of radio sensors may constrain the space where users can stay, but the general approach of combining signals from multiple sensors seems promising when dealing with large movements. We plan to investigate how to allow unconstrained movements in-home while dealing with large motions.

Trade-offs between Precision and Recall. We observe that for small fractions of time (less than 4%), the signal quality detector may fail. When this happens, none of the estimators can produce a correct heart rate estimate. If this continues long enough, the PWF may fail to smooth out this erroneous output and converge to provide an incorrect heart rate estimate. This problem can be alleviated by combining signals in consecutive time windows but at the cost of reduced recall of the data. We will study this in the future to see how to achieve a proper balance.

Sensing Range and Orientation. The effective range of the current prototype is limited by the UWB and depth sensors we use. The depth sensor has an LOS distance of 4.5 m, whereas the UWB one has a range of nealy 10 m (with some penetration capability). The range is limited by the lesser of the two. Nevertheless, it is still sufficient for room-sized monitoring in-home with the system mounted on the wall, thus, most of the area covered in LOS. Our identity tracking algorithm can also help mitigate blockage due to cross-walking. In future work, we will explore depth sensors of longer ranges and radio-only solutions for context annotation.

The current prototype also produces higher errors in respiration rate when the orientation of the subject is near 90 degrees. This is mainly because the chest movement in the mediolateral dimension (0.6–1.1 mm) is comparable to the displacement attributed to the heartbeat [77]. We will study multi-sensor deployment so that the 90-degree orientation of the subject can be avoided by at least one sensor.

Azimuth and Elevation Separation. The current prototype can only separate signals of different distances. Two people at same distance but different orientations are not separated. This is because the UWB sensor we use has only a single transmitter and receiver; therefore, it cannot differentiate azimuth and elevation. Proper placement of the sensor kit is needed to reduce the chances for failure when subjects are at similar distances. Some research [91] has studied how to extract vital signs from entangled signals with blind signal separation [17] (e.g., Independent Component Analysis [16]). We will explore the robustness of such approaches, and MIMO, multi-sensor configurations for azimuth and elevation separation.

More Subjects and Real Patients. The diversity of our 8-subject pool is a bit limited. We deployed VitalHub in our university hospital for the pilot test. However, due to the COVID-19 pandemic, we are unable to obtain sufficient data. We will resume the study on larger subject pools and real patients once the pandemic eases sufficiently.

10 Related Work

In this section, we describe the main work in vital sign monitoring and context information recognition, and how VitalHub compares.

Vital Sign Monitoring. Traditional medical equipment, such as electrocardiography (ECG) and echocardiography [26], are accurate but too expensive and difficult to operate in the home. VitalMon [35] uses geophones to sense bed vibrations from ballistic forces of sleeping subjects. Wearables using photoplethysmogram (PPG) sensors (e.g., pulse oximeters [2], watches [1], and wrist bands [23, 33]) require continuous skin contact and regular recharging. Some inertial sensor-based methods [6, 49] place smartphones on the chest. This touch-based sensing restrains the free roaming of users [32] and presents cognitive and physical challenges for older adults living alone. Thus, it is not suitable for longitudinal passive in-home monitoring.

Researchers have proposed four types of methods for contactless ubiquitous vital sign monitoring: remote PPG [32, 62], acoustic [51, 60, 75, 87], WiFi [45, 46, 58, 79] and other RF-based methods [15, 52, 72, 89]. Remote PPG measures the heart rate from an RGB video of the user’s face [32]. It has privacy issues and may be impacted by skin color, make-up, and lighting. Both passive [75, 87] and active [51] acoustic methods leveraging smart speakers or smartphones have been studied for respiration rate monitoring. Acousticcardiography (ACG) [60] uses frequency modulated continuous wave (FMCW) sonar frontend smartphones to monitor both heart and respiratory rates. It targets near-field monitoring at a maximum distance of 30 cm.

WiFi-based methods leverage channel state information (CSI) [45, 46, 79] or received signal strength (RSS) [3, 31, 58]. RF methods have exploited techniques including mmWave [15, 89], Doppler radar [52, 72], FMCW radar [5, 82], and UWB radar [43, 71, 84, 88]. Some of these methods monitor only respiration [3, 31, 52, 58, 72]. We observe that respiration harmonics and intermodulation disrupt spectral-based heart rate estimation severely, yet we do not see this described or treated sufficiently in the related work mentioned here. We use a UWB sensor, which has extremely short pulses and wide bandwidth, making it inherently more robust to interference and multipath [40].

The challenge of harmonics and intermodulation is analyzed in [39]. Work in the electronics engineering community [21, 36, 53, 64, 93] has proposed methods based on certain assumptions of the signal’s temporal, spectral patterns (e.g., magnitudes between fundamental and harmonic components, gradual changes in the heart rate). We observe that such patterns are far from stable; thus, these methods (e.g., finding pairs of stable spectral peaks with a \(1:2\) ratio in frequency) often fail. WiBreathe [61] adaptively selects an output from multiple respiration estimators closest to the previous estimate. It assumes that at least one estimator gives a good estimation, which we find does not hold for heartbeats due to much weaker spectral energy. Thus, heartbeats are easily dominated by respiration harmonics and intermodulation. We combine multiple estimators by quantifying respective evidence of their suitable patterns in a probabilistic framework to enable robust heart rate tracking.

Most radio sensing–based work targets quasi-stationary settings, whereas non-stationary settings remain an open issue [83, 86]. To detect large motions, many methods use a fixed threshold for phase change or spectrum sharpness [5, 29, 45, 51, 87, 90]. Motion-related sensors [57] and RGBD cameras [54] are used as well. These additional sensors add to system cost and complexity, and fixed thresholds cannot handle complex signal dynamics. Our heatmap feature incorporates the full spectral characteristics of the vital signs and uses a deep neural network to achieve near human performance.

There are efforts to directly measure and cancel motion disruptions using accelerometers [9, 59], regular RGB cameras and radio sensor pairs before/behind the body [27, 44]. They require wearable sensors or cause privacy concerns and constraints on the user’s free roaming. Thus, they are not preferred for long-term in-home monitoring. We plan to explore multi-sensor collaboration to keep robust measurements under motion while retaining passive, low-cost sensing.

Context Information Recognition. Body pose—activity recognition by non-touch means—has been an active research area. Lots of work leverages visual data, including RGB images, videos, and depth [12, 47]. OpenPose [12] tracks body poses with a single RGB image or reconstructs 3D skeletal poses with multiple cameras. FaceNet [66] extracts facial features [97, 98] from RGB images for identification. Caesar [47] detects complex activities with multiple non-overlapping cameras. Despite the maturity of the technology, many people are uncomfortable living with cameras in their homes due to privacy concerns.

Inertial sensors have been used for tracking activities, including dancing, smoking, and exercise [13, 24, 56]. Visible light is used for skeleton reconstruction [41] with photodiodes or LED panels on the ceiling or floor, which may not be convenient in-home.

WiFi [73, 74, 78, 80, 81, 92, 96] and other RF [4, 14, 42, 94] are studied for human activity and motion sensing. E-eyes [80] and mD-Track [81] use WiFi signals to recognize and track indoor human activities. CrossSense [92] applies transfer learning to effectively reuse learned knowledge across different sites. CARM [78] correlates WiFi CSI dynamics and human activities for recognition. Widar3.0 [96] estimates velocity profiles of gestures for cross-domain gesture recognition. WiMU [73] recognizes the gestures of multiple users simultaneously, and WiAG [74] recognizes the gestures irrespective of the user’s position or orientation. Among RF-based methods, EAR [14] uses ambient RF signals for human activity sensing. Li et al. [42] feed RFID data into a convolutional neural network for activity recognition. RF-Pose3D [94] and RF-Capture [4] track the 3D positions of a person’s skeleton even under full occlusion from the RF sensor.

We choose depth camera in the current prototype because it can produce mature and robust skeletal pose tracking [67] and is easily acceptable by people due to lack of fine-grained visual features. It offers distance report, skeleton tracking, user identification, and activity recognition simultaneously. RF-Pose3D [94] and RF-Capture [4] achieve such goals using customized FMCW radios with antenna arrays. We plan to study how to extend our UWB-based system for robust and low-cost context recognition.

11 Conclusion

We present VitalHub, a robust, non-touch, passive sensing system for longitudinal in-home vital sign monitoring leveraging UWB and depth sensors. We describe how respiration harmonics and intermodulation cause strong disturbances to robust heart rate monitoring. We propose a probabilistic weighted framework that adaptively combines an ensemble of estimators based on the quantified cumulative evidence of their suitable temporal and spectral signal patterns. In addition, we introduce an LSTM-based neural network and a probabilistic tracking model to provide reliable context annotation (i.e., identification and activity recognition) using privacy-preserving skeleton features. Extensive experiments show that VitalHub achieves \(1.5/3.2\) “breaths/beats per minute” (denoted by “bpm”) errors at 80-percentile for RR/HR, approaching the \(1.2/1.5\) bpm error “ceiling” of an idealistic but impractical oracle. We also share insights on why existing methods do not handle harmonics and intermodulation well. In addition, our context annotation module achieves 90% median accuracy for differentiating 8 subjects (whereas the number of cohabiting persons in one home would usually be less than 8) based on skeletal walking patterns and above 96% precision for classifying among 6 common daily activities. With automatic context annotation, the vital sign record can be valuable for future customized analytics (e.g., the detection of anomalous changes in vital signs from a user’s normal distribution of daily routines/activities). We believe that VitalHub offers a suitable solution for longitudinal in-home vital sign monitoring.

Footnotes

The respiration rate is more accurate due to the stronger energy.

To produce the range profile, the UWB-based system needs only a down-conversion mixer. In contrast, the FMCW-based system needs 1D-FFT in addition to the down-conversion mixer, because its time of flight (TOF) has to be linearly translated from the frequency shift.

The \(5 cm\) size is decided based on the amplitude of motion, the penetration effects of signals, and errors in distance measurement.

⁴

Horizontal lines near 10 bpm on RR exist because the true 20 bpm respiration peak could be interpreted as second-order harmonic. Still, such incorrect lines have weaker supporting evidence, thus smaller values and fainter colors.

⁵

Cases of more concurrent user occlusion are solved similarly; thus, we do not discuss them here.

References

[1]

[n. d.]. Apple Watch. Retrieved August 6, 2022 from https://www.apple.com/watch/.

Abstract

1 Introduction

2 Design Considerations

2.1 Design Goals

2.2 Hardware Choices

2.3 Background of UWB-Based Vital Sign Extraction

2.4 Robustness Challenges

3 VitalHub Overview

4 Vital Sign Monitoring

4.1 UWB Signal Preprocessing

4.2 Signal Quality Detector

4.3 Vital Sign Estimation

4.3.1 Respiration Rate Estimation.

4.3.2 Heart Rate Estimation.

5 Context Annotation

5.1 User Identification

5.2 Probabilistic Identity Tracking

5.3 Activity Context Recognition

6 Testbed

6.1 Implementation

6.2 Experimental Setup

7 Microbenchmarks

7.1 Signal Quality Detector

7.2 Context Annotation

7.2.1 User Identification.

7.2.2 Identity Tracking and Maintenance.

7.2.3 Activity Context Recognition.

8 End-to-end Vital Sign Monitoring

8.1 Estimators and Non-linearity Study

8.1.1 Individual Estimators.

8.1.2 Robustness against Harmonics and Intermodulation.

8.1.3 Dealing with Unpredictable and Dynamic Signal Patterns.

8.2 User and Environment Factors

8.2.1 Impact of Distances.

8.2.2 Impact of Orientations.

8.2.3 Impact of Ambient RF Sources.

8.2.4 Multi-user Minimum Resolvable Distance.

9 Discussion

10 Related Work

11 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

A Long-term Wearable Vital Signs Monitoring System using BSN

Novel system sampling multi vital signs for e-home healthcare

Monitoring vital signs using millimeter wave

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations