Keywords

1 Introduction

As robotics technology improves, robots in the home and workplace will become much more common. In particular, we are already seeing advances in remote presence technologies, such as the Double 2 [3], VGo [13], and Beam [12]. These systems go beyond basic phone and video calling to allow someone to not just see, hear, and talk, but to also move around and explore a remote environment as if they were there. While early applications have been focused on telecommuting, education, and health care, remotely teleoperated robots are now being developed for many different purposes including package delivery, construction surveying, and janitorial tasks.

Along with exciting new capabilities, these systems also raise new privacy concerns. In traditional voice and video calling, the local user has some inherent control over their own privacy because the remote person can only see what the local user points their camera at. In remote presence systems, the remote person is in control of the camera, can move the robot around, and, depending on the capabilities of the robot, manipulate the environment. Many of the potential benefits of having a remotely operated robot are lost if the local user must be present to make sure that the robot only sees and touches the things that it is supposed to see and touch.

In this paper we explore a “privacy by design” [2] approach where we restrict the viewing capabilities of the remote user at the local end. In contrast to previous work that relies on object detection and tracking (which may fail) to obscure selected objects, our goal is to record video in only as much detail as is necessary to accomplish the task, and no more.

One potential drawback to this approach is that our efforts to protect privacy might diminish the utility of our robot. There is a natural trade-off between privacy and utility [1]. At one extreme we could blindfold our robot and not allow it to move. This would maximize privacy protection, but the robot would be prevented from accomplishing many useful tasks. On the other hand, we could permit the robot to freely move and share data from all of its sensors with the robot operator. This may maximize utility—but would offer no privacy protections.

Fig. 1.
figure 1

Four visual conditions, from left to right: No filter, Depth image, standard Sobel, Combined (Combo) Sobel. Note the features of the book and ease of determining where the box is in each case.

To explore the trade-offs of this approach, we present a user study that examines the effects of data filtering on both teleoperated robot utility and privacy protections afforded by such filters. Our specific contributions are: (1) A user study evaluating the effects of data filtering on both privacy protection and the ability to perform a navigation task; and (2) An examination of the privacy-protecting effects of cognitive load for drivers of remotely-operated robots.

2 Related Work

The construct of privacy has been examined from a number of different disciplines ranging from anthropology to sociology [8]. No universal definition or operationalization for privacy exists. For this paper, we exclusively refer to informational privacy, as outlined in the privacy taxonomy proposed by Rueben et al. [11]. We are concerned only with obscuring text, images, and similar sensitive information from the remote operator of the robot. Certain filters may additionally protect social privacy by preventing identification of nearby individuals, thereby providing a degree of anonymity, but we do not focus on this effect.

There is a small, but growing, body of work on implementing and evaluating tools for protecting visual privacy by obscuring things in the robot’s video feed. Jana et al. [6] present a privacy tool for simplifying videos to only the necessary information. Although no user study is conducted, the tool is shown to preserve utility and protect privacy through a simple analysis. Raval et al. [9] present two more video privacy tools. One uses markers to specify private areas, and the other uses hand gestures for the same purpose. Rueben et al. [10] report on a user study comparing three different interfaces for specifying visual privacy preferences to a robot. Butler et al. [1] coin the term “privacy-utility tradeoff” and test a pick-and-place task with the PR2 robot; Hubers et al. [5] did similar tests for a patrol surveillance task with the Turtlebot 2 robot. Both studies found it feasible to complete the tasks with effective privacy filters in place.

One of the difficulties with many of the previous approaches is that the privacy protections that they provide rely on accurate object detection and tracking, and on robot localization. Failure to detect an object or correctly estimate its size results in insufficient (or wholly absent) privacy protections. Further, drawing on the recording concerns noted by Lee et al. [7], the robot’s owner may assume that the robot is always recording. In such a case, even a single-frame object detection or localization error would reveal the entirety of the private object.

In contrast, our approach applies data filtering by default to the entire image rather than a selected object. Butler et al. [1] also filter the entire transmitted image but only evaluate it on a simple remote manipulation task with a stationary robot. The study presented in this paper is the first to evaluate the effects of live, real-time data filtering on a remote operator’s ability to perform a navigation task with a mobile robot.

We are also interested in exploring the effects of cognitive load on privacy protection. Wickens et al. [14] indicate that, as an individual’s cognitive load increases beyond a certain point, their performance on the tasks making up that load decreases. It follows that the difficulty of driving an unfamiliar robot, with or without visual modifications, may have some effect on the driver’s ability to notice or identify objects near the robot.

3 Experiment

The focus of this study is on the influence of various visual conditions on the remote operator’s ability to perform a navigation task while noticing (and potentially identifying) items in their surroundings.

3.1 Hypotheses

We make the following hypotheses:

  • H1: Data filters can be applied to the video feed of a robot in real time while still allowing the operator to complete an assigned navigation task.

  • H2: Data filters are capable of protecting the privacy of the robot owner’s local area by reducing the operator’s ability to notice or identify objects in that area.

  • H3: The increased cognitive load of driving the robot will protect privacy by rendering the operator “too busy” to notice or identify objects in the area.

3.2 Task and Environment

Participants completed a navigation task that asked them to: (1) locate two landmarks (beach balls) in an unfamiliar environment; (2) remotely drive a robot to each of these positions in turn; and (3) return to the starting position. The experimental space was set up as an “obstacle course”, cluttered with objects such as toys and computer accessories. While navigating through the course, the participant was asked to talk about what they were seeing and thinking. Immediately afterward, the participant was shown a recording of their most recent drive and again asked to comment in the same manner. This was repeated for a total of four drive-review pairs, each with one of the four visual conditions described in Sect. 3.3.

Four layouts were created to help mitigate any learning effect. Each measured approximately 3 m by 5 m and contained several cardboard boxes to act as obstacles. The boxes were placed such that the drivable space was an “X” or “H” shape, ensuring that the distance driven was approximately equal in both layouts. Two beach balls were placed in opposite corners within each layout such that the robot did not have line-of-sight to either of them from its starting position in a third corner.

As a substitute for sensitive information such as credit cards or legal documents, we chose to place books throughout the space. Five books were used, placed in each layout such that the robot was nearly certain to see all five while completing the task.

3.3 Visual Conditions

We evaluated three visual filters, illustrated in Fig. 1 and the supplementary video,Footnote 1 chosen for their simplicity and for their potential to hide details while still providing spatial context.

  1. 1.

    Unfiltered (Control): A grayscale video stream. This condition represents the maximum utility side of the privacy-utility tradeoff. Grayscale was used instead of color to match the other conditions, which lacked color.

  2. 2.

    Depth image (Depth): The Kinect’s depth camera value, mapped to grayscale. Closer objects appear darker. This preserves spatial structure but hides texture, such as text.

  3. 3.

    Sobel filter (Sobel): A standard Sobel operator was applied to the video feed. Strong edges appear as dark lines while flat areas appear white. This preserves both spatial structure and texture, reducing privacy protections.

  4. 4.

    Combined Sobel (Combo): The Sobel operator combined with the Depth filter. The lighter pixel from both is kept. This refines the Sobel filter by largely removing texture from the video feed.

3.4 Apparatus and Implementation

A Turtlebot 2 robot running the Robot Operating System (ROS) [4] was used for this study. Video filters were coded as standalone ROS nodes, written in C++ and using the OpenCV library. During the study, the user was provided with a second laptop connected via private wifi network to the Turtlebot computer.

3.5 Design

Each participant experienced all four layouts and all four filters. The order in which the filters were presented was counterbalanced with each possible permutation of the four filters being assigned at most once. The order of layouts was counterbalanced in the same manner. For each participant, the specific filter and layout permutations were selected independently and at random from the pool of the ones not yet used.

3.6 Procedure

Participants were greeted in the hallway outside the testing space. After the consent process, they completed an optional demographic survey which included two questions asking about frequency of video game playing and prior experience with remote-controlled robots. While these were completed, the experimenter entered the testing room to fetch the robot for a short driving practice. At this point, it was verbally explained that driving a robot was the primary activity of the study, and that the video from each drive is recorded.

Next, the participant was given the opportunity to examine the robot and drive it around in the hallway to practice. Users were additionally asked to drive into the testing room and run into a beach ball. This was intended to improve spatial awareness of the robot and its surroundings prior to actual testing. After colliding with the beach ball, the user was allowed to drive about until they felt confident enough to proceed. Most users spent approximately 5 min on this practice (which used the Control visual condition).

Following the test drive, a short video of the various visual conditions was shown, with an explanation of the basic features of each being given. After that the participant was given a brief introduction to the talk-aloud protocol used during the study, along with a short example video of what a typical trial might look like. In particular, we asked participants to comment on everything they saw. The participant was also informed that the robot always begins a trial looking at a corner, and that their task is to locate two beach balls, bump into them, and return to their starting position.

The participant was then given control of the robot and asked to begin the first trial. If the participant stopped talking for approximately five seconds, the experimenter prompted them to resume with “What do you see?” or “Tell me about what’s happening.”

Upon finishing the first task, the participant was shown a recording of their drive and asked to again talk-aloud as they watched it. This was repeated for the remaining three trials.

3.7 Measures

For the driving component, we recorded whether or not the participant completed the task successfully, how long it took for them to complete the task, how many times they collided with other objects, and how many books they noticed or could identify. The user is considered to have succeeded at the task if they were able to bump into both of the beach balls and navigate back to within 0.5 m of their starting position. During the review phase, we recorded how many books the participant was able to notice or identify.

For the purposes of this study, we define “notice” to mean that the user has commented on the object and was able to recognize it as a member of a broad class of objects, such as books or toys. We define “identify” to mean that the user commented on the object and correctly labeled it as some specific object or entity, such as a toy elephant or reading a book’s title. We consider identification of a book to be a breach of privacy, as the visual modifications of the data filters were unable to prevent the driver from reading potentially-sensitive text. On the other hand, we do not consider simply noticing that an object is a book to be a breach of privacy, as awareness of an object’s general type and location is likely to be helpful in preventing collisions and completing the navigation task.

3.8 Participants

We recruited 21 participants from the Mount Vernon, Iowa area via recruitment fliers in public spaces, postings in a local Facebook group, and the Cornell College email newsletter. Participants were compensated \(\$10\). The mean age was 40 years old (S.D. 16). 62% of respondents identified as female and 62% identified as male. With regards to familiarity with robots and other remotely-operated devices, 57% reported “Not at all familiar,” while 38% reported “slightly familiar” and only one reported “very familiar.” In reporting the frequency with which they played video games, 76% reported “hardly ever.” Of the remainder, two reported “a few times per week,” two reported “a few times per month,” and only one reported “daily.” We found no evidence that these factors correlated with any of our recorded metrics.

Fig. 2.
figure 2

Time Required for Task Completion. The increase in time taken over the control condition is only statistically significant for the Combo filter.

4 Results

4.1 H1: Utility

85.5% of users successfully completed the task, with no statistically significant difference between the control and any of the filtered visual conditions (as measured by a 1-sample t-test against the control mean). Collision rates were also not statistically significantly different across all three filter conditions, averaging 1.3 collisions per drive. The average time to complete the trial in the control condition was 198 s. In the three filter conditions, completion took longer, as shown in Fig. 2, but this difference was only statistically significant in the case of the Combo filter, at an average of 238 s per trial (d = .656, p = .023).

4.2 H2: Privacy

In the control condition, participants noticed an average of 3.29 books and identified an average of 1.10 (see Fig. 3). Participants noticed and identified fewer books in all filter conditions, averaging 1.61 fewer books noticed and 0.93 fewer identified. This effect was more pronounced during the review video talk-aloud; in the control condition participants noticed an average 3.80 books and identified 1.20 (presumably due to the decreased cognitive load of not having to drive the robot). In the filter conditions, they noticed an average of 1.75 fewer books and identified 1.07 fewer. All of these differences were statistically significantly different (all d > 1.1, all p < .001).

Evaluating the individual visual conditions, the Depth image condition had the most dramatic effect on performance, with 2.38 fewer objects noticed on average during the drive and 2.60 fewer noticed upon review. The Combo condition resulted in 1.75 fewer objects noticed on average during the drive and 1.84 fewer upon review. Both showed similar reduced identification rates, with 1.1 fewer identifications on average during the drive and 1.2 fewer during review (all p < .001). The Sobel filter showed weakest performance in all areas, with an average of 0.91 fewer notices during the drive (p = .057) and 0.80 fewer notices upon review (p = .028). Identification rates were similarly lower at 0.62 fewer during driving (p = .023) and 0.75 fewer upon review (p = .005).

All results in this section were obtained using a paired t-test comparing, on a per-participant basis, notice and identification rates during the drives and reviews of each condition against the corresponding rates in the control trial.

Fig. 3.
figure 3

Privacy Results. Note the effectiveness of the Depth image across all metrics. Also note the differences between the Noticed drive and review and the similarity between Identification drive and review.

4.3 H3: Cognitive Load

The cognitive load imposed by having to drive the robot only resulted in 0.5 fewer objects being noticed compared to the review video, averaged across all conditions (p < .001). Identification rates were virtually identical, with only 0.01 additional objects identified, on average, during review.

With respect to individual visual conditions, the greatest cognitive load effect was seen in the control condition, with 0.55 additional books noticed during the review compared to the drive (p = .012). The only other significant effect was seen in the Combo condition, an increase of 0.4 books (p = .028). The differences in notice rates with respect to the other two visual conditions were not statistically significant, nor were any effects on the identification rates from any individual filter.

These results were calculated using a paired t-test comparing each user’s notice and identification rates during each drive against the rates for the corresponding review.

5 Discussion

Our first finding is that using visual filters did not affect task performance, but did increase task time. This was most pronounced in the Combo filter, possibly because it was unfamiliar. The filters were effective at reducing privacy violations with the depth image performing the best.

Threats to Generalizability: It is possible that our task and/or environment were too “simple” to tease out task performance differences. We tested a single navigation task, useful for determining the operator’s ability to construct a mental map of the space and move through it without collisions. This is an essential task in mobile teleoperated robots, but there are many others, e.g., object manipulation. Our study took place in a constructed space with fixed layouts. While we attempted to make the space more “naturally chaotic” by adding various clutter objects, we cannot be certain that our results would extend to a real living room or office. Similarly, we used only textbooks as a stand-in for private information. While we can assume that certain objects will usually be privacy-sensitive, such as credit cards, books, or legal documents, there are many other objects that could be considered sensitive that may not be as protected by these filters, such as valuable objects or prototypes. The space of potential filters is also large—for example, a blurred color image plus a depth edge enhancer—with some combinations being more effective for the task while still eliding detail. We chose to test simple filters that have the advantages of being low-cost, easy to implement, require low battery usage, etc., but we expect that many additional filters exist or can be developed that would potentially be more effective for particular tasks.

We observed a difference in task completion time between the user’s first trial and all subsequent trials, indicating that our initial training was insufficient (first: 252 s, second: 214, p = .003). The third and fourth trial times were not statistically different from the second.

There is a small, but noticeable, decrease in detection rates while driving the robot versus reviewing the video. This has implications for conducting tasks via telepresence; we hypothesize that the cognitive load of driving the robot reduced the participant’s ability to observe items not related to the driving task.

We only considered benign operators and “accidental” breaches of privacy. Further work—and a different study design—would be needed to determine if these filters are sufficient to protect against “snooping” during the task.

6 Conclusion

We conducted a user study to examine the effects of data filtering on a remote operator’s ability to perform a navigation task and discern details about the robot’s surroundings. We found that, when the user’s view is filtered, they were slightly slower to complete their assigned task but no more likely to collide with objects or fail the task. The applied filters all reduced the operator’s ability to notice nearby objects or correctly identify them when compared to an unfiltered drive. Of the three visual conditions, the depth image filter best protected privacy. Additionally, when users watched recordings of their own previous driving trials, they noticed slightly more objects but were no more or less likely to correctly identify them.