Estimating and using absolute and relative viewing distance in interactive systems

https://doi.org/10.1016/j.pmcj.2012.06.009Get rights and content

Abstract

In this paper we explore and validate the merits of using absolute and relative viewing distances from the screen as complementary input modalities for interactive systems. We motivate the use of viewing distance as a complementary modality by first mapping out its design space and then proposing several new applications that could benefit from it. We demonstrate that both absolute and relative viewing distance can be reliably estimated under controlled circumstances for both desktop and mobile devices using low-cost cameras and readily available computer vision algorithms. In our evaluations we find that viewing distance is a promising complementary input modality that can be reliably estimated using computer vision in environments with constant lighting. For environments with heterogeneous lighting conditions several challenges still exist when designing practical systems. To aid practitioners and researchers we conclude by highlighting several design implications for future systems.

Introduction

Researchers have overcome many of the robustness, initialisation, speed and usability issues that previously blocked vision-based techniques from evolving into mainstream user interfaces. Sensors such as Microsoft Kinect are used for body and gesture recognition, eye-tracking systems are used to create new accessible interfaces, motion tracking is integral to the film industry, and face detection and iris recognition are rapidly becoming commonplace in biometric security systems. Recognition of gaze and facial expressions has been researched extensively, see for example  [1] for an extensive review.

In this paper we explore how we can use the viewing distance between the user and the display as a complementary interaction modality. Specifically, we make the following contributions:

  • We map out the design space for techniques using viewing distance.

  • We present an algorithm that can reliably estimate absolute and relative viewing distances (90% accuracy) for both mobile and desktop devices under constant lighting conditions.

  • We achieve these results with a markerless approach using readily available computer vision algorithms and commodity cameras.

  • We investigate the accuracy of absolute and relative viewing distance estimation for both indoor and outdoor environments.

  • We highlight several design implications for future systems.

Researchers have previously explored a range of computer vision techniques for identifying and tracking facial features, including pupils, the eye area, nostrils, lips, lip corners and pose (e.g.  [2], [3], [4]). Once detected and tracked, such features can form parts of a multimodal interface (for an extensive survey see  [1]).

Face-tracking has been used to realise “perceptual interfaces” that allow face movement to control games  [5] and 3D graphics interfaces  [6]. Head movements can be translated into control variables, which can then be used to control a mouse pointer. However, under certain circumstances face-tracking using computer vision algorithms is insufficient to accurately control viewpoint movements. To overcome this limitation, researchers have proposed to fuse multiple modalities. For example, fused inputs from gaze, speech, mouse and keyboard have been explored to adapt the interfaces of office applications  [7].

Researchers have also investigated how to detect and interpret user’s gaze. Vertegaal et al.  [8] use a custom EyeContact sensor coupled with a mobile phone to detect whether the user is engaging in a conversation. Later, Dickie et al.  [9] propose a range of scenarios on how to use the same EyeContact sensor to adapt mobile applications depending on whether the user is looking at the mobile phone display or not. A related technique is to use gaze-gestures for mobile phone interaction  [10].

In addition, several systems exist that measure users’ proximity to their displays. Harrison and Dey  [11] present a system that detects when users learn towards their laptop and then automatically zooms into the user interface. This system requires the user to wear markers for it to be able to estimate the user’s distance to the display. In their study of ambient public displays, Vogel and Balakrishnan  [12] use another marker-based approach, the Vicon motion-tracking system. They also define four zones of interaction: ambient, implicit, subtle and personal. This is a refinement of the three interaction zones (ambient, notification and cell interaction) defined by Prante et al.  [13], who used a combination of RFID and WiFi sensing to determine distance. Ballendat et al.  [14] expand the work of the previous papers by refining the interaction zones for proximity-aware interaction. Ballendat et al.  [14] also use the marker-based Vicon motion tracking system.

There are a number of different technologies used for distance detection in the referenced work. We list the available technologies (or their current alternatives where the product has been superseded by a newer version) in Table 1. We observe three distinct groups of sensors.

  • High accuracy, long range sensors that require user augmentation.

  • Sensors with lower accuracy and range requiring no user augmentation.

  • Other systems.

The high range and accuracy group includes two marker based systems using a number of cameras in the environment and passive markers placed on the object to be tracked (Vicon, OptiTrack). There is also an ultrasonic system (InterSense), which uses tracker devices worn or held by the user in addition to the sensors placed in the environment. The Vicon system has been used in several referenced works  [12], [14] and the InterSense system was used for example by Nacenta et al. in their study of perspective-aware multi-display interfaces  [16].

The main advantage of the second group of sensors is that they require no augmentation of the user. This group includes our system and the Kinect sensor. Where our system uses standard computer vision algorithms, the Kinect sensor has a special camera capable of capturing depth information. The maximum detection distance of our system has not been established yet as it depends on the resolution of the camera available to the system. It also means that as camera sensor technology improves, the maximum detection distance of our system will improve as well.

The third group consists of sensor systems that do not fit into the groups above. They both need to augment the user to work. The infrastructure based system used by Prante et al.  [13] has only been specified very loosely in their paper but includes a Pocket PC held by the user, and a WiFi device and an RFID reader present in the environment. The system described by Harrison and Dey  [11] uses a similar approach to ours but they use markers on the user’s head to determine the distance.

In this paper we propose viewing distance as a complementary modality. Such a modality can be useful for a variety of applications. For example, it can enable mobile systems to detect whether the user is looking at the screen or not (e.g.  [9]). It can also drive a system that automatically zooms in when users are leaning towards the screen (e.g.  [11]). In addition, this modality can be useful to automatically dismiss notifications when the user is looking at their mobile phone or desktop screen.

In Fig. 1 we map out the design space of viewing distance in two dimensions: effective resolution and human detection level. In the figure, the human detection level ranges from undetectable to disruptive. Effective resolution is a combination of the viewing distance (and thus the viewing angle) of the user, the pixel resolution of a display, and the physical dimensions of the display. We give examples of three potential applications that map to the three most common types of content presented on displays: images, text and visualisations.

One design concept is to use viewing distance to create a system for adaptive viewing of images. Such a system would determine the ideal resolution of an image based on the distance between the user’s eyes and the screen. This interface design would lie between undetectable and subtle as it would be hard for the user to notice that the resolution of downloaded images depended on the distance between their eyes and the screen. This is because there would always be enough detail for the pixels of the displayed image to be slightly beyond the resolving capability of the user.

Another example is an adaptive mapping system that would display varying amounts of detail in the maps depending on the viewing distance of the user. This system is on the boundary between subtle and intrusive. On the one hand, the changes in the amount of detail shown would be noticeable. On the other hand, the interface would appear less cluttered.

The third design idea is to dynamically adapt textual information. Consider a resolution-independent e-book reader. No matter how far away the user is from the display, the system can ensure the text is shown at a constant size from the user’s point of view. In this design the user can select the most comfortable font size for the text and the reader can automatically adjust the amount of text shown based on the available display area (which is proportional to the viewing angle). We hypothesise that this system would reside in the subtle range. Although the changes in the amount of text shown will be noticeable, they would probably not distract the user much from their reading task.

Fig. 1 also maps out three other existing systems. Bradski’s CAMSHIFT head-tracking system  [6] is used as a control interface for games and spatial navigation in 3D environments. Since the head movement directly controls all the movement within the interface, the changes in the user interface are intrusive, bordering on disruptive. Harrison and Dey  [11] present a system that magnifies content on screen as the user leans in towards the display. Although the changes are still very noticeable, they are somewhat less pronounced than in the CAMSHIFT system. The third system is the Gaze-X system  [7]. While it does not use viewing distance to alter the user interface, some of the described alterations of their affective user interface system could be motivated and directly enhanced using viewing distance as an input modality. The system has the widest variety of potential changes to the user interface, ranging from disruptive to almost subtle, providing their system with more flexibility compared to the other systems above.

Section snippets

Estimating viewing distance

In order to use viewing distance as a complementary modality we need to be able to measure it accurately. Preferably we can do so using inexpensive commodity hardware. Previously, the most closely related systems for measuring viewing distance, or detecting whether the user is looking at the screen, have relied on custom hardware  [9] or markers  [11]. In contrast, we explore a system designed to work with commodity hardware.

Our system uses the OpenCV computer vision library.1

Evaluation 1: gaze detection and gaze direction detection

In the first evaluation we tested the system’s performance in a controlled laboratory setting. We examined two input devices: a desktop computer equipped with a screen with a built-in web camera (Desktop) and a mobile phone with a front-facing camera (Mobile).

Evaluation 2: real life eye-pair detection

The first evaluation examined our ability to detect the presence of gaze in a controlled environment by detecting the participant’s eye-pair. In this second evaluation we wanted to test how well our classifiers would work under more real-life conditions through a task involving a mobile phone in a combination of outdoor and indoor environments.

Evaluation 3: estimating absolute viewing distance

The third evaluation investigates how well we can estimate absolute viewing distance using the algorithm presented earlier in this paper.

Discussion

Our evaluations revealed that it is indeed possible to detect viewing distance using built-in and consumer grade cameras on desktops and mobile phones. The first evaluation showed that in a controlled environment, relative distance can be accurately estimated using computer vision. The third experiment further demonstrated that in a controlled environment with constant lighting conditions we can estimate users’ absolute viewing distance with high accuracy.

However, the second evaluation

Conclusions

In this paper we explored how we can use absolute and relative viewing distance between the user and the display as a complementary interaction modality. We first mapped out the design space for techniques using viewing distance. Thereafter we presented markerless techniques for reliably estimating absolute and relative viewing distance (90% accuracy) for both mobile and desktop devices under constant lighting conditions using readily available computer vision algorithms and commodity cameras.

References (18)

  • A. Jaimes et al.

    Multimodal human–computer interaction: a survey

    Computer Vision and Image Understanding

    (2007)
  • E. Murphy-Chutorian et al.

    Head pose estimation in computer vision: a survey

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2009)
  • J. Yang et al.

    Visual tracking for multimodal human computer interaction

  • M.-H. Yang et al.

    Detecting faces in images: a survey

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2002)
  • S. Wang et al.

    Face-tracking as an augmented input in video games: enhancing presence, role-playing and control

  • G.R. Bradski

    Computer vision face tracking for use in a perceptual user interface

    Intel Technology Journal

    (1998)
  • L. Maat et al.

    Gaze-X: adaptive, affective, multimodal interface for single-user office scenarios

    Lecture Notes in Artificial Intelligence

    (2007)
  • R. Vertegaal

    Designing attentive interfaces

  • C. Dickie et al.

    EyeLook: using attention to facilitate mobile media consumption

There are more references available in the full text version of this article.

Cited by (13)

View all citing articles on Scopus
View full text