Elsevier

Information Sciences

Volume 420, December 2017, Pages 417-430
Information Sciences

Visual attention analysis and prediction on human faces

https://doi.org/10.1016/j.ins.2017.08.040Get rights and content

Abstract

Human faces are almost always the focus of visual attention because of the rich semantic information therein. While some visual attention models incorporating face cues indeed perform better in images with faces, yet there is no systematic analysis of the deployment of visual attention on human faces in the context of visual attention modelling, nor is there any specific attention model designed for face images. On faces, many high-level factors have influence on visual attention. To investigate visual attention on human faces, we first construct a Visual Attention database for Faces (VAF database), which is composed of 481 face images along with eye-tracking data of 22 viewers. Statistics of the eye-movement data show that some high-level factors such as face size, facial features and face pose have impact on visual attention. Thus we propose to build visual attention models specifically for face images through combining low-level saliency calculated by traditional saliency models with high-level facial features. Efficiency of the built models is verified on the VAF database. When combined with high-level facial features, most saliency models can achieve better performance.

Introduction

We can handle massive amounts of visual information efficiently because of our remarkable ability to focus on the salient events. This remarkable ability of the human visual system (HVS) is known as visual attention. Visual attention has been widely investigated and applied in numerous signal processing (e.g., image quality assessment [18], [22], [31], [32], [36], [45], [46], automatic contrast enhancement [20], [21], image retargeting [16]) and computer vision applications (e.g., salient object detection [40], scene classification [42], image retrieval [47]). In the long term research of visual attention, various computational models have been proposed with encouraging results [4], [5]. Traditional visual saliency models generally take full advantage of low-level visual features [4], [5], [9], [19], [30], which is reasonable since visual attention can be influenced by these low-level features [10], [33]. These models perform well in stimuli that can be well represented by low-level features. But they are not so efficient in some “semantic” situations, especially in some social scenes.

Birmingham et al. [2] pointed out that saliency does not account for fixations within social scenes, and observers’ fixations are mainly driven by the social information. Birmingham et al. [3] gave a review to human social attention. An important conclusion is that people tend to look at faces, especially at the eyes. Besides the social scenes, some research specifically investigated gaze allocation during face exploration. Võ et al. [44] observed that gaze is dynamically directed to the eyes, nose, or mouth according to the currently depicted event. Moreover, auditory speech makes an important role. Attention is focused on the mouth region when the face is speaking, but fixations in the mouth region decrease when the speech signal is removed. Besides auditory speech, general audio information also has influence on visual attention [34], [37], [38]. Min et al. [37], [38] tried to fuse both audio and visual information to predict eye fixations. The proposed method shows superiority in some scenes containing moving-sounding objects. Eisenbarth and Alpers [13] found that subjects fixate on mouth regions for a longer time in happy expressions, while eyes will receive more attention in sad or angry expressions. It is reasonable since facial expressions convey social information, and some facial regions are most characteristic for specific emotions.

Since many factors concerning human faces have been proved to have influence on visual attention, researchers start to incorporate face cues into visual attention modeling. The most common way of considering face cues is to combine low-level saliency with face detector [8], [11], [27], [28]. Through emphasis of the face regions, they perform better in scenes containing human faces. Cerf et al. [8] first incorporated face detectors. Judd et al. [28] learned a saliency model based on low, middle and high-level image features. In their work, high-level features are composed of face, person and car detectors. Jiang et al. [27] took the crowd information into account and proposed a visual attention model in crowd scenes. In this model, low-level image features and high-level crowd features are integrated through multiple kernel learning (MKL). The crowd features are mainly face related cues such as face size, face density, face pose, etc. Coutrot and Guyader [11] assessed the impact of face and speech in conversations. They proposed an audio-visual saliency model for natural conversation scenes based on the fact that the speaking faces are generally much more salient compared with other faces.

Although plentiful of visual attention models take faces into consideration [8], [11], [27], [28], the way of incorporating face cues is not comprehensive. They simply detect faces and then emphasize the face regions. As described in the second paragraph of this section, there are many high-level factors that will influence fixation distribution on faces [2], [3], [13], [44]. But few researchers apply these psychological findings into visual attention modeling, and little work has been done to build a visual attention computational model for faces. In practice, there are many visual communication systems in which faces occupy the scenes, such as video calls. In such systems, the influence of those high-level factors are significant and face-optimized visual attention models are needed. In this paper, we mainly investigate visual attention allocation on faces and build visual attention models specifically for faces. The contributions of this study are three-fold:

  • First, to conduct the research of visual attention analysis and prediction on human faces, we perform eye-tracking experiments and construct a Visual Attention database for Faces (VAF database). The VAF database is composed of 481 images containing faces of various sizes, poses, ages, genders, etc. Eye-tracking data, face detection and facial landmark localization results are also available with this database. The constructed database can facilitate the research of both visual attention distribution analysis on faces and building or evaluating visual attention models on face images.

  • Second, based on the VAF database, we analyze visual attention allocation on faces. Some high-level factors such as face size, facial features and face pose are verified to have influence on visual attention. Results suggest that in images containing small faces, the face is generally considered as a whole. But when viewing large faces, participants tend to focus on specific facial features such as the eyes, nose and mouth. Face pose is also found to have influence on the attention distribution on faces.

  • Third, those verified high-level impacting factors are incorporated to visual attention modeling. We build visual attention models specifically for face images. We extract some features related to the high-level impacting factors. Then the extracted features are combined with low-level saliency which is calculated by traditional bottom-up saliency models. We also evaluate state-of-the-art saliency models on the VAF database. Experiment results show that when combined with those high-level features, most saliency models can achieve better fixation prediction performance in face images.

The rest of this paper is organized as follows. In Section 2, we introduce the eye-tracking experiments and some specifications of the VAF database. Based on the VAF database, fixation distribution on faces is analyzed in Section 3. Some high-level factors that influence visual attention are also concluded and verified in this section. In Section 4, we extract some high-level facial features, and combine them with low-level saliency computed by state-of-the-art saliency models through learning. Effectiveness of the improved models are demonstrated on the VAF database. Section 5 concludes this paper.

Section snippets

Stimuli

We collect 481 source images from Flickr1 which are all available under the Creative Commons (CC) copyright. The collected images are cropped to resolutions of 1280 × 960, 1024 × 1024 or 768 × 1024 (width × height). All test images contain human faces, and most images only include a single face. Collected images contain faces of various sizes, poses, ages, genders, etc. More details are discussed in Section 2.4, and some sample images, statistics of test stimuli are also

Visual attention analysis on human faces

In this section, we analyze the collected eye-tracking data. Several factors are found to influence visual attention on human faces significantly. Details are discussed below.

Visual attention prediction on human faces

As described in previous sections, traditional saliency models may not work well for face images. So we propose to build visual attention models specifically for face images. Fig. 7 illustrates the flowchart of the proposed method. We first detect low-level saliency using traditional saliency models. Then we propose some high-level facial features based on the visual attention analyses given in Section 3. Finally the low-level saliency and high-level facial features are fused adaptively through

Conclusion

In this paper, we focus on the problem of visual attention analysis and prediction on human faces. We perform eye-tracking experiments on faces of various sizes. We find that visual attention is distributed more dispersedly on larger faces. A visual attention database for faces, named VAF database is constructed. Eye fixation distribution on face is analyzed. Subjects generally fixate on particular areas, e.g., eyes, mouth, nose, etc. Moreover, fixations in each area are distributed with a bias

Acknowledgements

This work was supported in part by National Natural Science Foundation of China under Grants 61422112, 61371146, 61521062, and 61527804.

References (50)

  • A. Borji et al.

    Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study

    IEEE Trans. Image Process.

    (2013)
  • N. Bruce et al.

    Saliency based on information maximization

    Proceedings of the Advances in Neural Information Processing Systems

    (2005)
  • M. Cerf et al.

    Predicting human gaze using low-level saliency combined with face detection

    Proceedings of the Advances in Neural Information Processing Systems

    (2008)
  • CheZ. et al.

    A hierarchical saliency detection approach for bokeh images

    Proceedings of the IEEE International Workshop on Multimedia Signal Processing

    (2015)
  • CheZ. et al.

    Influence of spatial resolution on state-of-the-art saliency models

    Proceedings of the The Pacific-Rim Conference on Multimedia

    (2015)
  • A. Coutrot et al.

    How saliency, faces, and sound influence gaze in dynamic social scenes

    J. Vis.

    (2014)
  • DuanL. et al.

    Visual saliency detection by spatially weighted dissimilarity

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2011)
  • H. Eisenbarth et al.

    Happy mouth and sad eyes: scanning emotional facial expressions

    Emotion

    (2011)
  • E. Erdem et al.

    Visual saliency estimation by nonlinearly integrating features using region covariances

    J. Vis.

    (2013)
  • FanR.-E. et al.

    LIBLINEAR: a library for large linear classification

    J. Mach. Learn. Res.

    (2008)
  • FangY. et al.

    Saliency-based stereoscopic image retargeting

    Inf. Sci.

    (2016)
  • GuK. et al.

    A fast reliable image quality predictor by fusing micro-and macro-structures

    IEEE Trans. Ind. Electron.

    (2017)
  • GuK. et al.

    Visual saliency detection with free energy theory

    IEEE Signal Process. Lett.

    (2015)
  • GuK. et al.

    Brightness preserving video contrast enhancement using s-shaped transfer function

    Proceedings of the IEEE Visual Communications and Image Processing

    (2013)
  • GuK. et al.

    Automatic contrast enhancement technology with saliency preservation

    IEEE Trans. Circuits Syst. Video Technol.

    (2015)
  • Cited by (26)

    • Image visualization: Dynamic and static images generate users’ visual cognitive experience using eye-tracking technology

      2022, Displays
      Citation Excerpt :

      The related cognitive load theory also provides additional concepts for understanding and analyzing cognitive learning using images. In particular, in cognitive psychology, visual perception is the primary channel for acquiring most information about the external world [38]. However, human attention is limited and can only allocate certain resources to the visual system; thus, visual selective attention plays a role in reducing visual attention stress [34].

    • Look me in the eyes! A pre-registered eye-tracking study investigating visual attention and affective reactions to faces with a visible difference

      2022, Body Image
      Citation Excerpt :

      Based on the assumption that people gaze more often and longer on more important sources of information (see Rahal & Fiedler, 2019), different cognitive processes can be investigated through eye tracking measures, notably information weighting or processing depth, with total dwell time (i.e., total time spent fixating on a specific location) or time to first fixations. Using eye-tracking technology, the importance of internal expressive features when viewing a face to get useful information (i.e., eyes, nose, mouth) was demonstrated (Min et al., 2017). Specifically, Vo and colleagues (2012) demonstrated that participants stared at the eyes more when watching a photograph of a face making eye-contact, the mouth when the target was speaking and the nose when the target was moving fast.

    • Behavioral phenotype features of autism

      2022, Neural Engineering Techniques for Autism Spectrum Disorder: Volume 2: Diagnosis and Clinical Analysis
    • Prediction of mobile image saliency and quality under cloud computing environment

      2019, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      In this work, we concentrate on the design of bottom-up models. Numerous algorithms in this type were proposed to look for particular areas through making local saliency maximized on the basis of biologically inspired regional features [3–10]. These features are primarily explored according to important findings about neural feedbacks which happened in V1 cortex and lateral geniculate nucleus.

    • Multi-view face hallucination using SVD and a mapping model

      2019, Information Sciences
      Citation Excerpt :

      Face hallucination is a technique to construct a high-resolution (HR) image from a low-resolution (LR) face image [30,33,40], and has become an active issue in computer vision and pattern recognition [9,10,21,24 46,48].

    View all citing articles on Scopus
    View full text