Visual attention analysis and prediction on human faces

doi:10.1016/j.ins.2017.08.040

Information Sciences

Volume 420, December 2017, Pages 417-430

https://doi.org/10.1016/j.ins.2017.08.040 Get rights and content

Abstract

Human faces are almost always the focus of visual attention because of the rich semantic information therein. While some visual attention models incorporating face cues indeed perform better in images with faces, yet there is no systematic analysis of the deployment of visual attention on human faces in the context of visual attention modelling, nor is there any specific attention model designed for face images. On faces, many high-level factors have influence on visual attention. To investigate visual attention on human faces, we first construct a Visual Attention database for Faces (VAF database), which is composed of 481 face images along with eye-tracking data of 22 viewers. Statistics of the eye-movement data show that some high-level factors such as face size, facial features and face pose have impact on visual attention. Thus we propose to build visual attention models specifically for face images through combining low-level saliency calculated by traditional saliency models with high-level facial features. Efficiency of the built models is verified on the VAF database. When combined with high-level facial features, most saliency models can achieve better performance.

Introduction

We can handle massive amounts of visual information efficiently because of our remarkable ability to focus on the salient events. This remarkable ability of the human visual system (HVS) is known as visual attention. Visual attention has been widely investigated and applied in numerous signal processing (e.g., image quality assessment [18], [22], [31], [32], [36], [45], [46], automatic contrast enhancement [20], [21], image retargeting [16]) and computer vision applications (e.g., salient object detection [40], scene classification [42], image retrieval [47]). In the long term research of visual attention, various computational models have been proposed with encouraging results [4], [5]. Traditional visual saliency models generally take full advantage of low-level visual features [4], [5], [9], [19], [30], which is reasonable since visual attention can be influenced by these low-level features [10], [33]. These models perform well in stimuli that can be well represented by low-level features. But they are not so efficient in some “semantic” situations, especially in some social scenes.

Birmingham et al. [2] pointed out that saliency does not account for fixations within social scenes, and observers’ fixations are mainly driven by the social information. Birmingham et al. [3] gave a review to human social attention. An important conclusion is that people tend to look at faces, especially at the eyes. Besides the social scenes, some research specifically investigated gaze allocation during face exploration. Võ et al. [44] observed that gaze is dynamically directed to the eyes, nose, or mouth according to the currently depicted event. Moreover, auditory speech makes an important role. Attention is focused on the mouth region when the face is speaking, but fixations in the mouth region decrease when the speech signal is removed. Besides auditory speech, general audio information also has influence on visual attention [34], [37], [38]. Min et al. [37], [38] tried to fuse both audio and visual information to predict eye fixations. The proposed method shows superiority in some scenes containing moving-sounding objects. Eisenbarth and Alpers [13] found that subjects fixate on mouth regions for a longer time in happy expressions, while eyes will receive more attention in sad or angry expressions. It is reasonable since facial expressions convey social information, and some facial regions are most characteristic for specific emotions.

Since many factors concerning human faces have been proved to have influence on visual attention, researchers start to incorporate face cues into visual attention modeling. The most common way of considering face cues is to combine low-level saliency with face detector [8], [11], [27], [28]. Through emphasis of the face regions, they perform better in scenes containing human faces. Cerf et al. [8] first incorporated face detectors. Judd et al. [28] learned a saliency model based on low, middle and high-level image features. In their work, high-level features are composed of face, person and car detectors. Jiang et al. [27] took the crowd information into account and proposed a visual attention model in crowd scenes. In this model, low-level image features and high-level crowd features are integrated through multiple kernel learning (MKL). The crowd features are mainly face related cues such as face size, face density, face pose, etc. Coutrot and Guyader [11] assessed the impact of face and speech in conversations. They proposed an audio-visual saliency model for natural conversation scenes based on the fact that the speaking faces are generally much more salient compared with other faces.

Although plentiful of visual attention models take faces into consideration [8], [11], [27], [28], the way of incorporating face cues is not comprehensive. They simply detect faces and then emphasize the face regions. As described in the second paragraph of this section, there are many high-level factors that will influence fixation distribution on faces [2], [3], [13], [44]. But few researchers apply these psychological findings into visual attention modeling, and little work has been done to build a visual attention computational model for faces. In practice, there are many visual communication systems in which faces occupy the scenes, such as video calls. In such systems, the influence of those high-level factors are significant and face-optimized visual attention models are needed. In this paper, we mainly investigate visual attention allocation on faces and build visual attention models specifically for faces. The contributions of this study are three-fold:

•
First, to conduct the research of visual attention analysis and prediction on human faces, we perform eye-tracking experiments and construct a Visual Attention database for Faces (VAF database). The VAF database is composed of 481 images containing faces of various sizes, poses, ages, genders, etc. Eye-tracking data, face detection and facial landmark localization results are also available with this database. The constructed database can facilitate the research of both visual attention distribution analysis on faces and building or evaluating visual attention models on face images.
•
Second, based on the VAF database, we analyze visual attention allocation on faces. Some high-level factors such as face size, facial features and face pose are verified to have influence on visual attention. Results suggest that in images containing small faces, the face is generally considered as a whole. But when viewing large faces, participants tend to focus on specific facial features such as the eyes, nose and mouth. Face pose is also found to have influence on the attention distribution on faces.
•
Third, those verified high-level impacting factors are incorporated to visual attention modeling. We build visual attention models specifically for face images. We extract some features related to the high-level impacting factors. Then the extracted features are combined with low-level saliency which is calculated by traditional bottom-up saliency models. We also evaluate state-of-the-art saliency models on the VAF database. Experiment results show that when combined with those high-level features, most saliency models can achieve better fixation prediction performance in face images.

The rest of this paper is organized as follows. In Section 2, we introduce the eye-tracking experiments and some specifications of the VAF database. Based on the VAF database, fixation distribution on faces is analyzed in Section 3. Some high-level factors that influence visual attention are also concluded and verified in this section. In Section 4, we extract some high-level facial features, and combine them with low-level saliency computed by state-of-the-art saliency models through learning. Effectiveness of the improved models are demonstrated on the VAF database. Section 5 concludes this paper.

Section snippets

Stimuli

We collect 481 source images from Flickr¹ which are all available under the Creative Commons (CC) copyright. The collected images are cropped to resolutions of 1280 × 960, 1024 × 1024 or 768 × 1024 (width × height). All test images contain human faces, and most images only include a single face. Collected images contain faces of various sizes, poses, ages, genders, etc. More details are discussed in Section 2.4, and some sample images, statistics of test stimuli are also

Visual attention analysis on human faces

In this section, we analyze the collected eye-tracking data. Several factors are found to influence visual attention on human faces significantly. Details are discussed below.

Visual attention prediction on human faces

As described in previous sections, traditional saliency models may not work well for face images. So we propose to build visual attention models specifically for face images. Fig. 7 illustrates the flowchart of the proposed method. We first detect low-level saliency using traditional saliency models. Then we propose some high-level facial features based on the visual attention analyses given in Section 3. Finally the low-level saliency and high-level facial features are fused adaptively through

Conclusion

In this paper, we focus on the problem of visual attention analysis and prediction on human faces. We perform eye-tracking experiments on faces of various sizes. We find that visual attention is distributed more dispersedly on larger faces. A visual attention database for faces, named VAF database is constructed. Eye fixation distribution on face is analyzed. Subjects generally fixate on particular areas, e.g., eyes, mouth, nose, etc. Moreover, fixations in each area are distributed with a bias

Acknowledgements

This work was supported in part by National Natural Science Foundation of China under Grants 61422112, 61371146, 61521062, and 61527804.

References (50)

E. Birmingham et al.
Saliency does not account for fixations to eyes within social scenes
Vis. Res.
(2009)
S. Goferman et al.
Context-aware saliency detection
IEEE Trans. Pattern Anal. Mach. Intell.
(2012)
R.J. Peters et al.
Components of bottom-up gaze allocation in natural images
Vis. Res.
(2005)
C. Siagian et al.
Rapid biologically-inspired scene classification using features shared with visual attention
IEEE Trans. Pattern Anal. Mach. Intell.
(2007)
ZhaiG. et al.
A psychovisual quality metric in free-energy principle
IEEE Trans. Image Process.
(2012)
ZhangJ. et al.
Saliency detection: a Boolean map approach
Proceedings of the IEEE International Conference on Computer Vision
(2013)
A. Asthana et al.
Robust discriminative response map fitting with constrained local models
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2013)
E. Birmingham et al.
Human social attention
Ann. N.Y. Acad. Sci.
(2009)
A. Borji et al.
Salient object detection: a survey
(2014)
A. Borji et al.
State-of-the-art in visual attention modeling
IEEE Trans. Pattern Anal. Mach. Intell.
(2013)

Cited by (26)

Image visualization: Dynamic and static images generate users’ visual cognitive experience using eye-tracking technology
2022, Displays
Citation Excerpt :
The related cognitive load theory also provides additional concepts for understanding and analyzing cognitive learning using images. In particular, in cognitive psychology, visual perception is the primary channel for acquiring most information about the external world [38]. However, human attention is limited and can only allocate certain resources to the visual system; thus, visual selective attention plays a role in reducing visual attention stress [34].
This paper discusses the influence of dynamic images and traditional static images on user perception in web interface visualizations of products. First, thirty graduate students in industrial design participated in an eye-tracking experiment, performing visual imagery (VI) tasks of the product images with different presentation formats and durations. The results of eye movement experiment show that the visual cognitive effect was better for the dynamic images than the static image, and the efficiency of visual search was improved. However, the emotional experience of viewing dynamic images was substantially affected by the presentation time. Secondly, there were significant differences in the cognitive level and emotional experience of the users between the dynamic images with different presentation times. The optimal perception experience was observed at a presentation time of 9000 ms, indicating that the subjective responses of the users’ questionnaire survey did not represent the actual cognitive needs of the users. This study provides a scientific basis for product designers to achieve an improved browsing experience of their products.
What's in a face? Facial appearance associated with emergence but not success in entrepreneurship
2022, Leadership Quarterly
Facial appearance has been associated with leader selection in domains where effective leadership is considered crucial, such as politics, business and the military. Few studies, however, have so far explored associations between facial appearance and entrepreneurship, despite the growing expectation that societies project on entrepreneurs for providing exemplary leadership in activities leading to the creation of disruptive start-ups. By using computer vision tools and a large-scale sample of entrepreneurs and non-entrepreneurs from Crunchbase, we investigate whether three geometrically based facial characteristics - facial width-to-height ratio (fWHR), cheekbone prominence, and facial symmetry - as well as advanced statistical models of whole facial appearance, are associated with a) the likelihood of an individual to emerge as an entrepreneur and b) the performance of the company founded by that individual. We find that cheekbone prominence, facial symmetry and two whole facial appearance statistical models are associated with the likelihood of an individual to emerge as an entrepreneur. In contrast to entrepreneurship emergence, none of the examined facial characteristics are associated with performance. Overall, our results suggest that facial appearance is associated with the emergence of leaders in the entrepreneurial endeavor, however, it is not informative about their subsequent performance.
Look me in the eyes! A pre-registered eye-tracking study investigating visual attention and affective reactions to faces with a visible difference
2022, Body Image
Citation Excerpt :
Based on the assumption that people gaze more often and longer on more important sources of information (see Rahal & Fiedler, 2019), different cognitive processes can be investigated through eye tracking measures, notably information weighting or processing depth, with total dwell time (i.e., total time spent fixating on a specific location) or time to first fixations. Using eye-tracking technology, the importance of internal expressive features when viewing a face to get useful information (i.e., eyes, nose, mouth) was demonstrated (Min et al., 2017). Specifically, Vo and colleagues (2012) demonstrated that participants stared at the eyes more when watching a photograph of a face making eye-contact, the mouth when the target was speaking and the nose when the target was moving fast.
This research aims to determine how disfigurement alters visual attention paid to faces and to examine whether such a potential modified pattern of visual attention to faces with visible difference was associated, in turn, with perceiver’s stigmatizing affective reactions. A pilot study (N = 38) and a pre-registered experimental eye-tracking study (N = 89) were conducted. First, the visual explorations of faces with and without disfigurement were compared. The association of these visual explorations with affective reactions were investigated next. Findings suggest that disfigurement impacts visual attention toward faces; attention is not merely attracted to the disfigured area but it is also diverted particularly from the eye area. Disfigurement also eases disgust-related, surprise-related, anxiety-related, and, to a lesser extent, hostility-related affective states. Exploratory interaction effects between attention to the eyes and to the disfigured part of the face revealed a hybrid effect on disgust-related affect and an increase in surprise-related affect when participants fixated more upon the disfigured area and fixated less upon the eyes. Thus, perceiver’s attention is captured by disfigurement and also diverted from face internal features which seems to play a role in the affective reactions elicited.
Behavioral phenotype features of autism
2022, Neural Engineering Techniques for Autism Spectrum Disorder: Volume 2: Diagnosis and Clinical Analysis
Autism spectrum disorder (ASD; henceforth “autism”) is one type of neurodevelopmental condition, whose underlying neurobiology marker is still unclear. Individuals with autism show delays in the development of human cognition, leading to the difficulties with cognitive empathy across lifespan. The phenotype markers including social communication symptoms, fixated or restricted behaviors or interests, hyper- or hypo-sensitivity to sensory stimuli, and associated features have historically been the primary markers in the diagnosis procedure of autism. However, current diagnosis procedures are time and labor expensive and require well-trained clinicians to administer, resulting in long waiting times for at-risk individuals. In this paper, we discuss several state-of-the-art techniques exploring objective and quantitative behavior phenotype features for autism, which includes atypical visual attention, action, and drawing behavior. These techniques may shed light on future studies and instruments related to the analysis and computer-aided diagnosis of autism based on behavioral phenotype.
Prediction of mobile image saliency and quality under cloud computing environment
2019, Digital Signal Processing: A Review Journal
Citation Excerpt :
In this work, we concentrate on the design of bottom-up models. Numerous algorithms in this type were proposed to look for particular areas through making local saliency maximized on the basis of biologically inspired regional features [3–10]. These features are primarily explored according to important findings about neural feedbacks which happened in V1 cortex and lateral geniculate nucleus.
Recent years have witnessed the explosive growth of multimedia applications over networks and increasingly high requirements of consumers for multimedia signals in terms of quality of experience (QoE). Effective and efficient yet energy-saving saliency detection model and quality prediction method are eagerly desired, since they play critical roles in raising users' QoE and promoting the progress of green multimedia communication. Current studies of saliency detection and quality evaluation are far from ideal yet. In this paper we investigate the influence of complexity on visual saliency and quality. Complexity is an essential concept in human perception to visual stimulus, but it is substantially abstract and hard to be endowed with a clear definition. We suppose that brain systematically combines global and local features during the whole process of human perception. Global features lead a dominant position in seeking salient areas under the condition that image complexity is high, namely without obviously isolated foreground objects, whereas local features play a key role in an opposite situation. With this consideration, this paper establishes a novel framework for detecting visual saliency based on image complexity estimation before complexity-adaptive merging of global and local features. Furthermore, the concept of complexity is deployed for blind photographic image quality assessment (IQA) by means of saliency-based weighting. Features which refer to contrast, artifacts, brightness and natural scene statistics (NSS) are modified and integrated to derive a blind IQA model and predict the quality of photos. Based on the above two technologies, this paper introduces smart phones as mobile terminals, cloud platforms for speed-up and energy-saving, and wireless networks for transmission, and provides a practical mobile multimedia application. Comparative experiments validate that, within this application system, our proposed saliency detection model and blind photographic IQA method implement better than existing relevant competitors in terms of effectiveness and efficiency comparison.
Multi-view face hallucination using SVD and a mapping model
2019, Information Sciences
Citation Excerpt :
Face hallucination is a technique to construct a high-resolution (HR) image from a low-resolution (LR) face image [30,33,40], and has become an active issue in computer vision and pattern recognition [9,10,21,24 46,48].
Multi-view face hallucination (MFH) presents a challenge issue in face recognition domain. In this paper, an efficient method based on singular value decomposition (SVD) and a mapping model is proposed for multi-view face hallucination. Based on an approximately same linear mapping relationship across different views, two corresponding matrices obtained from the SVD of the low resolution (LR) image for the high-resolution (HR) multi-view face images can be constructed via the mapping model using global reconstruction. Experiments show that our proposed multi-view face-hallucination scheme is effective and produces promising super-resolved results.

View all citing articles on Scopus

View full text

Visual attention analysis and prediction on human faces

Abstract

Introduction

Section snippets

Stimuli

Visual attention analysis on human faces

Visual attention prediction on human faces

Conclusion

Acknowledgements

Vis. Res.

IEEE Trans. Pattern Anal. Mach. Intell.

Vis. Res.

IEEE Trans. Pattern Anal. Mach. Intell.

IEEE Trans. Image Process.

Robust discriminative response map fitting with constrained local models

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Human social attention

Ann. N.Y. Acad. Sci.

Salient object detection: a survey

State-of-the-art in visual attention modeling

IEEE Trans. Pattern Anal. Mach. Intell.

Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study

IEEE Trans. Image Process.

Saliency based on information maximization

Proceedings of the Advances in Neural Information Processing Systems

Predicting human gaze using low-level saliency combined with face detection

Proceedings of the Advances in Neural Information Processing Systems

A hierarchical saliency detection approach for bokeh images

Proceedings of the IEEE International Workshop on Multimedia Signal Processing

Influence of spatial resolution on state-of-the-art saliency models

Proceedings of the The Pacific-Rim Conference on Multimedia

How saliency, faces, and sound influence gaze in dynamic social scenes

J. Vis.

Visual saliency detection by spatially weighted dissimilarity

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Happy mouth and sad eyes: scanning emotional facial expressions

Emotion

Visual saliency estimation by nonlinearly integrating features using region covariances

J. Vis.

LIBLINEAR: a library for large linear classification

J. Mach. Learn. Res.

Saliency-based stereoscopic image retargeting

Inf. Sci.

A fast reliable image quality predictor by fusing micro-and macro-structures

IEEE Trans. Ind. Electron.

Visual saliency detection with free energy theory

IEEE Signal Process. Lett.

Brightness preserving video contrast enhancement using s-shaped transfer function

Proceedings of the IEEE Visual Communications and Image Processing

Automatic contrast enhancement technology with saliency preservation

IEEE Trans. Circuits Syst. Video Technol.