Elsevier

Pattern Recognition Letters

Volume 112, 1 September 2018, Pages 241-248
Pattern Recognition Letters

Extracting discriminative features using task-oriented gaze maps measured from observers for personal attribute classification

https://doi.org/10.1016/j.patrec.2018.08.001Get rights and content

Highlights

  • Human gaze locations tend to concentrate on informative regions of the human body.

  • We represent the informative region as a task-oriented gaze map.

  • The task-oriented map assigns large weights for extracting discriminative features.

  • Our gaze-based features yield highly accurate personal attribute classification.

Abstract

We discuss how to reveal and use the gaze locations of observers who view pedestrian images for personal attribute classification. Observers look at informative regions when attempting to classify the attributes of pedestrians in images. Thus, we hypothesize that the regions in which observers’ gaze locations are clustered will contain discriminative features for the classifiers of personal attributes. Our method acquires the distribution of gaze locations from several observers while they perform the task of manually classifying each personal attribute. We term this distribution a task-oriented gaze map. To extract discriminative features, we assign large weights to the region with a cluster of gaze locations in the task-oriented gaze map. In our experiments, observers mainly looked at different regions of body parts when classifying each personal attribute. Furthermore, our experiments show that the gaze-based feature extraction method significantly improved the performance of personal attribute classification when combined with a convolutional neural network or metric learning technique.

Introduction

Personal attributes such as gender, clothing, and carried items, which are of interest in the field of soft-biometrics [6], [7], [27], [32], help the collection of statistical data about people in public spaces. Furthermore, personal attributes have many potential applications, such as video surveillance and consumer behavior analysis. In general, pedestrians captured on video or in still images are used for personal attribute classification. Researchers have proposed several methods for automatically classifying personal attributes in pedestrian images; for example, techniques involving convolutional neural networks (CNNs) [22], [25], [29], [30] and metric learning [21], [41] have been proposed. The existing methods can extract discriminative features for personal attribute classification and obtain high accuracy when many training samples containing diverse pedestrian images are acquired in advance. However, the collection of a sufficient number of training samples is very time consuming. Unfortunately, the performance of the existing methods has been found to decrease when the number of training samples is small.

People correctly and quickly classify personal attributes. We believe that people have the visual ability to extract features from an individual. For instance, people correctly classify gender from facial images [3], [4]. In the research field of cognitive science, Yarbus [38] reported that human observers can recognize personal attributes in a scene image with high accuracy when they are given different tasks such as remembering the clothes worn by the individuals or estimating their ages. In this interesting research, he noticed that the observers paid attention to different regions in the scene when they tackled a different task even though they viewed the same image. Recently, researchers have made some efforts to analyze the role of task in various applications [13], [14], [19]. Based on these observations, we hypothesize that people pay attention to different informative regions in pedestrian images while tackling various tasks of personal attribute classification.

It may be possible to reproduce human visual abilities via an algorithm on a computer with a small number of training samples such that the classification performance is equivalent to that of humans. With respect to object recognition, several existing methods for mimicking human visual abilities have been proposed [12], [33], [40]. To mimic human visual ability, the existing methods exploited a saliency map computed from low-level features in a given image using techniques such as those described in [17], [39], [42]. However, the use of the saliency map does not sufficiently represent human visual abilities because of the deep mechanisms of human vision.

An increasing number of pattern recognition studies, specifically those attempting to mimic human visual ability, have measured the gaze locations of observers [11], [18], [31], [36], [37]. These gaze locations have great potential for the collection of informative features during various recognition tasks. Very recently, state-of-the-art techniques [26], [28] have demonstrated that gaze locations can help to extract informative features for the attribute classification of fashion clothing and face images. However, these existing methods do not consider how to treat the case in which observers tackle different tasks for body attributes in the same pedestrian image. We believe that the informative region of the body for each classifier is significantly different for each task of personal attribute classification.

In this paper, we consider the challenging case in which participants in an experiment are given different tasks of personal attribute classification while viewing the same pedestrian images. We confirm whether or not test participants look at different regions when tackling each task. We determine whether or not the gaze locations measured from the participants play an important role in the personal attribute classification. To this end, we generated a task-oriented gaze map from the distribution of gaze locations recorded while participants viewed images to complete each task of manually classifying personal attributes. The high values in a task-oriented gaze map correspond to regions that are frequently viewed by participants. We assume that these regions contain discriminative features for each classifier of a personal attribute because they appear to be useful when the participants are tackling each task of personal attribute classification. When extracting features to learn the classifier, larger weights are given to the regions of the pedestrian images that correspond to the attention regions of the task-oriented gaze maps. The experimental results indicate that our method improves the accuracy of feature extraction when using a CNN or metric learning technique with a small number of training samples.

This paper is organized as follows. Section 2 describes related work, Section 3 describes the generation of task-oriented gaze maps, and Section 4 describes feature extraction using the maps. Our concluding remarks are given in Section 5.

Section snippets

Related work

To mimic human visual ability, existing methods [12], [33], [40] involve the saliency maps of object images with representations of the regions that draw visual attention. Walther et al. [33] combined a recognition algorithm with a saliency map generated from low-level features of gradients of color and intensity using [17]. Researchers have developed techniques [12], [40] that use the object labels of images in addition to the low-level features of objects to generate saliency maps.

Gaze locations in personal attribute classification

Here, we consider the regions of pedestrian images that are frequently looked at by observers when manually classifying personal attributes. For instance, Hsiao et al. [15] found that observers looked at a region around the nose when they identified individuals from a facial image. In the case of gender classification, we believe that the human face plays an important role. However, a pedestrian image contains not only a face but also a body. Yarbus [38] found that observers look at a different

Overview of our method

Here, we describe our method for extracting features using task-oriented gaze maps. The regions that obtain high values in the maps appear to contain informative features for participants because these regions are given attention when the participants manually classified the personal attribute in the pedestrian images for each task. We assume that these regions contain discriminative features for the classifiers of personal attributes. Based on this assumption, we extract these features by

Conclusions

We hypothesized that gaze locations measured from observers performing a classification task contain informative features and help to extract discriminative features for classifiers of personal attributes. We demonstrated that the measured gaze locations tended to concentrate on specific regions of the human body according to the manual personal attribute classification task. We represented the informative region as a task-oriented gaze map for each personal attribute classifier. Owing to the

Acknowledgments

This work was partially supported by JSPS KAKENHI Grant No. JP17K00238 and MIC SCOPE Grant No. 172308003.

Conflict of Interest

None

References (42)

  • P. Dollar et al.

    Pedestrian detection: an evaluation of the state of the art

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • M. Fairchild

    Color Appearance Models

    (2013)
  • A. Fathi et al.

    Learning to recognize daily actions using gaze

    Proceedings of the 12th European Conference on Computer Vision

    (2012)
  • D. Gao et al.

    Discriminant saliency for visual recognition from cluttered scenes

    Proceedings of Neural Information Processing Systems

    (2004)
  • M. Hayhoe et al.

    Eye movements in natural behavior

    Trends Cogn. Sci.

    (2005)
  • J. Hsiao et al.

    Two fixations suffice in face recognition

    Psychol. Sci.

    (2008)
  • J. Huang et al.

    Speed/accuracy trade-offs for modern convolutional object detectors

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • N. Karessli et al.

    Gaze embeddings for zero-shot image classification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • M. Land et al.

    The roles of vision and eye movements in the control of activities of daily living

    Perception

    (1999)
  • M. Li et al.

    Head-shoulder based gender recognition

    Proceedings of IEEE International Conference on Image Processing

    (2013)
  • Cited by (11)

    View all citing articles on Scopus
    View full text