Crowdsourcing facial expressions for affective-interaction

https://doi.org/10.1016/j.cviu.2016.02.001Get rights and content

Highlights

  • The contribution of this paper is a dataset to foster an affective-interaction research and applications.

  • Facial-expression images were captured while users played a game that responded to facial expressions.

  • Statistical consensus techniques were used to merge 229,584 judgments obtained by crowdsourcing to produce high-quality labels for the 42,911 images.

Abstract

Affective-interaction in computer games is a novel area with several new challenges, such as detecting players facial expressions robustly. Many of the existing facial expression datasets are composed of a set of posed face images not captured in a realistic affective-interaction setting. The contribution of this paper is an affective-interaction dataset captured while users were playing a game that reacted to their facial-expressions. This dataset was the result of a framework designed for gathering affective-interaction data and annotating this data with high-quality labels. The first part of the framework is a computer game [15] planned to elicit a particular facial expressions that directly control the game outcome. Thus, the game creates a true and engaging affective-interaction scenario where facial-expressions data were captured. The proposed dataset is composed of a series of sequential video frames where faces were detected while users interacted with a game with their facial expressions. The second part of the framework is a crowdsourcing process designed to ask annotators to identify the facial-expression present in a given face image. Each face image was annotated with a facial-expression: happy, anger, disgust, contempt, sad, fear, surprise, and neutral. We examined how the annotators performance was affected by multiple variables, e.g., reward, judgment limits, golden questions. Once these parameters were tuned, we gathered 229,584 annotations for the whole 42,911 images. Statistical consensus techniques were then used to merge the annotators judgments and produce high-quality image-labels. Finally, we compared different classifiers trained on both ground-truth (expert) labels and crowdsourcing labels: we observed no differences in classification accuracy, which confirms the quality of the produced labels. Thus, we conclude that the proposed affective-interaction dataset provides a unique set of images of people playing games with their facial expressions and labels with a quality similar to that of expert labels (differences are less than 9%).

Introduction

Novel input techniques enabling human–computer interaction using the user body are becoming increasingly popular. Microsoft Kinect introduced full body gesture-based interaction to the mainstream public. With Microsoft Kinect, users can play games with their body by performing actions like jumping, boxing, amongst others. Although the body movements and gestures provide a rich source of input, they can be augmented with affective-interaction through the recognition of a player’s facial expression for a more natural interaction. Fig. 1a shows the same player performing the same action, a punch, but with different facial expressions: a neutral and an angry expression. An affective-interaction aware application would consider both the body-motion and the facial expression when scoring or recognizing actions, giving a higher score to the punch action on the right of Fig. 1a.

Computer games are a primary example of where computational actions can be adjusted to the player’s facial expression, however, interest in facial expressions has moved far beyond computer vision and interaction research, reaching areas like media consumption or health applications [14], [20]. Mcduff el al. [14] showed the effectiveness of a smile for rating videos; they developed a video recommendation system where viewers’ facial expressions are used to rate videos. Moreover, their experiments were validated with manual affective-feedback data collected through crowdsourcing. These applications require a realistic set of annotated images for researchers to improve current algorithms in the context of affective-interaction.

To address this challenge, we devised a framework to gather affective-interaction data through a computer game and a crowdsourcing process to acquire high-quality judgments of that data, see Fig. 1b. Crowdsourcing services are increasingly explored in research tasks to generate large volumes of human-level annotated knowledge [20]. In this paper, we will follow the terminology defined in Table 1. In our case, crowdsourcing enables the collection of an overcomplete set of facial expression judgments. In step 1, we implemented a game [15] that captures players’ faces while interacting with the game. In this scenario, players were controlling the game-play through their facial expression – what we call an affective-interaction scenario. During the span of several game rounds, a large set of unlabeled interaction images was collected. Next, a crowdsourcing process was used to annotate each image with a facial expression. The design of the crowdsourcing job was carefully planned: we obtained several judgments per image (facial expression and corresponding intensity). This is particularly important as facial expression classification is a multi-class problem, instead of a binary.

The data was then uploaded to the crowdsourcing site1 (step 2a), a crowdsourcing job was designed (step 2b), and a set of judgments were obtained for each face image (step 2c). These judgments were later merged with a statistical consensus methods to obtain the optimal labels, this corresponds to step 3. This last step is particularly relevant to get the maximum quality out of the crowdsourcing judgments. To improve the quality of the obtained judgments, i.e. inter-annotator agreement, we need to model the annotator’s behavior when judging facial expression images. We profiled multiple state-of-the-art statistical consensus methods, in the facial-expression domain, using a dataset annotated with expert labels, the CK+ (Cohn–Kanade) [12]. Considering the label estimates provided by each method, several classifiers were then trained to recognize facial-expressions. This allowed us to further analyze the quality of the crowdsourcing labels.

In summary, the key contribution of this paper is a novel facial-expression dataset2 to foster the affective-interaction field, in particular computer-games interaction. From this point forward, we will refer to the released dataset as the NovaEmotions dataset. The two other contributions confirm the value of the proposed dataset: building on previous work by Sheshadri and Lease [21] we benchmarked several state-of-the-art statistical consensus methods in the domain of affective-interaction, and compared the performance of facial-expression classifiers trained with expert labels and crowdsourced labels. These experiments confirmed that, although the crowdsourced labels are less than 9% different from the expert labels, the facial-expression classifiers present no significant difference. In the end, we obtained a unique facial expression dataset of users playing an affective-interactive computer game.

The remainder of this paper is organized as follows: Section 2 discusses related work and Section 3 details the acquisition of the 42,911 affective-interaction images. The crowdsourcing process for obtaining over 229,584 judgments of facial-expressions is presented in Section 4 and the results are discussed in Section 5. Finally, in Section 6, we assess the data quality by comparing different classifiers trained on the proposed dataset and the widely known CK+ dataset.

Section snippets

Human computation

Although computing power has exponentially increased in recent years, humans still achieve better results in understanding human languages, image semantics and many other tasks. Researchers have been studying ways of using humans as a source of human computation [18]. However, unlike computers, humans need motivation and incentives to work and produce quality results. Many human computation systems were proposed [32] to collect the most accurate data from humans, from which gamification was one

Affective-interaction image data

This section describes how the affective interaction images were captured. Existing facial expressions datasets like CK+ [13] or the BU-4DFE [31] were captured in controlled environments and, in CK+, by people trained to perform a prototype expression. The NovaEmotion face images were captured in a novel and natural setting: players competing in a game where facial expression is the sole controller of the game. Fig. 2 illustrates the sequence of two players that are using their facial

Crowdsourcing task design

The design of the crowdsourcing task has a direct relation to how the label data is collected and the quality of the annotator judgments. Therefore, in this section we identify the parameters that affect crowdsourcing results, and present our parameter tuning experiments to determine the most reliable parameter values. We group these parameters into two groups: annotator selection and job attributes. Fig. 6 illustrates a screenshot of a job page on the crowdsourcing Web site.

Affective-interaction label data analysis

The previous sections described the image and label data acquisition process. This section will analyze the quality of the entire set of collected judgments: the dataset has 42,911 images and 229,584 judgments. We required 5 judgments per image, payed $0.006 per judged image and did not limit the number of judgments that one annotator is allowed to perform but limited the number of judgments per page to 20. Gold questions used the images that obtained an agreement of 0.9 in the design jobs. The

Comparison of crowdsourced labels to expert labels

The objective of a crowdsourcing task is to generate labels as correct as possible, allowing classifiers to be trained with high accuracies. In the previous sections, we observed that crowdsourcing is indeed a good alternative to ground-truth generated by experts, with a difference of less than 9%. In this section, we will examine the reliability of crowdsourcing labels for training a facial expressions classifier.

In order to perform this experiment, we used the Cohn and Kanade [12] dataset

Conclusion

This article describes an affective-interaction dataset with both image and label data. Image data was captured while users played a game and label data was collected by a remote crowdsourcing task. To filter crowdsourcing noisy labels, we implemented several quality control measures. As a result, we release a dataset4 with over 40,000 images of player’s facial expression and multiple judgments per image. Judgments for the full set of images are also provided to

Acknowledgments

We would like to thank the volunteers who tested the game and granted research rights over the collected image data. This work has been partially funded by the Portuguese National Foundation under the project UTA-Est/MAI/0010/2009.

References (33)

  • S. Ahern et al.

    Zonetag: designing context-aware mobile media capture to increase participation

    Proceedings of the Pervasive Image Capture and Sharing, 8th International Conference on Ubiquitous Computing, California

    (2006)
  • B. Borsboom

    “Guess Who?: A game to crowdsource the labeling of affective facial expressions is comparable to expert ratings”

    (2012)
  • Y.-Y. Chen et al.

    Learning facial attributes by crowdsourcing in social media

    Proceedings of the 20th International Conference Companion on World Wide Web

    (2011)
  • A.P. Dawid et al.

    Maximum likelihood estimation of observer error-rates using the EM algorithm

    Appl. Stat.

    (1979)
  • G. Demartini et al.

    Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking

    Proceedings of the 21st International Conference on World Wide Web

    (2012)
  • S. Deterding, D. Dixon, From game design elements to gamefulness : defining “Gamification”, 2011, pp....
  • S. Deterding et al.

    Gamification. using game-design elements in non-gaming contexts

    Proceedings of the 2011 Annual Conference Extended Abstracts on Human Factors in Computing Systems, CHI EA’11

    (2011)
  • A. Doan et al.

    Crowdsourcing systems on the world-wide web

    Commun. ACM

    (2011)
  • P. Ekman et al.

    Facial Action Coding System: A Technique for the Measurement of Facial Movement

    (1978)
  • P. Ekman

    Facial expression and emotion

    American psychologist

    (1993)
  • R.E. Jack et al.

    Facial expressions of emotion are not culturally universal

    Proc. Natl. Acad. Sci. USA

    (2012)
  • T. Kanade et al.

    Comprehensive database for facial expression analysis

    Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition

    (2000)
  • P. Lucey et al.

    The extended Cohn–Kanade dataset (CK+): a complete dataset for action unit and emotion-specified expression

    Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    (2010)
  • D. Mcduff et al.

    Crowdsourcing facial responses to online videos

    IEEE Transactions on Affective Computing

    (2012)
  • A. Mourão et al.

    Competitive affective gaming: winning with a smile

    Proceedings of the 21st ACM International Conference on Multimedia

    (2013)
  • A. Mourão et al.

    Facial expression recognition by sparse reconstruction with robust features

    Image Analysis and Recognition

    (2013)
  • Cited by (12)

    • The decadal perspective of facial emotion processing and Recognition: A survey

      2022, Displays
      Citation Excerpt :

      College students of age ages of 18 and 25 were the bulk of participants. The information offers several positions and enlightenment [53]. The primary and crucial starting phase for FER is face detection.

    • Automatic classification of ASD children using appearance-based features from videos

      2022, Neurocomputing
      Citation Excerpt :

      Therefore, it is very urgent and important to design an objective and effective ASD diagnostic method that can be quantified. The human face expresses emotions, which transmit the most important information in linguistic communication, social interaction [58], human–computer interaction [59], and mental state identification [60]. In the past few decades, facial expression characteristics of patients with mental disorders, especially for schizophrenia or depression [4], have been the focus of clinical research.

    • Data-Driven Diagnostics and the Potential of Mobile Artificial Intelligence for Digital Therapeutic Phenotyping in Computational Psychiatry

      2020, Biological Psychiatry: Cognitive Neuroscience and Neuroimaging
      Citation Excerpt :

      Use of human raters has also enabled labeling of novel computer vision datasets, as human raters are imperative for labeling complex behaviors where sufficient training data are lacking. Prior successes in the field of autism computer vision–based diagnostics include labeling emotion (72,73), disengagement of attention (74), levels of engagement, social referencing, facial engagement (75), nonspeech sounds such as laughter, and indicators of language errors such as stutters (76). As a powerful technique for generating novel datasets, crowdsourcing is emerging as an invaluable tool for accelerating data-driven diagnostics.

    • Transformation Modeling with Deformable ConvNets for Facial Action Unit Detection

      2020, Proceedings - 8th International Conference on Digital Home, ICDH 2020
    • Multi-Staged Training of Deep Neural Networks for Micro-Expression Recognition

      2020, SACI 2020 - IEEE 14th International Symposium on Applied Computational Intelligence and Informatics, Proceedings
    • Machine learning from crowds: A systematic review of its applications

      2019, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
    View all citing articles on Scopus
    View full text