Crowdsourcing facial expressions for affective-interaction
Introduction
Novel input techniques enabling human–computer interaction using the user body are becoming increasingly popular. Microsoft Kinect introduced full body gesture-based interaction to the mainstream public. With Microsoft Kinect, users can play games with their body by performing actions like jumping, boxing, amongst others. Although the body movements and gestures provide a rich source of input, they can be augmented with affective-interaction through the recognition of a player’s facial expression for a more natural interaction. Fig. 1a shows the same player performing the same action, a punch, but with different facial expressions: a neutral and an angry expression. An affective-interaction aware application would consider both the body-motion and the facial expression when scoring or recognizing actions, giving a higher score to the punch action on the right of Fig. 1a.
Computer games are a primary example of where computational actions can be adjusted to the player’s facial expression, however, interest in facial expressions has moved far beyond computer vision and interaction research, reaching areas like media consumption or health applications [14], [20]. Mcduff el al. [14] showed the effectiveness of a smile for rating videos; they developed a video recommendation system where viewers’ facial expressions are used to rate videos. Moreover, their experiments were validated with manual affective-feedback data collected through crowdsourcing. These applications require a realistic set of annotated images for researchers to improve current algorithms in the context of affective-interaction.
To address this challenge, we devised a framework to gather affective-interaction data through a computer game and a crowdsourcing process to acquire high-quality judgments of that data, see Fig. 1b. Crowdsourcing services are increasingly explored in research tasks to generate large volumes of human-level annotated knowledge [20]. In this paper, we will follow the terminology defined in Table 1. In our case, crowdsourcing enables the collection of an overcomplete set of facial expression judgments. In step 1, we implemented a game [15] that captures players’ faces while interacting with the game. In this scenario, players were controlling the game-play through their facial expression – what we call an affective-interaction scenario. During the span of several game rounds, a large set of unlabeled interaction images was collected. Next, a crowdsourcing process was used to annotate each image with a facial expression. The design of the crowdsourcing job was carefully planned: we obtained several judgments per image (facial expression and corresponding intensity). This is particularly important as facial expression classification is a multi-class problem, instead of a binary.
The data was then uploaded to the crowdsourcing site1 (step 2a), a crowdsourcing job was designed (step 2b), and a set of judgments were obtained for each face image (step 2c). These judgments were later merged with a statistical consensus methods to obtain the optimal labels, this corresponds to step 3. This last step is particularly relevant to get the maximum quality out of the crowdsourcing judgments. To improve the quality of the obtained judgments, i.e. inter-annotator agreement, we need to model the annotator’s behavior when judging facial expression images. We profiled multiple state-of-the-art statistical consensus methods, in the facial-expression domain, using a dataset annotated with expert labels, the CK+ (Cohn–Kanade) [12]. Considering the label estimates provided by each method, several classifiers were then trained to recognize facial-expressions. This allowed us to further analyze the quality of the crowdsourcing labels.
In summary, the key contribution of this paper is a novel facial-expression dataset2 to foster the affective-interaction field, in particular computer-games interaction. From this point forward, we will refer to the released dataset as the NovaEmotions dataset. The two other contributions confirm the value of the proposed dataset: building on previous work by Sheshadri and Lease [21] we benchmarked several state-of-the-art statistical consensus methods in the domain of affective-interaction, and compared the performance of facial-expression classifiers trained with expert labels and crowdsourced labels. These experiments confirmed that, although the crowdsourced labels are less than 9% different from the expert labels, the facial-expression classifiers present no significant difference. In the end, we obtained a unique facial expression dataset of users playing an affective-interactive computer game.
The remainder of this paper is organized as follows: Section 2 discusses related work and Section 3 details the acquisition of the 42,911 affective-interaction images. The crowdsourcing process for obtaining over 229,584 judgments of facial-expressions is presented in Section 4 and the results are discussed in Section 5. Finally, in Section 6, we assess the data quality by comparing different classifiers trained on the proposed dataset and the widely known CK+ dataset.
Section snippets
Human computation
Although computing power has exponentially increased in recent years, humans still achieve better results in understanding human languages, image semantics and many other tasks. Researchers have been studying ways of using humans as a source of human computation [18]. However, unlike computers, humans need motivation and incentives to work and produce quality results. Many human computation systems were proposed [32] to collect the most accurate data from humans, from which gamification was one
Affective-interaction image data
This section describes how the affective interaction images were captured. Existing facial expressions datasets like CK+ [13] or the BU-4DFE [31] were captured in controlled environments and, in CK+, by people trained to perform a prototype expression. The NovaEmotion face images were captured in a novel and natural setting: players competing in a game where facial expression is the sole controller of the game. Fig. 2 illustrates the sequence of two players that are using their facial
Crowdsourcing task design
The design of the crowdsourcing task has a direct relation to how the label data is collected and the quality of the annotator judgments. Therefore, in this section we identify the parameters that affect crowdsourcing results, and present our parameter tuning experiments to determine the most reliable parameter values. We group these parameters into two groups: annotator selection and job attributes. Fig. 6 illustrates a screenshot of a job page on the crowdsourcing Web site.
Affective-interaction label data analysis
The previous sections described the image and label data acquisition process. This section will analyze the quality of the entire set of collected judgments: the dataset has 42,911 images and 229,584 judgments. We required 5 judgments per image, payed $0.006 per judged image and did not limit the number of judgments that one annotator is allowed to perform but limited the number of judgments per page to 20. Gold questions used the images that obtained an agreement of 0.9 in the design jobs. The
Comparison of crowdsourced labels to expert labels
The objective of a crowdsourcing task is to generate labels as correct as possible, allowing classifiers to be trained with high accuracies. In the previous sections, we observed that crowdsourcing is indeed a good alternative to ground-truth generated by experts, with a difference of less than 9%. In this section, we will examine the reliability of crowdsourcing labels for training a facial expressions classifier.
In order to perform this experiment, we used the Cohn and Kanade [12] dataset
Conclusion
This article describes an affective-interaction dataset with both image and label data. Image data was captured while users played a game and label data was collected by a remote crowdsourcing task. To filter crowdsourcing noisy labels, we implemented several quality control measures. As a result, we release a dataset4 with over 40,000 images of player’s facial expression and multiple judgments per image. Judgments for the full set of images are also provided to
Acknowledgments
We would like to thank the volunteers who tested the game and granted research rights over the collected image data. This work has been partially funded by the Portuguese National Foundation under the project UTA-Est/MAI/0010/2009.
References (33)
- et al.
Zonetag: designing context-aware mobile media capture to increase participation
Proceedings of the Pervasive Image Capture and Sharing, 8th International Conference on Ubiquitous Computing, California
(2006) “Guess Who?: A game to crowdsource the labeling of affective facial expressions is comparable to expert ratings”
(2012)- et al.
Learning facial attributes by crowdsourcing in social media
Proceedings of the 20th International Conference Companion on World Wide Web
(2011) - et al.
Maximum likelihood estimation of observer error-rates using the EM algorithm
Appl. Stat.
(1979) - et al.
Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking
Proceedings of the 21st International Conference on World Wide Web
(2012) - S. Deterding, D. Dixon, From game design elements to gamefulness : defining “Gamification”, 2011, pp....
- et al.
Gamification. using game-design elements in non-gaming contexts
Proceedings of the 2011 Annual Conference Extended Abstracts on Human Factors in Computing Systems, CHI EA’11
(2011) - et al.
Crowdsourcing systems on the world-wide web
Commun. ACM
(2011) - et al.
Facial Action Coding System: A Technique for the Measurement of Facial Movement
(1978) Facial expression and emotion
American psychologist
(1993)
Facial expressions of emotion are not culturally universal
Proc. Natl. Acad. Sci. USA
Comprehensive database for facial expression analysis
Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition
The extended Cohn–Kanade dataset (CK+): a complete dataset for action unit and emotion-specified expression
Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Crowdsourcing facial responses to online videos
IEEE Transactions on Affective Computing
Competitive affective gaming: winning with a smile
Proceedings of the 21st ACM International Conference on Multimedia
Facial expression recognition by sparse reconstruction with robust features
Image Analysis and Recognition
Cited by (12)
The decadal perspective of facial emotion processing and Recognition: A survey
2022, DisplaysCitation Excerpt :College students of age ages of 18 and 25 were the bulk of participants. The information offers several positions and enlightenment [53]. The primary and crucial starting phase for FER is face detection.
Automatic classification of ASD children using appearance-based features from videos
2022, NeurocomputingCitation Excerpt :Therefore, it is very urgent and important to design an objective and effective ASD diagnostic method that can be quantified. The human face expresses emotions, which transmit the most important information in linguistic communication, social interaction [58], human–computer interaction [59], and mental state identification [60]. In the past few decades, facial expression characteristics of patients with mental disorders, especially for schizophrenia or depression [4], have been the focus of clinical research.
Data-Driven Diagnostics and the Potential of Mobile Artificial Intelligence for Digital Therapeutic Phenotyping in Computational Psychiatry
2020, Biological Psychiatry: Cognitive Neuroscience and NeuroimagingCitation Excerpt :Use of human raters has also enabled labeling of novel computer vision datasets, as human raters are imperative for labeling complex behaviors where sufficient training data are lacking. Prior successes in the field of autism computer vision–based diagnostics include labeling emotion (72,73), disengagement of attention (74), levels of engagement, social referencing, facial engagement (75), nonspeech sounds such as laughter, and indicators of language errors such as stutters (76). As a powerful technique for generating novel datasets, crowdsourcing is emerging as an invaluable tool for accelerating data-driven diagnostics.
Transformation Modeling with Deformable ConvNets for Facial Action Unit Detection
2020, Proceedings - 8th International Conference on Digital Home, ICDH 2020Multi-Staged Training of Deep Neural Networks for Micro-Expression Recognition
2020, SACI 2020 - IEEE 14th International Symposium on Applied Computational Intelligence and Informatics, ProceedingsMachine learning from crowds: A systematic review of its applications
2019, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery