Keywords

1 Introduction

As we have described in our proposal [2] for a non-intrusive estimation of a computer user’s affective state based on the Circumplex Model of Affect [1], our goal is to build a supervised machine learning system to classify the user’s state of affect. Thus, the design of the data collecting process plays an extremely important role to achieve the right result. To collect the data, we set up the experiment in a scenario where human subjects will be presented with images from the International Affective Picture System (IAPS) [3] to elicit from them affective reactions that will be manifested through their involuntary changes in pupil diameter and in their facial expressions, while also reporting the subjective assessment of their reactions through the Self-Assessment Manikin (SAM) [4]. During the recording sessions a Kinect sensor will be used to collect the 3D facial coordinates and the Facial Animation Parameter Units (FAPUs) [5] from the subject’s face, as well as an estimate of the illumination level in the area around the eyes of the subject. Simultaneously, an Eye Gaze Tracking (EGT) system will be used to record the pupil diameter in the eyes of the subject. The self reports of arousal and valence marked by the subject in SAM for each IAPS image will also be recorded into the dataset for later use. The last part of this section will contain some brief explanation of terms used throughout this article. Then, in the following section, we will discuss in detail the experiment procedure, and the process of data acquisition as previously described in the early part of the section.

Circumplex Model of Affect: The Circumplex model of affect was introduced by Russell [1] as a proposal to model the affective state of a human. Russell proposed that an affective state consists of two parameters: arousal and valence. Their relationship can be visualized as a four-quadrant graph where valence is the horizontal axis and arousal is the vertical axis. Other studies have shown a similar trend where models have two parameters that are referred using different words [6, 7].

The International Affective Picture System: The International Affective Picture System (IAPS) is a large set of color photographs that elicit shifts in the subject’s arousal and valence. IAPS contains a wide variety of stimulus types for more than 1,000 exemplars of human experience such as joyful, sad, fearful, attractive, angry, simple objects, scenery, etc. The idea is to present the subject with visual stimuli to modify his/her affective state while recording his/her reaction. The IAPS has been used across various fields of study to investigate emotion and attention worldwide and it is well-known for its replication and robustness. Pictures from IAPS are rated with arousal, pleasure, and dominance mean values, based on reactions from men and women, which make them suitable to be used as stimuli in this study. More in-depth information about IAPS can be found in [3].

Fig. 1.
figure 1

(from [3])

Self-Assessment Manikin (SAM)

Self-Assessment Manikin: The Self-Assessment Manikin (SAM) [4] is a tool for a non-verbal, pictorial assessment reporting technique that directly expresses the pleasure, arousal, and dominance associated with the affective state of the subject while being exposed to a stimulus. We mainly focus on the 2-dimensional Circumplex Model of Affect; therefore, dominance reactions are not considered. As demonstrated in Fig. 1, the SAM figure varies along each scale. In the arousal scale, the left-most figure corresponds to the most extremely stimulated, excited, frenzied, jittery, wide-awake, or aroused state. While the other end of the scale represent a completely relaxed, calm, sluggish, dull, sleepy, or unaroused state. The scale ranges from 1 to 9 for the purpose of intermediate fine-grained rating. For the pleasure (valence) assessment, the scale works the same way as in arousal except, in this case, the left-most figure represents a highly happy, pleased, satisfied, contended, hopeful state; while the opposite end represents a very unhappy, annoyed, unsatisfied,melancholic, despaired, bored state.

2 Experiment Setup

The entire data collection process is depicted in the diagram shown in Fig. 2. The diagram describes the process handled by the AffectiveMonitor application [2] and indicates the list of output files for post-data analysis. Kinect, running on the primary machine is responsible to obtain 3D facial coordinates while the TM3 Eye-Gaze Tracker device running on a secondary machine records the pupil diameter signals and sends them over to the primary machine. Desired data are then recorded during the experiment session and are written out in a timely manner, for each frame, to output files. We show how the experiment has been set up and its environment in Fig. 3a.

Fig. 2.
figure 2

Bird’s eye view of the system (data collection process)

2.1 Experiment Procedure

AffectiveMonitor has a separate “Experiment” interface tab section (Fig. 3b) to conduct the experiment from the start to the end. The experiment takes about 35 min and before the experiment session begins, the subject will go through the calibration process consisting of adjusting the shape of a 3D facial model, and adjusting the subject’s position for pupil diameter recording. 70 pictures selected from IAPS will be shown to the subject, one after another, until all samples are presented. For each sample, the subject is asked to look at the picture for 6 s, then immediately after, rate their affective state assessment via SAM (5 s). In between samples, a gray screen is shown during the resting period. The subject is urged to stay still during the first 6 seconds, when he/she is first presented with the stimulus in order to reduce the measurement interference that could occur during the recording process.

Fig. 3.
figure 3

An entire system including Kinect V2 (on top of the screen) and TM3 (in front of the computer) is shown in (a). (b) shows an experiment interface of the AffectiveMonitor application

2.2 Sample Selection

For the experiment, we selected IAPS pictures on the basis of the mean and variance of arousal and valence that come with each picture from the IAPS repository. Our criterion on selecting the samples is based on the study of a 12-Point Affect Circumplex (12-PAC) model of Core Affect [8] which is also based on the Circumplex Model of Affect. The study introduces the modification of hypothetically dividing the Circumplex model into twelve segments called the 12-Point Affect Circumplex (12-PAC) structure. By finding the correlation between many previous studies and their own, the authors report their analysis, and their placement of moods on a 12-PAC structure as shown in Fig. 4b. Based on this study, we selected the IAPS samples that are located around desired angles of those core affects that have more than 60% likelihood to appear in the Circumplex Model. Accordingly, we selected 70 samples as shown in Fig. 4a and we also list the samples in Table 1 by the picture ID from IAPS, categorized by core affect description.

Fig. 4.
figure 4

(a) shows a plot of means of arousal and valence for images in the IAPS repository on top of the Circumplex Model of Affect. Notice that the radius of each plotted circle varies according to its variance. The triangular labels indicate the images chosen for use samples in this experiment. (b) demonstrates thirty mood scales which are placed within the 12-PAC structure with CIRCUM-extension method [8]. The length of the solid line from the center can be roughly described as the maximum likelihood of placing a mood on the designated angle

Table 1. Selected samples listed by picture ID from IAPS

3 Data Acquisition

In this section, we explain the method of obtaining each parameter including 3D facial coordinates, pupil diameter, Facial Animation Parameter (FAP), and illumination around the facial area. All of them are recorded with the same timestamp by the AffectiveMonitor application.

3.1 3D Facial Coordinates

Kinect has provided the basic framework software called HD face [9]. The framework can detect the face of the closest person in front of the Kinect sensor and generate the person’s 3D facial mesh model in real-time. Another interesting prospect of this framework is an ability to reconstruct the person’s face shape by 3D scanning to acquire a very accurate characterization of the person’s face. Given all that, we have integrated this framework into our AffectiveMonitor application to benefit from all the functionality that Kinect has to offer. The mesh model can also be represented by 3D coordinates and can be thought of as markers attached on the subject’s face so whenever the subject’s facial expression changes, the markers also move according to the corresponding facial muscle movement. By recording frame by frame, we can observe the changes of 3D facial coordinates that occurs because of the subject’s facial expression.

One problem that arises during the design of the experiment is the impossibility to restrain the movement of the subjects during the experiment. Body shifts can alter the position and orientation of the subject’s face, which may complicate their processing. To circumvent the issue, we have built a feature in AffectiveMonitor to artificially re-position and re-orient the subject’s face before recording the values. Fortunately, Kinect also provides the pivot point as well as the orientation (in quaternion) of the face. Thus, we can reverse the rotation and transform the point cloud to neutral position at the origin by applying a change of coordinate frame as described in [10].

3.2 Pupil Diameter and Illumination

To acquire pupil diameter signals, we utilize the TM3 Eye-Gaze Tracker (EGT), which has a capability to measure the pupil diameter using the dark-pupil method. We set the sampling interval at 0.33 s and average samples in an average window of 30-samplewidth. The pupil diameter signals are then transferred to the primary machine via TCP/IP, over ethernet cable. AffectMonitor has a feature to plot the average of the pupil diameter dynamically as shown in Fig. 6a.

Many studies have shown that the pupil diameter is under the influence of the Autonomous Nervous System (ANS) and can be used as a marker for arousal level [11]. Unfortunately, pupil diameter is also susceptible to the amount of light on the retina. To bypass this issue, we plan to perform a post-processing step to eliminate the effect of the pupillary light reflex using an adaptive signal processing technique. In order to attain that goal, illumination around the eyes must also be recorded as one of the output parameters. We obtain the illuminance utilizing Kinect’s RGB camera by cropping the video around the eye area (Fig. 6b) and calculating the illumination based on the cropped video. A more detailed explanation on this subject will be reported in a separate article, under preparation (Fig. 5).

Table 2. Facial Animation Parameter Unit (FAPU)
Fig. 5.
figure 5

(a) shows the interface of AffectiveMonitor for mesh construction. (b) displays the interface of AffectiveMonitor used for resetting the facial point cloud to its neutral position. The interface shows the shift in position and orientation in the Euclidean domain.

Fig. 6.
figure 6

(a) shows the interface of AffectiveMonitor for dynamic plotting of pupil diameter. (b) shows the cropped video used for illumination measurement around the eyes.

3.3 Facial Animation Parameter

The Facial Animation Parameter (FAP) is one concept of the components in MPEG-4 Face and Body Animation (FBA) International Standard (ISO/IEC 14496 -1 & -2) [13]. It describes a standard protocol to encode the virtual representation of human and humanoid movement, specifically around the facial region of a body. FAP is commonly used to describe basic actions of facial expression for a synthetic face; for instance, in the CANDIDE model [5]. The ability of FAP to encode the primitive expression information with small memory usage makes it interesting as an alternative method to preserve the subject’s facial expression.

Fig. 7.
figure 7

(from [12])

Facial feature points and Facial Animation Parameter (FAPU)

Table 3. FAP measurement with facial feature points (FBA & Kinect)

The Facial Animation Parameters (FAP) are defined by the displacement between facial feature points defined by FBA (See Fig. 7) which are measured by Facial Animation Parameter Units (FAPUs). FAPUs are normally calculated from a neutral face and divided by 1024 so that the unit is small enough to enable FAPs to be represented in integer numbers. The purpose of FAPUs is to allow a consistent way to interpret FAPs indices for any facial model regardless of their shape and dimension. The description of the FAPUs and how to calculate them are listed in Table 2. We decide to output 19 FAPs listed in Table 3 which are actively related to basic facial expressions as desired output from the total of 68 FAPs [12]. Note that in Fig. 7, the numbering of the facial feature points is according to FBA; while, the index coordination system from Kinect is in different listing. See Table 3 for the correspondence between Kinect’s index coordination system and FBA’s coordination system.

4 Discussion and Conclusion

Given all the previous explanations, we would like to reemphasize that the purpose of this work is to collect the data suitable to train a supervised machine learning model, to classify the affective state of the subject in a Circumplex Model of Affect. In order to achieve that, we have to estimate two parameters, arousal and valence, our model. In case of arousal, we have found strong evidence supporting the notion that pupil diameter is influenced by the Autonomous Nervous System, which is responsible for the state of arousal. While, in the case of valence, we decided to estimate this parameter on the basis of the subject’s facial expression since pleasure and displeasure are directly expressed naturally by activity of the facial muscles. Two data formats representing facial expression are recorded, 3D facial coordinates and Facial Animation Parameter index and each has pros and cons. 3D coordinates are practical because they preserve the whole information recorded in the facial expression without losing any; while, FAP is better in the aspect of memory management. Other data that are collected along during the experiment, such as illuminance around the eye area, distance between the subject’s face and the Kinect sensor, and FAPU, are necessary for scaling adjustment and calibration. Data are obtained in a time-stamped manner where pupil diameter, FAP, 3D facial coordinates, and others are captured simultaneously and recorded together. Additionally, they are recorded in a customized output file for facilitating the transfer of the data to the analysis phase.