Elsevier

Neurocomputing

Volume 119, 7 November 2013, Pages 101-110
Neurocomputing

Video classification and recommendation based on affective analysis of viewers

https://doi.org/10.1016/j.neucom.2012.04.042Get rights and content

Abstract

Most previous works on video classification and recommendation were only based on video contents, without considering the affective analysis of viewers. In this paper, we presented a novel method to classify and recommend videos based on affective analysis, mainly on facial expression recognition of viewers, by fusing the spatio–temporal features. For spatial features, we integrate Haar-like features into compositional ones according to the features’ correlation, and train a mid classifier. Then this process is embedded into the improved AdaBoost learning algorithm to obtain spatial features. And for temporal feature fusion, we adopt HDCRFs based on HCRFs by introducing a time dimension variable. The spatial features are embedded into HDCRFs to recognize facial expressions. Experiments on the Cohn–Kanada database show that the proposed method has a promising performance. Then viewers' changing facial expressions are collected frame by frame from the camera when they are watching videos. Finally, we draw affective curves which tell the process of affection changes. Through the curves, we segment each video into affective sections, classify videos into categories, and list recommendation scores. Experimental results on our collected database show that most subjects are satisfied with the classification and recommendation results.

Introduction

With the widespread use of cameras, videos have become more and more popular, being an indispensable part of our daily life. Meanwhile, on-line services for sharing personal or professional videos, such as YouTube and YouKu [1], are developing rapidly, as shown in Fig. 1. Therefore, how to organize videos with categories, how to segment each video into affective sections, and how to recommend videos sorted by affective semantics and high quality, have become a difficult and nerve-wracking question, but are of paramount importance for managing websites and improving user experience. These questions have attracted more attention of computer science researchers. Since viewers are interested in affective video scenes that can arouse some types of emotions [2], affective classification and recommendation of videos have a great demand and become possible with the advancement of computer vision techniques.

Most previous and existing works on affective video classification and recommendation focus on detecting video affective content by using low-level features. In Ref. [3], HMMs were used to categorize videos into three types of affective content, fear, sadness and joy, based on low-level visual features. Arifin et al. [4] used dynamic Bayesian networks (DBN) to analyze affective content and its temporal dynamics, also based on low-level features. Xu et al. [5] proposed a hierarchical model to classify five types of affective content (joy, sadness, anger, fear, and neutral) and applied HMMs to calculate the strength of emotions. Go Irie et al. [2] proposed a latent topic driving model to classify video affective scenes based on movie topic extraction via the latent Dirichlet allocation and emotion dynamics with reference to Plutchik's emotion theory. In Ref. [6], arousal and valence features are extracted to implement an integrated system for personalized music video affective analysis, visualization, and retrieval. They focused mainly on learning classifiers by using generative models based on low-level features. Song et al. [7] presented a large scale video taxonomic classification system, utilizing the category taxonomic structure in training and interpreting the classification results, and integrating video content based features with text features. Wang et al. [8] provided an effective way of integrating data from diverse sources, and proposed three-DRF model for fusing models trained from individual data sources when combined with small amount of manually labeled data in a pairwise fashion. Tang et al. proposed a structure-sensitive anisotropic manifold ranking method based on a structure-sensitive similarity measure for video concept detection [9], and two graph-based semi-supervised learning methods (one is named kernel linear neighborhood propagation [10], and the other considers correlations between labeling concepts [11]) for video annotation.

However, they have not considered the affective analysis of viewers, while facial expressions of viewers are efficient and important to reflect their attitudes, feelings and evaluations of videos and to reflect videos' affective transitions. Therefore, another type of video analysis based on affective states of viewers appeared. Rho et al. [12] reasoned about the user's mood and situation using both collaborative filtering and ontology technology and implemented a prototype music recommendation system. Joho et al. [13], [14] proposed a new approach for personal highlights detection in videos and video summarization based on the analysis of facial activities of viewers with the model of pronounced level and expression's change rate. They considered that the magnitude of the facial motion vectors represented a degree of a viewer's affective reaction to video contents.

On the other hand, affective computing has been paid much attention and widely spread, since Picard introduced the idea [15]. And facial expression is the key method to express one's affections. Facial expression analysis and recognition has attracted much research attention in the community of psychology and computer science for more than 40 years due to its wide potential applications. For human beings, a particular facial expression is continuous and transient, usually activated by the associated emotion and generated through a series of muscle motions. These subtleties have caused designing computer vision and pattern recognition algorithms to recognize expressions automatically a challenging task [16].

Most present works on expression recognition are based on two typical pioneering works. First, Izard [17] categorized facial expression into six basic expressions (happiness, sadness, anger, disgust, surprise and fear), giving the psychological fundamental for computer vision processing. Second, Ekman and Friesen [18] proposed the Facial Action Coding System (FACS), which decomposed each expression into several relative action units (AUs). Most of the existing automatic facial expression recognition works focused on learning discriminative classifiers over the whole face, which classified an input facial image or sequence as one of the six basic emotions [19], [20].

Generally, there are two steps to tackle this problem. The first one is to extract geometrical features or appearance features. The popular geometrical features are the key points extracted by active shape model (ASM) [21]. Since ASM is sensitive to illumination, pose, and exaggerated expression, appearance features such as Gabor features, Haar-like features, and LBPs are becoming more and more widely used. Shan et al. [22] illustrated that LBPs are as powerful as Gabor features. The experiments in [23] demonstrated that Haar-like features are comparable with Gabor features and even better for expression recognition. Because the definitions of AUs are actually ambiguous semantic descriptions, it is not so easy to do accurate AU detection automatically [24]. Although FACS does not make a clear definition of AU's level, it points out each AU's location. Based on AUs' locations, Yang et al. [24] proposed an algorithm for building compositional Haar-like features to avoid the AU detection due to its ambiguity, which achieved good performance for facial expression recognition, especially those expressions at low intensity level.

The second one is to select discriminative features over the whole face to build classifiers, such as SVM [22], AdaBoost [25] and DBN [26]. Chang et al. [16] adopted Hidden Conditional Random Fields (HCRFs) to classify facial expressions and expanded it into partially-observed HCRFs (PO-HCRFs), in which PO-HCRF9 got the highest accuracy. Alternatively, there are also methods that directly analyze and recognize the basic expressions without depending on AUs. Cohen et al. [27] used a tree-augmented-naive Bayesian classifier to learn the dependencies between facial expressions and AUs.

However, most previous works only focus on increasing the recognition accuracy, based on a normal facial expression database, in which expressions are too exaggerated, unavailable in daily life. And they have not considered the spatio–temporal feature fusion. But it is the combination of spatial and temporal features that generates facial expressions. Though widely reported that facial expression recognition has many potential applications, few concrete and practical systems based on facial expression recognition have been developed.

In this paper, we propose a novel method to classify, segment and recommend videos based on affective analysis, mainly on facial expression recognition of viewers, by learning the spatio–temporal feature fusion. For spatial features, we integrate Haar-like features into compositional ones according to features’ correlation and train a mid classifier during the period. Then this process is embedded into improved AdaBoost learning algorithm to obtain spatial features. And for temporal feature fusion, we adopt HDCRFs based on HCRFs by introducing the time dimension variable. Finally the obtained spatial features are embedded into HDCRFs to recognize facial expressions.

Based on the above facial expression recognition algorithm, we propose and implement a novel affective video classification and recommendation method. The architecture of our proposed method is shown in Fig. 2. Different from previous methods using generative models based on low-level features, we classify and recommend affective videos according to the expressions of viewers with discriminative methods.

Our contribution lies in four aspects: 1. We classify, segment and recommend videos, especially web videos based on affective analysis, mainly on facial expression recognition of viewers. To the best of our knowledge, this is the first time to employ viewers' affections for classifying and recommending videos. 2. We propose an affective curve model in the form of function and combine it with videos' affection flow. We also explain the meanings of inflection points of the curve. 3. We analyze and recognize facial expressions from spatio–temporal aspects simultaneously by embedding the process of building compositional Haar-like features into HDCRFs, and develop a practical system of expression recognition to reflect viewers' affections, which are crucial to evaluate videos' quality. 4. We propose the hidden dynamic conditional random fields (HDCRFs) model based on HCRFs by introducing a time dimension variable.

The rest of this paper is organized as follows. 2 Spatial features, 3 Temporal feature fusion: HDCRFs introduce the process of spatial feature extraction and temporal feature fusion for facial expression recognition, respectively. We describe how to classify and recommend videos based on viewers' facial expressions in Section 4, followed by relative experiments and results in Section 5. Finally, conclusion and future work are discussed in Section 6.

Section snippets

Spatial features

Feature representation and extraction plays an important role in the facial expression recognition task. Because of the simplicity and the effectiveness of the Haar-like features in facial expression recognition [25], we adopt the Haar-like features to represent expression appearance.

Temporal feature fusion: HDCRFs

The HCRF model has been widely used and applied to object recognition [28]. In nature, the main idea behind HCRFs is to enrich CRFs by adding hidden states to capture complex dependencies or implicit structures in the training samples. The effect can be enhanced by using more hidden variables, or by increasing the number of possible hidden states. Either way would lead to a graphical model with a large number of hidden-state configurations [16].

In our method, by introducing the time dimension

Viewers' facial expression recognition

What is most different from previous works and the most difficult is that the videos of viewers' facial expression usually consist of several different categories of expressions. How to segment the videos into image sequences, each of which consists of only one prototype expression, becomes the most important question. Here, we adopt the optimization strategy: we calculate the probability of the sequence from the first frame to the present frame, until the probability comes to the maximum and

Experiments

We test our facial expression recognition method with three sets of experiments on the Cohn–Kanada facial expression database [31] and our affective video classification and recommendation experiment on our own videos. And comparisons are done against now popular methods.

Conclusion and future work

In this paper, we presented a novel method for facial expression recognition by fusing the spatio–temporal features. We first build compositional features based on local appearance features extracted from each patch divided to cover the Aus' location and train a mid classifier. Then we embed this process into improved AdaBoost algorithm to obtain spatial features. Secondly we define HDCRFs expanded from HCRFs by introducing the time dimension. Finally, the spatial features are embedded into

Acknowledgment

The work is supported by the National Natural Science Foundation of China (No. 61071180) and Key Program (No. 61133003).

Sicheng Zhao is currently a Ph.D. candidate at the Harbin Institute of Technology. His research interests include image and video affective content analysis, especially focusing on expression recognition and its applications.

References (32)

  • YouKu 〈www.youku.com〉, YouTube...
  • Kota Hidaka Go Irie

    Latent topic driving model for movie affective scene classification

    ACM Multimedia

    (2009)
  • H.B. Kang

    Affective content detection using HMMs

    ACM Multimedia

    (2003)
  • S. Arifin et al.

    Affective level video segmentation by utilizing the pleasure–arousal–dominance information

    IEEE Trans. Multimedia

    (2008)
  • M. Xu et al.

    Hierarchical movie affective content analysis based on arousal and valence features

    ACM Multimedia

    (2008)
  • S. Zhang et al.

    Affective Visualization and Retrieval for Music Video

    IEEE Trans. Multimedia

    (2010)
  • Y. Song, M. Zhao, JayYagnik, X. Wu. Taxonomic classification for web-based videos, in: Proceedings of IEEE Conference...
  • Z. Wang, M. Zhao, et al, YouTubeCat: learning to categorize wild web videos, in: Proceedings of IEEE Conference on...
  • J. Tang et al.

    Structure-sensitive manifold ranking for video concept detection

    ACM Multimedia

    (2007)
  • J. Tang et al.

    Video annotation based on kernel linear neighborhood propagation

    IEEE Trans. Multimedia

    (2008)
  • J. Tang et al.

    Correlative linear neighborhood propagation for video annotation

    IEEE Trans. Syst. Man Cybern. B Cybern.

    (2009)
  • Byeong-jun Seungmin Rho et al.

    SVR-based music mood classification and context-based music recommendation

    ACM Multimedia

    (2009)
  • Jacopo Hideo Joho et al.

    Looking at the viewer: analysing facial activity to detect personal highlights of multimedia contents

    Multimedia Tools Appl.

    (2011)
  • M. Hideo Joho et al.

    Exploiting facial expressions for affective video summarisation

    ACM Int. Conf. Image Video Retreival (CIVR)

    (2009)
  • R.W. Picard

    Affective Computing [M]

    (1997)
  • K. Chang et al.

    Learning partially-observed hidden conditional random fields for facial expression recognition

    IEEE Conf. Comput. Vision Pattern Recognition (CVPR)

    (2009)
  • Cited by (43)

    • Face sketch aging via aging oriented principal component analysis

      2018, Pattern Recognition Letters
      Citation Excerpt :

      Despite these encouraging advances in face aging, simulating face sketch aging is a challenging research for the following specific reasons, e.g., the discrepancy between photos and sketches, lack of training data and complex aging mechanisms. In principle, many problems [21–24,28,29,31] in computer vision can motivate face sketch aging issue. In this paper, our work was mostly inspired by principle component analysis (PCA) [30].

    • A novel video recommendation system based on efficient retrieval of human actions

      2016, Physica A: Statistical Mechanics and its Applications
      Citation Excerpt :

      There are challenges posed by fast growth of videos on the Internet which the principal one is searching the users’ desired items. Recommender Systems (RSs) are used in many applications that seek to find the users’ interesting items such as music [1,2] news [3,4] and videos [5–7]. These systems can be categorized based on items and users information or evaluations on items that are taken from users as rates, into Content Based (CB) and Collaborative Filtering (CF) [8–11].

    • Affective video recommender systems: A survey

      2022, Frontiers in Neuroscience
    • Affective classification model based on emotional user experience and visual markers in YouTube video

      2021, International Journal of Advanced Technology and Engineering Exploration
    • Recommender system techniques and approaches to improve the modern learning challenges

      2021, Machine Learning Approaches for Improvising Modern Learning Systems
    View all citing articles on Scopus

    Sicheng Zhao is currently a Ph.D. candidate at the Harbin Institute of Technology. His research interests include image and video affective content analysis, especially focusing on expression recognition and its applications.

    Hongxun Yao received the B.S. and M.S. degrees in Computer Science from the Harbin Shipbuilding Engineering Institute, Harbin, China, in 1987 and in 1990, respectively, and received the Ph.D. degree in Computer Science from the Harbin Institute of Technology in 2003. Currently, she is a Professor with the School of Computer Science and Technology, Harbin Institute of Technology. Her research interests include computer vision, multimedia computing and human-computer interaction.. She has published five books and over 200 scientific papers.

    Xiaoshuai Sun is currently a Ph.D. candidate at the Harbin Institute of Technology. His research interests include image and video understanding, especially focusing on saliency analysis.

    View full text