ABSTRACT
Among the ever growing available multimedia data, finding multimedia content which matches the current mood of users is a challenging problem. Choosing discriminative features for the representation of video segments is a key issue in designing video affective content analysis algorithms, where no dominant feature representation has emerged yet. Most existing affective content analysis methods either use low-level audio-visual features or generate hand-crafted higher level representations. In this work, we propose to use deep learning methods, in particular, convolutional neural networks (CNNs), in order to learn mid-level representations from automatically extracted raw features. We exploit only the audio modality in the current framework and employ Mel-Frequency Cepstral Coefficients (MFCC) features in order to build higher level audio representations. We use the learned representations for the affective classification of music video clips. We choose multi-class support vector machines (SVMs) for classifying video clips into affective categories. Preliminary results on a subset of the DEAP dataset show that a significant improvement is obtained when we learn higher level representations instead of using low-level features directly for video affective content analysis. We plan to further extend this work and include visual modality as well. We will generate mid-level visual representations using CNNs and fuse these visual representations with mid-level audio representations both at feature- and decision-level for video affective content analysis.
- J. C. Bezdek. Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, 1981. Google ScholarCross Ref
- L. Canini, S. Benini, P. Migliorati, and R. Leonardi. Emotional identity of movies. In Image Processing (ICIP), 2009 16th IEEE International Conference on, pages 1821--1824. IEEE, 2009. Google ScholarDigital Library
- Y. Cui, J. S. Jin, S. Zhang, S. Luo, and Q. Tian. Music video affective understanding using feature importance analysis. In Proceedings of the ACM International Conference on Image and Video Retrieval, pages 213--219. ACM, 2010. Google ScholarDigital Library
- J. Eggink and D. Bland. A large scale experiment for mood-based classification of tv programmes. In Multimedia and Expo (ICME), 2012 IEEE International Conference on, pages 140--145. IEEE, 2012. Google ScholarDigital Library
- G. Irie, K. Hidaka, T. Satou, A. Kojima, T. Yamasaki, and K. Aizawa. Latent topic driving model for movie affective scene classification. In Proceedings of the 17th ACM international conference on Multimedia, pages 565--568. ACM, 2009. Google ScholarDigital Library
- G. Irie, T. Satou, A. Kojima, T. Yamasaki, and K. Aizawa. Affective audio-visual words and latent topic driving model for realizing movie affective scene classification. Multimedia, IEEE Transactions on, 12(6):523 --535, oct. 2010. Google ScholarDigital Library
- S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 221--231, 2013. Google ScholarDigital Library
- S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras. Deap: A database for emotion analysis; using physiological signals. Affective Computing, IEEE Transactions on, 3(1):18--31, 2012. Google ScholarDigital Library
- T. Li, A. B. Chan, and A. Chun. Automatic musical pattern feature extraction using convolutional neural network. In Proc. Int. Conf. Data Mining and Applications, 2010.Google Scholar
- N. Malandrakis, A. Potamianos, G. Evangelopoulos, and A. Zlatintsi. A supervised approach to movie emotion tracking. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 2376--2379. IEEE, 2011.Google ScholarCross Ref
- R. Plutchik. The nature of emotions human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American Scientist, 89(4):344--350, 2001.Google ScholarCross Ref
- E. M. Schmidt, J. Scott, and Y. E. Kim. Feature learning in dynamic environments: Modeling the acoustic structure of musical emotion. In International Society for Music Information Retrieval, pages 325--330, 2012.Google Scholar
- M. Soleymani, G. Chanel, J. Kierkels, and T. Pun. Affective characterization of movie scenes based on multimedia content analysis and user's physiological emotional responses. In Multimedia, 2008. ISM 2008. Tenth IEEE International Symposium on, pages 228--235. IEEE, 2008. Google ScholarDigital Library
- R. Srivastava, S. Yan, T. Sim, and S. Roy. Recognizing emotions of characters in movies. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 993--996. IEEE, 2012.Google ScholarCross Ref
- M. Wimmer, B. Schuller, D. Arsic, G. Rigoll, and B. Radig. Low-level fusion of audio and video feature for multi-modal emotion recognition. In 3rd International Conference on Computer Vision Theory and Applications. VISAPP, volume 2, pages 145--151, 2008.Google Scholar
- M. Xu, J. S. Jin, S. Luo, and L. Duan. Hierarchical movie affective content analysis based on arousal and valence features. In Proceedings of the 16th ACM international conference on Multimedia, pages 677--680. ACM, 2008. Google ScholarDigital Library
- M. Xu, J. Wang, X. He, J. S. Jin, S. Luo, and H. Lu. A three-level framework for affective content analysis and its case studies. Multimedia Tools and Applications, 2012.Google Scholar
- A. Yazdani, K. Kappeler, and T. Ebrahimi. Affective content analysis of music video clips. In Proc. of the 1st Int. ACM workshop on Music information retrieval with user-centered and multimodal strategies, pages 7--12. ACM, 2011. Google ScholarDigital Library
Index Terms
- Learning representations for affective video understanding
Recommendations
Understanding Affective Content of Music Videos through Learned Representations
MMM 2014: Proceedings of the 20th Anniversary International Conference on MultiMedia Modeling - Volume 8325In consideration of the ever-growing available multimedia data, annotating multimedia content automatically with feeling(s) expected to arise in users is a challenging problem. In order to solve this problem, the emerging research field of video ...
Deep learning and SVM-based emotion recognition from Chinese speech for smart affective services
Emotion recognition is challenging for understanding people and enhances human-computer interaction experiences, which contributes to the harmonious running of smart health care and other smart services. In this paper, several kinds of speech features ...
Affective Learning: Empathetic Agents with Emotional Facial and Tone of Voice Expressions
Empathetic behavior has been suggested to be one effective way for Embodied Conversational Agents (ECAs) to provide feedback to learners' emotions. An issue that has been raised is the effective integration of parallel and reactive empathy. The aim of ...
Comments