short-paper

Learning representations for affective video understanding

Author:
Esra Acar

DAI Laboratory, Technische Universitat Berlin, Berlin, Germany

DAI Laboratory, Technische Universitat Berlin, Berlin, Germany
View Profile

MM '13: Proceedings of the 21st ACM international conference on MultimediaOctober 2013Pages 1055–1058https://doi.org/10.1145/2502081.2502215

Published:21 October 2013Publication History

MM '13: Proceedings of the 21st ACM international conference on Multimedia

Pages 1055–1058

ABSTRACT

Among the ever growing available multimedia data, finding multimedia content which matches the current mood of users is a challenging problem. Choosing discriminative features for the representation of video segments is a key issue in designing video affective content analysis algorithms, where no dominant feature representation has emerged yet. Most existing affective content analysis methods either use low-level audio-visual features or generate hand-crafted higher level representations. In this work, we propose to use deep learning methods, in particular, convolutional neural networks (CNNs), in order to learn mid-level representations from automatically extracted raw features. We exploit only the audio modality in the current framework and employ Mel-Frequency Cepstral Coefficients (MFCC) features in order to build higher level audio representations. We use the learned representations for the affective classification of music video clips. We choose multi-class support vector machines (SVMs) for classifying video clips into affective categories. Preliminary results on a subset of the DEAP dataset show that a significant improvement is obtained when we learn higher level representations instead of using low-level features directly for video affective content analysis. We plan to further extend this work and include visual modality as well. We will generate mid-level visual representations using CNNs and fuse these visual representations with mid-level audio representations both at feature- and decision-level for video affective content analysis.

References

J. C. Bezdek. Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, 1981. Google ScholarCross Ref
L. Canini, S. Benini, P. Migliorati, and R. Leonardi. Emotional identity of movies. In Image Processing (ICIP), 2009 16th IEEE International Conference on, pages 1821--1824. IEEE, 2009. Google ScholarDigital Library
Y. Cui, J. S. Jin, S. Zhang, S. Luo, and Q. Tian. Music video affective understanding using feature importance analysis. In Proceedings of the ACM International Conference on Image and Video Retrieval, pages 213--219. ACM, 2010. Google ScholarDigital Library
J. Eggink and D. Bland. A large scale experiment for mood-based classification of tv programmes. In Multimedia and Expo (ICME), 2012 IEEE International Conference on, pages 140--145. IEEE, 2012. Google ScholarDigital Library
G. Irie, K. Hidaka, T. Satou, A. Kojima, T. Yamasaki, and K. Aizawa. Latent topic driving model for movie affective scene classification. In Proceedings of the 17th ACM international conference on Multimedia, pages 565--568. ACM, 2009. Google ScholarDigital Library
G. Irie, T. Satou, A. Kojima, T. Yamasaki, and K. Aizawa. Affective audio-visual words and latent topic driving model for realizing movie affective scene classification. Multimedia, IEEE Transactions on, 12(6):523 --535, oct. 2010. Google ScholarDigital Library
S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 221--231, 2013. Google ScholarDigital Library
S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras. Deap: A database for emotion analysis; using physiological signals. Affective Computing, IEEE Transactions on, 3(1):18--31, 2012. Google ScholarDigital Library
T. Li, A. B. Chan, and A. Chun. Automatic musical pattern feature extraction using convolutional neural network. In Proc. Int. Conf. Data Mining and Applications, 2010.Google Scholar
N. Malandrakis, A. Potamianos, G. Evangelopoulos, and A. Zlatintsi. A supervised approach to movie emotion tracking. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 2376--2379. IEEE, 2011.Google ScholarCross Ref
R. Plutchik. The nature of emotions human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American Scientist, 89(4):344--350, 2001.Google ScholarCross Ref
E. M. Schmidt, J. Scott, and Y. E. Kim. Feature learning in dynamic environments: Modeling the acoustic structure of musical emotion. In International Society for Music Information Retrieval, pages 325--330, 2012.Google Scholar
M. Soleymani, G. Chanel, J. Kierkels, and T. Pun. Affective characterization of movie scenes based on multimedia content analysis and user's physiological emotional responses. In Multimedia, 2008. ISM 2008. Tenth IEEE International Symposium on, pages 228--235. IEEE, 2008. Google ScholarDigital Library
R. Srivastava, S. Yan, T. Sim, and S. Roy. Recognizing emotions of characters in movies. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 993--996. IEEE, 2012.Google ScholarCross Ref
M. Wimmer, B. Schuller, D. Arsic, G. Rigoll, and B. Radig. Low-level fusion of audio and video feature for multi-modal emotion recognition. In 3rd International Conference on Computer Vision Theory and Applications. VISAPP, volume 2, pages 145--151, 2008.Google Scholar
M. Xu, J. S. Jin, S. Luo, and L. Duan. Hierarchical movie affective content analysis based on arousal and valence features. In Proceedings of the 16th ACM international conference on Multimedia, pages 677--680. ACM, 2008. Google ScholarDigital Library
M. Xu, J. Wang, X. He, J. S. Jin, S. Luo, and H. Lu. A three-level framework for affective content analysis and its case studies. Multimedia Tools and Applications, 2012.Google Scholar
A. Yazdani, K. Kappeler, and T. Ebrahimi. Affective content analysis of music video clips. In Proc. of the 1st Int. ACM workshop on Music information retrieval with user-centered and multimodal strategies, pages 7--12. ACM, 2011. Google ScholarDigital Library

Index Terms

Learning representations for affective video understanding
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Video summarization
2. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Understanding Affective Content of Music Videos through Learned Representations
MMM 2014: Proceedings of the 20th Anniversary International Conference on MultiMedia Modeling - Volume 8325

In consideration of the ever-growing available multimedia data, annotating multimedia content automatically with feeling(s) expected to arise in users is a challenging problem. In order to solve this problem, the emerging research field of video ...
Read More
Deep learning and SVM-based emotion recognition from Chinese speech for smart affective services

Emotion recognition is challenging for understanding people and enhances human-computer interaction experiences, which contributes to the harmonious running of smart health care and other smart services. In this paper, several kinds of speech features ...
Read More
Affective Learning: Empathetic Agents with Emotional Facial and Tone of Voice Expressions

Empathetic behavior has been suggested to be one effective way for Embodied Conversational Agents (ECAs) to provide feedback to learners' emotions. An issue that has been raised is the effective integration of parallel and reactive empathy. The aim of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '13: Proceedings of the 21st ACM international conference on Multimedia
October 2013
1166 pages
ISBN:9781450324045
DOI:10.1145/2502081
General Chairs:
Alejandro (Alex) Jaimes
Yahoo!, Spain
,
Nicu Sebe
University of Trento, Italy
,
Nozha Boujemaa
INRIA, France
,
Program Chairs:
Daniel Gatica-Perez
IDIAP & EPFL, Switzerland
,
David A. Shamma
Yahoo!, USA
,
Marcel Worring
University of Amsterdam, The Netherlands
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2013 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 October 2013
Check for updates
Author Tags
affect analysis
convolutional neural network
learning feature representations
support vector machine
Qualifiers
- short-paper
Conference

Acceptance Rates
MM '13 Paper Acceptance Rate47of235submissions,20%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 328
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning representations for affective video understanding

MM '13: Proceedings of the 21st ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Understanding Affective Content of Music Videos through Learned Representations

Deep learning and SVM-based emotion recognition from Chinese speech for smart affective services

Affective Learning: Empathetic Agents with Emotional Facial and Tone of Voice Expressions