skip to main content
10.1145/2567688.2567690acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
research-article

Emotion Recognition from Audio and Visual Data using F-score based Fusion

Published: 21 March 2014 Publication History

Abstract

Emotion recognition has been one of the cornerstones of human-computer interaction. Although decades of work has attacked the problem of automatic emotion recognition from either audio or video signals, the fusion of the two modalities is more recent. In this paper, we aim to tackle the problem when both audio and video data are available in a synchronized manner. We address the six basic human emotions, namely, anger, disgust, fear, happiness, sadness, and surprise. We employ an automatic face tracker to extract the different facial points of interest from a video. We then compute feature vectors for each video frame using distances and angles between the tracked points. For audio data, we use the pitch, energy and MFCC to derive feature vectors for each window as well as the entire audio signal. We use two standard techniques, GMM-based HMM and SVM, as the base classifiers. We then design a novel fusion method using the F-score of the base classifiers. We first demonstrate that our fusion approach can increase the accuracy of the base classifiers by as much as 5%. Finally, we show that our fusion-based bi-modal emotion recognition method achieves an overall accuracy of 54% on a publicly available database, which is an improvement upon the current state-of-the-art by 9%.

References

[1]
M. S. Bartlett, G. Littlewort, I. Fasel, and J. R. Movellan. Real time face detection and facial expression recognition: Development and applications to human computer interaction. In CVPRW, pages 53--53, 2003.
[2]
A. Batliner, S. Steidl, and E. Nöth. Releasing a thoroughly annotated and processed spontaneous emotional database. In Proc. of a Satellite Workshop of LREC, 2008.
[3]
F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss. A database of German emotional speech. In Proc. Interspeech, 2005.
[4]
C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM Trans. on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
[5]
S. W. Chew, P. Lucey, S. Lucey, J. Saragih, J. F. Cohn, and S. Sridharan. Person-independent facial expression detection using constrained local models. In Automatic Face & Gesture Recognition and Workshops (FG), pages 915--920, 2011.
[6]
T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. Pattern Analysis and Machine Intelligence, 23(6):681--685, 2001.
[7]
C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273--297, 1995.
[8]
D. Datcu and L. Rothkrantz. Semantic audio-visual data fusion for automatic emotion recognition. Euromedia, 2008.
[9]
L. Devillers and L. Vidrascu. Real-life emotions detection with lexical and paralinguistic cues on human-human call center dialogs. In Proc. of Interspeech, pages 801--804, 2006.
[10]
I. Dinov. Expectation maximization and mixture modeling tutorial. Statistics Online Computational Resource, Paper EM_MM, 2008.
[11]
I. S. Engberg and A. V. Hansen. Documentation of the Danish emotional speech database DES. Technical report, Center for Person Kommunikation, Denmark, 1996.
[12]
B. Fasel and J. Luettin. Automatic facial expression analysis: a survey. Pattern Recognition, 36(1):259--275, 2003.
[13]
H.-J. Go, K.-C. Kwak, D.-J. Lee, and M.-G. Chun. Emotion recognition from the facial image and speech signal. In SICE, pages 2890--2895, 2003.
[14]
K. Goebel and W. Yan. Choosing classifiers for decision fusion. In Proc. Seventh Int. Conf. on Information Fusion, volume 1, pages 563--568, 2004.
[15]
S. Hoch, F. Althoff, G. McGlaun, and G. Rigoll. Bimodal fusion of emotional data in an automotive environment. In Acoustics, Speech, and Signal Processing (ICASSP), pages ii--1085, 2005.
[16]
M. Kamachi, M. Lyons, and J. Gyoba. The Japanese female facial expression (JAFFE) database. URL: http://www.kasrl.org/jaffe.html, 1998.
[17]
L. I. Kuncheva, C. J. Whitaker, and Shipp. Limits on the majority vote accuracy in classifier fusion. Pattern Analysis & Applications, 2003.
[18]
O.-W. Kwon, K. Chan, J. Hao, and T.-W. Lee. Emotion recognition by speech signals. In Eighth European Conf. on Speech Communication and Technology, 2003.
[19]
L. Lam and S. Y. Suen. Application of majority voting to pattern recognition: An analysis of its behavior and performance. Systems, Man and Cybernetics, Part A, 27(5):553--568, 1997.
[20]
O. Lartillot, P. Toiviainen, and T. Eerola. Mirtoolbox. Retrieved March, 20:2011, 2007.
[21]
Y.-L. Lin and G. Wei. Speech emotion recognition based on hmm and svm. In Machine Learning and Cybernetics, pages 4898--4901, 2005.
[22]
G. Littlewort, M. S. Bartlett, I. Fasel, J. Chenu, and J. R. Movellan. Analysis of machine learning methods for real-time recognition of facial expressions from video. In CVPR, 2004.
[23]
P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression. In CVPRW, pages 94--101, 2010.
[24]
S. Lucey, A. B. Ashraf, and J. Cohn. Investigating spontaneous facial action recognition through aam representations of the face. Face Recognition, pages 275--286, 2007.
[25]
M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba. Coding facial expressions with gabor wavelets. In Automatic Face and Gesture Recognition, pages 200--205, 1998.
[26]
M. J. Lyons, J. Budynek, A. Plante, and S. Akamatsu. Classifying facial attributes using a 2-D Gabor wavelet representation and discriminant analysis. In Automatic Face and Gesture Recognition, 2000, pages 202--207, 2000.
[27]
O. Martin, I. Kotsia, B. Macq, and I. Pitas. The eNTERFACE audio-visual emotion database. In Data Engineering Workshops, page 8, 2006.
[28]
P. Mermelstein. Distance measures for speech recognition, psychological and instrumental. In Pattern Recognition and Artificial Intelligence, pages 374--388, 1976.
[29]
P. Michel and R. El Kaliouby. Real time facial expression recognition in video using support vector machines. In Proc. 5th Int. Conf. on Multimodal Interfaces, pages 258--264, 2003.
[30]
K. Murphy. Hidden markov model (HMM) toolbox for Matlab. URL: http://www.cs.ubc.ca/murphyk/Software/HMM/hmm.html, 2005.
[31]
M. Paleari and B. Huet. Toward emotion indexing of multimedia excerpts. In CBMI, pages 425--432, 2008.
[32]
M. Paleari, B. Huet, and R. Chellali. Towards multimodal emotion recognition: A new approach. In CIVR, 2010.
[33]
M. Pantic and I. Patras. Detecting facial actions and their temporal segments in nearly frontal-view face image sequences. In Systems, Man and Cybernetics, pages 3358--3363, 2005.
[34]
M. Pantic and L. J. Rothkrantz. Facial action recognition for facial expression analysis from static face images. Systems, Man, and Cybernetics, 34(3):1449--1461, 2004.
[35]
M. Petrakos, I. Kannelopoulos, J. A. Benediktsson, and M. Pesaresi. The effect of correlation on the accuracy of the combined classifier in decision level fusion. In Geoscience and Remote Sensing Symposium (IGARSS), pages 2623--2625, 2000.
[36]
L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proc. of the IEEE, 77(2):257--286, 1989.
[37]
B. Schuller, R. Müller, M. Lang, and G. Rigoll. Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles. In Proc. Interspeech, pages 805--808, 2005.
[38]
B. Schuller, S. Reiter, R. Muller, M. Al-Hames, M. Lang, and G. Rigoll. Speaker independent speech emotion recognition by ensemble classification. In Multimedia and Expo (ICME), pages 864--867, 2005.
[39]
Y. Tian, T. Kanade, and J. F. Cohn. Recognizing action units for facial expression analysis. IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(2):97--115, 2001.
[40]
Y. Tian, T. Kanade, and J. F. Cohn. Evaluation of gabor-wavelet-based facial action unit recognition in image sequences of increasing complexity. In Automatic Face and Gesture Recognition, pages 229--234, 2002.
[41]
K. Venkataramani and B. V. K. V. Kumar. Designing classifiers for fusion-based biometric verification. In Biometrics: Theory, Methods and Applications, chapter 4. John Wiley & Sons, 2009.
[42]
P. Viola and M. J. Jones. Robust real-time face detection. Int. J. of Computer Vision, 57(2):137--154, 2004.
[43]
Y. Wang and L. Guan. Recognizing human emotion from audiovisual information. In Acoustics, Speech, and Signal Processing (ICASSP), pages ii--1125, 2005.
[44]
Z. Zeng, Y. Hu, G. Roisman, Z. Wen, Y. Fu, and T. Huang. Audio-visual spontaneous emotion recognition. Artifical Intelligence for Human Computing, pages 72--90, 2007.
[45]
Z. Zeng, J. Tu, M. Liu, T. S. Huang, B. Pianfetti, D. Roth, and S. Levinson. Audio-visual affect recognition. Multimedia, 9(2):424--428, 2007.
[46]
Z. Zeng, J. Tu, B. Pianfetti, M. Liu, T. Zhang, Z. Zhang, T. S. Huang, and S. Levinson. Audio-visual affect recognition through multi-stream fused HMM for HCI. In CVPR, pages 967--972, 2005.

Cited By

View all
  • (2021)Multimodal emotion recognition using SDA-LDA algorithm in video clipsJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-021-03529-714:6(6585-6602)Online publication date: 4-Oct-2021
  • (2020)Research on the Phonetic Emotion Recognition Model of Mandarin Chinese2020 International Conference on Culture-oriented Science & Technology (ICCST)10.1109/ICCST50977.2020.00124(602-606)Online publication date: Oct-2020
  • (2020)Psychological Personal Support System with Long Short Term Memory and Facial Expressions Recognition ApproachDeep Learning for Medical Decision Support Systems10.1007/978-981-15-6325-6_8(129-144)Online publication date: 18-Jun-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CODS '14: Proceedings of the 1st IKDD Conference on Data Sciences
March 2014
73 pages
ISBN:9781450324755
DOI:10.1145/2567688
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 March 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Audio-visual data
  2. Emotion recognition
  3. F-score
  4. Multi-class fusion

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

CoDS '14
CoDS '14: 1st IKDD Conference on Data Sciences
March 21 - 23, 2014
Delhi, India

Acceptance Rates

CODS '14 Paper Acceptance Rate 7 of 57 submissions, 12%;
Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Multimodal emotion recognition using SDA-LDA algorithm in video clipsJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-021-03529-714:6(6585-6602)Online publication date: 4-Oct-2021
  • (2020)Research on the Phonetic Emotion Recognition Model of Mandarin Chinese2020 International Conference on Culture-oriented Science & Technology (ICCST)10.1109/ICCST50977.2020.00124(602-606)Online publication date: Oct-2020
  • (2020)Psychological Personal Support System with Long Short Term Memory and Facial Expressions Recognition ApproachDeep Learning for Medical Decision Support Systems10.1007/978-981-15-6325-6_8(129-144)Online publication date: 18-Jun-2020
  • (2018)A Multimodal Emotion Recognition System Using Facial Landmark AnalysisIranian Journal of Science and Technology, Transactions of Electrical Engineering10.1007/s40998-018-0142-943:S1(171-189)Online publication date: 22-Oct-2018
  • (2016)Fusion of classifier predictions for audio-visual emotion recognition2016 23rd International Conference on Pattern Recognition (ICPR)10.1109/ICPR.2016.7899608(61-66)Online publication date: Dec-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media