Skip to main content
Log in

Improved TOPSIS method for peak frame selection in audio-video human emotion recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The peak frame selection with corresponding voice segment identification is a challenging problem in the audio-video human emotion recognition. The peak frame is a most relevant descriptor of facial expression that can be inferred from varied emotional states. In this paper, an improved Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) is proposed to select the key frame based on facial action units co-occurrence behavior in the visual sequences. The proposed method utilizes the experts judgments while identifying the peak frame in video modality. It locates the peak voiced segment in audio modality using synchronous and asynchronous temporal relationship with selected peak visual frame. The facial action unit features of peak frame are fused with nine statistical characteristics of spectral features of the voiced segment. The weighted product rule-based decision level fusion is performed to combine the posterior probabilities of two independent (i.e., audio, and video) support vector machines based classification models. The performance of the proposed peak frame and voiced segment selection method is evaluated and compared with the existing Maximum-Dissimilarity (MAX-DIST), Dendrogram -Clustering (DEND-CLUSTER), and Emotion Intensity (EIFS) based peak frame selection methods on two challenging emotion datasets in two different languages namely eNTERFACE’05 in English and BAUM-1a in Turkish. The results show that the system with the proposed method has performed better than the existing techniques, and it achieved 88.03%, and 84.61% emotion recognition accuracies on the eNTERFACE’05 and BAUM-1a datasets respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Alonso JA, Teresa Lamata M (2006) Consistency in the analytic hierarchy process: a new approach. Int J Uncertainty Fuzziness Knowledge Based Syst 14.4:445–459. https://doi.org/10.1142/S0218488506004114 https://doi.org/10.1142/S021848850600

    Article  MATH  Google Scholar 

  2. Amiriparian S, Freitag M, Cummins N, Schuller B (2017) Feature selection in multimodal continuous emotion prediction. In: 17th IEEE international conference on affective computing and intelligent interaction workshops and demos (ACIIW), pp 30–37. https://doi.org/10.1109/ACIIW.2017.8272619

  3. Atrey PK, Anwar Hossain M, Saddik AEl, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia Systems 16.6:345–379. https://doi.org/10.1007/s00530-010-0182-0

    Article  Google Scholar 

  4. Baltrušaitis T, Ahuja C, Morency L-P (2018) Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence. p 99. https://doi.org/10.1109/TPAMI.2018.2798607 https://doi.org/10.1109/TPAMI.2018.2798607

  5. Baltrusaitis T, Mahmoud M, Robinson P (2015) Cross-dataset learning and person-specific normalisation for automatic action unit detection. In: IEEE international conference and workshops on automatic face and gesture recognition (FG), vol 6. pp 1–6. https://doi.org/10.1109/FG.2015.7284869. https://github.com/TadasBaltrusaitis/FERA-2015. Accessed 21 Feb 2016

  6. Baltrusaitis T, Robinson P, Morency L-P (2012) 3D constrained local model for rigid and non-rigid facial tracking. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2610–2617. https://doi.org/10.1109/CVPR.2012.6247980

  7. Baltrusaitis T, Robinson P, Morency L-P (2013) Constrained local neural fields for robust facial landmark detection in the wild. In: Proceedings of the IEEE international conference on computer vision workshops, pp 354–361. https://doi.org/10.1109/ICCVW.2013.54

  8. Baltrusaitis T, Robinson P, Morency L-P (2016) Openface: an open source facial behavior analysis toolkit. In: IEEE winter conference on applications of computer vision (WACV), pp 1–10. https://doi.org/10.1109/WACV.2016.7477553

  9. Chang C-C, Lin C-J (2011) LIBSVM: A library for support vector machines. ACM transactions on Intelligent Systems and Technology (TIST) 2.3:27. https://doi.org/10.1145/1961189.1961199

    Article  Google Scholar 

  10. Du S, Tao Y, Martinez AM (2014) Compound facial expressions of emotion. Proc Natl Acad Sci 111.15:1454–62. https://doi.org/10.1073/pnas.1322355111

    Article  Google Scholar 

  11. Ekman P (1999) Basic emotions. The Handbook of Cognition and Emotion. pp 45–60

    Chapter  Google Scholar 

  12. Ekman P, Friesen WV, Hager JC (2002) Facial action coding system: the manual on CD ROM instructor’s guide. Network Information Research Co, Salt Lake City

    Google Scholar 

  13. Escalera S, Pujol O, Radeva P (2009) Separability of ternary codes for sparse designs of error-correcting output codes. Pattern Recogn Lett 30.3:285–297. https://doi.org/10.1016/j.patrec.2008.10.002

    Article  Google Scholar 

  14. Gharavian D, Bejani M, Sheikhan M (2017) Audio-visual emotion recognition using FCBF feature selection method and particle swarm optimization for fuzzy ARTMAP neural networks. Multimedia Tools and Applications 76.2:2331–2352. https://doi.org/10.1007/s11042-015-3180-6

    Article  Google Scholar 

  15. Giannakopoulos T (2009) A method for silence removal and segmentation of speech signals, implemented in Matlab. University of Athens, Athens, p 2

    Google Scholar 

  16. Grant KW, Greenberg S (2001) Speech intelligibility derived from asynchronous processing of auditory-visual information. In: International conference on auditory-visual speech processing (AVSP), pp 132–137

  17. Haq S, Jackson PJB (2009) Speaker-dependent audio-visual emotion recognition. In: Proceedings of international conference on auditory-visual speech processing, pp 53—58

  18. Hermansky H, Morgan N (1994) RASTA Processing of speech. IEEE Transactions on Speech and Audio Processing 2.4:578–589. https://doi.org/10.1109/89.326616

    Article  Google Scholar 

  19. Hwang C-L, Yoon K (1981) Methods for multiple attribute decision making. In: Multiple attribute decision making, Springer, Berlin, pp 58–191. https://doi.org/10.1007/978-3-642-48318-9_3

    Chapter  Google Scholar 

  20. King DE (2009) Dlib-ml: a machine learning toolkit. J Mach Learn Res 10:1755–1758

    Google Scholar 

  21. Kolakowska A, Landowska A, Szwoch M, Wrobel MR (2014) Emotion recognition and its applications. Human-Computer Systems Interactions: Backgrounds and Applications. pp 51–62. https://doi.org/10.1007/978-3-319-08491-6_5

    Google Scholar 

  22. Kohler CG, Turner T, Stolar NM, Bilker WB, Brensinger CM, Gur RE, Gur RC (2004) Differences in facial expressions of four universal emotions. Psychiatry Res 128.3:235–244. https://doi.org/10.1016/j.psychres.2004.07.003 https://doi.org/10.1016/j.psychres.2004.07.003

    Article  Google Scholar 

  23. Martin O, Kotsia I, Macq B, Pitas I (2006) The enterface’05 audio-visual emotion database. In: 22nd IEEE international conference on data engineering workshops, pp 1–8. http://www.enterface.net/enterface05/docs/results/databases/project2_database.zip. Accessed 13 Sept 2016

  24. Mehmet K, Cigdem EE (2015) Affect recognition using key frame selection based on minimum sparse reconstruction. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp 519–524. https://doi.org/10.1145/2818346.2830594

  25. Mohammad Mavadati S, Mahoor MH, Bartlett K, Trinh P, Cohn JF (2013) DISFA: A spontaneous facial action intensity database. IEEE Trans Affect Comput 4.2:151–60. https://doi.org/10.1109/T-AFFC.2013.4

    Article  Google Scholar 

  26. Picard RW, Picard R (1997) Affective computing. MIT Press, Cambridge, p 252

    Google Scholar 

  27. Rao RV (2013) Improved multiple attribute decision making methods. In: Decision making in manufacturing environment using graph theory and fuzzy multiple attribute decision making methods, Springer, London, pp 7–39. https://doi.org/10.1007/978-1-4471-4375-8_2

    Google Scholar 

  28. Sidorov M, Sopov E, Ivanov I, Minker W (2015) Feature and decision level audio-visual data fusion in emotion recognition problem. In: 12th IEEE international conference on informatics in control automation and robotics (ICINCO), vol 2. pp 246–251

  29. Valstar MF, Almaev T, Girard JM, McKeown G, Mehu M, Yin L, Pantic M, Cohn JF (2015) FERA 2015-Second facial expression recognition and analysis challenge. In: 11th IEEE international conference and workshops on automatic face and gesture recognition (FG) vol 6. pp 1–8. https://doi.org/10.1109/FG.2015.7284874

  30. Yan C, Zhang Y, Xu J, Dai F, Zhang J, Dai Q, Wu F (2014) Efficient parallel framework for HEVC motion estimation on Many-Core processors. IEEE Trans Circuits Syst Video Technol 24.12:2077–2089. https://doi.org/10.1109/TCSVT.2014.2335852

    Article  Google Scholar 

  31. Yan C, Zhang Y, Xu J, Dai F, Li L, Dai Q, Wu F (2014) A highly parallel framework for HEVC coding unit partitioning tree decision on many-core processors. IEEE Signal Process Lett 21.5:573–576. https://doi.org/10.1109/LSP.2014.2310494

    Article  Google Scholar 

  32. Yan C, Xie H, Yang D, Yin J, Zhang Y, Dai Q (2018) Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Trans Intell Transp Syst 19.1:284–295. https://doi.org/10.1109/TITS.2017.2749965

    Article  Google Scholar 

  33. Zhalehpour S, Akhtar Z, Erdem CE (2014) Multimodal emotion recognition with automatic peak frame selection. In: Proceedings of IEEE international symposium on innovations in intelligent systems and applications (INISTA), pp 116–121

  34. Zhalehpour S, Akhtar Z, Erdem CE (2016) Multimodal emotion recognition based on peak frame selection from video. Signal Image and Video Processing. 10:827–834. https://doi.org/10.1007/s11760-015-0822-0 https://doi.org/10.1007/s11760-015-0822-0

    Article  Google Scholar 

  35. Zhalehpour S, Onder O, Akhtar Z, Erdem CE (2016) BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States. IEEE Trans Affect Comput 8.3:300–313. https://doi.org/10.1109/TAFFC.2016.2553038 https://doi.org/10.1109/TAFFC.2016.2553038. http://baum1.bahcesehir.edu.tr/R/. Accessed 18 Aug 2017

    Article  Google Scholar 

  36. Zhang X, Yin L, Cohn JF, Canavan SJ, Reale MJ, Horowitz A, Liu P, Girard JM (2014) BP4D-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing 32.10:692–706

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by University Grant Commission (UGC), Ministry of Human Resource Development (MHRD) of India under Basic Scientific Research (BSR) fellowship for meritorious fellows vide UGC letter no. F.25-1/2013-14(BSR)/7-379/2012(BSR) Dated 30.5. 2014.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarbjeet Singh.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Singh, L., Singh, S. & Aggarwal, N. Improved TOPSIS method for peak frame selection in audio-video human emotion recognition. Multimed Tools Appl 78, 6277–6308 (2019). https://doi.org/10.1007/s11042-018-6402-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6402-x

Keywords

Navigation