Improved TOPSIS method for peak frame selection in audio-video human emotion recognition

Singh, Lovejit; Singh, Sarbjeet; Aggarwal, Naveen

doi:10.1007/s11042-018-6402-x

Improved TOPSIS method for peak frame selection in audio-video human emotion recognition

Published: 20 July 2018

Volume 78, pages 6277–6308, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Lovejit Singh¹,
Sarbjeet Singh¹ &
Naveen Aggarwal¹

624 Accesses
17 Citations
Explore all metrics

Abstract

The peak frame selection with corresponding voice segment identification is a challenging problem in the audio-video human emotion recognition. The peak frame is a most relevant descriptor of facial expression that can be inferred from varied emotional states. In this paper, an improved Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) is proposed to select the key frame based on facial action units co-occurrence behavior in the visual sequences. The proposed method utilizes the experts judgments while identifying the peak frame in video modality. It locates the peak voiced segment in audio modality using synchronous and asynchronous temporal relationship with selected peak visual frame. The facial action unit features of peak frame are fused with nine statistical characteristics of spectral features of the voiced segment. The weighted product rule-based decision level fusion is performed to combine the posterior probabilities of two independent (i.e., audio, and video) support vector machines based classification models. The performance of the proposed peak frame and voiced segment selection method is evaluated and compared with the existing Maximum-Dissimilarity (MAX-DIST), Dendrogram -Clustering (DEND-CLUSTER), and Emotion Intensity (EIFS) based peak frame selection methods on two challenging emotion datasets in two different languages namely eNTERFACE’05 in English and BAUM-1a in Turkish. The results show that the system with the proposed method has performed better than the existing techniques, and it achieved 88.03%, and 84.61% emotion recognition accuracies on the eNTERFACE’05 and BAUM-1a datasets respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal emotion recognition based on peak frame selection from video

Article 29 September 2015

Multimodal emotion recognition using SDA-LDA algorithm in video clips

Article 04 October 2021

Multimodal emotion recognition based on feature selection and extreme learning machine in video clips

Article 27 July 2021

References

Alonso JA, Teresa Lamata M (2006) Consistency in the analytic hierarchy process: a new approach. Int J Uncertainty Fuzziness Knowledge Based Syst 14.4:445–459. https://doi.org/10.1142/S0218488506004114 https://doi.org/10.1142/S021848850600
Article MATH Google Scholar
Amiriparian S, Freitag M, Cummins N, Schuller B (2017) Feature selection in multimodal continuous emotion prediction. In: 17th IEEE international conference on affective computing and intelligent interaction workshops and demos (ACIIW), pp 30–37. https://doi.org/10.1109/ACIIW.2017.8272619
Atrey PK, Anwar Hossain M, Saddik AEl, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia Systems 16.6:345–379. https://doi.org/10.1007/s00530-010-0182-0
Article Google Scholar
Baltrušaitis T, Ahuja C, Morency L-P (2018) Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence. p 99. https://doi.org/10.1109/TPAMI.2018.2798607 https://doi.org/10.1109/TPAMI.2018.2798607
Baltrusaitis T, Mahmoud M, Robinson P (2015) Cross-dataset learning and person-specific normalisation for automatic action unit detection. In: IEEE international conference and workshops on automatic face and gesture recognition (FG), vol 6. pp 1–6. https://doi.org/10.1109/FG.2015.7284869. https://github.com/TadasBaltrusaitis/FERA-2015. Accessed 21 Feb 2016
Baltrusaitis T, Robinson P, Morency L-P (2012) 3D constrained local model for rigid and non-rigid facial tracking. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2610–2617. https://doi.org/10.1109/CVPR.2012.6247980
Baltrusaitis T, Robinson P, Morency L-P (2013) Constrained local neural fields for robust facial landmark detection in the wild. In: Proceedings of the IEEE international conference on computer vision workshops, pp 354–361. https://doi.org/10.1109/ICCVW.2013.54
Baltrusaitis T, Robinson P, Morency L-P (2016) Openface: an open source facial behavior analysis toolkit. In: IEEE winter conference on applications of computer vision (WACV), pp 1–10. https://doi.org/10.1109/WACV.2016.7477553
Chang C-C, Lin C-J (2011) LIBSVM: A library for support vector machines. ACM transactions on Intelligent Systems and Technology (TIST) 2.3:27. https://doi.org/10.1145/1961189.1961199
Article Google Scholar
Du S, Tao Y, Martinez AM (2014) Compound facial expressions of emotion. Proc Natl Acad Sci 111.15:1454–62. https://doi.org/10.1073/pnas.1322355111
Article Google Scholar
Ekman P (1999) Basic emotions. The Handbook of Cognition and Emotion. pp 45–60
Chapter Google Scholar
Ekman P, Friesen WV, Hager JC (2002) Facial action coding system: the manual on CD ROM instructor’s guide. Network Information Research Co, Salt Lake City
Google Scholar
Escalera S, Pujol O, Radeva P (2009) Separability of ternary codes for sparse designs of error-correcting output codes. Pattern Recogn Lett 30.3:285–297. https://doi.org/10.1016/j.patrec.2008.10.002
Article Google Scholar
Gharavian D, Bejani M, Sheikhan M (2017) Audio-visual emotion recognition using FCBF feature selection method and particle swarm optimization for fuzzy ARTMAP neural networks. Multimedia Tools and Applications 76.2:2331–2352. https://doi.org/10.1007/s11042-015-3180-6
Article Google Scholar
Giannakopoulos T (2009) A method for silence removal and segmentation of speech signals, implemented in Matlab. University of Athens, Athens, p 2
Google Scholar
Grant KW, Greenberg S (2001) Speech intelligibility derived from asynchronous processing of auditory-visual information. In: International conference on auditory-visual speech processing (AVSP), pp 132–137
Haq S, Jackson PJB (2009) Speaker-dependent audio-visual emotion recognition. In: Proceedings of international conference on auditory-visual speech processing, pp 53—58
Hermansky H, Morgan N (1994) RASTA Processing of speech. IEEE Transactions on Speech and Audio Processing 2.4:578–589. https://doi.org/10.1109/89.326616
Article Google Scholar
Hwang C-L, Yoon K (1981) Methods for multiple attribute decision making. In: Multiple attribute decision making, Springer, Berlin, pp 58–191. https://doi.org/10.1007/978-3-642-48318-9_3
Chapter Google Scholar
King DE (2009) Dlib-ml: a machine learning toolkit. J Mach Learn Res 10:1755–1758
Google Scholar
Kolakowska A, Landowska A, Szwoch M, Wrobel MR (2014) Emotion recognition and its applications. Human-Computer Systems Interactions: Backgrounds and Applications. pp 51–62. https://doi.org/10.1007/978-3-319-08491-6_5
Google Scholar
Kohler CG, Turner T, Stolar NM, Bilker WB, Brensinger CM, Gur RE, Gur RC (2004) Differences in facial expressions of four universal emotions. Psychiatry Res 128.3:235–244. https://doi.org/10.1016/j.psychres.2004.07.003 https://doi.org/10.1016/j.psychres.2004.07.003
Article Google Scholar
Martin O, Kotsia I, Macq B, Pitas I (2006) The enterface’05 audio-visual emotion database. In: 22nd IEEE international conference on data engineering workshops, pp 1–8. http://www.enterface.net/enterface05/docs/results/databases/project2_database.zip. Accessed 13 Sept 2016
Mehmet K, Cigdem EE (2015) Affect recognition using key frame selection based on minimum sparse reconstruction. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp 519–524. https://doi.org/10.1145/2818346.2830594
Mohammad Mavadati S, Mahoor MH, Bartlett K, Trinh P, Cohn JF (2013) DISFA: A spontaneous facial action intensity database. IEEE Trans Affect Comput 4.2:151–60. https://doi.org/10.1109/T-AFFC.2013.4
Article Google Scholar
Picard RW, Picard R (1997) Affective computing. MIT Press, Cambridge, p 252
Google Scholar
Rao RV (2013) Improved multiple attribute decision making methods. In: Decision making in manufacturing environment using graph theory and fuzzy multiple attribute decision making methods, Springer, London, pp 7–39. https://doi.org/10.1007/978-1-4471-4375-8_2
Google Scholar
Sidorov M, Sopov E, Ivanov I, Minker W (2015) Feature and decision level audio-visual data fusion in emotion recognition problem. In: 12th IEEE international conference on informatics in control automation and robotics (ICINCO), vol 2. pp 246–251
Valstar MF, Almaev T, Girard JM, McKeown G, Mehu M, Yin L, Pantic M, Cohn JF (2015) FERA 2015-Second facial expression recognition and analysis challenge. In: 11th IEEE international conference and workshops on automatic face and gesture recognition (FG) vol 6. pp 1–8. https://doi.org/10.1109/FG.2015.7284874
Yan C, Zhang Y, Xu J, Dai F, Zhang J, Dai Q, Wu F (2014) Efficient parallel framework for HEVC motion estimation on Many-Core processors. IEEE Trans Circuits Syst Video Technol 24.12:2077–2089. https://doi.org/10.1109/TCSVT.2014.2335852
Article Google Scholar
Yan C, Zhang Y, Xu J, Dai F, Li L, Dai Q, Wu F (2014) A highly parallel framework for HEVC coding unit partitioning tree decision on many-core processors. IEEE Signal Process Lett 21.5:573–576. https://doi.org/10.1109/LSP.2014.2310494
Article Google Scholar
Yan C, Xie H, Yang D, Yin J, Zhang Y, Dai Q (2018) Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Trans Intell Transp Syst 19.1:284–295. https://doi.org/10.1109/TITS.2017.2749965
Article Google Scholar
Zhalehpour S, Akhtar Z, Erdem CE (2014) Multimodal emotion recognition with automatic peak frame selection. In: Proceedings of IEEE international symposium on innovations in intelligent systems and applications (INISTA), pp 116–121
Zhalehpour S, Akhtar Z, Erdem CE (2016) Multimodal emotion recognition based on peak frame selection from video. Signal Image and Video Processing. 10:827–834. https://doi.org/10.1007/s11760-015-0822-0 https://doi.org/10.1007/s11760-015-0822-0
Article Google Scholar
Zhalehpour S, Onder O, Akhtar Z, Erdem CE (2016) BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States. IEEE Trans Affect Comput 8.3:300–313. https://doi.org/10.1109/TAFFC.2016.2553038 https://doi.org/10.1109/TAFFC.2016.2553038. http://baum1.bahcesehir.edu.tr/R/. Accessed 18 Aug 2017
Article Google Scholar
Zhang X, Yin L, Cohn JF, Canavan SJ, Reale MJ, Horowitz A, Liu P, Girard JM (2014) BP4D-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing 32.10:692–706
Article Google Scholar

Download references

Acknowledgements

This work was supported by University Grant Commission (UGC), Ministry of Human Resource Development (MHRD) of India under Basic Scientific Research (BSR) fellowship for meritorious fellows vide UGC letter no. F.25-1/2013-14(BSR)/7-379/2012(BSR) Dated 30.5. 2014.

Author information

Authors and Affiliations

UIET, Panjab University, Chandigarh, India
Lovejit Singh, Sarbjeet Singh & Naveen Aggarwal

Authors

Lovejit Singh
View author publications
You can also search for this author in PubMed Google Scholar
Sarbjeet Singh
View author publications
You can also search for this author in PubMed Google Scholar
Naveen Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sarbjeet Singh.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Singh, L., Singh, S. & Aggarwal, N. Improved TOPSIS method for peak frame selection in audio-video human emotion recognition. Multimed Tools Appl 78, 6277–6308 (2019). https://doi.org/10.1007/s11042-018-6402-x

Download citation

Received: 21 November 2017
Revised: 02 May 2018
Accepted: 11 July 2018
Published: 20 July 2018
Issue Date: March 2019
DOI: https://doi.org/10.1007/s11042-018-6402-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved TOPSIS method for peak frame selection in audio-video human emotion recognition

Abstract

Access this article

Similar content being viewed by others

Multimodal emotion recognition based on peak frame selection from video

Multimodal emotion recognition using SDA-LDA algorithm in video clips

Multimodal emotion recognition based on feature selection and extreme learning machine in video clips

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improved TOPSIS method for peak frame selection in audio-video human emotion recognition

Abstract

Access this article

Similar content being viewed by others

Multimodal emotion recognition based on peak frame selection from video

Multimodal emotion recognition using SDA-LDA algorithm in video clips

Multimodal emotion recognition based on feature selection and extreme learning machine in video clips

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation