Abstract
Short video is one of the most popular forms of user generated contents and it is also a carrier of people’s emotion. However, researches on the emotional consistency between audio and video are limited, and there is also a lack of relevant datasets. In this paper, we propose a multi-model fusion system for assessing emotional consistency between different types of action videos and audios with different emotions. We also build a new dataset and compare the early fusion and late fusion methods on this dataset. We use video features extracted by a pre-trained C3D network and audio features extracted by Librosa, a tool for audio analysis. In early fusion method, we concatenate video features and audio features and train a SVM with a linear kernel using the fused features. In late fusion method, video features and audio features are used for training separately to get their own decisions. Then we fuse these two kinds of decisions to get the classification result. Our best classifier attained 85.56% accuracy.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Montagu, J.: How music and instruments began: a brief overview of the origin and entire development of music, from its earliest stages. Front. Sociol. 2, 8 (2017)
Hallam, S., Cross, I., Thaut, M.: Oxford Handbook of Music Psychology. Oxford University Press, Oxford (2011)
Russell, J.A.: A circumplex model of affect. J. Pers. Soc. Psychol. 39(6), 1161 (1980)
Grekow, J.: Music emotion maps in arousal-valence space. In: Saeed, K., Homenda, W. (eds.) CISIM 2016. LNCS, vol. 9842, pp. 697–706. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45378-1_60
Schmidt, E.M., Turnbull, D., Kim, Y.E.: Feature selection for content-based, time-varying musical emotion regression. In: Proceedings of the International Conference on Multimedia Information Retrieval, pp. 267–274 (2010)
Yang, Y.H., Lin, Y.C., Su, Y.F., Chen, H.H.: A regression approach to music emotion recognition. IEEE Trans. Audio Speech Lang. Process. 16(2), 448–457 (2008)
Deng, J.J., Leung, C.H.: Dynamic time warping for music retrieval using time series modeling of musical emotions. IEEE Trans. Affect. Comput. 6(2), 137–151 (2015)
Lin, Y., Chen, X., Yang, D.: Exploration of music emotion recognition based on MIDI. In: ISMIR, pp. 221–226 (2013)
Dhall, A., Goecke, R., Lucey, S., Gedeon, T.: Collecting large, richly annotated facial-expression databases from movies. IEEE Multimed. 19(03), 34–41 (2012)
Douglas-Cowie, E., et al.: The HUMAINE database: addressing the collection and annotation of naturalistic and induced emotional data. In: Paiva, A.C.R., Prada, R., Picard, R.W. (eds.) ACII 2007. LNCS, vol. 4738, pp. 488–500. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74889-2_43
Sneddon, I., McRorie, M., McKeown, G., Hanratty, J.: The belfast induced natural emotion database. IEEE Trans. Affect. Comput. 3(1), 32–41 (2011)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
McFee, B., et al.: librosa: audio and music signal analysis in python. In: Proceedings of the 14th Python in Science Conference, vol. 8 (2015)
Eerola, T., Vuoskoski, J.K.: A comparison of the discrete and dimensional models of emotion in music. Psychol. Music 39(1), 18–49 (2011)
Aljanaki, A., Wiering, F., Veltkamp, R.C.: Studying emotion induced by music through a crowdsourcing game. Inf. Process. Manag. 52(1), 115–128 (2016)
Ekman, P.: Basic emotions. Handb. Cogn. Emot. 98(45–60), 16 (1999)
Aljanaki, A., Yang, Y.H., Soleymani, M.: Developing a benchmark for emotional analysis of music. PLoS One 12(3), e0173392 (2017)
Soleymani, M., Aljanaki, A., Yang, Y.: DEAM: mediaeval database for emotional analysis in music (2016)
Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. Roy. Stat. Soc.: Ser. B (Stat. Methodol.) 63(2), 411–423 (2001)
Aljanaki, A., Yang, Y.H., Soleymani, M.: Emotion in music task: lessons learned. In: MediaEval. Citeseer (2016)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Liu, Z., Chai, X., Liu, Z., Chen, X.: Continuous gesture recognition with hand-oriented spatiotemporal feature. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 3056–3064 (2017)
Liu, M., Wang, R., Li, S., Shan, S., Huang, Z., Chen, X.: Combining multiple kernel methods on Riemannian manifold for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 494–501 (2014)
Ebrahimi Kahou, S., Michalski, V., Konda, K., Memisevic, R., Pal, C.: Recurrent neural networks for emotion recognition in video. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 467–474 (2015)
Jakubik, J., Kwaśnicka, H.: Music emotion analysis using semantic embedding recurrent neural networks. In: 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA), pp. 271–276. IEEE (2017)
Çano, E., Morisio, M., et al.: Music mood dataset creation based on last. FM tags. In: 2017 International Conference on Artificial Intelligence and Applications, Vienna, Austria, pp. 15–26 (2017)
Panda, R., Malheiro, R., Paiva, R.P.: Novel audio features for music emotion recognition. IEEE Trans. Affect. Comput. 11(4), 614–626 (2018)
Acknowledgements
This work was supported by the National Natural Science Foundation of China (62101326, 62225112, 61831015, and 62271312), National Key R &D Program of China (2021YFE0206700), and China Postdoctoral Science Foundation (2022M712090).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Gui, Y., Zhu, Y., Zhai, G., Liu, N. (2023). Subjective and Objective Emotional Consistency Assessment for UGC Short Videos. In: Zhai, G., Zhou, J., Yang, H., Yang, X., An, P., Wang, J. (eds) Digital Multimedia Communications. IFTC 2022. Communications in Computer and Information Science, vol 1766. Springer, Singapore. https://doi.org/10.1007/978-981-99-0856-1_18
Download citation
DOI: https://doi.org/10.1007/978-981-99-0856-1_18
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-0855-4
Online ISBN: 978-981-99-0856-1
eBook Packages: Computer ScienceComputer Science (R0)