Abstract
To better understand complex human emotions, there is growing interest in utilizing heterogeneous sensory data to detect multiple co-occurring emotions. However, existing studies have focused on extracting static information from each modality, while overlooking various interactions within and between modalities. Additionally, the label-to-modality and label-to-label dependencies still lack exploration. In this paper, we propose LAbel-induced Mixed-level Blending (LAMB) to address these challenges. Mixed-level blending leverages shallow but manifold self-attention and cross-attention encoders in parallel to model unimodal context dependency and cross-modal interaction simultaneously. This is in contrast to previous works either use one of them or cascade them successively, which ignores the diversity of interaction in multimodal data. LAMB also employs label-induced aggregation to allow different labels to attend to the most relevant blended tokens adaptively using a transformer-based decoder, which facilitates the exploration of label-to-modality dependency. Unlike common low-order strategies in multi-label learning, correlations among multiple labels can be learned by self-attention in label embedding space before being treated as queries. Comprehensive experiments demonstrate the effectiveness of our methods for multimodal multi-label emotion detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baltrusaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 423–443 (2019)
Baltrusaitis, T., Robinson, P., Morency, L.P.: OpenFace: an open source facial behavior analysis toolkit. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 1–10 (2016)
Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recogn. 37, 1757–1771 (2004)
Chen, Z.M., Wei, X.S., Wang, P., Guo, Y.: Multi-label image recognition with graph convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5177–5186 (2019)
Clare, A., King, R.D.: Knowledge discovery in multi-label phenotype data. In: Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery, pp. 42–53 (2001)
Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP - A collaborative voice analysis repository for speech technologies. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 960–964 (2014)
Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: Proceedings of the Conference on Neural Information Processing Systems, pp. 681–687 (2001)
Feng, L., An, B., He, S.: Collaboration based multi-label learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3550–3557 (2019)
Fürnkranz, J., Hüllermeier, E., Mencía, E.L., Brinker, K.: Multilabel classification via calibrated label ranking. Mach. Learn. 73, 133–153 (2008)
Ghamrawi, N., McCallum, A.: Collective multi-label classification. In: Proceedings of the ACM International Conference on Information and Knowledge Management, pp. 195–200 (2005)
Graves, A., Fernández, S., Gomez, F.J., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the International Conference on Machine Learning, pp. 369–376 (2006)
Hazarika, D., Zimmermann, R., Poria, S.: MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of the ACM International Conference on Multimedia, pp. 1122–1131 (2020)
Huang, J., Li, G., Huang, Q., Wu, X.: Learning label-specific features and class-dependent labels for multi-label classification. IEEE Trans. Knowl. Data Eng. 28, 3309–3323 (2016)
Liang, T., Lin, G., Feng, L., Zhang, Y., Lv, F.: Attention is not Enough: mitigating the distribution discrepancy in asynchronous multimodal sequence fusion. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 8128–8136 (2021)
Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.P.: Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 2247–2256 (2018)
Lv, F., Chen, X., Huang, Y., Duan, L., Lin, G.: Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2554–2562 (2021)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
Qi, G.J., Hua, X.S., Rui, Y., Tang, J., Mei, T., Zhang, H.J.: Correlative multi-label video annotation. In: Proceedings of the ACM International Conference on Multimedia, pp. 17–26 (2007)
Rahman, W., et al.: Integrating multimodal information in large pretrained transformers. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 2359–2369 (2020)
Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Mach. Learn. 85, 333–359 (2011)
Tsai, Y.H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 6558–6569 (2019)
Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehouse. Min. 3, 1–13 (2007)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wang, H., et al.: Collaboration based multi-label propagation for fraud detection. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 2477–2483 (2020)
Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., Morency, L.P.: Words Can Shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7216–7223 (2019)
Wu, X., et al.: Multi-View Multi-label learning with view-specific information extraction. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 3884–3890 (2019)
Xiao, L., Huang, X., Chen, B., Jing, L.: Label-specific document representation for multi-label text classification. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 466–475 (2019)
Yang, D., Huang, S., Kuang, H., Du, Y., Zhang, L.: Disentangled representation learning for multimodal emotion recognition. In: Proceedings of the ACM International Conference on Multimedia, pp. 1642–1651 (2022)
Yang, D., Kuang, H., Huang, S., Zhang, L.: Learning modality-specific and -agnostic representations for asynchronous multimodal language sequences. In: Proceedings of the ACM International Conference on Multimedia, pp. 1708–1717 (2022)
Yang, P., Sun, X., Li, W., Ma, S., Wu, W., Wang, H.: SGM: sequence generation model for multi-label classification. In: Proceedings of the International Conference on Computational Linguistics, pp. 3915–3926 (2018)
Yu, W., Xu, H., Yuan, Z., Wu, J.: learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10790–10797 (2021)
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1103–1114 (2017)
Zadeh, A., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 2236–2246 (2018)
Zhang, D., et al.: Multi-modal multi-label emotion recognition with heterogeneous hierarchical message passing. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 14338–14346 (2021)
Zhang, M.L., Fang, J.P., Wang, Y.B.: BiLabel-specific features for multi-label classification. ACM Trans. Knowl. Discov. Data 16, 1–23 (2022)
Zhang, M.L., Wu, L.: Lift: multi-label learning with label-specific features. IEEE Trans. Knowl. Data Eng. 37, 107–120 (2015)
Zhang, M.L., Zhou, Z.H.: ML-KNN: A lazy learning approach to multi-label learning. Pattern Recogn. 40, 2038–2048 (2007)
Zhang, M.L., Zhou, Z.H.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26, 1819–1837 (2014)
Zhang, Y., Chen, M., Shen, J., Wang, C.: Tailor versatile multi-modal learning for multi-label emotion recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9100–9108 (2022)
Zhao, X., Chen, Y., Li, W., Gao, L., Tang, B.: MAG+: an extended multimodal adaptation gate for multimodal sentiment analysis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4753–4757 (2022)
Zhu, Y., Kwok, J.T., Zhou, Z.H.: Multi-label learning with global and local label correlation. IEEE Trans. Knowl. Data Eng. 30, 1081–1094 (2018)
Acknowledgments
This paper is supported by the National Natural Science Foundation of China (Grant No. 62192783, 62376117), the Collaborative Innovation Center of Novel Software Technology and Industrialization at Nanjing University.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Qian, S., Guo, M., Fan, Z., Chen, M., Wang, C. (2024). LAMB: Label-Induced Mixed-Level Blending for Multimodal Multi-label Emotion Detection. In: Gao, H., Wang, X., Voros, N. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2023. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 562. Springer, Cham. https://doi.org/10.1007/978-3-031-54528-3_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-54528-3_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54527-6
Online ISBN: 978-3-031-54528-3
eBook Packages: Computer ScienceComputer Science (R0)