skip to main content
10.1145/3536221.3557030acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

Multimodal Representation Learning For Real-World Applications

Published:07 November 2022Publication History

ABSTRACT

Multimodal representation learning has shown tremendous improvements in recent years. An extensive set of works for fusing multiple modalities have shown promising results on the public benchmarks. However, most famous works target unrealistic settings or toy datasets, and a considerable gap exists between the real-world implications of the existing methods. In this work, we aim to bridge the gap between the well-defined benchmark settings and the real-world use cases. We aim to explore architectures inspired by existing promising approaches that have the potential to be implemented in real-world instances. Moreover, we also try to move the research forward by addressing questions that can be solved using multimodal approaches and have a considerable impact on the community. With this work, we attempt to leverage the multimodal representation learning methods, which directly apply to real-world settings.

References

  1. Harsh Agarwal, Keshav Bansal, Abhinav Joshi, and Ashutosh Modi. 2021. Shapes of Emotions: Multimodal Emotion Recognition in Conversations via Emotion Shifts. https://doi.org/10.48550/ARXIV.2112.01938Google ScholarGoogle Scholar
  2. Samuel Albanie, Gül Varol, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox, and Andrew Zisserman. 2020. BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues. In ECCV.Google ScholarGoogle Scholar
  3. John Arevalo, Thamar Solorio, Manuel Montes-y Gómez, and Fabio A. González. 2017. Gated Multimodal Units for Information Fusion. https://doi.org/10.48550/ARXIV.1702.01992Google ScholarGoogle Scholar
  4. AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 2236–2246. https://doi.org/10.18653/v1/P18-1208Google ScholarGoogle ScholarCross RefCross Ref
  5. AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 2236–2246. https://doi.org/10.18653/v1/P18-1208Google ScholarGoogle ScholarCross RefCross Ref
  6. Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2(2019), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Lisa Beinborn, Teresa Botschen, and Iryna Gurevych. 2018. Multimodal Grounding for Language Processing. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 2325–2339. https://aclanthology.org/C18-1197Google ScholarGoogle Scholar
  8. Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42, 4 (2008), 335–359.Google ScholarGoogle Scholar
  9. Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. 2021. History Aware Multimodal Transformer for Vision-and-Language Navigation. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.). Vol. 34. Curran Associates, Inc., 5834–5847. https://proceedings.neurips.cc/paper/2021/file/2e5c2cb8d13e8fba78d95211440ba326-Paper.pdfGoogle ScholarGoogle Scholar
  10. Dragos Datcu and Leon JM Rothkrantz. 2014. Semantic audio-visual data fusion for automatic emotion recognition. Emotion recognition: a pattern analysis approach (2014), 411–435.Google ScholarGoogle Scholar
  11. Jean-Benoit Delbrouck, Noé Tits, Mathilde Brousmiche, and Stéphane Dupont. 2020. A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis. In Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). Association for Computational Linguistics, Seattle, USA, 1–7. https://doi.org/10.18653/v1/2020.challengehml-1.1Google ScholarGoogle ScholarCross RefCross Ref
  12. Marwan Dhuheir, Abdullatif Albaseer, Emna Baccour, Aiman Erbad, Mohamed Abdallah, and Mounir Hamdi. 2021. Emotion Recognition for Healthcare Surveillance Systems Using Neural Networks: A Survey. arxiv:2107.05989 [cs.LG]Google ScholarGoogle Scholar
  13. Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Yuqi Huo, Jingyuan Wen, Haoyu Lu, Ruihua Song, Xin Gao, Tao Xiang, Hao Sun, and Ji-Rong Wen. 2022. Towards artificial general intelligence via a multimodal foundation model. Nature Communications 13, 1 (2022), 3094. https://doi.org/10.1038/s41467-022-30761-2Google ScholarGoogle ScholarCross RefCross Ref
  14. Marc Franzen, Michael Stephan Gresser, Tobias Müller, and Prof. Dr. Sebastian Mauser. 2021. Developing emotion recognition for video conference software to support people with autism. arxiv:2101.10785 [cs.CV]Google ScholarGoogle Scholar
  15. François Gardères, Maryam Ziaeefard, Baptiste Abeloos, and Freddy Lecue. 2020. ConceptBert: Concept-Aware Representation for Visual Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 489–498. https://doi.org/10.18653/v1/2020.findings-emnlp.44Google ScholarGoogle Scholar
  16. Md Kamrul Hasan, Wasifur Rahman, AmirAli Bagher Zadeh, Jianyuan Zhong, Md Iftekhar Tanveer, Louis-Philippe Morency, and Mohammed (Ehsan) Hoque. 2019. UR-FUNNY: A Multimodal Language Dataset for Understanding Humor. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 2046–2056. https://doi.org/10.18653/v1/D19-1211Google ScholarGoogle Scholar
  17. Devamanyu Hazarika, Soujanya Poria, Rada Mihalcea, Erik Cambria, and Roger Zimmermann. 2018. ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 2594–2604. https://doi.org/10.18653/v1/D18-1280Google ScholarGoogle ScholarCross RefCross Ref
  18. Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, and Roger Zimmermann. 2018. Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 2122–2132. https://doi.org/10.18653/v1/N18-1193Google ScholarGoogle ScholarCross RefCross Ref
  19. Wei-Ning Hsu, David Harwath, Tyler Miller, Christopher Song, and James Glass. 2021. Text-Free Image-to-Speech Synthesis Using Learned Segmental Units. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 5284–5300. https://doi.org/10.18653/v1/2021.acl-long.411Google ScholarGoogle Scholar
  20. Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and João Carreira. 2021. Perceiver: General Perception with Iterative Attention. CoRR abs/2103.03206(2021). arXiv:2103.03206https://arxiv.org/abs/2103.03206Google ScholarGoogle Scholar
  21. Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li, and Yun Fu. 2021. Skeleton Aware Multi-modal Sign Language Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.Google ScholarGoogle ScholarCross RefCross Ref
  22. Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data 3, 1 (2016), 160035. https://doi.org/10.1038/sdata.2016.35Google ScholarGoogle ScholarCross RefCross Ref
  23. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Computational Linguistics 5 (2017), 339–351. https://doi.org/10.1162/tacl_a_00065Google ScholarGoogle ScholarCross RefCross Ref
  24. Abhinav Joshi, Ashwani Bhat, Ayush Jain, Atin Singh, and Ashutosh Modi. 2022. COGMEN: COntextualized GNN based Multimodal Emotion recognitioN. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 4148–4164. https://aclanthology.org/2022.naacl-main.306Google ScholarGoogle ScholarCross RefCross Ref
  25. Abhinav Joshi, Naman Gupta, Jinang Shah, Binod Bhattarai, Ashutosh Modi, and Danail Stoyanov. 2022. Generalized Product-of-Experts for Learning Multimodal Representations in Noisy Environments. In Proceedings of the 2022 International Conference on Multimodal Interaction (Bangalore, India) (ICMI ’22). Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. SeongJun Jung, Woo Suk Choi, Seongho Choi, and Byoung-Tak Zhang. 2022. Language-agnostic Semantic Consistent Text-to-Image Generation. In Proceedings of the Workshop on Multilingual Multimodal Learning. Association for Computational Linguistics, Dublin, Ireland and Online, 1–5. https://aclanthology.org/2022.mml-1.1Google ScholarGoogle ScholarCross RefCross Ref
  27. Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, Shruti Gupta, Subhash Chandra Bose Gali, Vish Subramanian, and Partha P. Talukdar. 2021. MuRIL: Multilingual Representations for Indian Languages. CoRR abs/2103.10730(2021). arXiv:2103.10730https://arxiv.org/abs/2103.10730Google ScholarGoogle Scholar
  28. Sander Koelstra, Christian Muhl, Mohammad Soleymani, Jong-Seok Lee, Ashkan Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. 2012. DEAP: A Database for Emotion Analysis ;Using Physiological Signals. IEEE Transactions on Affective Computing 3, 1 (2012), 18–31. https://doi.org/10.1109/T-AFFC.2011.15Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Oscar Koller, Jens Forster, and Hermann Ney. 2015. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding 141 (Dec. 2015), 108–125.Google ScholarGoogle Scholar
  30. Vineet Kosaraju, Amir Sadeghian, Roberto Martín-Martín, Ian Reid, Hamid Rezatofighi, and Silvio Savarese. 2019. Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Vol. 32. Curran Associates, Inc.https://proceedings.neurips.cc/paper/2019/file/d09bf41544a3365a46c9077ebb5e35c3-Paper.pdfGoogle ScholarGoogle Scholar
  31. Michelle A. Lee, Brent Yi, Roberto Martín-Martín, Silvio Savarese, and Jeannette Bohg. 2020. Multimodal Sensor Fusion with Differentiable Filters. CoRR abs/2010.13021(2020). arXiv:2010.13021https://arxiv.org/abs/2010.13021Google ScholarGoogle Scholar
  32. Michelle A. Lee, Yuke Zhu, Krishnan Srinivasan, Parth Shah, Silvio Savarese, Li Fei-Fei, Animesh Garg, and Jeannette Bohg. 2018. Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks. CoRR abs/1810.10191(2018). arXiv:1810.10191http://arxiv.org/abs/1810.10191Google ScholarGoogle Scholar
  33. Luis A. Leiva, Asutosh Hota, and Antti Oulasvirta. 2020. Enrico: A Dataset for Topic Modeling of Mobile UI Designs. In 22nd International Conference on Human-Computer Interaction with Mobile Devices and Services (Oldenburg, Germany) (MobileHCI ’20). Association for Computing Machinery, New York, NY, USA, Article 9, 4 pages. https://doi.org/10.1145/3406324.3410710Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Dongxu Li, Cristian Rodriguez, Xin Yu, and Hongdong Li. 2020. Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison. In The IEEE Winter Conference on Applications of Computer Vision. 1459–1469.Google ScholarGoogle ScholarCross RefCross Ref
  35. Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, and Chengqing Zong. 2017. Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 1092–1102. https://doi.org/10.18653/v1/D17-1114Google ScholarGoogle ScholarCross RefCross Ref
  36. Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Yufan Chen, Peter Wu, Michelle A Lee, Yuke Zhu, 2021. MultiBench: Multiscale Benchmarks for Multimodal Representation Learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).Google ScholarGoogle Scholar
  37. Fenglin Liu, Xian Wu, Shen Ge, Xuancheng Ren, Wei Fan, Xu Sun, and Yuexian Zou. 2021. DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention. ACM Trans. Knowl. Discov. Data 16, 1, Article 1 (jul 2021), 19 pages. https://doi.org/10.1145/3447685Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Huan Ma, Zongbo Han, Changqing Zhang, Huazhu Fu, Joey Tianyi Zhou, and Qinghua Hu. 2021. Trustworthy Multimodal Regression with Mixture of Normal-inverse Gamma Distributions. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.). Vol. 34. Curran Associates, Inc., 6881–6893. https://proceedings.neurips.cc/paper/2021/file/371bce7dc83817b7893bcdeed13799b5-Paper.pdfGoogle ScholarGoogle Scholar
  39. Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria. 2019. DialogueRNN: An attentive RNN for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6818–6825.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Yasuhide Miura, Yuhao Zhang, C. Langlotz, and Dan Jurafsky. 2021. Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation. In NAACL.Google ScholarGoogle Scholar
  41. Shumpei Miyawaki, Taku Hasegawa, Kyosuke Nishida, Takuma Kato, and Jun Suzuki. 2022. Scene-Text Aware Image and Text Retrieval with Dual-Encoder. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics, Dublin, Ireland, 422–433. https://aclanthology.org/2022.acl-srw.34Google ScholarGoogle ScholarCross RefCross Ref
  42. Amit Moryossef, Ioannis Tsochantaridis, Roee Aharoni, Sarah Ebling, and Srini Narayanan. 2020. Real-time sign language detection using human pose estimation. In European Conference on Computer Vision. Springer, 237–248.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. 2021. Attention Bottlenecks for Multimodal Fusion. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.). Vol. 34. Curran Associates, Inc., 14200–14213. https://proceedings.neurips.cc/paper/2021/file/76ba9f564ebbc35b1014ac498fafadd0-Paper.pdfGoogle ScholarGoogle Scholar
  44. Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers). 873–883.Google ScholarGoogle ScholarCross RefCross Ref
  45. Raeid Saqur and Karthik Narasimhan. 2020. Multimodal Graph Networks for Compositional Generalization in Visual Question Answering. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.). Vol. 33. Curran Associates, Inc., 3070–3081. https://proceedings.neurips.cc/paper/2020/file/1fd6c4e41e2c6a6b092eb13ee72bce95-Paper.pdfGoogle ScholarGoogle Scholar
  46. Nicu Sebe, Ira Cohen, Theo Gevers, and Thomas S Huang. 2005. Multimodal approaches for emotion recognition: A survey. Proceedings of SPIE - The International Society for Optical Engineering 5670 (20 July 2005), 56–67. https://doi.org/10.1117/12.600746 Proceedings of SPIE-IS and T Electronic Imaging - Internet Imaging VI ; Conference date: 18-01-2005 Through 20-01-2005.Google ScholarGoogle Scholar
  47. Garima Sharma and Abhinav Dhall. 2021. A Survey on Automatic Multimodal Emotion Recognition in the Wild. 35–64. https://doi.org/10.1007/978-3-030-51870-7_3Google ScholarGoogle Scholar
  48. Aman Shenoy and Ashish Sardana. 2020. Multilogue-Net: A Context-Aware RNN for Multi-modal Emotion Detection and Sentiment Analysis in Conversation. In Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). Association for Computational Linguistics, Seattle, USA, 19–28. https://doi.org/10.18653/v1/2020.challengehml-1.3Google ScholarGoogle Scholar
  49. Zhan Shi, Hui Liu, and Xiaodan Zhu. 2021. Enhancing Descriptive Image Captioning with Natural Language Inference. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, Online, 269–277. https://doi.org/10.18653/v1/2021.acl-short.36Google ScholarGoogle Scholar
  50. Ozge Mercanoglu Sincan and Hacer Yalim Keles. 2020. Autsl: A large scale multi-modal turkish sign language dataset and baseline methods. IEEE Access 8(2020), 181340–181355.Google ScholarGoogle ScholarCross RefCross Ref
  51. Alessandro Suglia, Antonio Vergari, Ioannis Konstas, Yonatan Bisk, Emanuele Bastianelli, Andrea Vanzo, and Oliver Lemon. 2020. Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 1090–1102. https://doi.org/10.18653/v1/2020.coling-main.95Google ScholarGoogle ScholarCross RefCross Ref
  52. Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-to-End Memory Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (Montreal, Canada) (NIPS’15). MIT Press, Cambridge, MA, USA, 2440–2448.Google ScholarGoogle Scholar
  53. Jabeen Summaira, Xi Li, Amin Muhammad Shoib, Songyuan Li, and Jabbar Abdul. 2021. Recent Advances and Trends in Multimodal Deep Learning: A Review. ArXiv abs/2105.11087(2021).Google ScholarGoogle Scholar
  54. Siqi Sun, Yen-Chun Chen, Linjie Li, Shuohang Wang, Yuwei Fang, and Jingjing Liu. 2021. LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 982–997. https://doi.org/10.18653/v1/2021.naacl-main.77Google ScholarGoogle Scholar
  55. Thomas Sutter, Imant Daunhawer, and Julia Vogt. 2020. Multimodal Generative Learning Utilizing Jensen-Shannon-Divergence. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.). Vol. 33. Curran Associates, Inc., 6100–6110. https://proceedings.neurips.cc/paper/2020/file/43bb733c1b62a5e374c63cb22fa457b4-Paper.pdfGoogle ScholarGoogle Scholar
  56. Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal Few-Shot Learning with Frozen Language Models. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.). Vol. 34. Curran Associates, Inc., 200–212. https://proceedings.neurips.cc/paper/2021/file/01b7575c38dac42f3cfb7d500438b875-Paper.pdfGoogle ScholarGoogle Scholar
  57. Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux, and Frédéric Jurie. 2018. CentralNet: a Multilayer Approach for Multimodal Fusion. CoRR abs/1808.07275(2018). arXiv:1808.07275http://arxiv.org/abs/1808.07275Google ScholarGoogle Scholar
  58. Tana Wang, Yaqing Hou, Dongsheng Zhou, and Qiang Zhang. 2021. A Contextual Attention Network for Multimodal Emotion Recognition in Conversation. In 2021 International Joint Conference on Neural Networks (IJCNN). 1–7. https://doi.org/10.1109/IJCNN52387.2021.9533718Google ScholarGoogle ScholarCross RefCross Ref
  59. Yan Wang, Wei Song, Wei Tao, Antonio Liotta, Dawei Yang, Xinlei Li, Shuyong Gao, Yixuan Sun, Weifeng Ge, Wei Zhang, and Wenqiang Zhang. 2022. A systematic review on affective computing: emotion models, databases, and recent advances. Information Fusion 83-84(2022), 19–52. https://doi.org/10.1016/j.inffus.2022.03.009Google ScholarGoogle Scholar
  60. Martin Wollmer, Angeliki Metallinou, Florian Eyben, Bjorn Schuller, and Shrikanth Narayanan. 2010. Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional lstm modeling. In Proc. INTERSPEECH 2010, Makuhari, Japan. 2362–2365.Google ScholarGoogle ScholarCross RefCross Ref
  61. Chenchen Xu, Dongxu Li, Hongdong Li, Hanna Suominen, and Ben Swift. 2022. Automatic Gloss Dictionary for Sign Language Learners. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Dublin, Ireland, 83–92. https://aclanthology.org/2022.acl-demo.8Google ScholarGoogle ScholarCross RefCross Ref
  62. Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 1103–1114. https://doi.org/10.18653/v1/D17-1115Google ScholarGoogle ScholarCross RefCross Ref
  63. Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory Fusion Network for Multi-View Sequential Learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (New Orleans, Louisiana, USA) (AAAI’18/IAAI’18/EAAI’18). AAAI Press, Article 691, 8 pages.Google ScholarGoogle Scholar

Index Terms

  1. Multimodal Representation Learning For Real-World Applications

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction
          November 2022
          830 pages
          ISBN:9781450393904
          DOI:10.1145/3536221

          Copyright © 2022 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 7 November 2022

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • short-paper
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate453of1,080submissions,42%
        • Article Metrics

          • Downloads (Last 12 months)94
          • Downloads (Last 6 weeks)8

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format