skip to main content
10.1145/3595916.3626459acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Cross-Modal Retrieval for Motion and Text via DropTriple Loss

Published: 01 January 2024 Publication History

Abstract

Cross-modal retrieval of image-text and video-text is a prominent research area in computer vision and natural language processing. However, there has been insufficient attention given to cross-modal retrieval between human motion and text, despite its wide-ranging applicability. To address this gap, we utilize a concise yet effective dual-unimodal transformer encoder for tackling this task. Recognizing that overlapping atomic actions in different human motion sequences can lead to semantic conflicts between samples, we explore a novel triplet loss function called DropTriple Loss. This loss function discards false negative samples from the negative sample set and focuses on mining remaining genuinely hard negative samples for triplet training, thereby reducing violations they cause. We evaluate our model and approach on the HumanML3D and KIT Motion-Language datasets. On the latest HumanML3D dataset, we achieve a recall of 62.9% for motion retrieval and 71.5% for text retrieval (both based on R@10). The source code for our approach is publicly available at https://github.com/eanson023/rehamot.

References

[1]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning. 41–48.
[2]
Jianan Chen, Lu Zhang, Qiong Wang, Cong Bai, and Kidiyo Kpalma. 2022. Intra-Modal Constraint Loss for Image-Text Retrieval. In IEEE International Conference on Image Processing (ICIP). 4023–4027.
[3]
Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, and Grégory Rogez. 2022. PoseScript: 3D human poses from natural language. In European Conference on Computer Vision (ECCV). 346–362.
[4]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv:1707.05612 (2017).
[5]
Yan Gong, Georgina Cosma, and Hui Fang. 2021. On the limitations of visual-semantic embedding networks for image-to-text information retrieval. Journal of Imaging 7, 8 (2021), 125.
[6]
Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. 2022. Generating diverse and natural 3d human motions from text. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5152–5161.
[7]
Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. 2020. Action2motion: Conditioned generation of 3d human motions. In ACM International Conference on Multimedia (ACM MM). 2021–2029.
[8]
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2. 1735–1742.
[9]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9729–9738.
[10]
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853–899.
[11]
Daniel Holden, Jun Saito, and Taku Komura. 2016. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics 35, 4 (2016), 1–11.
[12]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3128–3137.
[13]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980 (2014).
[14]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 (2014).
[15]
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems (NeurIPS) 34 (2021), 9694–9705.
[16]
Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang. 2017. Identity-aware textual-visual matching with latent co-attention. In IEEE International Conference on Computer Vision (ICCV). 1890–1899.
[17]
Jinfu Liu, Xinshun Wang, Can Wang, Yuan Gao, and Mengyuan Liu. 2023. Temporal Decoupling Graph Convolutional Network for Skeleton-based Gesture Recognition. IEEE Transactions on Multimedia (2023).
[18]
Yang Liu, Hong Liu, Huaqiu Wang, and Mengyuan Liu. 2022. Regularizing visual semantic embedding with contrastive learning for image-text matching. IEEE Signal Processing Letters 29 (2022), 1332–1336.
[19]
Xianzhong Long, Zhiyi Zhang, and Yun Li. 2022. Multi-network contrastive learning of visual representations. Knowledge-Based Systems 258 (2022), 109991.
[20]
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2015. SMPL: A skinned multi-person linear model. ACM transactions on graphics 34, 6 (2015), 1–16.
[21]
Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. 2019. AMASS: Archive of motion capture as surface shapes. In IEEE International Conference on Computer Vision (ICCV). 5442–5451.
[22]
Christian Mandery, Ömer Terlemez, Martin Do, Nikolaus Vahrenkamp, and Tamim Asfour. 2015. The KIT whole-body human motion database. In 2015 International Conference on Advanced Robotics (ICAR). IEEE, 329–336.
[23]
Mathis Petrovich, Michael J Black, and Gül Varol. 2022. TEMOS: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision (ECCV).
[24]
Matthias Plappert, Christian Mandery, and Tamim Asfour. 2016. The KIT motion-language dataset. Big data 4, 4 (2016), 236–252.
[25]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML). 8748–8763.
[26]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125 (2022).
[27]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 (2019).
[28]
Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. 2022. Human motion diffusion model. arXiv:2209.14916 (2022).
[29]
Zhigang Tu, Yuanzhong Liu, Yan Zhang, Qizi Mu, and Junsong Yuan. 2023. DTCM: Joint Optimization of Dark Enhancement and Action Recognition in Videos. IEEE Transactions on Image Processing (2023).
[30]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. 30 (2017).
[31]
Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. 2014. Learning fine-grained image similarity with deep ranking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1386–1393.
[32]
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5005–5013.
[33]
Zheng Wang, Xing Xu, Jiwei Wei, Ning Xie, Jie Shao, and Yang Yang. 2023. Quaternion Representation Learning for cross-modal matching. Knowledge-Based Systems (2023), 110505.
[34]
Chun Yang, Jianxiao Zou, JianHua Wu, Hongbing Xu, and Shicai Fan. 2022. Supervised contrastive learning for recommendation. Knowledge-Based Systems 258 (2022), 109973.
[35]
Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, Yilin Wang, Michael Maire, Ajinkya Kale, and Baldo Faieta. 2021. Multimodal contrastive training for visual representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6995–7004.
[36]
Guoshuai Zhao, Chaofeng Zhang, Heng Shang, Yaxiong Wang, Li Zhu, and Xueming Qian. 2023. Generative label fused network for image-text matching. Knowledge-Based Systems (2023), 110280.
[37]
Fan Zhou, Yurou Dai, Qiang Gao, Pengyu Wang, and Ting Zhong. 2021. Self-supervised human mobility learning for next location prediction and trajectory classification. Knowledge-Based Systems 228 (2021), 107214.
[38]
Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. 2019. On the continuity of rotation representations in neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5745–5753.
[39]
Mohammadreza Zolfaghari, Yi Zhu, Peter Gehler, and Thomas Brox. 2021. Crossclr: Cross-modal contrastive learning for multi-modal video representations. In IEEE International Conference on Computer Vision (ICCV). 1450–1459.

Cited By

View all
  • (2024)Modal-Enhanced Semantic Modeling for Fine-Grained 3D Human Motion RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681625(10114-10123)Online publication date: 28-Oct-2024
  • (2024)Multi-Instance Multi-Label Learning for Text-motion RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681444(5829-5837)Online publication date: 28-Oct-2024
  • (2024)BAMM: Bidirectional Autoregressive Motion ModelComputer Vision – ECCV 202410.1007/978-3-031-72633-0_10(172-190)Online publication date: 22-Nov-2024

Index Terms

  1. Cross-Modal Retrieval for Motion and Text via DropTriple Loss

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia
      December 2023
      745 pages
      ISBN:9798400702051
      DOI:10.1145/3595916
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 January 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Contrastive learning
      2. Cross-modal retrieval
      3. Motion-text retrieval
      4. Triplet loss

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      MMAsia '23
      Sponsor:
      MMAsia '23: ACM Multimedia Asia
      December 6 - 8, 2023
      Tainan, Taiwan

      Acceptance Rates

      Overall Acceptance Rate 59 of 204 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)61
      • Downloads (Last 6 weeks)4
      Reflects downloads up to 28 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Modal-Enhanced Semantic Modeling for Fine-Grained 3D Human Motion RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681625(10114-10123)Online publication date: 28-Oct-2024
      • (2024)Multi-Instance Multi-Label Learning for Text-motion RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681444(5829-5837)Online publication date: 28-Oct-2024
      • (2024)BAMM: Bidirectional Autoregressive Motion ModelComputer Vision – ECCV 202410.1007/978-3-031-72633-0_10(172-190)Online publication date: 22-Nov-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media