research-article

Cross-Modal Retrieval for Motion and Text via DropTriple Loss

Authors:

Hong LiuAuthors Info & Claims

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

Article No.: 83, Pages 1 - 7

https://doi.org/10.1145/3595916.3626459

Published: 01 January 2024 Publication History

Abstract

Cross-modal retrieval of image-text and video-text is a prominent research area in computer vision and natural language processing. However, there has been insufficient attention given to cross-modal retrieval between human motion and text, despite its wide-ranging applicability. To address this gap, we utilize a concise yet effective dual-unimodal transformer encoder for tackling this task. Recognizing that overlapping atomic actions in different human motion sequences can lead to semantic conflicts between samples, we explore a novel triplet loss function called DropTriple Loss. This loss function discards false negative samples from the negative sample set and focuses on mining remaining genuinely hard negative samples for triplet training, thereby reducing violations they cause. We evaluate our model and approach on the HumanML3D and KIT Motion-Language datasets. On the latest HumanML3D dataset, we achieve a recall of 62.9% for motion retrieval and 71.5% for text retrieval (both based on R@10). The source code for our approach is publicly available at https://github.com/eanson023/rehamot.

References

[1]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning. 41–48.

Digital Library

[2]

Jianan Chen, Lu Zhang, Qiong Wang, Cong Bai, and Kidiyo Kpalma. 2022. Intra-Modal Constraint Loss for Image-Text Retrieval. In IEEE International Conference on Image Processing (ICIP). 4023–4027.

[3]

Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, and Grégory Rogez. 2022. PoseScript: 3D human poses from natural language. In European Conference on Computer Vision (ECCV). 346–362.

Digital Library

[4]

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv:1707.05612 (2017).

[5]

Yan Gong, Georgina Cosma, and Hui Fang. 2021. On the limitations of visual-semantic embedding networks for image-to-text information retrieval. Journal of Imaging 7, 8 (2021), 125.

[6]

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. 2022. Generating diverse and natural 3d human motions from text. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5152–5161.

[7]

Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. 2020. Action2motion: Conditioned generation of 3d human motions. In ACM International Conference on Multimedia (ACM MM). 2021–2029.

Digital Library

[8]

Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2. 1735–1742.

Digital Library

[9]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9729–9738.

[10]

Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853–899.

Digital Library

[11]

Daniel Holden, Jun Saito, and Taku Komura. 2016. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics 35, 4 (2016), 1–11.

Digital Library

[12]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3128–3137.

[13]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980 (2014).

[14]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 (2014).

[15]

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems (NeurIPS) 34 (2021), 9694–9705.

[16]

Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang. 2017. Identity-aware textual-visual matching with latent co-attention. In IEEE International Conference on Computer Vision (ICCV). 1890–1899.

[17]

Jinfu Liu, Xinshun Wang, Can Wang, Yuan Gao, and Mengyuan Liu. 2023. Temporal Decoupling Graph Convolutional Network for Skeleton-based Gesture Recognition. IEEE Transactions on Multimedia (2023).

Digital Library

[18]

Yang Liu, Hong Liu, Huaqiu Wang, and Mengyuan Liu. 2022. Regularizing visual semantic embedding with contrastive learning for image-text matching. IEEE Signal Processing Letters 29 (2022), 1332–1336.

[19]

Xianzhong Long, Zhiyi Zhang, and Yun Li. 2022. Multi-network contrastive learning of visual representations. Knowledge-Based Systems 258 (2022), 109991.

Digital Library

[20]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2015. SMPL: A skinned multi-person linear model. ACM transactions on graphics 34, 6 (2015), 1–16.

Digital Library

[21]

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. 2019. AMASS: Archive of motion capture as surface shapes. In IEEE International Conference on Computer Vision (ICCV). 5442–5451.

[22]

Christian Mandery, Ömer Terlemez, Martin Do, Nikolaus Vahrenkamp, and Tamim Asfour. 2015. The KIT whole-body human motion database. In 2015 International Conference on Advanced Robotics (ICAR). IEEE, 329–336.

[23]

Mathis Petrovich, Michael J Black, and Gül Varol. 2022. TEMOS: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision (ECCV).

Digital Library

[24]

Matthias Plappert, Christian Mandery, and Tamim Asfour. 2016. The KIT motion-language dataset. Big data 4, 4 (2016), 236–252.

[25]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML). 8748–8763.

[26]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125 (2022).

[27]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 (2019).

[28]

Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. 2022. Human motion diffusion model. arXiv:2209.14916 (2022).

[29]

Zhigang Tu, Yuanzhong Liu, Yan Zhang, Qizi Mu, and Junsong Yuan. 2023. DTCM: Joint Optimization of Dark Enhancement and Action Recognition in Videos. IEEE Transactions on Image Processing (2023).

[30]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. 30 (2017).

[31]

Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. 2014. Learning fine-grained image similarity with deep ranking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1386–1393.

Digital Library

[32]

Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5005–5013.

[33]

Zheng Wang, Xing Xu, Jiwei Wei, Ning Xie, Jie Shao, and Yang Yang. 2023. Quaternion Representation Learning for cross-modal matching. Knowledge-Based Systems (2023), 110505.

[34]

Chun Yang, Jianxiao Zou, JianHua Wu, Hongbing Xu, and Shicai Fan. 2022. Supervised contrastive learning for recommendation. Knowledge-Based Systems 258 (2022), 109973.

Digital Library

[35]

Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, Yilin Wang, Michael Maire, Ajinkya Kale, and Baldo Faieta. 2021. Multimodal contrastive training for visual representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6995–7004.

[36]

Guoshuai Zhao, Chaofeng Zhang, Heng Shang, Yaxiong Wang, Li Zhu, and Xueming Qian. 2023. Generative label fused network for image-text matching. Knowledge-Based Systems (2023), 110280.

[37]

Fan Zhou, Yurou Dai, Qiang Gao, Pengyu Wang, and Ting Zhong. 2021. Self-supervised human mobility learning for next location prediction and trajectory classification. Knowledge-Based Systems 228 (2021), 107214.

Digital Library

[38]

Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. 2019. On the continuity of rotation representations in neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5745–5753.

[39]

Mohammadreza Zolfaghari, Yi Zhu, Peter Gehler, and Thomas Brox. 2021. Crossclr: Cross-modal contrastive learning for multi-modal video representations. In IEEE International Conference on Computer Vision (ICCV). 1450–1459.

Cited By

Shi HZhang HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Modal-Enhanced Semantic Modeling for Fine-Grained 3D Human Motion RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681625(10114-10123)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681625
Yang YCao LShi HZhang HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Multi-Instance Multi-Label Learning for Text-motion RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681444(5829-5837)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681444
Pinyoanuntapong ESaleem MWang PLee MDas SChen C(2024)BAMM: Bidirectional Autoregressive Motion ModelComputer Vision – ECCV 202410.1007/978-3-031-72633-0_10(172-190)Online publication date: 22-Nov-2024
https://doi.org/10.1007/978-3-031-72633-0_10

Index Terms

Cross-Modal Retrieval for Motion and Text via DropTriple Loss
1. Computing methodologies
  1. Artificial intelligence
2. Information systems
  1. Information retrieval

Recommendations

Enhancing Intra-modal Similarity in a Cross-Modal Triplet Loss
Discovery Science
Abstract
Cross-modal retrieval requires building a common latent space that captures and correlates information from different data modalities, usually images and texts. Cross-modal training based on the triplet loss with hard negative mining is a state-of-...
Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning
ICMR '15: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval

Cross-modal retrieval extends the ability of search engines to deal with the massive cross-modal data. The goal of image-text cross-modal retrieval is to search images (texts) by using text (image) queries by computing the similarities of images and ...
Improving text-image cross-modal retrieval with contrastive loss
Abstract
Text-image retrieval task has attracted extensive attention nowadays. Due to the different feature distributions, the performance of this task suffers from the large modal discrepancy. Most retrieval methods map images and texts into a common ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

December 2023

745 pages

ISBN:9798400702051

DOI:10.1145/3595916

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MMAsia '23

Sponsor:

SIGMM

MMAsia '23: ACM Multimedia Asia

December 6 - 8, 2023

Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
90
Total Downloads

Downloads (Last 12 months)61
Downloads (Last 6 weeks)4

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shi HZhang HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Modal-Enhanced Semantic Modeling for Fine-Grained 3D Human Motion RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681625(10114-10123)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681625
Yang YCao LShi HZhang HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Multi-Instance Multi-Label Learning for Text-motion RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681444(5829-5837)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681444
Pinyoanuntapong ESaleem MWang PLee MDas SChen C(2024)BAMM: Bidirectional Autoregressive Motion ModelComputer Vision – ECCV 202410.1007/978-3-031-72633-0_10(172-190)Online publication date: 22-Nov-2024
https://doi.org/10.1007/978-3-031-72633-0_10

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten