MoCap-Video Data Retrieval with Deep Cross-Modal Learning

Zhang, Lu; Peng, Jingliang; Lv, Na

doi:10.1007/978-3-031-53308-2_36

Lu Zhang^14,15,
Jingliang Peng^14,15 &
Na Lv^14,15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14555))

Included in the following conference series:

International Conference on Multimedia Modeling

325 Accesses

Abstract

Cross-modal retrieval between video and motion capture (MoCap) data facilitates efficient reuse of human motion data in either skeletal or video format. For this purpose, we propose a deep cross-modal learning model for cross-modal retrieval between MoCap data and video data. First, we use a graph convolution-based network and a 3D convolution-based network to extract features from MoCap data and video data, respectively. In addition, we propose to use a pre-defined common subspace to maximize the inter-class variation and minimize the intra-class variation. Furthermore, we employ a similarity matrix to achieve the alignment between these two modalities and exploit their underlying correlations. For the purpose of experimental evaluation, due to the small amount of video data corresponding to the MoCap data in the public HDM05 dataset, we recorded a video dataset corresponding to the HDM05 motion capture dataset and performed cross-modal retrieval on it. The experimental results proved the effectiveness of the proposed scheme.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1728–1738 (2021)
Google Scholar
Caetano, C., Sena, J., Brémond, F., Dos Santos, J.A., Schwartz, W.R.: SkeleMotion: a new representation of skeleton joint sequences based on motion information for 3D action recognition. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–8. IEEE (2019)
Google Scholar
Gu, W., Gu, X., Gu, J., Li, B., Xiong, Z., Wang, W.: Adversary guided asymmetric hashing for cross-modal retrieval. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp. 159–167 (2019)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
Jiang, Z., Li, Z., Li, W., Li, X., Peng, J.: Generic video-based motion capture data retrieval. In: 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1950–1957. IEEE (2019)
Google Scholar
Kapadia, M., Chiang, I., Thomas, T., Badler, N.I., Kider, J.T., Jr.: Efficient motion retrieval in large motion databases. In: Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, pp. 19–28 (2013)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Le, L., Patterson, A., White, M.: Supervised autoencoders: improving generalization performance with unsupervised regularizers. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar
Lee, J., Lee, M., Lee, D., Lee, S.: Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. arXiv preprint arXiv:2208.10741 (2022)
Li, L., Zheng, W., Zhang, Z., Huang, Y., Wang, L.: Skeleton-based relational modeling for action recognition. arXiv preprint arXiv:1805.02556 (2018)
Li, W., Huang, Y., Kuo, C.C.J., Peng, J., et al.: Video-based human motion capture data retrieval via normalized motion energy image subspace projections. In: 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 243–248. IEEE (2017)
Google Scholar
Li, X., Hu, D., Nie, F.: Deep binary reconstruction for cross-modal hashing. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1398–1406 (2017)
Google Scholar
Li, Y., Xia, R., Liu, X., Huang, Q.: Learning shape-motion representations from geometric algebra spatio-temporal model for skeleton-based action recognition. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 1066–1071. IEEE (2019)
Google Scholar
Li, Z., Guo, C., Feng, Z., Hwang, J.N., Jin, Y., Zhang, Y.: Image-text retrieval with binary and continuous label supervision. arXiv preprint arXiv:2210.11319 (2022)
Liu, Z., et al.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)
Google Scholar
Luo, H., et al.: CLIP4Clip: an empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508, 293–304 (2022)
Article Google Scholar
Lv, N., Jiang, Z., Huang, Y., Meng, X., Meenakshisundaram, G., Peng, J.: Generic content-based retrieval of marker-based motion capture data. IEEE Trans. Vis. Comput. Graph. 24(6), 1969–1982 (2017)
Article Google Scholar
Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., Weber, A.: Mocap database HDM05. Institut für Informatik II, Universität Bonn 2(7) (2007)
Google Scholar
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172 (2021)
Google Scholar
Numaguchi, N., Nakazawa, A., Shiratori, T., Hodgins, J.K.: A puppet interface for retrieval of motion capture data. In: Proceedings of the 2011 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 157–166 (2011)
Google Scholar
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3d human activity analysis. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1010–1019. IEEE Computer Society (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Google Scholar
Xiao, J., Tang, Z., Feng, Y., Xiao, Z.: Sketch-based human motion retrieval via selected 2D geometric posture descriptor. Sig. Process. 113, 1–8 (2015)
Article Google Scholar
Xiao, Q., Siqi, L.: Motion retrieval based on dynamic Bayesian network and canonical time warping. Soft. Comput. 21, 267–280 (2017)
Article Google Scholar
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_19
Chapter Google Scholar
Zhang, P., Xue, J., Lan, C., Zeng, W., Gao, Z., Zheng, N.: Adding attentiveness to the neurons in recurrent neural networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 136–152. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_9
Chapter Google Scholar

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (No. 61802144) and Shandong Provincial Natural Science Foundation, China (No. ZR2022MF294).

Author information

Authors and Affiliations

Shandong Provincial Key Laboratory of Network Based Intelligent Computing, University of Jinan, Jinan, China
Lu Zhang, Jingliang Peng & Na Lv
School of Information Science and Engineering, University of Jinan, Jinan, China
Lu Zhang, Jingliang Peng & Na Lv

Authors

Lu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jingliang Peng
View author publications
You can also search for this author in PubMed Google Scholar
Na Lv
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Na Lv .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, L., Peng, J., Lv, N. (2024). MoCap-Video Data Retrieval with Deep Cross-Modal Learning. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14555. Springer, Cham. https://doi.org/10.1007/978-3-031-53308-2_36

Download citation

DOI: https://doi.org/10.1007/978-3-031-53308-2_36
Published: 28 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53307-5
Online ISBN: 978-3-031-53308-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MoCap-Video Data Retrieval with Deep Cross-Modal Learning