CycDA: Unsupervised Cycle Domain Adaptation to Learn from Image to Video

Lin, Wei; Kukleva, Anna; Sun, Kunyang; Possegger, Horst; Kuehne, Hilde; Bischof, Horst

doi:10.1007/978-3-031-20062-5_40

Wei Lin^12,16,
Anna Kukleva¹³,
Kunyang Sun^12,14,
Horst Possegger¹²,
Hilde Kuehne¹⁵ &
…
Horst Bischof¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13663))

Included in the following conference series:

European Conference on Computer Vision

1494 Accesses
1 Citations

Abstract

Although action recognition has achieved impressive results over recent years, both collection and annotation of video training data are still time-consuming and cost intensive. Therefore, image-to-video adaptation has been proposed to exploit labeling-free web image source for adapting on unlabeled target videos. This poses two major challenges: (1) spatial domain shift between web images and video frames; (2) modality gap between image and video data. To address these challenges, we propose Cycle Domain Adaptation (CycDA), a cycle-based approach for unsupervised image-to-video domain adaptation. We leverage the joint spatial information in images and videos on the one hand and, on the other hand, train an independent spatio-temporal model to bridge the modality gap. We alternate between the spatial and spatio-temporal learning with knowledge transfer between the two in each cycle. We evaluate our approach on benchmark datasets for image-to-video as well as for mixed-source domain adaptation achieving state-of-the-art results and demonstrating the benefits of our cyclic adaptation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)
Google Scholar
Chen, J., Wu, X., Hu, Y., Luo, J.: Spatial-temporal causal inference for partial image-to-video adaptation. In: AAAI, vol. 35, pp. 1027–1035 (2021)
Google Scholar
Chen, M.H., Kira, Z., AlRegib, G., Yoo, J., Chen, R., Zheng, J.: Temporal attentive alignment for large-scale video domain adaptation. In: ICCV, pp. 6321–6330 (2019)
Google Scholar
Chen, M.H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: CVPR, pp. 9454–9463 (2020)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607. PMLR (2020)
Google Scholar
Choi, J., Sharma, G., Chandraker, M., Huang, J.B.: Unsupervised and semi-supervised domain adaptation for action recognition from drones. In: WACV, pp. 1717–1726 (2020)
Google Scholar
Choi, J., Sharma, G., Schulter, S., Huang, J.-B.: Shuffle and attend: video domain adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 678–695. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_40
Chapter Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)
Google Scholar
Duan, H., Zhao, Y., Xiong, Y., Liu, W., Lin, D.: Omni-sourced webly-supervised learning for video recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 670–688. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_40
Chapter Google Scholar
Feichtenhofer, C.: X3d: expanding architectures for efficient video recognition. In: CVPR (2020)
Google Scholar
Gan, C., Sun, C., Duan, L., Gong, B.: Webly-supervised video recognition by mutually voting for relevant web images and web video frames. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 849–866. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_52
Chapter Google Scholar
Gan, C., Sun, C., Nevatia, R.: Deck: discovering event composition knowledge from web images for zero-shot event detection and recounting in videos. In: AAAI, vol. 31 (2017)
Google Scholar
Gan, C., Yao, T., Yang, K., Yang, Y., Mei, T.: You lead, we exceed: labor-free video concept learning by jointly exploiting web videos and images. In: CVPR, pp. 923–932 (2016)
Google Scholar
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: ICML, pp. 1180–1189. PMLR (2015)
Google Scholar
Ganin, Y., et al.: Domain-adversarial training of neural networks. JMLR 17(1), 2030–2096 (2016)
MathSciNet Google Scholar
Guo, S., et al.: Curriculumnet: weakly supervised learning from large-scale web images. In: ECCV, pp. 135–150 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Jamal, A., Namboodiri, V.P., Deodhare, D., Venkatesh, K.: Deep domain adaptation in action space. In: BMVC, vol. 2, p. 5 (2018)
Google Scholar
Kae, A., Song, Y.: Image to video domain adaptation using web supervision. In: WACV, pp. 567–575 (2020)
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR, pp. 1725–1732 (2014)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Kim, D., et al.: Learning cross-modal contrastive features for video domain adaptation. In: ICCV, pp. 13618–13627 (2021)
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: ICCV, pp. 2556–2563. IEEE (2011)
Google Scholar
Li, J., Wong, Y., Zhao, Q., Kankanhalli, M.S.: Attention transfer from web images for video recognition. In: ACM Multimedia, pp. 1–9 (2017)
Google Scholar
Li, Y., Wang, N., Shi, J., Hou, X., Liu, J.: Adaptive batch normalization for practical domain adaptation. Pattern Recogn. 80, 109–117 (2018)
Article Google Scholar
Liu, H., Wang, J., Long, M.: Cycle self-training for domain adaptation. arXiv preprint arXiv:2103.03571 (2021)
Liu, Y., Lu, Z., Li, J., Yang, T., Yao, C.: Deep image-to-video adaptation and fusion networks for action recognition. TIP 29, 3168–3182 (2019)
MATH Google Scholar
Luo, Y., Huang, Z., Wang, Z., Zhang, Z., Baktashmotlagh, M.: Adversarial bipartite graph learning for video domain adaptation. In: ACM Multimedia, pp. 19–27 (2020)
Google Scholar
Ma, S., Bargal, S.A., Zhang, J., Sigal, L., Sclaroff, S.: Do less and achieve more: training cnns for action recognition utilizing action images from the web. Pattern Recogn. 68, 334–345 (2017)
Article Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. JMLR 9(11) (2008)
Google Scholar
Munro, J., Damen, D.: Multi-modal domain adaptation for fine-grained action recognition. In: CVPR, pp. 122–132 (2020)
Google Scholar
Pan, B., Cao, Z., Adeli, E., Niebles, J.C.: Adversarial cross-domain action recognition with co-attention. In: AAAI, vol. 34, pp. 11815–11822 (2020)
Google Scholar
Sahoo, A., Shah, R., Panda, R., Saenko, K., Das, A.: Contrast and mix: temporal contrastive video domain adaptation with background mixing. In: NeurIPS (2021)
Google Scholar
Saito, K., Watanabe, K., Ushiku, Y., Harada, T.: Maximum classifier discrepancy for unsupervised domain adaptation. In: CVPR, pp. 3723–3732 (2018)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Sun, C., Shetty, S., Sukthankar, R., Nevatia, R.: Temporal localization of fine-grained actions in videos by domain transfer from web images. In: ACM Multimedia, pp. 371–380 (2015)
Google Scholar
Tanisik, G., Zalluhoglu, C., Ikizler-Cinbis, N.: Facial descriptors for human interaction recognition in still images. Pattern Recogn. Lett. 73, 44–51 (2016)
Article Google Scholar
Wang, L., Xiong, Y., Lin, D., Van Gool, L.: Untrimmednets for weakly supervised action recognition and detection. In: CVPR, pp. 4325–4334 (2017)
Google Scholar
Wang, Z., She, Q., Smolic, A.: Action-net: multipath excitation for action recognition. In: CVPR (2021)
Google Scholar
Yang, C., Xu, Y., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: CVPR (2020)
Google Scholar
Yang, J., Sun, X., Lai, Y.K., Zheng, L., Cheng, M.M.: Recognition from web data: a progressive filtering approach. TIP 27(11), 5303–5315 (2018)
MathSciNet Google Scholar
Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: ICCV, pp. 1331–1338. IEEE (2011)
Google Scholar
Yu, F., Wu, X., Chen, J., Duan, L.: Exploiting images for video recognition: heterogeneous feature augmentation via symmetric adversarial learning. TIP 28(11), 5308–5321 (2019)
MathSciNet MATH Google Scholar
Yu, F., Wu, X., Sun, Y., Duan, L.: Exploiting images for video recognition with hierarchical generative adversarial networks. In: IJCAI (2018)
Google Scholar
Zhang, J., Han, Y., Tang, J., Hu, Q., Jiang, J.: Semi-supervised image-to-video adaptation for video action recognition. IEEE Trans. Cybern. 47(4), 960–973 (2016)
Article Google Scholar
Zhang, Y., Deng, B., Jia, K., Zhang, L.: Label propagation with augmented anchors: a simple semi-supervised learning baseline for unsupervised domain adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 781–797. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_45
Chapter Google Scholar
Zhuang, B., Liu, L., Li, Y., Shen, C., Reid, I.: Attend in groups: a weakly-supervised deep learning framework for learning from web data. In: CVPR, pp. 1878–1887 (2017)
Google Scholar
Zou, Y., Yu, Z., Kumar, B., Wang, J.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: ECCV, pp. 289–305 (2018)
Google Scholar
Zou, Y., Yu, Z., Liu, X., Kumar, B., Wang, J.: Confidence regularized self-training. In: ICCV, pp. 5982–5991 (2019)
Google Scholar

Download references

Acknowledgements

This work was partially funded by the Austrian Research Promotion Agency (FFG) project 874065 and by the Christian Doppler Laboratory for Semantic 3D Computer Vision, funded in part by Qualcomm Inc.

Author information

Authors and Affiliations

Institute of Computer Graphics and Vision, Graz University of Technology, Graz, Austria
Wei Lin, Kunyang Sun, Horst Possegger & Horst Bischof
Max-Planck-Institute for Informatics, Saarbrücken, Germany
Anna Kukleva
Southeast University, Nanjing, China
Kunyang Sun
Goethe University Frankfurt, Frankfurt, Germany
Hilde Kuehne
Christian Doppler Laboratory for Semantic 3D Computer Vision, Frankfurt, Germany
Wei Lin

Authors

Wei Lin
View author publications
You can also search for this author in PubMed Google Scholar
Anna Kukleva
View author publications
You can also search for this author in PubMed Google Scholar
Kunyang Sun
View author publications
You can also search for this author in PubMed Google Scholar
Horst Possegger
View author publications
You can also search for this author in PubMed Google Scholar
Hilde Kuehne
View author publications
You can also search for this author in PubMed Google Scholar
Horst Bischof
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Lin .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7978 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, W., Kukleva, A., Sun, K., Possegger, H., Kuehne, H., Bischof, H. (2022). CycDA: Unsupervised Cycle Domain Adaptation to Learn from Image to Video. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13663. Springer, Cham. https://doi.org/10.1007/978-3-031-20062-5_40

Download citation

DOI: https://doi.org/10.1007/978-3-031-20062-5_40
Published: 11 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20061-8
Online ISBN: 978-3-031-20062-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CycDA: Unsupervised Cycle Domain Adaptation to Learn from Image to Video