Abstract
Pose transfer refers to transferring a given person’s pose to the target pose. We present a context-driven omni-dimensional dynamic pose transfer model to address the issue of regular convolutional networks being unable to manage complicated changes. First, we construct a dynamic convolution module to extract rich contextual features. This module dynamically adjusts to the differences in input data during the convolution process, enhancing the adaptability of features. Second, a feature fusion block (FFBlock) is built by merging multiscale channel attention information from global and local channel contexts. Furthermore, the focal-frequency distance between the generated image and the original image is measured using the focal-frequency loss, which allows a model to adaptively focus on difficult-to-synthesis frequency components by reducing the weighting of easy-to-synthesis frequency components, narrowing the gap in the frequency domain, and improving image generation quality. The effectiveness and efficiency of the network are qualitatively and quantitatively verified on fashion datasets, and a large number of experiments demonstrate the superiority of our method.







Similar content being viewed by others
Data availability statement
The data used to support the findings of this study is available from the corresponding author upon request. No datasets were generated or analysed during the current study.
References
Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L.: Pose guided person image generation. Adv. Neural Info. Process. Syst. 30 (2017)
Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3332–3341 (2017)
Zheng, Z., Zheng, L., Yang, Y.: Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3754–3762 (2017)
Ma, L., Sun, Q., Georgoulis, S., Van Gool, L., Schiele, B., Fritz, M.: Disentangled person image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 99–108 (2018)
Zhu, Z., Huang, T., Shi, B., Yu, M., Wang, B., Bai, X.: Progressive pose attention transfer for person image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2347–2356 (2019)
Roy, P., Bhattacharya, S., Ghosh, S., Pal, U.: Multi-scale attention guided pose transfer. Pattern Recognit. 137, 109315 (2023)
Esser, P., Sutter, E., Ommer, B.: A variational u-net for conditional appearance and shape generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8857–8866 (2018)
Tang, H., Bai, S., Zhang, L., Torr, P.H., Sebe, N.: Xinggan for person image generation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 717–734. Springer (2020)
Zhang, J., Li, K., Lai, Y.-K., Yang, J.: Pise: person image synthesis and editing with decoupled gan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7982–7990 (2021)
Zhang, P., Yang, L., Lai, J.-H., Xie, X.: Exploring dual-task correlation for pose guided person image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7713–7722 (2022)
Bako, S., Vogels, T., McWilliams, B., Meyer, M., Novák, J., Harvill, A., Sen, P., Derose, T., Rousselle, F.: Kernel-predicting convolutional networks for denoising monte carlo renderings. ACM Trans. Graph. 36(4), 97–1 (2017)
Munkhdalai, T., Yu, H.: Meta networks. In: International Conference on Machine Learning, pp. 2554–2563. PMLR (2017)
Mildenhall, B., Barron, J.T., Chen, J., Sharlet, D., Ng, R., Carroll, R.: Burst denoising with kernel prediction networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2502–2510 (2018)
Diba, A., Sharma, V., Gool, L.V., Stiefelhagen, R.: Dynamonet: Dynamic action and motion network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6192–6201 (2019)
Ma, N., Zhang, X., Huang, J., Sun, J.: Weightnet: Revisiting the design space of weight networks. In: European Conference on Computer Vision, pp. 776–792. Springer (2020)
Lin, X., Ma, L., Liu, W., Chang, S.-F.: Context-gated convolution. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pp. 701–718 (2020). Springer
Li, C., Zhou, A., Yao, A.: Omni-dimensional dynamic convolution. arXiv preprint arXiv:2209.07947 (2022)
Cui, Y., Tao, Y., Bing, Z., Ren, W., Gao, X., Cao, X., Huang, K., Knoll, A.: Selective frequency network for image restoration. In: The 11th International Conference on Learning Representations (2023)
Cui, Y., Ren, W., Knoll, A.: Omni-kernel network for image restoration. Proc. AAAI Conf. Artif. Intell. 38, 1426–1434 (2024)
Cui, Y., Ren, W., Cao, X., Knoll, A.: Revitalizing convolutional network for image restoration. IEEE Trans. Pattern Anal. Mach. Intell. (2024)
Jiang, L., Dai, B., Wu, W., Loy, C.C.: Focal frequency loss for image reconstruction and synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13919–13929 (2021)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Adv. Neural Info. Process. Syst. 29 (2016)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Info. Process. Syst. 30 (2017)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Men, Y., Mao, Y., Jiang, Y., Ma, W.-Y., Lian, Z.: Controllable person image synthesis with attribute-decomposed gan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5084–5093 (2020)
Funding
This work was supported by the National Natural Science Foundation of China (61772179), Hunan Provincial Natural Science Foundation of China (2022JJ50016,2023JJ50095), and the Science and Technology Plan Project of Hunan Province (2016TP1020). Scientific Research Fund of Hunan Provincial Education Department (21B0649), Double First-Class University Project of Hunan Province (Xiangjiaotong [2018]469,[2020]248), Postgraduate Scientific Research Innovation Project of Hunan Province (CX20221285).
Author information
Authors and Affiliations
Contributions
All authors contributed to the study’s conception and design. Material preparation, data collection, and analysis were performed by Yue Chen, Xiaoman Liang, Mugang Lin, Yuan Qin and Huihuang Zhao. The first draft of the manuscript was written by Yue Chen and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
All the authors declare that they have no competing financial interests or personal relationships that could influence the work reported in this paper.
Ethics approval statement
This study did not involve any humans or animals.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, Y., Liang, X., Lin, M. et al. CDPT: context-driven omni-dimensional dynamic pose transfer network. SIViP 19, 376 (2025). https://doi.org/10.1007/s11760-025-03969-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11760-025-03969-0