Abstract
Most human synthesis schemes use high-performance servers, so the user interaction experience of mobile devices is not satisfied. Viewing human synthesis results on smartphones directly increases user interaction and enhances user experience. This paper proposes a smart frame selection network (SFSN) on mobile devices to reduce the traffic between smartphones and cloud. We leverage the attention and relationship model to focus on the relationship between a single frame and the entire video, which can better select important frames, thus reducing the traffic and computing effectively. In addition, we build a multi-task human synthesis system based on SFSN to process the generation tasks such as background changing, pose transfer and virtual try-on in a unified framework. Evaluation results indicate proposed approach reduces the number of frames to be processed by more than 42.2%.











Similar content being viewed by others
References
AlBahar, B., & Huang, J.-B. (2019). Guided image-to-image translation with bi-directional feature transformation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 9016–9025
Hahn, F., Thomaszewski, B., Coros, S., Sumner, R. W., Cole, F., Meyer, M., DeRose, T., & Gross, M. (2014). Subspace clothing simulation using adaptive bases. ACM Transactions on Graphics (TOG), 33(4), 1–9.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems, 27
Jetchev, N., & Bergmann, U. (2017). The conditional analogy gan: Swapping fashion articles on people images. In: Proceedings of the IEEE international conference on computer vision workshops, pp. 2287–2292
Balakrishnan, G., Zhao, A., Dalca, A.V., Durand, F., & Guttag, J. (2018). Synthesizing images of humans in unseen poses. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8340–8348
Men, Y., Mao, Y., Jiang, Y., Ma, W.-Y., & Lian, Z. (2020). Controllable person image synthesis with attribute-decomposed gan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5084–5093
Liu, W., Piao, Z., Min, J., Luo, W., Ma, L., & Gao, S. (2019) Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 5904–5913
Belousov, S. (2021). Mobilestylegan: A lightweight convolutional neural network for high-fidelity image synthesis. arXiv preprint arXiv:2104.04767
Li, M., Lin, J., Ding, Y., Liu, Z., Zhu, J.-Y., & Han, S. (2020). Gan compression: Efficient architectures for interactive conditional gans. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5284–5294
Ren, Y., Wu, J., Xiao, X., & Yang, J. (2021). Online multi-granularity distillation for gan compression. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6793–6803
Liu, L., Li, H., & Gruteser, M. (2019). Edge assisted real-time object detection for mobile augmented reality. In: The 25th annual international conference on mobile computing and networking, pp. 1–16
Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., & Van Gool, L. (2017). Pose guided person image generation. arXiv preprint arXiv:1705.09368
Siarohin, A., Sangineto, E., Lathuiliere, S., & Sebe, N. (2018). Deformable gans for pose-based human image generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3408–3416
Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784
Petrovic, N., Jojic, N., & Huang, T. S. (2005). Adaptive video fast forward. Multimedia Tools and Applications, 26(3), 327–344.
Wolf, W. (1996). Key frame selection by motion analysis. In: Proceedings 1996 IEEE international conference on acoustics, speech, and signal processing conference, Vol. 2, pp. 1228–1231 . IEEE
Cheng, K.-Y., Luo, S.-J., Chen, B.-Y., & Chu, H.-H. (2009). Smartplayer: user-centric video fast-forwarding. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp. 789–798
Zhang, Q., Yu, S.-P., Zhou, D.-S., & Wei, X.-P. (2013). An efficient method of key-frame extraction based on a cluster algorithm. Journal of Human Kinetics, 39, 5.
Li, Y., Liu, M., & Rehg, J.M. (2018). In the eye of beholder: Joint learning of gaze and actions in first person video. In: Proceedings of the European conference on computer vision (ECCV), pp. 619–635
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803
Meng, D., Peng, X., Wang, K., & Qiao, Y. (2019). Frame attention networks for facial expression recognition in videos. In: 2019 IEEE international conference on image processing (ICIP), pp. 3866–3870 . IEEE
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). Smpl: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6), 1–16.
Kanazawa, A., Black, M. J., Jacobs, D. W., & Malik, J. (2018). End-to-end recovery of human shape and pose. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7122–7131
Kato, H., Ushiku, Y., & Harada, T. (2018). Neural 3d mesh renderer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3907–3916
Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T., Zhang, F., & Grundmann, M. (2020). Blazepose: On-device real-time body pose tracking. arXiv preprint arXiv:2006.10204
Sheena, C. V., & Narayanan, N. (2015). Key-frame extraction by analysis of histograms of video frames using statistical methods. Procedia Computer Science, 70, 36–40.
Acknowledgements
This work was partially supported by the National Science Fund for Distinguished Young Scholars (62025205) and the Natural Science Basic Research Program of Shaanxi (Program No. 2022JQ-623).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, B., Feng, X., Qiu, C. et al. SFSN: smart frame selection network for multi-task human synthesis on mobile devices. Wireless Netw 30, 4655–4668 (2024). https://doi.org/10.1007/s11276-022-03112-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11276-022-03112-8