Abstract
As a popular research problem in computational aesthetics, image aesthetic assessment has many important applications in image editing, retrieval, and recommendation. However, the existing mainstream CNN-based image aesthetic assessment methods are difficult to obtain the global aesthetic attributes of images well. To this end, we propose a two-stream image aesthetic assessment model that couples Transformer and CNN features. We use the traditional CNN network to extract the image’s local aesthetic feature in the first stream, apply the superpixel algorithm to segment the image, and then feed the segmented image region into the Transformer network to learn the image’s aesthetic global features in the second stream. Finally, the features learned by Transformer and CNN are fused to achieve the image aesthetic assessment. The experimental results on the AVA dataset show that our proposed method can obtain local and global aesthetic information on images, which enables the model to learn richer aesthetic information, and the combination of whole and part is more in line with human aesthetic characteristics. Our proposed model achieves an accuracy of 84.5% in the classification task, achieving optimal performance compared to existing methods and good performance in the other two tasks (Score Regression and Distribution).







Similar content being viewed by others
References
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2017). https://doi.org/10.1145/3065386
She, D., Lai, Y.-K., Yi, G., Xu, K.: Hierarchical layout-aware graph convolutional network for unified aesthetics assessment. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8471–8480 (2021). https://doi.org/10.1109/CVPR46437.2021.00837
Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In Presented at the NIPS June 12 (2017)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv. (2020)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv (2020)
Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: transformer for semantic segmentation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 7242–7252 (2021). https://doi.org/10.1109/ICCV48922.2021.00717
Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34, 2274–2282 (2012). https://doi.org/10.1109/TPAMI.2012.120
Lu, X., Lin, Z., Jin, H., Yang, J., Wang, J.Z.: RAPID: rating pictorial aesthetics using deep learning. In: Proceedings of the 22nd ACM International Conference on Multimedia. pp. 457–466 (2014). https://doi.org/10.1145/2647868.2654927
Talebi, H., Milanfar, P.: NIMA: neural image assessment. IEEE Trans. Image Process. 27, 3998–4011 (2018). https://doi.org/10.1109/TIP.2018.2831899
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR. (2014)
Szegedy, C., Wei Liu, Yangqing Jia, Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1–9 (2015). https://doi.org/10.1109/CVPR.2015.7298594
Wu, O., Hu, W., Gao, J: Learning to predict the perceived visual quality of photos. In: 2011 International Conference on Computer Vision. pp. 225–232 (2011). https://doi.org/10.1109/ICCV.2011.6126246
Kong, S., Shen, X., Lin, Z., Mech, R., Fowlkes, C.: Photo aesthetics ranking network with attributes and content adaptation. Vol. 9905, pp. 662–679 (2016). https://doi.org/10.1007/978-3-319-46448-0_40
Gao, F., Li, Z., Yu, J., Yu, J., Huang, Q., Tian, Q.: Style-adaptive photo aesthetic rating via convolutional neural networks and multi-task learning. Neurocomputing 395, 247–254 (2020). https://doi.org/10.1016/j.neucom.2018.06.099
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Murray, N., Marchesotti, L., Perronnin, F.: AVA: a large-scale database for aesthetic visual analysis. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. pp. 2408–2415 (2012). https://doi.org/10.1109/CVPR.2012.6247954
Yang, Y., Xu, L., Li, L., Qie, N., Li, Y., Zhang, P., Guo, Y.: Personalized image aesthetics assessment with rich attributes. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19829–19837 (2022). https://doi.org/10.1109/CVPR52688.2022.01924
Zhu, H., Zhou, Y., Li, L., Li, Y., Guo, Y.: Learning personalized image aesthetics from subjective and objective attributes. IEEE Trans. Multimedia 25, 179–190 (2023). https://doi.org/10.1109/TMM.2021.3123468
Zhu, H., Li, L., Wu, J., Zhao, S., Ding, G., Shi, G.: Personalized image aesthetics assessment via meta-learning with bilevel gradient optimization. IEEE Trans. Cybern. 52, 1798–1811 (2022). https://doi.org/10.1109/TCYB.2020.2984670
Liu, D., Puri, R., Kamath, N., Bhattacharya, S.: Composition-aware image aesthetics assessment. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 3558–3567 (2020). https://doi.org/10.1109/WACV45572.2020.9093412
Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., Ye, Q.: Conformer: local features coupling global representations for visual recognition. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 357–366 (2021). https://doi.org/10.1109/ICCV48922.2021.00042
Srinivas, A., Lin, T.-Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: bottleneck transformers for visual recognition. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16514–16524 (2021). https://doi.org/10.1109/CVPR46437.2021.01625
Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., Xu, C.: CMT: Convolutional neural networks meet vision transformers. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12165–12175 (2022). https://doi.org/10.1109/CVPR52688.2022.01186
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: CvT: introducing convolutions to vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 22–31 (2021). https://doi.org/10.1109/ICCV48922.2021.00009
Li, K., Wang, Y., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y.: UniFormer: unified transformer for efficient spatiotemporal representation learning. arXiv (2022)
Deng, J., Dong, W., Socher, R., Li, L.-J., Kai, L., Li, F.-F.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Achanta, R., Susstrunk, S.: Superpixels and polygons using simple non-iterative clustering. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4895–4904 (2017). https://doi.org/10.1109/CVPR.2017.520
Van Den Bergh, M., Boix, X., Roig, G., Van Gool, L.: SEEDS: superpixels extracted via energy-driven sampling. Int. J. Comput. Vis. 111, 298–314 (2015). https://doi.org/10.1007/s11263-014-0744-2
Yao, J., Boben, M., Fidler, S., Urtasun, R.: Real-time coarse-to-fine topologically preserving segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2947–2955 (2015). https://doi.org/10.1109/CVPR.2015.7298913
Li, L., Zhu, H., Zhao, S., Ding, G., Lin, W.: Personality-assisted multi-task learning for generic and personalized image aesthetics assessment. IEEE Trans. Image Process. 29, 3898–3910 (2020). https://doi.org/10.1109/TIP.2020.2968285
Chen, Q., Zhang, W., Zhou, N., Lei, P., Xu, Y., Zheng, Y., Fan, J.: Adaptive fractional dilated convolution network for image aesthetics assessment. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14102–14111 (2020). https://doi.org/10.1109/CVPR42600.2020.01412
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2921–2929 (2016). https://doi.org/10.1109/CVPR.2016.319
Ma, S., Liu, J., Chen, C.W.: A-lamp: adaptive layout-aware multi-patch deep convolutional neural network for photo aesthetic assessment. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 722–731 (2017). https://doi.org/10.1109/CVPR.2017.84
Fu, X., Yan, J., Fan, C.: Image aesthetics assessment using composite features from off-the-shelf deep models. In: 2018 25th IEEE International Conference on Image Processing (ICIP). pp. 3528–3532 (2018). https://doi.org/10.1109/ICIP.2018.8451133
Hosu, V., Goldlucke, B., Saupe, D.: Effective aesthetics prediction with multi-level spatially pooled features. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9367–9375 (2019). https://doi.org/10.1109/CVPR.2019.00960
Ko, K., Lee, J.-T., Kim, C.-S.: PAC-Net: pairwise aesthetic comparison network for image aesthetic assessment. In: 2018 25th IEEE International Conference on Image Processing (ICIP). pp. 2491–2495 (2018). https://doi.org/10.1109/ICIP.2018.8451621
Lee, J.-T., Kim, C.-S.: Image aesthetic assessment based on pairwise comparison—a unified approach to score regression, binary classification, and personalization. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1191–1200 (2019). https://doi.org/10.1109/ICCV.2019.00128
Zeng, H., Cao, Z., Zhang, L., Bovik, A.C.: A unified probabilistic formulation of image aesthetic assessment. IEEE Trans. Image Process. 29, 1548–1561 (2020). https://doi.org/10.1109/TIP.2019.2941778
Murray, N., Gordo, A.: A deep architecture for unified aesthetic prediction. arXiv (2017)
Sheng, K., Dong, W., Ma, C., Mei, X., Huang, F., Hu, B.-G.: Attention-based multi-patch aggregation for image aesthetic assessment. In: Proceedings of the 26th ACM international conference on Multimedia. pp. 879–886 (2018). https://doi.org/10.1145/3240508.3240554
Author information
Authors and Affiliations
Contributions
Yongzhen Ke: Conceptualization, Methodology, Supervision, Project administration. Yin Wang: Methodology, Software, Writing - Original Draft, Writing - Review & Editing Kai Wang: Methodology, Software, Writing - Original Draft. Fan Qin: Validation, Writing - Review & Editing Jing Guo: Writing - Review & Editing, Formal analysis, Visualization. Shuai Yang: Resources, Validation, Data Curation.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Conflict of interest
The authors have no financial or proprietary interests in any material discussed in this article.
Additional information
Communicated by B. Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ke, Y., Wang, Y., Wang, K. et al. Image aesthetics assessment using composite features from transformer and CNN. Multimedia Systems 29, 2483–2494 (2023). https://doi.org/10.1007/s00530-023-01141-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-023-01141-7