Abstract
An increasing number of deep learning methods is being applied to quantify the perception of urban environments, study the relationship between urban appearance and resident safety, and improve urban appearance. Most advanced methods extract image feature representations from street-level images through conventional visual computation algorithms or deep convolutional neural networks and then directly predict the results using features. Unfortunately, these methods take color and texture information together during processing. Color and texture are prime image features, and they affect human perception and judgment differently. We argue that color and texture should be operated differently; therefore, we formulate an end-to-end learning methodology to process input images according to color and texture information before inputting it into the neural network. The processed images and the original image constitute three input streams for the triad attention ranking convolutional neural network (AR-CNN) model proposed in this study. In accordance with the aspects of color and texture, an improved attention mechanism in the convolution layer is proposed. Our objective is to obtain the scores of humans on urban appearance in accordance with the prediction results computed from pairwise comparisons generated by the AR-CNN model.
Similar content being viewed by others
References
Wilson J Q, Kelling G L. Broken windows. Atl Mon, 1982, 249: 29–38
Salesses P, Schechtner K, Hidalgo C A. The collaborative image of the city: mapping the inequality of urban perception. PLoS ONE, 2013, 8: 68400
Dubey A, Naik N, Parikh D, et al. Deep learning the city: quantifying urban perception at a global scale. In: Proceedings of European Conference on Computer Vision, 2016. 196–212
Naik N, Philipoom J, Raskar R, et al. Streetscore-predicting the perceived safety of one million streetscapes. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014. 779–785
Ren J, Shen X H, Lin Z, et al. Personalized image aesthetics. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 638–647
Dhar S, Ordonez V, Berg T L. High level describable attributes for predicting aesthetics and interestingness. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011. 1657–1664
Isola P, Xiao J X, Torralba A, et al. What makes an image memorable? In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011. 145–152
Quercia D, O’Hare N K, Cramer H. Aesthetic capital: what makes London look beautiful, quiet, and happy? In: Proceedings of ACM Conference on Computer Supported Cooperative Work and Social Computing, 2014. 945–955
Gibson J J. The ecological approach to visual perception. Science, 1979, 42: 98–99
Tamura H, Mori S, Yamawaki T. Textural features corresponding to visual perception. IEEE Trans Syst Man Cybern, 1978, 8: 460–473
Liu J L, Lughofer E, Zeng X Y. Aesthetic perception of visual textures: a holistic exploration using texture analysis, psychological experiment, and perception modeling. Front Comput Neurosci, 2015, 9: 134
Thompson M, Haber R N, Hershenson M. The psychology of visual perception. Leonardo, 1976, 9: 74
Trussell H J, Lin J, Shamey R. Effects of texture on colour perception. In: Proceedings of the 10th IVMSP Workshop: Perception and Visual Signal Analysis, 2011. 7–11
Chapelle O, Keerthi S S. Efficient algorithms for ranking with SVMs. Inf Retrieval, 2010, 13: 201–215
Ordonez V, Berg T L. Learning high-level judgments of urban perception. In: Proceedings of European Conference on Computer Vision, 2014. 494–510
Porzi L, Samuel R B, Lepri B, et al. Predicting and understanding urban perception with convolutional neural networks. In: Proceedings of ACM International Conference on Multimedia, 2015. 139–148
Radenović F, Tolias G, Chum O. Fine-tuning CNN image retrieval with no human annotation. IEEE Trans Pattern Anal Mach Intell, 2019, 41: 1655–1668
Comaniciu D, Meer P. Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell, 2002, 24: 603–619
Smith A R. Color gamut transform pairs. SIGGRAPH Comput Graph, 1978, 12: 12–19
Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. 886–893
Jain A K, Farrokhnia F. Unsupervised texture segmentation using Gabor filters. Pattern Recogn, 1991, 24: 1167–1186
Yin W, Schütze H, Xiang B, et al. ABCNN: attention-based convolutional neural network for modeling sentence pairs. In: Proceedings of the Transactions of the Association for Computational Linguistics, 2016. 259–272
Mnih V, Heess N, Graves A, et al. Recurrent models of visual attention. In: Proceedings of Advances in Neural Information Processing Systems, 2014. 2204–2212
Fu J L, Zheng H L, Mei T. Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 4438–4446
Chen Q, Hu Q M, Huang J X, et al. Enhancing recurrent neural networks with positional attention for question answering. In: Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017. 993–996
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of International Conference on Learning Representation, 2015
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 5998–6008
Chorowski J, Bahdanau D, Serdyuk D, et al. Attention-based models for speech recognition. In: Proceedings of Advances in Neural Information Processing Systems, 2015. 577–585
Wang F, Jiang M, Qian C, et al. Residual attention network for image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 3156–3164
Herbrich R, Minka T, Graepel T. TrueSkill: a Bayesian skill rating system. In: Proceedings of Advances in Neural Information Processing Systems, 2007. 569–576
Zhou B L, Lapedriza A, Khosla A, et al. Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 1452–1464
Acknowledgements
This work was supported in part by National Natural Science Foundation of China (Grant No. 62032020), in part by Hunan Science and Technology Planning Project (Grant No. 2019RS3019), in part by Hunan Provincial Natural Science Foundation of China for Distinguished Young Scholars (Grant No. 2018JJ1025), and in part by Guangzhou Research Project (Grant No. 201902010037).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Z., Chen, Z., Zheng, WS. et al. AR-CNN: an attention ranking network for learning urban perception. Sci. China Inf. Sci. 65, 112104 (2022). https://doi.org/10.1007/s11432-019-2899-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-019-2899-9