Abstract
Visual urban perception has recently attracted a lot of research attention owing to its importance in many fields. Traditional methods for visual urban perception mostly need to collect adequate training instances for newly-added perception attributes. In this paper, we consider a novel formulation, zero-shot learning, to free this cumbersome curation. Based on the idea of different images containing similar objects are more likely to possess the same perceptual attribute, we learn the semantic correlation space formed by objects semantic information and perceptual attributes. For newly-added attributes, we attempt to synthesize their prototypes by transferring similar object vector representations between the unseen attributes and the training (seen) perceptual attributes. For this purpose, we leverage a deep semantic-aware network for zero-shot visual urban perception model. It is a new two step zero-shot learning architecture, which includes supervised visual urban perception step for training attributes and zero-shot prediction step for unseen attributes. In the first step, we highlight the important role of semantic information and introduce it into supervised deep visual urban perception framework for training attributes. In the second step, we use the visualization techniques to obtain the correlations between semantic information and visual perception attributes from the well trained supervised model, and learn the prototype of unseen attributes and testing images to predict perception score on unseen attributes. The experimental results on a large-scale benchmark dataset validate the effectiveness of our method.
Similar content being viewed by others
References
Akata Z, Reed S, Walter D, Lee H, Schiele B (2015) Evaluation of output embeddings for fine-grained image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2927–2936
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3(Feb):1137–1155
Can G, Benkhedda Y, Gatica-Perez D (2018) Ambiance in social media venues: Visual cue interpretation by machines and crowds. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 2363–2372
Chollet F et al (2015) Keras
Cohen DA, Mason K, Bedimo A, Scribner R, Basolo V, Farley TA (2003) Neighborhood physical conditions and health. Am J Public Health 93(3):467–471
David HA (1960) The method of paired comparisons. In: Proceedings of the fifth conference on the design of experiments in army research developments and testing, pp 1–16
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint. arXiv:1810.04805
Deza A, Parikh D (2015) Understanding image virality. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1818–1826
Dosovitskiy A, Brox T (2016) Inverting convolutional networks with convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4829–4837
Dubey A, Naik N, Parikh D, Raskar R, Hidalgo CA (2016) Deep learning the city: quantifying urban perception at a global scale. In: European conference on computer vision. Springer, Berlin, pp 196–212
Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 1778–1785
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T et al (2013) Devise: a deep visual-semantic embedding model. In: Proceedings of neural information processing systems, pp 2121–2129
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He S, Yoshimura Y, Helfer J, Hack G, Ratti C, Nagakura T (2020) Quantifying memories: mapping urban perception. Mob Networks Appl 2020(25):1275–1286
Hu CB, Zhang F, Gong FY, Ratti C, Li X (2020) Classification and mapping of urban canyon geometry using google street view images and deep multitask learning. Build Environ 167:106424
Isola P, Xiao J, Torralba A, Oliva A (2011) What makes an image memorable? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 145–152
Jayasuriya M, Arukgoda J, Ranasinghe R, Dissanayake G (2020) Localising PMDs through CNN based perception of urban streets. In: Proceedings of the IEEE international conference on robotics and automation (ICRA). IEEE, pp 6454–6460
Jeon JY, Jo HI (2020) Effects of audio–visual interactions on soundscape and landscape perception and their influence on satisfaction with the urban environment. Build Environ 169:106544
Jiang H, Wang R, Shan S, Chen X (2019) Transferable contrastive network for generalized zero-shot learning. In: Proceedings of the IEEE international conference on computer vision, pp 9765–9774
Kao Y, He R, Huang K (2017) Deep aesthetic quality assessment with semantic information. IEEE Trans Image Process 26(3):1482–1495
Koniusz P, Yan F, Mikolajczyk K (2013) Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection. Comput Vis Image Underst 117(5):479–492
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Lampert CH, Nickisch H, Harmeling S (2009) Learning to detect unseen object classes by between-class attribute transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 951–958
Larochelle H, Erhan D, Bengio Y (2008) Zero-data learning of new tasks. AAAI 1:3
Law S, Paige B, Russell C (2019) Take a look around: using street view and satellite images to estimate house prices. ACM Trans Intel Sys Technol (TIST) 10(5):1–19
Li J, Jing M, Lu K, Ding Z, Zhu L, Huang Z (2019) Leveraging the invariant side of generative zero-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7402–7411
Li K, Min MR, Fu Y (2019) Rethinking zero-shot learning: a conditional visual classification perspective. In: Proceedings of the IEEE international conference on computer vision, pp 3583–3592
Liu L, Zhang H, Xu X, Zhang Z, Yan S (2019) Collocating clothes with generative adversarial networks cosupervised by categories and attributes: a multidiscriminator framework. IEEE Trans Neural Netw Learn Syst 31(9):3540–3554
Liu M, Zhang D, Chen S (2014) Attribute relation learning for zero-shot classification. Neurocomputing 139:34–46
Liu X, Chen Q, Zhu L, Xu Y, Lin L (2017) Place-centric visual urban perception with deep multi-instance regression. In: Proceedings of the ACM on multimedia conference, pp 19–27
Mahendran A, Vedaldi A (2015) Understanding deep image representations by inverting them. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5188–5196
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint. arXiv:1301.3781
Milam A, Furr-Holden C, Leaf P (2010) Perceived school and neighborhood safety, neighborhood violence and academic achievement in urban school children. Urban Rev 42(5):458–467
Min W, Mei S, Liu L, Wang Y, Jiang S (2019) Multi-task deep relative attribute learning for visual urban perception. IEEE Trans Image Process 29:657–669
Naik N, Philipoom J, Raskar R, Hidalgo C (2014) Streetscore-predicting the perceived safety of one million streetscapes. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 779–785
Nasar JL (1990) The evaluative image of the city. J Am Plan Assoc 56(1):41–53
Norouzi M, Mikolov T, Bengio S, Singer Y, Shlens J, Frome A, Corrado GS, Dean J (2013) Zero-shot learning by convex combination of semantic embeddings. arXiv preprint. arXiv:1312.5650
Ordonez V, Berg TL (2014) Learning high-level judgments of urban perception. In: Proceedings of the European conference on computer vision. Springer, pp 494–510
Piro FN, Nœss Ø, Claussen B (2006) Physical activity among elderly people in a city population: the influence of neighbourhood level violence and self perceived safety. J Epidemiol Community Health 60(7):626–632
Porzi L, Rota Bulò S, Lepri B, Ricci E (2015) Predicting and understanding urban perception with convolutional neural networks. In: Proceedings of the 23rd ACM international conference on Multimedia. ACM, pp 139–148
Qiao R, Liu L, Shen C, Van Den Hengel A (2016) Less is more: zero-shot learning from online textual documents with noise suppression. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2249–2257
Quercia D, O’Hare NK, Cramer H (2014) Aesthetic capital: what makes London look beautiful, quiet, and happy? In: Proceedings of the 17th ACM conference on computer supported cooperative work & social computing, pp 945–955
Rohrbach M, Stark M, Szarvas G, Gurevych I, Schiele B (2010) What helps where—and why? Semantic relatedness for knowledge transfer. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 910–917
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Salesses P, Schechtner K, Hidalgo CA (2013) The collaborative image of the city: mapping the inequality of urban perception. PLoS One 8(7):e68400
Sariyildiz MB, Cinbis RG (2019) Gradient matching generative networks for zero-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2168–2178
Shen Y, Qin J, Huang L, Liu L, Zhu F, Shao L (2020) Invertible zero-shot recognition flows. In: Proceedings of European conference on computer vision. Springer, Berlin, pp 614–631
Simonyan K, Vedaldi A, Zisserman A (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint. arXiv:1312.6034
Sistu G, Leang I, Chennupati S, Hughes C, Milz S, Yogamani S, Rawashdeh S (2019) NeurAll: towards a unified model for visual perception in automated driving. arXiv preprint. arXiv:1902.03589
Socher R, Ganjoo M, Manning CD, Ng A (2013) Zero-shot learning through cross-modal transfer. In: Advances in neural information processing systems, pp 935–943
Tenney I, Das D, Pavlick E (2019) Bert rediscovers the classical nlp pipeline. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp 4593–4601
Wang Q, Chen K (2017) Alternative semantic representations for zero-shot human action recognition. In: Proceedings of the Joint European conference on machine learning and knowledge discovery in databases. Springer, Berlin, pp 87–102
Wang W, Zheng VW, Yu H, Miao C (2019) A survey of zero-shot learning: settings, methods, and applications. ACM Trans Intell Syst Technol 10(2):1–37
Wilson JQ (2003) Broken windows: the police and neighborhood safety. In: Proceedings of the social, ecological and environmental theories of crime. Routledge, pp 169–178
Wu Z, Fu Y, Jiang YG, Sigal L (2016) Harnessing object and scene semantics for large-scale video understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3112–3121
Xian Y, Akata Z, Sharma G, Nguyen Q, Hein M, Schiele B (2016) Latent embeddings for zero-shot classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 69–77
Xie GS, Liu L, Jin X, Zhu F, Zhang Z, Qin J, Yao Y, Shao L (2019) Attentive region embedding network for zero-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9384–9393
Xie GS, Liu L, Zhu F, Zhao F, Zhang Z, Yao Y, Qin J, Shao L (2020) Region graph embedding network for zero-shot learning. In: Proceedings of the European conference on computer vision. Springer, Berlin, pp 562–580
Xu Y, Yang Q, Cui C, Shi C, Song G, Han X, Yin Y (2019) Visual urban perception with deep semantic-aware network. In: International conference on multimedia modeling. Springer, Berlin, pp 28–40
Yao Y, Liang Z, Yuan Z, Liu P, Bie Y, Zhang J, Wang R, Wang J, Guan Q (2019) A human–machine adversarial scoring framework for urban perception assessment using street-view images. Int J Geogr Inf Sci 33(12):2363–2384
Zhang F, Zhou B, Liu L, Liu Y, Fung HH, Lin H, Ratti C (2018) Measuring human perceptions of a large-scale urban region using machine learning. Landsc Urban Plan 180:148–160
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2921–2929
Zhou B, Liu L, Oliva A, Torralba A (2014) Recognizing city identity via attribute analysis of geo-tagged images. In: Proceedings of the European conference on computer vision. Springer, Berlin, pp 519–534
Acknowledgements
This work is supported by the National Natural Science Foundation of China (62077033, 61876098, 61976123), Shandong Provincial Natural Science Foundation Key Project (ZR2020KF015), Major Scientific and Technological Innovation Projects of Shandong Province (2018CXGC1501), the Fostering Project of Dominant Discipline and Talent Team of Shandong Province Higher Education Institutions, and the Taishan Young Scholars Program of Shandong Province and Key Development Program for Basic Research of Shandong Province (ZR2020ZD44).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, C., Wu, T., Zhang, Y. et al. Deep semantic-aware network for zero-shot visual urban perception. Int. J. Mach. Learn. & Cyber. 13, 1197–1211 (2022). https://doi.org/10.1007/s13042-021-01401-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-021-01401-w