Skip to main content
Log in

Deep semantic-aware network for zero-shot visual urban perception

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Visual urban perception has recently attracted a lot of research attention owing to its importance in many fields. Traditional methods for visual urban perception mostly need to collect adequate training instances for newly-added perception attributes. In this paper, we consider a novel formulation, zero-shot learning, to free this cumbersome curation. Based on the idea of different images containing similar objects are more likely to possess the same perceptual attribute, we learn the semantic correlation space formed by objects semantic information and perceptual attributes. For newly-added attributes, we attempt to synthesize their prototypes by transferring similar object vector representations between the unseen attributes and the training (seen) perceptual attributes. For this purpose, we leverage a deep semantic-aware network for zero-shot visual urban perception model. It is a new two step zero-shot learning architecture, which includes supervised visual urban perception step for training attributes and zero-shot prediction step for unseen attributes. In the first step, we highlight the important role of semantic information and introduce it into supervised deep visual urban perception framework for training attributes. In the second step, we use the visualization techniques to obtain the correlations between semantic information and visual perception attributes from the well trained supervised model, and learn the prototype of unseen attributes and testing images to predict perception score on unseen attributes. The experimental results on a large-scale benchmark dataset validate the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Akata Z, Reed S, Walter D, Lee H, Schiele B (2015) Evaluation of output embeddings for fine-grained image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2927–2936

  2. Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3(Feb):1137–1155

    MATH  Google Scholar 

  3. Can G, Benkhedda Y, Gatica-Perez D (2018) Ambiance in social media venues: Visual cue interpretation by machines and crowds. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 2363–2372

  4. Chollet F et al (2015) Keras

  5. Cohen DA, Mason K, Bedimo A, Scribner R, Basolo V, Farley TA (2003) Neighborhood physical conditions and health. Am J Public Health 93(3):467–471

    Article  Google Scholar 

  6. David HA (1960) The method of paired comparisons. In: Proceedings of the fifth conference on the design of experiments in army research developments and testing, pp 1–16

  7. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint. arXiv:1810.04805

  8. Deza A, Parikh D (2015) Understanding image virality. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1818–1826

  9. Dosovitskiy A, Brox T (2016) Inverting convolutional networks with convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4829–4837

  10. Dubey A, Naik N, Parikh D, Raskar R, Hidalgo CA (2016) Deep learning the city: quantifying urban perception at a global scale. In: European conference on computer vision. Springer, Berlin, pp 196–212

  11. Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 1778–1785

  12. Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T et al (2013) Devise: a deep visual-semantic embedding model. In: Proceedings of neural information processing systems, pp 2121–2129

  13. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  14. He S, Yoshimura Y, Helfer J, Hack G, Ratti C, Nagakura T (2020) Quantifying memories: mapping urban perception. Mob Networks Appl 2020(25):1275–1286

  15. Hu CB, Zhang F, Gong FY, Ratti C, Li X (2020) Classification and mapping of urban canyon geometry using google street view images and deep multitask learning. Build Environ 167:106424

    Article  Google Scholar 

  16. Isola P, Xiao J, Torralba A, Oliva A (2011) What makes an image memorable? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 145–152

  17. Jayasuriya M, Arukgoda J, Ranasinghe R, Dissanayake G (2020) Localising PMDs through CNN based perception of urban streets. In: Proceedings of the IEEE international conference on robotics and automation (ICRA). IEEE, pp 6454–6460

  18. Jeon JY, Jo HI (2020) Effects of audio–visual interactions on soundscape and landscape perception and their influence on satisfaction with the urban environment. Build Environ 169:106544

    Article  Google Scholar 

  19. Jiang H, Wang R, Shan S, Chen X (2019) Transferable contrastive network for generalized zero-shot learning. In: Proceedings of the IEEE international conference on computer vision, pp 9765–9774

  20. Kao Y, He R, Huang K (2017) Deep aesthetic quality assessment with semantic information. IEEE Trans Image Process 26(3):1482–1495

    Article  MathSciNet  Google Scholar 

  21. Koniusz P, Yan F, Mikolajczyk K (2013) Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection. Comput Vis Image Underst 117(5):479–492

    Article  Google Scholar 

  22. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105

    Google Scholar 

  23. Lampert CH, Nickisch H, Harmeling S (2009) Learning to detect unseen object classes by between-class attribute transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 951–958

  24. Larochelle H, Erhan D, Bengio Y (2008) Zero-data learning of new tasks. AAAI 1:3

    Google Scholar 

  25. Law S, Paige B, Russell C (2019) Take a look around: using street view and satellite images to estimate house prices. ACM Trans Intel Sys Technol (TIST) 10(5):1–19

  26. Li J, Jing M, Lu K, Ding Z, Zhu L, Huang Z (2019) Leveraging the invariant side of generative zero-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7402–7411

  27. Li K, Min MR, Fu Y (2019) Rethinking zero-shot learning: a conditional visual classification perspective. In: Proceedings of the IEEE international conference on computer vision, pp 3583–3592

  28. Liu L, Zhang H, Xu X, Zhang Z, Yan S (2019) Collocating clothes with generative adversarial networks cosupervised by categories and attributes: a multidiscriminator framework. IEEE Trans Neural Netw Learn Syst 31(9):3540–3554

    Article  MathSciNet  Google Scholar 

  29. Liu M, Zhang D, Chen S (2014) Attribute relation learning for zero-shot classification. Neurocomputing 139:34–46

    Article  Google Scholar 

  30. Liu X, Chen Q, Zhu L, Xu Y, Lin L (2017) Place-centric visual urban perception with deep multi-instance regression. In: Proceedings of the ACM on multimedia conference, pp 19–27

  31. Mahendran A, Vedaldi A (2015) Understanding deep image representations by inverting them. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5188–5196

  32. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint. arXiv:1301.3781

  33. Milam A, Furr-Holden C, Leaf P (2010) Perceived school and neighborhood safety, neighborhood violence and academic achievement in urban school children. Urban Rev 42(5):458–467

    Article  Google Scholar 

  34. Min W, Mei S, Liu L, Wang Y, Jiang S (2019) Multi-task deep relative attribute learning for visual urban perception. IEEE Trans Image Process 29:657–669

    Article  MathSciNet  Google Scholar 

  35. Naik N, Philipoom J, Raskar R, Hidalgo C (2014) Streetscore-predicting the perceived safety of one million streetscapes. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 779–785

  36. Nasar JL (1990) The evaluative image of the city. J Am Plan Assoc 56(1):41–53

    Article  Google Scholar 

  37. Norouzi M, Mikolov T, Bengio S, Singer Y, Shlens J, Frome A, Corrado GS, Dean J (2013) Zero-shot learning by convex combination of semantic embeddings. arXiv preprint. arXiv:1312.5650

  38. Ordonez V, Berg TL (2014) Learning high-level judgments of urban perception. In: Proceedings of the European conference on computer vision. Springer, pp 494–510

  39. Piro FN, Nœss Ø, Claussen B (2006) Physical activity among elderly people in a city population: the influence of neighbourhood level violence and self perceived safety. J Epidemiol Community Health 60(7):626–632

    Article  Google Scholar 

  40. Porzi L, Rota Bulò S, Lepri B, Ricci E (2015) Predicting and understanding urban perception with convolutional neural networks. In: Proceedings of the 23rd ACM international conference on Multimedia. ACM, pp 139–148

  41. Qiao R, Liu L, Shen C, Van Den Hengel A (2016) Less is more: zero-shot learning from online textual documents with noise suppression. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2249–2257

  42. Quercia D, O’Hare NK, Cramer H (2014) Aesthetic capital: what makes London look beautiful, quiet, and happy? In: Proceedings of the 17th ACM conference on computer supported cooperative work & social computing, pp 945–955

  43. Rohrbach M, Stark M, Szarvas G, Gurevych I, Schiele B (2010) What helps where—and why? Semantic relatedness for knowledge transfer. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 910–917

  44. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  45. Salesses P, Schechtner K, Hidalgo CA (2013) The collaborative image of the city: mapping the inequality of urban perception. PLoS One 8(7):e68400

    Article  Google Scholar 

  46. Sariyildiz MB, Cinbis RG (2019) Gradient matching generative networks for zero-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2168–2178

  47. Shen Y, Qin J, Huang L, Liu L, Zhu F, Shao L (2020) Invertible zero-shot recognition flows. In: Proceedings of European conference on computer vision. Springer, Berlin, pp 614–631

  48. Simonyan K, Vedaldi A, Zisserman A (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint. arXiv:1312.6034

  49. Sistu G, Leang I, Chennupati S, Hughes C, Milz S, Yogamani S, Rawashdeh S (2019) NeurAll: towards a unified model for visual perception in automated driving. arXiv preprint. arXiv:1902.03589

  50. Socher R, Ganjoo M, Manning CD, Ng A (2013) Zero-shot learning through cross-modal transfer. In: Advances in neural information processing systems, pp 935–943

  51. Tenney I, Das D, Pavlick E (2019) Bert rediscovers the classical nlp pipeline. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp 4593–4601

  52. Wang Q, Chen K (2017) Alternative semantic representations for zero-shot human action recognition. In: Proceedings of the Joint European conference on machine learning and knowledge discovery in databases. Springer, Berlin, pp 87–102

  53. Wang W, Zheng VW, Yu H, Miao C (2019) A survey of zero-shot learning: settings, methods, and applications. ACM Trans Intell Syst Technol 10(2):1–37

    Google Scholar 

  54. Wilson JQ (2003) Broken windows: the police and neighborhood safety. In: Proceedings of the social, ecological and environmental theories of crime. Routledge, pp 169–178

  55. Wu Z, Fu Y, Jiang YG, Sigal L (2016) Harnessing object and scene semantics for large-scale video understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3112–3121

  56. Xian Y, Akata Z, Sharma G, Nguyen Q, Hein M, Schiele B (2016) Latent embeddings for zero-shot classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 69–77

  57. Xie GS, Liu L, Jin X, Zhu F, Zhang Z, Qin J, Yao Y, Shao L (2019) Attentive region embedding network for zero-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9384–9393

  58. Xie GS, Liu L, Zhu F, Zhao F, Zhang Z, Yao Y, Qin J, Shao L (2020) Region graph embedding network for zero-shot learning. In: Proceedings of the European conference on computer vision. Springer, Berlin, pp 562–580

  59. Xu Y, Yang Q, Cui C, Shi C, Song G, Han X, Yin Y (2019) Visual urban perception with deep semantic-aware network. In: International conference on multimedia modeling. Springer, Berlin, pp 28–40

  60. Yao Y, Liang Z, Yuan Z, Liu P, Bie Y, Zhang J, Wang R, Wang J, Guan Q (2019) A human–machine adversarial scoring framework for urban perception assessment using street-view images. Int J Geogr Inf Sci 33(12):2363–2384

    Article  Google Scholar 

  61. Zhang F, Zhou B, Liu L, Liu Y, Fung HH, Lin H, Ratti C (2018) Measuring human perceptions of a large-scale urban region using machine learning. Landsc Urban Plan 180:148–160

    Article  Google Scholar 

  62. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2921–2929

  63. Zhou B, Liu L, Oliva A, Torralba A (2014) Recognizing city identity via attribute analysis of geo-tagged images. In: Proceedings of the European conference on computer vision. Springer, Berlin, pp 519–534

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (62077033, 61876098, 61976123), Shandong Provincial Natural Science Foundation Key Project (ZR2020KF015), Major Scientific and Technological Innovation Projects of Shandong Province (2018CXGC1501), the Fostering Project of Dominant Discipline and Talent Team of Shandong Province Higher Education Institutions, and the Taishan Young Scholars Program of Shandong Province and Key Development Program for Basic Research of Shandong Province (ZR2020ZD44).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chaoran Cui.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, C., Wu, T., Zhang, Y. et al. Deep semantic-aware network for zero-shot visual urban perception. Int. J. Mach. Learn. & Cyber. 13, 1197–1211 (2022). https://doi.org/10.1007/s13042-021-01401-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-021-01401-w

Keywords

Navigation