Deep semantic-aware network for zero-shot visual urban perception

Zhang, Chunyun; Wu, Tianze; Zhang, Yunfeng; Zhao, Baolin; Wang, Tingwen; Cui, Chaoran; Yin, Yilong

doi:10.1007/s13042-021-01401-w

Deep semantic-aware network for zero-shot visual urban perception

Original Article
Published: 17 August 2021

Volume 13, pages 1197–1211, (2022)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Chunyun Zhang¹,
Tianze Wu²,
Yunfeng Zhang¹,
Baolin Zhao³,
Tingwen Wang¹,
Chaoran Cui ORCID: orcid.org/0000-0003-3332-1348¹ &
…
Yilong Yin⁴

436 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Visual urban perception has recently attracted a lot of research attention owing to its importance in many fields. Traditional methods for visual urban perception mostly need to collect adequate training instances for newly-added perception attributes. In this paper, we consider a novel formulation, zero-shot learning, to free this cumbersome curation. Based on the idea of different images containing similar objects are more likely to possess the same perceptual attribute, we learn the semantic correlation space formed by objects semantic information and perceptual attributes. For newly-added attributes, we attempt to synthesize their prototypes by transferring similar object vector representations between the unseen attributes and the training (seen) perceptual attributes. For this purpose, we leverage a deep semantic-aware network for zero-shot visual urban perception model. It is a new two step zero-shot learning architecture, which includes supervised visual urban perception step for training attributes and zero-shot prediction step for unseen attributes. In the first step, we highlight the important role of semantic information and introduce it into supervised deep visual urban perception framework for training attributes. In the second step, we use the visualization techniques to obtain the correlations between semantic information and visual perception attributes from the well trained supervised model, and learn the prototype of unseen attributes and testing images to predict perception score on unseen attributes. The experimental results on a large-scale benchmark dataset validate the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Learning to Prompt for Vision-Language Models

Article 31 July 2022

FSODv2: A Deep Calibrated Few-Shot Object Detection Network

Article 04 April 2024

References

Akata Z, Reed S, Walter D, Lee H, Schiele B (2015) Evaluation of output embeddings for fine-grained image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2927–2936
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3(Feb):1137–1155
MATH Google Scholar
Can G, Benkhedda Y, Gatica-Perez D (2018) Ambiance in social media venues: Visual cue interpretation by machines and crowds. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 2363–2372
Chollet F et al (2015) Keras
Cohen DA, Mason K, Bedimo A, Scribner R, Basolo V, Farley TA (2003) Neighborhood physical conditions and health. Am J Public Health 93(3):467–471
Article Google Scholar
David HA (1960) The method of paired comparisons. In: Proceedings of the fifth conference on the design of experiments in army research developments and testing, pp 1–16
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint. arXiv:1810.04805
Deza A, Parikh D (2015) Understanding image virality. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1818–1826
Dosovitskiy A, Brox T (2016) Inverting convolutional networks with convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4829–4837
Dubey A, Naik N, Parikh D, Raskar R, Hidalgo CA (2016) Deep learning the city: quantifying urban perception at a global scale. In: European conference on computer vision. Springer, Berlin, pp 196–212
Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 1778–1785
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T et al (2013) Devise: a deep visual-semantic embedding model. In: Proceedings of neural information processing systems, pp 2121–2129
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He S, Yoshimura Y, Helfer J, Hack G, Ratti C, Nagakura T (2020) Quantifying memories: mapping urban perception. Mob Networks Appl 2020(25):1275–1286
Hu CB, Zhang F, Gong FY, Ratti C, Li X (2020) Classification and mapping of urban canyon geometry using google street view images and deep multitask learning. Build Environ 167:106424
Article Google Scholar
Isola P, Xiao J, Torralba A, Oliva A (2011) What makes an image memorable? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 145–152
Jayasuriya M, Arukgoda J, Ranasinghe R, Dissanayake G (2020) Localising PMDs through CNN based perception of urban streets. In: Proceedings of the IEEE international conference on robotics and automation (ICRA). IEEE, pp 6454–6460
Jeon JY, Jo HI (2020) Effects of audio–visual interactions on soundscape and landscape perception and their influence on satisfaction with the urban environment. Build Environ 169:106544
Article Google Scholar
Jiang H, Wang R, Shan S, Chen X (2019) Transferable contrastive network for generalized zero-shot learning. In: Proceedings of the IEEE international conference on computer vision, pp 9765–9774
Kao Y, He R, Huang K (2017) Deep aesthetic quality assessment with semantic information. IEEE Trans Image Process 26(3):1482–1495
Article MathSciNet Google Scholar
Koniusz P, Yan F, Mikolajczyk K (2013) Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection. Comput Vis Image Underst 117(5):479–492
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Google Scholar
Lampert CH, Nickisch H, Harmeling S (2009) Learning to detect unseen object classes by between-class attribute transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 951–958
Larochelle H, Erhan D, Bengio Y (2008) Zero-data learning of new tasks. AAAI 1:3
Google Scholar
Law S, Paige B, Russell C (2019) Take a look around: using street view and satellite images to estimate house prices. ACM Trans Intel Sys Technol (TIST) 10(5):1–19
Li J, Jing M, Lu K, Ding Z, Zhu L, Huang Z (2019) Leveraging the invariant side of generative zero-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7402–7411
Li K, Min MR, Fu Y (2019) Rethinking zero-shot learning: a conditional visual classification perspective. In: Proceedings of the IEEE international conference on computer vision, pp 3583–3592
Liu L, Zhang H, Xu X, Zhang Z, Yan S (2019) Collocating clothes with generative adversarial networks cosupervised by categories and attributes: a multidiscriminator framework. IEEE Trans Neural Netw Learn Syst 31(9):3540–3554
Article MathSciNet Google Scholar
Liu M, Zhang D, Chen S (2014) Attribute relation learning for zero-shot classification. Neurocomputing 139:34–46
Article Google Scholar
Liu X, Chen Q, Zhu L, Xu Y, Lin L (2017) Place-centric visual urban perception with deep multi-instance regression. In: Proceedings of the ACM on multimedia conference, pp 19–27
Mahendran A, Vedaldi A (2015) Understanding deep image representations by inverting them. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5188–5196
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint. arXiv:1301.3781
Milam A, Furr-Holden C, Leaf P (2010) Perceived school and neighborhood safety, neighborhood violence and academic achievement in urban school children. Urban Rev 42(5):458–467
Article Google Scholar
Min W, Mei S, Liu L, Wang Y, Jiang S (2019) Multi-task deep relative attribute learning for visual urban perception. IEEE Trans Image Process 29:657–669
Article MathSciNet Google Scholar
Naik N, Philipoom J, Raskar R, Hidalgo C (2014) Streetscore-predicting the perceived safety of one million streetscapes. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 779–785
Nasar JL (1990) The evaluative image of the city. J Am Plan Assoc 56(1):41–53
Article Google Scholar
Norouzi M, Mikolov T, Bengio S, Singer Y, Shlens J, Frome A, Corrado GS, Dean J (2013) Zero-shot learning by convex combination of semantic embeddings. arXiv preprint. arXiv:1312.5650
Ordonez V, Berg TL (2014) Learning high-level judgments of urban perception. In: Proceedings of the European conference on computer vision. Springer, pp 494–510
Piro FN, Nœss Ø, Claussen B (2006) Physical activity among elderly people in a city population: the influence of neighbourhood level violence and self perceived safety. J Epidemiol Community Health 60(7):626–632
Article Google Scholar
Porzi L, Rota Bulò S, Lepri B, Ricci E (2015) Predicting and understanding urban perception with convolutional neural networks. In: Proceedings of the 23rd ACM international conference on Multimedia. ACM, pp 139–148
Qiao R, Liu L, Shen C, Van Den Hengel A (2016) Less is more: zero-shot learning from online textual documents with noise suppression. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2249–2257
Quercia D, O’Hare NK, Cramer H (2014) Aesthetic capital: what makes London look beautiful, quiet, and happy? In: Proceedings of the 17th ACM conference on computer supported cooperative work & social computing, pp 945–955
Rohrbach M, Stark M, Szarvas G, Gurevych I, Schiele B (2010) What helps where—and why? Semantic relatedness for knowledge transfer. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 910–917
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Salesses P, Schechtner K, Hidalgo CA (2013) The collaborative image of the city: mapping the inequality of urban perception. PLoS One 8(7):e68400
Article Google Scholar
Sariyildiz MB, Cinbis RG (2019) Gradient matching generative networks for zero-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2168–2178
Shen Y, Qin J, Huang L, Liu L, Zhu F, Shao L (2020) Invertible zero-shot recognition flows. In: Proceedings of European conference on computer vision. Springer, Berlin, pp 614–631
Simonyan K, Vedaldi A, Zisserman A (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint. arXiv:1312.6034
Sistu G, Leang I, Chennupati S, Hughes C, Milz S, Yogamani S, Rawashdeh S (2019) NeurAll: towards a unified model for visual perception in automated driving. arXiv preprint. arXiv:1902.03589
Socher R, Ganjoo M, Manning CD, Ng A (2013) Zero-shot learning through cross-modal transfer. In: Advances in neural information processing systems, pp 935–943
Tenney I, Das D, Pavlick E (2019) Bert rediscovers the classical nlp pipeline. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp 4593–4601
Wang Q, Chen K (2017) Alternative semantic representations for zero-shot human action recognition. In: Proceedings of the Joint European conference on machine learning and knowledge discovery in databases. Springer, Berlin, pp 87–102
Wang W, Zheng VW, Yu H, Miao C (2019) A survey of zero-shot learning: settings, methods, and applications. ACM Trans Intell Syst Technol 10(2):1–37
Google Scholar
Wilson JQ (2003) Broken windows: the police and neighborhood safety. In: Proceedings of the social, ecological and environmental theories of crime. Routledge, pp 169–178
Wu Z, Fu Y, Jiang YG, Sigal L (2016) Harnessing object and scene semantics for large-scale video understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3112–3121
Xian Y, Akata Z, Sharma G, Nguyen Q, Hein M, Schiele B (2016) Latent embeddings for zero-shot classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 69–77
Xie GS, Liu L, Jin X, Zhu F, Zhang Z, Qin J, Yao Y, Shao L (2019) Attentive region embedding network for zero-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9384–9393
Xie GS, Liu L, Zhu F, Zhao F, Zhang Z, Yao Y, Qin J, Shao L (2020) Region graph embedding network for zero-shot learning. In: Proceedings of the European conference on computer vision. Springer, Berlin, pp 562–580
Xu Y, Yang Q, Cui C, Shi C, Song G, Han X, Yin Y (2019) Visual urban perception with deep semantic-aware network. In: International conference on multimedia modeling. Springer, Berlin, pp 28–40
Yao Y, Liang Z, Yuan Z, Liu P, Bie Y, Zhang J, Wang R, Wang J, Guan Q (2019) A human–machine adversarial scoring framework for urban perception assessment using street-view images. Int J Geogr Inf Sci 33(12):2363–2384
Article Google Scholar
Zhang F, Zhou B, Liu L, Liu Y, Fung HH, Lin H, Ratti C (2018) Measuring human perceptions of a large-scale urban region using machine learning. Landsc Urban Plan 180:148–160
Article Google Scholar
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2921–2929
Zhou B, Liu L, Oliva A, Torralba A (2014) Recognizing city identity via attribute analysis of geo-tagged images. In: Proceedings of the European conference on computer vision. Springer, Berlin, pp 519–534

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (62077033, 61876098, 61976123), Shandong Provincial Natural Science Foundation Key Project (ZR2020KF015), Major Scientific and Technological Innovation Projects of Shandong Province (2018CXGC1501), the Fostering Project of Dominant Discipline and Talent Team of Shandong Province Higher Education Institutions, and the Taishan Young Scholars Program of Shandong Province and Key Development Program for Basic Research of Shandong Province (ZR2020ZD44).

Author information

Authors and Affiliations

School of Computer Science and Technology, Shandong University of Finance and Economics, Jinan, China
Chunyun Zhang, Yunfeng Zhang, Tingwen Wang & Chaoran Cui
School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
Tianze Wu
Inspur Information Technology Co., Ltd., Jinan, China
Baolin Zhao
School of Software, Shandong University, Jinan, China
Yilong Yin

Authors

Chunyun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Tianze Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yunfeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Baolin Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Tingwen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chaoran Cui
View author publications
You can also search for this author in PubMed Google Scholar
Yilong Yin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chaoran Cui.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, C., Wu, T., Zhang, Y. et al. Deep semantic-aware network for zero-shot visual urban perception. Int. J. Mach. Learn. & Cyber. 13, 1197–1211 (2022). https://doi.org/10.1007/s13042-021-01401-w

Download citation

Received: 01 December 2020
Accepted: 22 July 2021
Published: 17 August 2021
Issue Date: May 2022
DOI: https://doi.org/10.1007/s13042-021-01401-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep semantic-aware network for zero-shot visual urban perception

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

FSODv2: A Deep Calibrated Few-Shot Object Detection Network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep semantic-aware network for zero-shot visual urban perception

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

FSODv2: A Deep Calibrated Few-Shot Object Detection Network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation