Abstract
In this paper, we present a new dataset named CityUPlaces, comprising 17,771 images from various campus buildings, which contains 9 major categories and further derives 18 minor categories based on the internal and external scenes of these identities. Each category is not balanced ranging from 344 to 1539 with diverse variations in angle, attractions, views, illumination, etc. Compared to existing large-scale datasets, the proposed dataset shows its strengths in two aspects: (1) it contains a moderate number of both indoor and outdoor images under different conditions for each identity, which enables diverse real-time recognition tasks by featuring hierarchical categorization with reasonable dataset size; (2) the issue of label noise is significantly alleviated for each identity in the dedicated annotation and filtering stages to facilitate the subsequent tasks. This provides great flexibility to perform these vision-based tasks with different learning objectives in a real-time mode. Moreover, we propose a novel lightweight classification framework that outperforms state-of-the-art baselines on the dataset with the relatively low computational complexity of fewer training parameters and floating-point operations per second, by taking advantage of the involved coarse-to-fined learning strategy in a self-transfer manner. This laterally confirms the applicability of the new dataset. We also conduct experiments on the MIT Indoors and Paris datasets, where the proposed method still achieves superior performance that validates its efficacy. The dataset and code will be publicly available in the future.









Similar content being viewed by others
References
Bergamo, A., Sinha, S.N., Torresani, L.: Leveraging structure from motion to learn discriminative codebooks for scalable landmark classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 763–770 (2013)
Cordts, M., Omran, M., Ramos, S., Scharwächter, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset. In: CVPR Workshop on the Future of Datasets in Vision, vol. 2. sn (2015)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)
Deng, J., Guo, J., Liu, T., Gong, M., Zafeiriou, S.: Sub-center arcface: boosting face recognition by large-scale noisy web faces. In: Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 741–757. Springer (2020)
Ding, X., Zhang, X., Han, J., Ding, G.: Diverse branch block: building a convolution as an inception-like unit. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10886–10895 (2021)
Ding, P., Qian, H., Zhou, Y., Chu, S.: Object detection method based on lightweight yolov4 and attention mechanism in security scenes. J. Real-Time Image Proc. 20(2), 34 (2023)
Ge, W.: Deep metric learning with hierarchical triplet loss. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–285 (2018)
Gündüz, M.Ş, Işık, G.: A new yolo-based method for real-time crowd detection from video and performance analysis of yolo models. J. Real-Time Image Proc. 20(1), 5 (2023)
Ha, Q., Liu, B., Liu, F., Liao, P.: Google landmark recognition 2020 competition third place solution (2020). arXiv preprint arXiv:2010.05350
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., Krishnan, D.: Supervised contrastive learning. Adv. Neural. Inf. Process. Syst. 33, 18661–18673 (2020)
Li, Y., Crandall, D.J., Huttenlocher, D.P.: Landmark classification in large-scale image collections. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1957–1964. IEEE (2009)
Lu, D., Weng, Q.: A survey of image classification methods and techniques for improving classification performance. Int. J. Remote Sens. 28(5), 823–870 (2007)
Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval with attentive deep local features. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3456–3465 (2017)
Noothout, J.M., De Vos, B.D., Wolterink, J.M., Postma, E.M., Smeets, P.A., Takx, R.A., Leiner, T., Viergever, M.A., Išgum, I.: Deep learning-based regression and classification for automatic landmark localization in medical images. IEEE Trans. Med. Imaging 39(12), 4011–4022 (2020)
Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition (2015)
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2007)
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: Improving particular object retrieval in large scale image databases. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)
Qi, Y., Gu, J., Zhang, Y., Wu, G., Wang, F.: Supervised deep semantics-preserving hashing for real-time pulmonary nodule image retrieval. J. Real-Time Image Proc. 17, 1857–1868 (2020)
Quattoni, A., Torralba, A.: Recognizing indoor scenes. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 413–420. IEEE (2009)
Rahimzadeh, M., Parvin, S., Safi, E., Mohammadi, M.R.: Wise-srnet: a novel architecture for enhancing image classification by learning spatial resolution of feature maps (2021). arXiv preprint arXiv:2104.12294
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Sikaroudi, M., Ghojogh, B., Safarpoor, A., Karray, F., Crowley, M., Tizhoosh, H.R.: Offline versus online triplet mining based on extreme distances of histopathology patches. In: Advances in Visual Computing: 15th International Symposium, ISVC 2020, San Diego, CA, USA, October 5–7, 2020, Proceedings, Part I 15, pp. 333–345. Springer (2020)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint arXiv:1409.1556
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Vailaya, A., Jain, A., Zhang, H.J.: On image classification: city images vs. landscapes. Pattern Recogn. 31(12), 1921–1935 (1998)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., Wu, Y.: Learning fine-grained image similarity with deep ranking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–1393 (2014)
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. IEEE (2010)
Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Lin, H., Zhang, Z., Sun, Y., He, T., Mueller, J., Manmatha, R., et al.: Resnest: Split-attention networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2736–2746 (2022)
Zhou, B., Khosla, A., Lapedriza, A., Torralba, A., Oliva, A.: Places: An image database for deep scene understanding (2016). arXiv preprint arXiv:1610.02055
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)
Zhu, L., Shen, J., Jin, H., Xie, L., Zheng, R.: Landmark classification with hierarchical multi-modal exemplar feature. IEEE Trans. Multimedia 17(7), 981–993 (2015)
Acknowledgements
This work is supported by the National Natural Science Foundation of China under Grant no. 62001341 and the National Natural Science Foundation of Jiangsu Province under Grant no. BK20221379.
Author information
Authors and Affiliations
Contributions
HW and GW wrote the main text and designed the experiments. HW, JH, and SX collected the dataset, processed the images, and finished the experiments. HW, JH, SX and SZ prepared figures and tables. GW and YL guided the direction of this paper. All the authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, H., Wu, G., Hu, J. et al. CityUPlaces: a new dataset for efficient vision-based recognition. J Real-Time Image Proc 20, 109 (2023). https://doi.org/10.1007/s11554-023-01369-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11554-023-01369-6