Abstract
Fisheye cameras has a large field of view, so it is widely used in scene monitoring, robot navigation, intelligent system, virtual reality panorama, augmented reality panorama and other fields, but person detection under the overhead fisheye camera is still a challenge due to its unique radial geometry and barrel distortion. Generic object detection algorithms do not work well for person detection on panoramic images of the fisheye camera. Recent approaches either use radially aligned bounding boxes to detect persons or improve anchor-based methods to obtain rotated bounding boxes. However, these methods require additional hyperparameters (e.g., anchor boxes) and have low generalization ability. To address this issue, we propose a novel model called Group Equivariant Transformer (GET) which uses the Transformer to directly regress the bounding boxes and rotation angles. GET not need any additional hyperparameters and have generalization ability. In our GET, we uses the Group Equivariant Convolutional Network (GECN) and Multi-Scale Encoder Module (MEM) to extract multi-scale rotated embedding features of overhead fisheye image for Transformer, then we propose an embedding optimization loss to improve the diversity of these features. Finally, we use a Decoder Module (DM) to decode the rotated bounding boxes’information from embedding features. Extensive experiments conducted on three benchmark fisheye camera datasets demonstrate that the proposed method achieves the state of the art.
Similar content being viewed by others
References
Challenging events for person detection from overhead fisheye images (2022). https://vip.bu.edu/projects/vsns/cossy/datasets/cepdof/
Human-aligned bounding boxes from overhead fisheye cameras dataset (2022). https://vip.bu.edu/projects/vsns/cossy/datasets/habbof/
Mirror worlds challenge (2022). http://www2.icat.vt.edu/mirrorworlds/challenge/index.html
Brunetti A, Buongiorno D, Trotta GF, Bevilacqua V (2018) Computer vision and deep learning techniques for pedestrian detection and tracking: A survey. Neurocomputing 300:17–33
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. European Conference on Computer Vision (ECCV) pp. 213–229
Chang Q Hung K W JJ (2018) Deep learning based image super-resolution for nonlinear lens distortion. In: Neurocomputing, pp. 969–982. Elsevier
Chiang AT, Wang Y (2014) Human detection in fish-eye images using hog-based detectors over rotated windows. 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW) pp. 1–6
Cohen T, Welling M (2016) Group equivariant convolutional networks. International Conference on Machine Learning (ICML) pp. 2990–2999
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. Proc IEEE/CVF Int Conf Comput Vision (CVPR) 1:886–893
Deng L Chen Z CBea (2016) Incremental image set querying based localization. In: Neurocomputing, pp. 315–324. Elsevier
Ding J, Xue N, Long Y, Xia GS, Lu Q (2019) Learning roi transformer for oriented object detection in aerial images. Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR) pp. 2849–2858
Dollár P, Appel R, Belongie S, Perona P (2014) Fast feature pyramids for object detection. IEEE Trans Pattern Anal Mach Intell (TPAMI) 36(8):1532–1545
Duan Z, Tezcan O, Nakamura H, Ishwar P, Konrad J (2020) Rapid: rotation-aware people detection in overhead fisheye images. Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR) Workshops pp. 636–637
Enzweiler M, Gavrila DM (2008) Monocular pedestrian detection: Survey and experiments. IEEE Trans Pattern Anal Mach Intell (TPAMI) 31(12):2179–2195
Fu CY, Liu W, Ranga A, Tyagi A, Berg AC (2017) Dssd: Deconvolutional single shot detector. arXiv:1701.06659
Georgakopoulos S V Kottari K DKea (2018) Pose recognition using convolutional neural networks on omni-directional images. In: Neurocomputing, pp. 23–31. Elsevier
Girshick R (2015) Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision (ICCV) pp. 1440–1448
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 580–587
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision(ICCV), pp. 2961–2969
Jiang Y, Zhu X, Wang X, Yang S, Li W, Wang H, Fu P, Luo Z (2017) R2cnn: Rotational region cnn for orientation robust scene text detection. arXiv:1706.09579
Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M (2021) Transformers in vision: A survey. arXiv:2101.01169
Krams O, Kiryati N (2017) People detection in top-view fisheye imaging. In: IEEE international conference on advanced video and signal based surveillance (AVSS), pp. 1–6. IEEE
Law H, Deng J (2018) Cornernet: Detecting objects as paired keypoints. Proceedings of the European conference on computer vision (ECCV) pp. 734–750
Li S, Tezcan MO, Ishwar P, Konrad J (2019) Supervised people counting using an overhead fisheye camera. 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) pp. 1–8. https://doi.org/10.1109/AVSS.2019.8909877
Li N Si Y DDea (2017) Ecg beats classification via online sparse dictionary and time pyramid matching. In: International Conference on Communication Technology (ICCT), pp. 1537–1543. IEEE
Li N Wang J LZea (2021) High-order-interaction for weakly supervised fine-grained visual categorization. In: Neurocomputing, pp. 27–36. Elsevier
Li N ZCC (2020) Ampa-net:optimization-inspired attention neural network for deep compressed sensing. In: International Conference on Communication Technology (ICCT), pp. 1338–1344. IEEE
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision (ECCV), pp. 740–755. Springer
Liu L, Pan Z, Lei B (2017) Learning a rotation invariant detector with rotatable bounding box. arXiv:1711.09405
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. European Conference on Computer Vision (ECCV) pp. 21–37
Liu Z, Hu J, Weng L, Yang Y (2017) Rotated region based cnn for ship detection. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 900–904. IEEE
Ma J, Shao W, Ye H, Wang L, Wang H, Zheng Y, Xue X (2018) Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans Multimed 20(11):3111–3122
Nguyen DT, Li W, Ogunbona PO (2016) Human detection from images and videos: A survey. Elsevier Pattern Recog 51:148–175
Noh J, Lee S, Kim B, Kim G (2018) Improving occlusion and hard negative handling for single-stage pedestrian detectors. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pp. 966–974
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR) pp. 779–788
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pp. 7263–7271
Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. arXiv:1804.02767
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inform Process Syst 28:91–99
Saito M, Kitaguchi K, Kimura G, Hashimoto M (2011) People detection and tracking from fish-eye image based on probabilistic appearance model. In: SICE Annual Conference 2011, pp. 435–440. IEEE
Tamura M, Horiguchi S, Murakami T (2019) Omnidirectional pedestrian detection by rotation invariant training. 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) pp. 1989–1998
Tan M, Pang R, Le QV (2020) Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pp. 10781–10790
Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE International Conference on Computer Vision(ICCV), pp. 9627–9636
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems (NIPS), pp. 5998–6008
Wang H, Zhu Y, Green B, Adam H, Yuille A, Chen LC (2020) Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In: European Conference on Computer Vision(ECCV), pp. 108–126. Springer
Wang T, Chang CW, Wu YS (2107) Template-based people detection using a single downward-viewing fisheye camera. In: International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), pp. 719–723. IEEE
Zhang L, Lin L, Liang X, He K (2016) Is faster r-cnn doing well for pedestrian detection? In: European Conference on Computer Vision (ECCV), pp. 443–457. Springer
Zhang S, Benenson R, Schiele B (2017) Citypersons: A diverse dataset for pedestrian detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pp. 3213–3221
Zhang S, Benenson R, Schiele B, et al (2015) Filtered channel features for pedestrian detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), vol. 1, p. 4
Zhang S, Chi C, Yao Y, Lei Z, Li SZ (2020) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pp. 9759–9768
Zhang S, Yang J, Schiele B (2018) Occluded pedestrian detection through guided attention in cnns. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pp. 6995–7003
Zhao H, Jiang L, Jia J, Torr PH, Koltun V (2021) Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pp. 16259–16268
Zhao Q, Sheng T, Wang Y, Tang Z, Chen Y, Cai L, Ling H (2019) M2det: A single-shot object detector based on multi-level feature pyramid network. Proc Conf Artif Intell (AAAI) 33:9259–9266
Zhao K, Liao K LCea (2021) Joint distortion rectification and super-resolution for self-driving scene perception. In: Neurocomputing, pp. 176–185. Elsevier
Zhou X, Wang D, Krähenbühl P (2019) Objects as points. arXiv:1904.07850
Zhou Y, Ye Q, Qiu Q, Jiao J (2017) Oriented response networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pp. 519–528
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: Deformable transformers for end-to-end object detection. arXiv:2010.04159
Acknowledgements
This work was supported in part by the Hainan Provincial Natural Science Foundation of China under Grant 620RC556, the National Natural Science Foundation of China under Grant 61961014 and the National Natural Science Foundation of China under Grant 62001289.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, Y., Zhu, D., Li, N. et al. GET: group equivariant transformer for person detection of overhead fisheye images. Appl Intell 53, 24551–24565 (2023). https://doi.org/10.1007/s10489-023-04747-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04747-6