Abstract
As an emerging computer vision task, crowd localization has received increasing attention due to its ability to produce more accurate spatially predictions. However, continuous scale variations in complex crowd scenes lead to tiny individuals at the edges, so that existing methods cannot achieve precise crowd localization. Aiming at alleviating the above problems, we propose a novel Dilated Convolution-based Feature Refinement Network (DFRNet) to enhance the representation learning capability. Specifically, the DFRNet is built with three branches that can capture the information of each individual in crowd scenes more precisely. More specifically, we introduce a Feature Perception Module to model long-range contextual information at different scales by adopting multiple dilated convolutions, thus providing sufficient feature information to perceive tiny individuals at the edge of images. Afterwards, a Feature Refinement Module is deployed at multiple stages of the three branches to facilitate the mutual refinement of feature information at different scales, thus further improving the expression capability of multi-scale contextual information. By incorporating the above modules, DFRNet can locate individuals in complex scenes more precisely. Extensive experiments on multiple datasets demonstrate that the proposed method has more advanced performance compared to existing methods and can be more accurately adapted to complex crowd scenes.
- [1] . 2021. Localization in the crowd with topological constraints. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’21).Google ScholarCross Ref
- [2] . 2009. Pictorial structures revisited: People detection and articulated pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1014–1021.Google ScholarCross Ref
- [3] . 2021. Synthetic temporal anomaly guided end-to-end video anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 207–214.Google ScholarCross Ref
- [4] . 2017. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 12 (2017), 2481–2495.Google ScholarCross Ref
- [5] . 2021. Enhanced information fusion network for crowd counting. arXiv:2101.04279. Retrieved from https://arxiv.org/abs/2101.04279.Google Scholar
- [6] . 2021. Human trajectory prediction via counterfactual analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9824–9833.Google ScholarCross Ref
- [7] . 2019. Scale pyramid network for crowd counting. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’19). IEEE, 1941–1950.Google ScholarCross Ref
- [8] . 2021. SA-InterNet: Scale-aware interaction network for joint crowd counting and localization. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV’21). Springer, 203–215.Google ScholarDigital Library
- [9] . 2021. Decoupled two-stage crowd counting and beyond. IEEE Trans. Image Process. 30 (2021), 2862–2875.Google ScholarDigital Library
- [10] . 2004. Mean squared error of empirical predictor. Ann. Stat. 32, 2 (2004), 818–840.Google ScholarCross Ref
- [11] . 2021. Counting and locating high-density objects using convolutional neural network. arXiv:2102.04366. Retrieved from https://arxiv.org/abs/2102.04366.Google Scholar
- [12] . 2019. Retinaface: Single-stage dense face localisation in the wild. arXiv:1905.00641. Retrieved from https://arxiv.org/abs/1905.00641.Google Scholar
- [13] . 2021. Congested crowd instance localization with dilated convolutional swin transformer. arXiv:2108.00584. Retrieved from https://arxiv.org/abs/2108.00584.Google Scholar
- [14] . 2019. Domain-adaptive crowd counting via inter-domain features segregation and gaussian-prior reconstruction. arXiv:1912.03677. Retrieved from https:/arxiv.org/abs/1912.03677.Google Scholar
- [15] . 2020. Learning independent instance maps for crowd localization. arXiv:2012.04164. Retrieved from https://arxiv.org/abs/2012.04164.Google Scholar
- [16] . 2021. Densetnt: End-to-end trajectory prediction from dense goal sets. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15303–15312.Google ScholarCross Ref
- [17] . 2019. Dadnet: Dilated-attention-deformable convnet for crowd counting. In Proceedings of the 27th ACM International Conference on Multimedia. 1823–1832.Google ScholarDigital Library
- [18] . 2019. Crowd counting using scale-aware attention networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’19). IEEE, 1280–1288.Google ScholarCross Ref
- [19] . 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141.Google ScholarCross Ref
- [20] . 2017. Finding tiny faces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 951–959.Google ScholarCross Ref
- [21] . 2018. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV’18). 532–546.Google ScholarDigital Library
- [22] . 2021. A smartly simple way for joint crowd counting and localization. Neurocomputing 459 (2021), 35–43.Google ScholarDigital Library
- [23] . 2019. Crowd counting and density estimation by trellis encoder-decoder networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6133–6142.Google ScholarCross Ref
- [24] . 2012. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012), 1097–1105.Google ScholarDigital Library
- [25] . 2017. A multiview-based parameter free framework for group detection. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.Google ScholarCross Ref
- [26] . 2020. Quantifying and detecting collective motion in crowd scenes. IEEE Trans. Image Process. 29 (2020), 5571–5583.Google ScholarCross Ref
- [27] . 2019. Pyramidbox++: High performance detector for finding tiny face. arXiv:1904.00386. Retrieved from https://arxiv.org/abs/1904.00386.Google Scholar
- [28] . 2019. Density map regression guided detection network for rgb-d crowd counting and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1821–1830.Google ScholarCross Ref
- [29] . 2021. Focal inverse distance transform maps for crowd localization and counting in dense crowd. arXiv:2102.07925. Retrieved from https://arxiv.org/abs/2102.07925.Google Scholar
- [30] . 2021. Reciprocal distance transform maps for crowd counting and people localization in dense crowd (unpublished).Google Scholar
- [31] . 2019. Recurrent attentive zooming for joint crowd counting and precise localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1217–1226.Google ScholarCross Ref
- [32] . 2018. Decidenet: Counting varying density crowds through attention guided detection and density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5197–5206.Google ScholarCross Ref
- [33] . 2021. Cross-modal collaborative representation learning and a large-scale RGBT benchmark for crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4823–4833.Google ScholarCross Ref
- [34] . 2019. Crowd counting with deep structured scale integration network. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1774–1783.Google ScholarCross Ref
- [35] . 2020. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9413–9422.Google ScholarCross Ref
- [36] . 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779–788.Google ScholarCross Ref
- [37] . 2020. Going beyond the regression paradigm with accurate dot prediction for dense crowds. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’20). IEEE, 2853–2861.Google ScholarCross Ref
- [38] . 2020. Locate, size and count: Accurately resolving people in dense crowds via detection. IEEE Trans. Pattern Anal. Mach. Intell. (2020).Google Scholar
- [39] . 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556.Google Scholar
- [40] . 2021. Rethinking counting and localization in crowds: A purely point-based framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3365–3374.Google ScholarCross Ref
- [41] . 2016. End-to-end people detection in crowded scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2325–2333.Google ScholarCross Ref
- [42] . 2011. Head detection in stereo data for people counting and segmentation. In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP’11). 620–625.Google Scholar
- [43] . 2020. Modeling noisy annotations for crowd counting. Adv. Neural Inf. Process. Syst. 33 (2020).Google Scholar
- [44] . 2021. A generalized loss function for crowd counting and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1974–1983.Google ScholarCross Ref
- [45] . 2021. Dynamic fusion module evolves drivable area and road anomaly detection: A benchmark and algorithms. IEEE Trans. Cybernet. (2021).Google Scholar
- [46] . 2020. NWPU-crowd: A large-scale benchmark for crowd counting and localization. IEEE Trans. Pattern Anal. Mach. Intell. 43, 6 (2020), 2141–2149.Google ScholarCross Ref
- [47] . 2021. A self-training approach for point-supervised object detection and counting in crowds. IEEE Trans. Image Process. 30 (2021), 2876–2887.Google ScholarDigital Library
- [48] . 2021. Dense point prediction: A simple baseline for crowd counting and localization. In Proceedings of the IEEE International Conference on Multimedia & Expo Workshops (ICMEW’21). IEEE, 1–6.Google ScholarCross Ref
- [49] . 2009. Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE Sign. Process. Mag. 26, 1 (2009), 98–117.Google ScholarCross Ref
- [50] . 2005. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 30, 1 (2005), 79–82.Google ScholarCross Ref
- [51] . 2019. Autoscale: Learning to scale for crowd counting (unpublished).Google Scholar
- [52] . 2020. Scale match for tiny person detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1257–1265.Google ScholarCross Ref
- [53] . 2016. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 589–597.Google ScholarCross Ref
Index Terms
- Dilated Convolution-based Feature Refinement Network for Crowd Localization
Recommendations
Congested crowd instance localization with dilated convolutional swin transformer
AbstractCrowd localization is a new computer vision task, evolved from crowd counting. Different from the latter, it provides more precise location information for each instance, not just counting numbers for the whole crowd scene, which ...
SC2Net: Scale-aware Crowd Counting Network with Pyramid Dilated Convolution
AbstractAccurate crowd counting is still challenging due to the variations of crowd heads. Most of crowd counting methods adopt multi-branch networks to extract multi-scale information. However, these networks are too complex to be optimized. To solve ...
Multi-scale dilated convolution of feature Fusion Network for Crowd counting
AbstractCrowd counting has long been a challenging task due to the perspective distortion and variability in head size. The previous methods ignore the multi-scale information in images or simply use convolutions with different kernel sizes to extract ...
Comments