Abstract
Computer vision tasks often have side information available that is helpful to solve the task. For example, for crowd counting, the camera perspective (e.g., camera angle and height) gives a clue about the appearance and scale of people in the scene. While side information has been shown to be useful for counting systems using traditional hand-crafted features, it has not been fully utilized in deep learning based counting systems. In order to incorporate the available side information, we propose an adaptive convolutional neural network (ACNN), where the convolution filter weights adapt to the current scene context via the side information. In particular, we model the filter weights as a low-dimensional manifold within the high-dimensional space of filter weights. The filter weights are generated using a learned “filter manifold” sub-network, whose input is the side information. With the help of side information and adaptive weights, the ACNN can disentangle the variations related to the side information, and extract discriminative features related to the current context (e.g. camera perspective, noise level, blur kernel parameters). We demonstrate the effectiveness of ACNN incorporating side information on 3 tasks: crowd counting, corrupted digit recognition, and image deblurring. Our experiments show that ACNN improves the performance compared to a plain CNN with a similar number of parameters and achieves similar or better than state-of-the-art performance on crowd counting task. Since existing crowd counting datasets do not contain ground-truth side information, we collect a new dataset with the ground-truth camera angle and height as the side information. We also perform ablation experiments, mainly for crowd counting, to study the helpfulness of the side information, and the effect of the placement of the adaptive convolutional layers in order to get insight about ACNNs.
Similar content being viewed by others
Notes
The perspective value on a pixel location is proportional to the size of the object if the object exists there.
To reduce clutter, here we do not show the bias term for the convolution.
The mean absolute difference (MAD) between the density maps generated using the original perspective maps and our perspective maps is 0.475 on average, and [0.029, 0.818, 0.800, 0.597, 0.131] respectively on the five test scenes.
The MAD between the original density maps and those using single Gaussian kernels is 2.893 on average, and [0.582, 4.491, 1.946, 7.078, 0.368] respectively on the five test scenes (using our perspective map). This is because the ROI boundary cuts through the most crowded regions on scenes 2 and 4.
CSRNet termed the first ten convolution layers from VGG as front-end, which is more commonly referred as back-end elsewhere.
On the clean MNIST dataset, the 2-conv and 4-conv CNN architectures achieve 0.81% and 0.69% error, while the current state-of-the-art is \(\sim \) 0.23% error (Ciresan et al. 2012).
References
Arteta, C., Lempitsky, V., Noble, J. A., & Zisserman, A. (2014). Interactive object counting. In ECCV
Burger, H. C., Schuler, C. J., & Harmeling, S. (2012). Image denoising: Can plain neural networks compete with BM3D? In CVPR
Chan, A. B., & Vasconcelos, N. (2009). Bayesian poisson regression for crowd counting. In ICCV
Chan, A. B., Liang, Z. S. J., & Vasconcelos, N. (2008). Privacy preserving crowd monitoring: Counting people without people models or tracking. In CVPR. IEEE.
Chan, A. B., & Vasconcelos, N. (2012). Counting people with low-level features and bayesian regression. IEEE Transactions on Image Processing, 21, 2160–2177.
Ciresan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In CVPR
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In CVPR
De Brabandere, B., Jia X., Tuytelaars, T., & Van Gool, L. (2016). Dynamic filter networks. In NIPS
Dozat, T. (2015). Incorporating nesterov momentum into adam. Technical report, Stanford University (2015). http://cs229.stanford.edu/proj2015/054report.pdf
Eigen, D., Krishnan, D., & Fergus, R. (2013). Restoring an image taken through a window covered with dirt or rain. In ICCV
Fiaschi, L., Nair, R., Koethe, U., & Hamprecht, F. (2012). Learning to count with regression forest and structured labels. In ICPR
Gharbi, M., Chaurasia, G., Paris, S., & Durand, F. (2016). Deep joint demosaicking and denoising. ACM Transactions on Graphics (TOG).
Ha, D., Dai, A., & Le, Q. V. (2017). HyperNetworks. In ICLR
He, K., Zhang, X., Ren, S., & Sun J. (2016). Deep residual learning for image recognition. In CVPR
Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, 853–899.
Idrees, H., Saleemi, I., Seibert, C., & Shah, M. (2013). Multi-source multi-scale counting in extremely dense crowd images. In CVPR
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML
Jaderberg, M., Simonyan, K, Zisserman A, & Kavukcuoglu K. (2015). Spatial transformer networks. In NIPS
Kang, D., & Chan, A. (2018). Crowd counting by adaptively fusing predictions from an image pyramid. In BMVC
Kang, D., Dhar, D., & Chan A. (2017). Incorporating side information by adaptive convolution. In NIPS
Kang, D., Ma, Z., & Chan, A. B. (2018). Beyond counting: Comparisons of density maps for crowd analysis tasks–Counting, detection, and tracking. IEEE Transactions on Circuits and Systems for Video Technology, 29, 1408–1422.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980
Klein, B., Wolf, L., & Afek, Y. (2015). A dynamic convolutional layer for short range weather prediction. In CVPR
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS
Lempitsky, V., & Zisserman, A. (2010). Learning to count objects in images. In NIPS
Li, S., Liu, Z. Q., & Chan, A. B. (2015). Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In: IJCV
Li, Y., Zhang, X., & Chen, D. (2018). CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes. In CVPR
Liu, R., Li, Z., & Jia, J. (2008). Image partial blur detection and classification. In CVPR
Ma, Z., Yu, L., & Chan, A. B. (2015). Small instance detection by integer programming on object density maps. In CVPR
Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In ICML
Niu, Z., Zhou, M., Wang, L., Gao, X., & Hua, G. (2016). Ordinal regression with multiple output CNN for age estimation. In CVPR
Onoro-Rubio, D., & López-Sastre, R. J. (2016). Towards perspective-free object counting with deep learning. In ECCV
Pech-Pacheco, J. L., Cristóbal, G., Chamorro-Martinez, J., & Fernández-Valdivia, J. (2000). Diatom autofocusing in brightfield microscopy: A comparative study. In ICPR
Ren, W., Kang, D., Tang, Y., & Chan, A. (2017). Fusing crowd density maps and visual object trackers for people tracking in crowd scenes. In CVPR
Rodriguez, M., Laptev, I., Sivic, J., & Audibert, J. Y. Y. (2011). Density-aware person detection and tracking in crowds. In ICCV
Rothe, R., Timofte, R., & Van Gool, L. (2015). DEX: Deep expectation of apparent age from a single image. In ICCVW
Sam, D. B., Surya, S., & Babu, R. V. (2017). Switching convolutional neural network for crowd counting. In CVPR
Shi, J., Xu, L., & Jia, J. (2014). Discriminative blur detection features. In CVPR
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR
Sindagi, V. A., & Patel, V. M. (2017). Generating high-quality crowd density maps using contextual pyramid CNNs. In ICCV
Sun, Y., Wang, X., & Tang, X. (2014). Deep learning face representation by joint identification-verification. In NIPS
Xu, L., Ren, J. S., Liu, C., & Jia, J. (2014). Deep convolutional neural network for image deconvolution. In NIPS
Zhang, C., Li, H., Wang, X., & Yang, X. (2015). Cross-scene crowd counting via deep convolutional neural networks. In CVPR
Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2014). Facial landmark detection by deep multi-task learning. In ECCV
Zhang, L., Shi, M., & Chen, Q. (2018). Crowd counting via scale-adaptive convolutional neural network. In WACV
Zhang, Y., Zhou, D., & Chen, S., Gao, S., & Ma, Y. (2016). Single-image crowd counting via multi-column convolutional neural network. In CVPR
Acknowledgements
The work described in this paper was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. [T32-101/15-R] and CityU 11212518). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by S. Soatto.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kang, D., Dhar, D. & Chan, A.B. Incorporating Side Information by Adaptive Convolution. Int J Comput Vis 128, 2897–2918 (2020). https://doi.org/10.1007/s11263-020-01345-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-020-01345-8