Skip to main content
Log in

From Individual to Whole: Reducing Intra-class Variance by Feature Aggregation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The recording process of observation is influenced by multiple factors, such as viewpoint, illumination, and state of the object-of-interest etc.Thus, the image observation of the same object may vary a lot under different conditions. This leads to severe intra-class variance which greatly challenges the discrimination ability of the vision model. However, the current prevailing softmax loss for visual recognition only pursues perfect inter-class separation in the feature space. Without considering the intra-class compactness, the learned model easily collapses when it encounters the instances that deviate a lot from their class centroid. To resist the intra-class variance, we start by organizing the input instances as a graph. From this viewpoint, we find that the normalized cut on the graph is a favorable surrogate metric of the intra-class variance within the training batch. Inspired by the equivalence between the normalized cut and random walk, we propose a feature aggregation scheme using transition probabilities as guidance. By imposing supervision on the aggregated features, we can constrain the transition probabilities to form a graph partition consistent with the given labels. Thus, the normalized cut as well as intra-class variance can be well suppressed. To validate the effectiveness of this idea, we instantiate it in spatial, temporal, and spatial-temporal scenarios. Experimental results on corresponding benchmarks demonstrate that the proposed feature aggregation leads to significant improvement in performance. Our method is on par with, or even better than current state-of-the-arts in both tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. Here, Softmax loss refers to the combination of Softmax activation function and cross-entropy loss.

References

  • Carreira-Perpiñán, M. Á. (2006). Fast nonparametric clustering with gaussian blurring mean-shift. In ICML.

  • Chang, X., Hospedales, T. M., & Xiang, T. (2018). Multi-level factorisation net for person re-identification. In CVPR.

  • Chen, D., Li, H., Xiao, T., Yi, S., & Wang, X. (2018a). Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In CVPR.

  • Chen, G., Zhang, T., Lu, J., & Zhou, J. (2019). Deep meta metric learning. In ICCV.

  • Chen, K., Wang, J., Yang, S., Zhang, X., Xiong, Y., Loy, C. C., & Lin, D. (2018b). Optimizing video object detection via a scale-time lattice. In CVPR.

  • Chen, Y., Zhu, X., & Gong, S. (2017). Person re-identification by deep learning multi-scale representations. In ICCV.

  • Chen, Z., Huang, S., & Tao, D. (2018c). Context refinement for object detection. In ECCV.

  • Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2019). Autoaugment: Learning augmentation strategies from data. In CVPR.

  • Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW.

  • Damen, D., Doughty, H., Maria Farinella, G., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al. (2018). Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV) (pp. 720–736).

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.

  • DeVries, T. & Taylor, G. W. (2017). Improved regularization of convolutional neural networks with cutout. arXiv:1708.04552.

  • Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., & Brox, T. (2015). FlowNet: Learning optical flow with convolutional networks. In CVPR.

  • Feichtenhofer, C., Pinz, A., & Zisserman, A. (2017). Detect to track and track to detect. In ICCV.

  • Fu, Y., Wang, X., Wei, Y., & Huang, T. (2019a). Sta: Spatial-temporal attention for large-scale video-based person re-identification. In AAAI.

  • Fu, Y., Wei, Y., Zhou, Y., Shi, H., Huang, G., Wang, X., Yao, Z., & Huang, T. (2019b). Horizontal pyramid matching for person re-identification. In AAAI.

  • Gu, X., Ma, B., Chang, H., Shan, S., & Chen, X. (2019). Temporal knowledge propagation for image-to-video person re-identification. In ICCV.

  • Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In CVPR.

  • Han, W., Khorrami, P., Paine, T. L., Ramachandran, P., Babaeizadeh, M., Shi, H., Li, J., Yan, S., & Huang, T. S. (2016). Seq-NMS for video object detection. arXiv:1602.08465.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

  • Hermans, A., Beyer, L., & Leibe, B. (2017). In defense of the triplet loss for person re-identification. arXiv:1703.07737.

  • Hou, R., Ma, B., Chang, H., Gu, X., Shan, S., & Chen, X. (2019). VRSTC: Occlusion-free video person re-identification. In CVPR.

  • Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y. (2018). Relation networks for object detection. In CVPR.

  • Ioffe, S. & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.

  • Jegou, H., Harzallah, H., & Schmid, C. (2007). A contextual dissimilarity measure for accurate and efficient image search. In CVPR.

  • Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., Zhang, C., Wang, Z., Wang, R., Wang, X., et al. (2017). T-CNN: Tubelets with convolutional neural networks for object detection from videos. In TCSVT.

  • Kang, K., Ouyang, W., Li, H., & Wang, X. (2016). Object detection from video tubelets with convolutional neural networks. In CVPR.

  • Li, J., Wang, J., Tian, Q., Gao, W., & Zhang, S. (2019a). Global-local temporal representations for video person re-identification. In ICCV.

  • Li, J., Zhang, S., & Huang, T. (2019b). Multi-scale 3d convolution network for video based person re-identification. In AAAI.

  • Li, S., Bak, S., Carr, P., & Wang, X. (2018a). Diversity regularized spatiotemporal attention for video-based person re-identification. In CVPR.

  • Li, W., Zhao, R., Xiao, T., & Wang, X. (2014). DeepReID: Deep filter pairing neural network for person re-identification. In ICCV.

  • Li, W., Zhu, X., & Gong, S. (2018b). Harmonious attention network for person re-identification. In CVPR.

  • Lin, Y., Zheng, L., Zheng, Z., Wu, Y., & Yang, Y. (2017). Improving person re-identification by attribute and identity learning. arXiv:1703.07220.

  • Liu, C.-T., Wu, C.-W., Wang, Y.-C. F., & Chien, S.-Y. (2019). Spatially and temporally efficient non-local attention network for video-based person re-identification. In BMVC.

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016a). SSD: Single shot multibox detector. In ECCV.

  • Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., & Song, L. (2017). Sphereface: Deep hypersphere embedding for face recognition. In CVPR.

  • Liu, W., Wen, Y., Yu, Z., & Yang, M. (2016b). Large-margin softmax loss for convolutional neural networks. In ICML.

  • Lu, Y., Lu, C., & Tang, C.-K. (2017). Online video object detection using association LSTM. In ICCV.

  • Luo, C., Chen, Y., Wang, N., & Zhang, Z. (2019a). Spectral feature transformation for person re-identification. In ICCV.

  • Luo, H., Jiang, W., Zhang, X., Fan, X., Qian, J., & Zhang, C. (2019b). Alignedreid++: Dynamically matching local information for person re-identification. Pattern Recognition, 94, 53–61.

    Article  Google Scholar 

  • Meila, M. & Shi, J. (2001). A random walks view of spectral segmentation. In AISTATS.

  • Movshovitzattias, Y., Toshev, A., Leung, T. K., Ioffe, S., & Singh, S. (2017). No fuss distance metric learning using proxies. In ICCV.

  • Oh Song, H., Xiang, Y., Jegelka, S., & Savarese, S. (2016). Deep metric learning via lifted structured feature embedding. In CVPR.

  • Qian, X., Fu, Y., Jiang, Y.-G., Xiang, T., & Xue, X. (2017). Multi-scale deep learning architectures for person re-identification. In ICCV.

  • Qin, D., Gammeter, S., Bossard, L., Quack, T., & Van Gool, L. (2011). Hello neighbor: Accurate object retrieval with k-reciprocal nearest neighbors. In CVPR.

  • Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In ECCV workshop.

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. In IJCV.

  • Sarfraz, M. S., Schumann, A., Eberle, A., & Stiefelhagen, R. (2018). A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In CVPR.

  • Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. In CVPR.

  • Shen, Y., Li, H., Xiao, T., Yi, S., Chen, D., & Wang, X. (2018a). Deep group-shuffling random walk for person re-identification. In CVPR.

  • Shen, Y., Li, H., Yi, S., Chen, D., & Wang, X. (2018b). Person re-identification with deep similarity-guided graph neural network. In ECCV.

  • Si, J., Zhang, H., Li, C.-G., Kuen, J., Kong, X., Kot, A. C., & Wang, G. (2018). Dual attention matching network for context-aware feature sequence based person re-identification. In CVPR.

  • Sohn, K. (2016). Improved deep metric learning with multi-class n-pair loss objective. In NeurIPS.

  • Subramaniam, A., Nambiar, A., & Mittal, A. (2019). Co-segmentation inspired attention networks for video-based person re-identification. In ICCV.

  • Suh, Y., Wang, J., Tang, S., Mei, T., & Lee, K. M. (2018). Part-aligned bilinear representations for person re-identification. In ECCV.

  • Sun, Y., Zheng, L., Deng, W., & Wang, S. (2017). SVDNet for pedestrian retrieval. In ICCV.

  • Sun, Y., Zheng, L., Yang, Y., Tian, Q., & Wang, S. (2018). Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV.

  • Tripathi, S., Lipton, Z. C., Belongie, S., & Nguyen, T. (2016). Context matters: Refining object detection in video with recurrent neural networks. arXiv:1607.04648.

  • Wang, C., Zhang, Q., Huang, C., Liu, W., & Wang, X. (2018a). Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In ECCV.

  • Wang, F., Cheng, J., Liu, W., & Liu, H. (2018b). Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7), 926–930.

    Article  Google Scholar 

  • Wang, G., Yuan, Y., Chen, X., Li, J., & Zhou, X. (2018c). Learning discriminative features with multiple granularities for person re-identification. In ACM MM.

  • Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., & Liu, W. (2018d). CosFace: Large margin cosine loss for deep face recognition. In CVPR.

  • Wang, S., Zhou, Y., Yan, J., & Deng, Z. (2018e). Fully motion-aware network for video object detection. In ECCV.

  • Wang, X. & Gupta, A. (2018). Videos as space-time region graphs. In ECCV.

  • Wang, Y., Chen, Z., Wu, F., & Wang, G. (2018f). Person re-identification with cascaded pairwise convolutions. In CVPR.

  • Wang, Y., Wang, L., You, Y., Zou, X., Chen, V., Li, S., Huang, G., Hariharan, B., & Weinberger, K. Q. (2018g). Resource aware person re-identification across multiple resolutions. In CVPR.

  • Wei, L., Zhang, S., Gao, W., & Tian, Q. (2018). Person transfer GAN to bridge domain gap for person re-identification. In CVPR.

  • Wei, L., Zhang, S., Yao, H., Gao, W., & Tian, Q. (2017). GLAD: Global–local-alignment descriptor for pedestrian retrieval. In ACM MM.

  • Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2016). A discriminative feature learning approach for deep face recognition. In ECCV.

  • Wu, H., Chen, Y., Wang, N., & Zhang, Z. (2019). Sequence level semantics aggregation for video object detection. In ICCV.

  • Wu, Y., Lin, Y., Dong, X., Yan, Y., Ouyang, W., & Yang, Y. (2018). Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In CVPR.

  • Xiao, F. & Lee, Y. J. (2018). Video object detection with an aligned spatial-temporal memory. In ECCV.

  • Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In CVPR.

  • Yang, W., Huang, H., Zhang, Z., Chen, X., Huang, K., & Zhang, S. (2019). Towards rich feature discovery with class activation maps augmentation for person re-identification. In CVPR.

  • Yu, R., Zhou, Z., Bai, S., & Bai, X. (2017). Divide and fuse: A re-ranking approach for person re-identification. In BMVC.

  • Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV.

  • Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond empirical risk minimization. arXiv:1710.09412.

  • Zhao, Y., Shen, X., Jin, Z., Lu, H., & Hua, X. (2019). Attribute-driven feature disentangling and temporal aggregation for video person re-identification. In CVPR.

  • Zheng, L., Bie, Z., Sun, Y., Wang, J., Su, C., Wang, S., & Tian, Q. (2016). MARS: A video benchmark for large-scale person re-identification. In ECCV.

  • Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., & Tian, Q. (2015). Scalable person re-identification: A benchmark. In ICCV.

  • Zheng, Z., Zheng, L., & Yang, Y. (2017a). A discriminatively learned CNN embedding for person reidentification. ACM Transactions on Multimedia Computing, Communications, and Applications, 14(1), 13.

  • Zheng, Z., Zheng, L., & Yang, Y. (2017b). Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. In ICCV.

  • Zhong, Z., Zheng, L., Cao, D., & Li, S. (2017). Re-ranking person re-identification with k-reciprocal encoding. In CVPR.

  • Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang, Y. (2020). Random erasing data augmentation. In AAAI.

  • Zhu, X., Dai, J., Yuan, L., & Wei, Y. (2018). Towards high performance video object detection. In CVPR.

  • Zhu, X., Wang, Y., Dai, J., Yuan, L., & Yichen, W. (2017a). Flow-guided feature aggregation for video object detection. In ICCV.

  • Zhu, X., Xiong, Y., Dai, J., Yuan, L., & Wei, Y. (2017b). Deep feature flow for video recognition. In CVPR.

Download references

Acknowledgements

This work was supported in part by the Major Project for New Generation of AI (No. 2018AAA0100400), the National Natural Science Foundation of China (No. 61836014, No. U21B2042, No. 61773375, No. 62006231, No. 62072457), the National Youth Talent Support Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhaoxiang Zhang.

Additional information

Communicated by Jifeng Dai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z., Luo, C., Wu, H. et al. From Individual to Whole: Reducing Intra-class Variance by Feature Aggregation. Int J Comput Vis 130, 800–819 (2022). https://doi.org/10.1007/s11263-021-01569-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01569-2

Keywords

Navigation