Skip to main content
Log in

RGB-IR Person Re-identification by Cross-Modality Similarity Preservation

International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Person re-identification (Re-ID) is an important problem in video surveillance for matching pedestrian images across non-overlapping camera views. Currently, most works focus on RGB-based Re-ID. However, RGB images are not well suited to a dark environment; consequently, infrared (IR) imaging becomes necessary for indoor scenes with low lighting and 24-h outdoor scene surveillance systems. In such scenarios, matching needs to be performed between RGB images and IR images, which exhibit different visual characteristics; this cross-modality matching problem is more challenging than RGB-based Re-ID due to the lack of visible colour information in IR images. To address this challenge, we study the RGB-IR cross-modality Re-ID (RGB-IR Re-ID) problem. Rather than applying existing cross-modality matching models that operate under the assumption of identical data distributions between training and testing sets to handle the discrepancy between RGB and IR modalities for Re-ID, we cast learning shared knowledge for cross-modality matching as the problem of cross-modality similarity preservation. We exploit same-modality similarity as the constraint to guide the learning of cross-modality similarity along with the alleviation of modality-specific information, and finally propose a Focal Modality-Aware Similarity-Preserving Loss. To further assist the feature extractor in extracting shared knowledge, we design a modality-gated node as a universal representation of both modality-specific and shared structures for constructing a structure-learnable feature extractor called Modality-Gated Extractor. For validation, we construct a new multi-modality Re-ID dataset, called SYSU-MM01, to enable wider study of this problem. Extensive experiments on this SYSU-MM01 dataset show the effectiveness of our method. Download link of dataset: https://github.com/wuancong/SYSU-MM01.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  • Ahmed, E., Jones, M., & Marks, T. K. (2015). An improved deep learning architecture for person re-identification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3908–3916).

  • Andrew, G., Arora, R., Bilmes, J., & Livescu, K. (2013). Deep canonical correlation analysis. In International conference on machine learning (ICML) (pp. 1247–1255).

  • Bak, S., & Carr, P. (2017). One-shot metric learning for person re-identification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1571–1580).

  • Castrejon, L., Aytar, Y., Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Learning aligned cross-modal representations from weakly aligned data. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2940–2949).

  • Chen, D., Xu, D., Li, H., Sebe, N., & Wang, X. (2018). Group consistent similarity learning via deep CRF for person re-identification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 8649–8658).

  • Chen, D., Yuan, Z., Hua, G., Zheng, N., & Wang, J. (2015). Similarity learning on an explicit polynomial kernel feature map for person re-identification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1565–1573).

  • Chen, J., Wang, Y., Qin, J., Liu, L., & Shao, L. (2017). Fast person re-identification via cross-camera semantic binary transformation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5330–5339).

  • Chen, Y. C., Zheng, W. S., & Lai, J. (2015). Mirror representation for modeling view-specific transform in person re-identification. In International joint conferences on artificial intelligence (IJCAI).

  • Chen, Y. C., Zheng, W. S., Lai, J. H., & Yuen, P. (2017). An asymmetric distance model for cross-view feature mapping in person reidentification. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 27(8), 1661–1675.

    Article  Google Scholar 

  • Chen, Y. C., Zhu, X., Zheng, W. S., & Lai, J. H. (2018). Person re-identification by camera correlation aware feature augmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(2), 392–408.

    Article  Google Scholar 

  • Dai, P., Ji, R., Wang, H., Wu, Q., & Huang, Y. (2018). Cross-modality person re-identification with generative adversarial training. In International joint conferences on artificial intelligence (IJCAI).

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 886–893).

  • Dong, S. C., Cristani, M., Stoppa, M., Bazzani, L., & Murino, V. (2011). Custom pictorial structures for re-identification. In British machine vision conference (BMVC) (pp. 68.1–68.11).

  • Farenzena, M., Bazzani, L., Perina, A., Murino, V., & Cristani, M. (2010). Person re-identification by symmetry-driven accumulation of local features. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2360–2367).

  • Feng, F., Wang, X., & Li, R. (2014). Cross-modal retrieval with correspondence autoencoder. In ACM international conference on multimedia (ICM) (pp. 7–16).

  • Gray, D., Brennan, S., & Tao, H. (2007). Evaluating appearance models for recognition, reacquisition, and tracking. IEEE International Workshop on Performance Evaluation for Tracking and Surveillance, 3, 1–7.

    Google Scholar 

  • Gray, D., & Tao, H. (2008). Viewpoint invariant pedestrian recognition with an ensemble of localized features. In European conference on computer vision (ECCV) (pp. 262–275).

  • Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research (JMLR), 13, 723–773.

    MathSciNet  MATH  Google Scholar 

  • Guo, C. C., Chen, S. Z., Lai, J. H., Hu, X. J., & Shi, S. C. (2014). Multi-shot person re-identification with automatic ambiguity inference and removal. In International conference on pattern recognition (ICPR) (pp. 3540–3545).

  • Haque, A., Alahi, A., & Li, F. F. (2016). Recurrent attention models for depth-based person identification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1229–1238).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).

  • He, R., Wu, X., Sun, Z., & Tan, T. (2017). Learning invariant deep representation for NIR-VIS face recognition. In Association for Advancement of Artificial Intelligence (AAAI).

  • He, R., Wu, X., Sun, Z., & Tan, T. N. (2019). Wasserstein CNN: Learning invariant features for NIR-VIS face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 41(7), 1761–1773.

    Article  Google Scholar 

  • Hirzer, M., Beleznai, C., Roth, P. M., & Bischof, H. (2011). Person re-identification by descriptive and discriminative classification. In Scandinavian conference on image analysis (pp. 91–102).

  • Hu, J., Shen, L., Albanie, S., Sun, G., & Wu, E. (2018). Squeeze-and-excitation networks. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7132–7141).

  • Jing, X. Y., Zhu, X., Wu, F., You, X., Liu, Q., Yue, D., Hu, R., & Xu, B. (2015). Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 695–704).

  • Jungling, K., & Arens, M. (2010). Local feature based person reidentification in infrared image sequences. In IEEE international conference on advanced video and signal-based surveillance (AVSS) (pp. 448–455).

  • Kan, M., Shan, S., Chen, X. (2016). Multi-view deep network for cross-view classification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4847–4855).

  • Karanam, S., Li, Y., & Radke, R. J. (2015). Person re-identification with discriminatively trained viewpoint invariant dictionaries. In IEEE international conference on computer vision (ICCV) (pp. 4516–4524).

  • Kodirov, E., Xiang, T., Fu, Z., & Gong, S. (2016). Person re-identification by unsupervised \(\ell \)1 graph learning. In European conference on computer vision (ECCV) (pp. 178–195).

  • Köstinger, M., Hirzer, M., Wohlhart, P., Roth, P. M., & Bischof, H. (2012). Large scale metric learning from equivalence constraints. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2288–2295).

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Annual conference on neural information processing systems (NeurIPS) (pp. 1097–1105).

  • Kviatkovsky, I., Adam, A., & Rivlin, E. (2013). Color invariants for person reidentification. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(7), 1622–1634.

    Article  Google Scholar 

  • Lei, Z., & Li, S. Z. (2009). Coupled spectral regression for matching heterogeneous faces. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1123–1128).

  • Li, M., Zhu, X., & Gong, S. (2018). Unsupervised person re-identification by deep learning tracklet association. In European conference on computer vision (ECCV) (pp. 737–753).

  • Li, W., Zhao, R., & Wang, X. (2012). Human reidentification with transferred metric learning. In Asian confererence on computer vision (ACCV) (pp. 31–44).

  • Li, W., Zhao, R., Xiao, T., & Wang, X. (2014). Deepreid: Deep filter pairing neural network for person re-identification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 152–159).

  • Li, X., Zheng, W. S., Wang, X., Xiang, T., & Gong, S. (2015). Multi-scale learning for low-resolution person re-identification. In IEEE international conference on computer vision (ICCV) (pp. 3765–3773).

  • Li, Z., Chang, S., Liang, F., Huang, T. S., Cao, L., & Smith, J. R. (2013). Learning locally-adaptive decision functions for person verification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3610–3617).

  • Liao, S., Hu, Y., Zhu, X., & Li, S. Z. (2015). Person re-identification by local maximal occurrence representation and metric learning. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2197–2206).

  • Liao, S., & Li, S. Z. (2015). Efficient PSD constrained asymmetric metric learning for person re-identification. In IEEE international conference on computer vision (ICCV) (pp. 3685–3693).

  • Lin, D., & Tang, X. (2006). Inter-modality face recognition. In European conference on computer vision (ECCV) (pp. 13–26).

  • Lin, L., Wang, G., Zuo, W., Xiangchu, F., & Zhang, L. (2017). Cross-domain visual matching via generalized similarity measure and feature learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(6), 1089–1102.

    Article  Google Scholar 

  • Lin, Z., Ding, G., Hu, M., & Wang, J. (2015). Semantics-preserving hashing for cross-view retrieval. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3864–3872).

  • Lisanti, G., Masi, I., Bagdanov, A. D., & Bimbo, A. D. (2015). Person re-identification by iterative re-weighted sparse ranking. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(8), 1629–1642.

    Article  Google Scholar 

  • Liu, C., Gong, S., Loy, C.C., & Lin, X. (2012). Person re-identification: What features are important? In European conference on computer vision workshop (ECCVW) (pp. 391–401).

  • Long, M., Cao, Y., Wang, J., & Jordan, M. (2015). Learning transferable features with deep adaptation networks. In International conference on machine learning (ICML) (pp. 97–105).

  • Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR), 9, 2579–2605.

    MATH  Google Scholar 

  • Matsukawa, T., Okabe, T., Suzuki, E., & Sato, Y. (2016). Hierarchical Gaussian descriptor for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1363–1372).

  • Nguyen, D. T., Hong, H. G., Kim, K. W., & Park, K. R. (2017). Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors, 17(3), 605.

    Article  Google Scholar 

  • Paisitkriangkrai, S., Shen, C., & Hengel, A.V.D. (2015). Learning to rank in person re-identification with metric ensembles. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1846–1855).

  • Pedagadi, S., Orwell, J., Velastin, S., & Boghossian, B. (2013). Local Fisher discriminant analysis for pedestrian re-identification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3318–3325).

  • Prosser, B., Zheng, W. S., Gong, S., Xiang, T., & Mary, Q. (2010). Person re-identification by support vector ranking. In British machine vision conference (BMVC) (pp. 21.1–21.11).

  • Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G. R., Levy, R., & Vasconcelos, N. (2010). A new approach to cross-modal multimedia retrieval. In ACM multimedia (ACMMM) (pp. 251–260).

  • Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision workshop (ECCVW) (pp. 17–35).

  • Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 815–823).

  • Sharma, A., Kumar, A., Daume, H., & Jacobs, D. W. (2012). Generalized multiview analysis: A discriminative latent space. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2160–2167).

  • Shi, Z., Hospedales, T. M., & Xiang, T. (2015). Transferring a semantic representation for person re-identification and search. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4184–4193).

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International conference on learning representations (ICLR).

  • Song, C., Huang, Y., Ouyang, W., & Wang, L. (2018). Mask-guided contrastive attention model for person re-identification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1179–1188).

  • Su, C., Zhang, S., Xing, J., Gao, W., & Tian, Q. (2016). Deep attributes driven multi-camera person re-identification. In European conference on computer vision (ECCV) (pp. 475–491).

  • Sun, B., & Saenko, K. (2016). Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision workshop (ECCVW) (pp. 443–450).

  • Sun, Y., Zheng, L., Yang, Y., Tian, Q., & Wang, S. (2018). Beyond part models: Person retrieval with refined part pooling. In European conference on computer vision (ECCV) (pp. 480–496).

  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–9).

  • Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7167–7176).

  • Wang, W., Yang, X., Ooi, B. C., Zhang, D., & Zhuang, Y. (2016). Effective deep learning-based multi-modal retrieval. The International Journal on Very Large Data Bases (VLDB), 25(1), 79–101.

    Article  Google Scholar 

  • Wang, X., Zheng, W. S., Li, X., & Zhang, J. (2016). Cross-scenario transfer person reidentification. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 26(8), 1447–1460.

    Article  Google Scholar 

  • Wei, L., Zhang, S., Gao, W., & Tian, Q. (2018). Person transfer gan to bridge domain gap for person re-identification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 79–88).

  • Wei, Y., Zhao, Y., Lu, C., Wei, S., Liu, L., Zhu, Z., et al. (2017). Cross-modal retrieval with CNN visual features: A new baseline. IEEE Transactions on Cybernetics, 47(2), 449–460.

    Google Scholar 

  • Wu, A., Zheng, W. S., & Lai, J. H. (2017). Robust depth-based person re-identification. IEEE Transactions on Image Processing (TIP), 26(6), 2588–2603.

    Article  MathSciNet  Google Scholar 

  • Wu, A., Zheng, W. S., Yu, H. X., Gong, S., & Lai, J. (2017).RGB-infrared cross-modality person re-identification. In IEEE international conference on computer vision (ICCV) (pp. 5390–5399).

  • Wu, B., Yang, Q., Zheng, W. S., Wang, Y., & Wang, J. (2015). Quantized correlation hashing for fast cross-modal search. In International joint conferences on artificial intelligence (IJCAI).

  • Wu, S., Chen, Y. C., Li, X., Wu, A., You, J., & Zheng, W. S. (2016). An enhanced deep feature representation for person re-identification. In IEEE winter conference on applications of computer vision (WACV) (pp. 1–8).

  • Wu, Z., Li, Y., & Radke, R. J. (2015). Viewpoint invariant human re-identification in camera networks using pose priors and subject-discriminative features. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(5), 1095–1108.

    Article  Google Scholar 

  • Xiao, T., Li, H., Ouyang, W., & Wang, X. (2016). Learning deep feature representations with domain guided dropout for person re-identification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1249–1258).

  • Xiao, T., Li, S., Wang, B., Lin, L., & Wang, X. (2017). Joint detection and identification feature learning for person search. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3376–3385).

  • Xiong, F., Gou, M., Camps, O., & Sznaier, M. (2014). Person re-identification using kernel-based metric learning methods. In European conference on computer vision (ECCV) (pp. 1–16).

  • Yang, Q., Wu, A., & Zheng, W. S. (2019). Person re-identification by contour sketch under moderate clothing change. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI),. https://doi.org/10.1109/TPAMI.2019.2960509.

    Article  Google Scholar 

  • Yang, Y., Yang, J., Yan, J., Liao, S., Yi, D., & Li, S. Z. (2014). Salient color names for person re-identification. In European conference on computer vision (ECCV) (pp. 536–551).

  • Ye, M., Lan, X., Li, J., & Yuen, P. C. (2018). Hierarchical discriminative learning for visible thermal person re-identification. In Association for Advancement of Artificial Intelligence (AAAI).

  • Ye, M., Wang, Z., Lan, X., & Yuen, P. C. (2018). Visible thermal person re-identification via dual-constrained top-ranking. In International joint conferences on artificial intelligence (IJCAI).

  • Yin, J., Wu, A., & Zheng, W. S. (2020). Fine-grained person re-identification. International Journal of Computer Vision (IJCV). https://doi.org/10.1007/s11263-019-01259-0.

  • You, J., Wu, A., Li, X., & Zheng, W. S. (2016). Top-push video-based person re-identification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1345–1353).

  • Yu, H. X., Wu, A., & Zheng, W. S. (2018). Unsupervised person re-identification by deep asymmetric metric embedding. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI),. https://doi.org/10.1109/TPAMI.2018.2886878.

    Article  Google Scholar 

  • Yu, H. X., Wu, A., & Zheng, W. S. (2017). Cross-view asymmetric metric learning for unsupervised person re-identification. In IEEE international conference on computer vision (ICCV) (pp. 994–1002).

  • Zhang, D., & Li, W. J. (2014). Large-scale supervised multimodal hashing with semantic correlation maximization. In Association for Advancement of Artificial Intelligence (AAAI).

  • Zhang, L., Xiang, T., & Gong, S. (2016). Learning a discriminative null space for person re-identification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1239–1248).

  • Zhao, L., Li, X., Zhuang, Y., & Wang, J. (2017). Deeply-learned part-aligned representations for person re-identification. In IEEE international conference on computer vision (ICCV) (pp. 3239–3248).

  • Zhao, R., Oyang, W., & Wang, X. (2017). Person re-identification by saliency learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(2), 356–370.

    Article  Google Scholar 

  • Zheng, L., Bie, Z., Sun, Y., Wang, J., Su, C., Wang, S., & Tian, Q. (2016). Mars: A video benchmark for large-scale person re-identification. In European conference on computer vision (ECCV) (pp. 868–884).

  • Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., & Tian, Q. (2015). Scalable person re-identification: A benchmark. In IEEE international conference on computer vision (ICCV) (pp. 1116–1124).

  • Zheng, W. S., Gong, S., & Xiang, T. (2009). Associating groups of people. In British machine vision conference (BMVC) (pp. 23.1–23.11).

  • Zheng, W. S., Gong, S., & Xiang, T. (2013). Reidentification by relative distance comparison. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(3), 653–668.

    Article  Google Scholar 

  • Zheng, W. S., Gong, S., & Xiang, T. (2016). Towards open-world person re-identification by one-shot group-based verification. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(3), 591–606.

    Article  Google Scholar 

  • Zheng, W. S., Li, X., Xiang, T., Liao, S., Lai, J., & Gong, S. (2015). Partial person re-identification. In IEEE international conference on computer vision (ICCV) (pp. 4678–4686).

  • Zheng, Z., Zheng, L., & Yang, Y. (2017). Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In IEEE international conference on computer vision (ICCV) (pp. 3774–3782).

  • Zhong, Z., Zheng, L., Cao, D., & Li, S. (2017). Re-ranking person re-identification with k-reciprocal encoding. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3652–3661).

  • Zhu, F., Shao, L., & Yu, M. (2014). Cross-modality submodular dictionary learning for information retrieval. In ACM international conference on information and knowledge management (CIKM) (pp. 1479–1488).

  • Zhu, J. Y., Zheng, W. S., Lai, J. H., & Li, S. Z. (2014). Matching nir face to vis face using transduction. IEEE Transactions on Information Forensics and Security (TIFS), 9(3), 501–514.

    Article  Google Scholar 

  • Zhu, X., Wu, B., Huang, D., & Zheng, W. S. (2017). Fast open-world person re-identification. IEEE Transactions on Image Processing (TIP), 27(5), 2286–2300.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported partially by the National Key Research and Development Program of China (2016YFB1001002), NSFC(U1911401, U1811461, U1611461), Guangdong Province Science and Technology Innovation Leading Talents (2016TX03X157), Guangdong Project (No. 2018B030312002), and Guangzhou Research Project (201902010037), and Research Projects of Zhejiang Lab (No. 2019KD0AB03). The corresponding author and principal investigator for this paper is Wei-Shi Zheng.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei-Shi Zheng.

Additional information

Communicated by Bernt Schiele.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Properties of Modality-Specific and Shared Nodes In Sect. 4.1, we analyse the properties of modality-specific and shared nodes, which can be derived as follows.

Let \(\mathbf {x}_{(l)}\) denote the input to layer \(l+1\), let \(o_{(l+1),i}\) denote the output of the i-th node before the activation function in layer \(l+1\), and let \(\mathbf {w}_{(l+1),i}\) and \({b}_{(l+1),i}\) denote the weight and bias parameters, respectively, i.e., \(o_{(l+1),i}=(\mathbf {w}_{(l+1),i})^\top \mathbf {x}_{(l)}+b_{(l+1),i}\). Using the previously defined types of nodes, without loss of generality, \(\mathbf {x}_{(l)}^{m1}\) and \(\mathbf {x}_{(l)}^{m2}\) can be factorised into three parts as follows: \(\mathbf {x}_{(l)}^{m1}=[\mathbf {x}_{(l)}^{m1,1spe};\mathbf {x}_{(l)}^{m1,2spe};\mathbf {x}_{(l)}^{m1,sh}]\) and \(\mathbf {x}_{(l)}^{m2}=[\mathbf {x}_{(l)}^{m2,1spe};\mathbf {x}_{(l)}^{m2,2spe};\mathbf {x}_{(l)}^{m2,sh}]\), where “;” denotes vector concatenation and the three components correspond to modality-1-specific, modality-2-specific and shared nodes, respectively. We denote \(\mathbf {w}_{(l+1),i}\) as \(\mathbf {w}_{(l+1),i}=[\mathbf {w}_{(l+1),i}^{1spe};\mathbf {w}_{(l+1),i}^{2spe};\mathbf {w}_{(l+1),i}^{sh}]\).

Let L denote the loss function of the network. For modality-2-specific nodes, given a network input \(\mathbf {x}^{m1}_{(0)}\) of modality 1, the output is \(\mathbf {x}_{(l)}^{m1,2spe}=\mathbf {0}\); this can be derived in accordance with the definition of types of nodes in Eq. (10). Therefore, in the forward propagation process, the output of layer \(l+1\) is

$$\begin{aligned} o_{(l+1),i}=(\mathbf {w}_{(l+1),i}^{1spe})^\top \mathbf {x}^{m1,1spe}_{(l)} + (\mathbf {w}_{(l+1),i}^{sh})^\top \mathbf {x}^{m1,sh}_{(l)} +{b}_{(l+1),i}. \end{aligned}$$
(15)

In the backward propagation process, the derivatives of the loss function L with respect to the weights are

$$\begin{aligned} \frac{\partial L}{\partial \mathbf {w}_{(l+1),i}^{1spe}}=\frac{\partial L}{\partial {o}_{(l+1),i}} \frac{\partial {o}_{(l+1),i}}{\partial \mathbf {w}_{(l+1),i}^{1spe}}=\frac{\partial L}{\partial {o}_{(l+1),i}}\mathbf {x}^{m1,1spe}_{(l)}, \end{aligned}$$
(16)
$$\begin{aligned} \frac{\partial L}{\partial \mathbf {w}_{(l+1),i}^{2spe}}=\frac{\partial L}{\partial {o}_{(l+1),i}} \frac{\partial {o}_{(l+1),i}}{\partial \mathbf {w}_{(l+1),i}^{2spe}}=\frac{\partial L}{\partial {o}_{(l+1),i}}\mathbf {x}^{m1,2spe}_{(l)}=\mathbf {0}, \end{aligned}$$
(17)
$$\begin{aligned} \frac{\partial L}{\partial \mathbf {w}_{(l+1),i}^{sh}}=\frac{\partial L}{\partial {o}_{(l+1),i}} \frac{\partial {o}_{(l+1),i}}{\partial \mathbf {w}_{(l+1),i}^{sh}}=\frac{\partial L}{\partial {o}_{(l+1),i}}\mathbf {x}^{m1,sh}_{(l)}. \end{aligned}$$
(18)

For a network input \(\mathbf {x}^{m2}_{(0)}\) of modality 2, formulations similar to those above can be derived. Since the bias parameter \({b}_{(l+1),i}\) can be included in the weight parameter \(\mathbf {w}_{(l+1),i}\) by simply padding 1 into the layer input \(\mathbf {x}_{(l)}\), the bias parameter is not analysed individually.

Based on the derivations above, the properties of modality-specific and shared nodes are analysed in Sect. 4.1.

Structure Representation Ability In the last paragraph of Sect. 4.2, we analyse the connection between one-stream networks consisting of modality-gated nodes and the existing network structures. To show the structure representation ability of modality-gated nodes, we take a simple two-stream network as an example. In Fig. 9, a one-stream network equivalent to the two-stream network in Fig. 7 with respect to forward propagation is shown. Inputs \(\mathbf {x}^{m1}\) and \(\mathbf {x}^{m2}\) are fed into the same nodes in layer 0. In layer 1, there are four modality-gated nodes, of which two are modality-1-specific nodes and the others are modality-2-specific nodes. In layer 1a, there are two shared modality-gated nodes. The black solid lines denote weights of 1, and the black dotted lines denote weights of 0. The modality label \(y^{mod}\) controls the modality gate. In this way, the weights represented in blue and red correspond to modality-specific nodes, and the weights represented in green correspond to shared nodes; thus, this one-stream structure consisting of modality-gated nodes is identical to the two-stream structure in Fig. 7. Therefore, it is possible to learn a two-stream structure. More generally, modality-gated nodes can represent soft modality-specific nodes and provide a network with sufficient flexibility to evolve into any modality-specific and shared structures.

Modality-Gated Nodes versus Channel Attention For our modality-gated nodes, each node (or channel in a convolutional network) is weighted by two modality selection weights, one for each modality. Some neural networks apply a channel attention mechanism (e.g., SENet Hu et al. 2018) and also use different weights for different channels. According to our analysis, the modality selection weights of modality-gated nodes can be adjusted to form either modality-specific or shared structures in a network for processing data from two modalities. By contrast, simply applying the channel attention mechanism does not allow modality-specific and shared structures to be learned because there is no specific modelling for different modalities.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, A., Zheng, WS., Gong, S. et al. RGB-IR Person Re-identification by Cross-Modality Similarity Preservation. Int J Comput Vis 128, 1765–1785 (2020). https://doi.org/10.1007/s11263-019-01290-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-019-01290-1

Keywords

Navigation