Spectral-invariant matching network

doi:10.1016/j.inffus.2022.10.033

Information Fusion

Volume 91, March 2023, Pages 623-632

https://doi.org/10.1016/j.inffus.2022.10.033 Get rights and content

Highlights

•
The method is based on domain conversion for matching between VIS and NIR images.
•
The method uses metric learning to integrate information from the two domains.
•
This work studies the effect of the domain conversion method and loss function.
•
This work achieves noteworthy performance in three public datasets.
•
The result can be adapted to various applications such as depth estimation.

Abstract

As the need for sensor fusion systems has grown, developing methods to find correspondences between images with different spectral ranges has become increasingly important. Since most images do not share low-level information, such as textures and edges, existing matching approaches fail even with convolutional neural networks (CNNs). In this paper, we propose an end-to-end metric learning method, called SPIMNet (SPectral-Invariant Matching Network) for robust cross- and multi-spectral image patch matching. While existing methods based on CNNs learn matching features directly from cross- and multi-spectral image patches, SPIMNet transforms across spectral bands and discriminates for similarity in three steps. First, (1) SPIMNet is adjusted for a feature domain by introducing a domain translation network; then (2) two Siamese networks learn to match the adjusted features with the same spectral domain; and (3) the matching features are fed to fully-connected layers to determine the identity of the patches as a classification task. By effectively incorporating each step, SPIMNet achieved competitive results on a variety of challenging datasets, including both VIS–NIR and VIS–Thermal image pairs. Our code is available at https://github.com/koyeongmin/SPIMNet.

Graphical abstract

Introduction

Many researchers and industries utilize sensors of various domains to get more information about targets. For managing information from each sensor, proper sensor fusion methods are required. In the field of computer vision, cross-spectral (i.e.visible–near infrared (NIR)) and multi-spectral (i.e.visible–thermal) image matching are being actively studied because the different spectral domains can provide complementary information [1]. As an example, visible and thermal images can mutually compensate for rich color information and high textural structures in low-light conditions, making these images suitable for all-day vision systems [2]. All-day vision or fusing multi-spectral images has become an essential and significant task for sensor fusion systems that conduct facial expression recognition [3], [4], material classification [5], [6], medical image analysis [7], pansharpening [8], and pedestrian detection [9], [10], [11].

Since cross- and multi-spectral images capture different wavelength spectral ranges, the images appear significantly different in both intensity and pixel levels. Even with well-known local feature descriptors [12], [13], the relationship between images across spectral domains cannot be accounted for, which results in severe performance drops in matching tasks. Recently, convolutional neural networks (CNNs) have demonstrated some ability to address this issue, by leveraging semantic information along with low-level features. Siamese structures overcome the somewhat challenging matching problems among various spectral domains [14], [15], [16], [17]. In most siamese structures, the same deep neural network is applied to both multi-spectral image patches and extracts each feature. Their loss functions make the distance between two positive patches short, otherwise far. Encoder–decoder structures are also utilized to extract common features between multi-spectral image patches [18], [19].

Although these previous methods show that their methods work for fusing cross- and multi-spectral images, these methods separately extract features from each image, and just aggregate the features to fuse information or to predict similarities. We have observed that the methods are not suited to dealing with both intensity- and pixel-level differences because these differences can be one reason to make a fusion between multi-spectral images hard and previous methods do not have any module to reduce the difference. In this paper, we present a SPectral-Invariant Matching Network (SPIMNet), an end-to-end CNN framework for robust image patch matching across different spectral domains. These are the primary contributions of this study:

•
In contrast to previous methods that extract features directly from input patches, SPIMNet learns the spectral translations of input patches using a domain conversion network. We can use similar features to compare image patches across different spectral domains.
•
SPIMNet utilizes a dual-Siamese network for feature extraction from each translated piece of information to predict the matching label through a fully connected network.
•
The proposed end-to-end method can be trained from scratch without any pre-trained backbone network, and we obtain competitive results over several standard datasets, including both visible–NIR and visible–thermal images.
•
Additionally, ablation studies indicate that each of these technical contributions leads to appreciable improvements in matching accuracy, and we show that the proposed method can be applied to various applications such as stereo matching and image enhancement.

Section snippets

Related works

Hand-crafted Feature Descriptions hand-crafted features such as SIFT [12], SURF [13] and FAST [20] are based on measurements of texture similarities and have shown promise for finding correspondences between visible images, even with illumination and scale changes. A modification of the hand-crafted features was used to handle the issue of dense correspondences in [21]. However, these methods often fail in cross- and multi-spectral imagery because their different spectral characteristics result

Spectral-invariant matching network

Previous works [14], [15] on cross-spectral matching have directly extracted discriminative features from input image patches. As shown in Fig. 1, matching image patches from different spectral domains is a challenging task because the objects and materials have totally different appearances. For this reason, performance has been limited in previous works [14], [15].

In this work, instead of learning discriminative features directly from cross-spectral image patches, we solve the matching

Experimental results

To demonstrate the effectiveness of SPIMNet, we evaluate it on three publicly available datasets, the VIS–NIR patch dataset [16], the KAIST Multi-spectral pedestrian dataset [9], and the PittsStereo-RGBNIR dataset [45] as shown in Fig. 7. We compare SPIMNet with four hand-crafted feature matching methods (SIFT [12], GISIFT [55], EHD [56], LGHD [57]) and eight CNN-based methods including a Siamese network [16], Pseudo-Siamese network [16] (PSiamese), 2-channel network [16], PNNet [58], Q-Net [59]

Conclusion

We have developed an image patch matching network across cross- and multi-spectral domains, named SPIMNet. SPIMNet is formulated as an end-to-end network, using two domain conversion networks to adjust the pixel- and intensity-level of input cross-spectral images. A dual-Siamese network enables the automatic selection of a better matching domain for two converted domain features. By incorporating these schemes in a deep learning framework, competitive matching accuracy is achieved on a variety

CRediT authorship contribution statement

Yeongmin Ko: Methodology, Software, Validation, Writing – review & editing, Visualization. Yong-Jun Jang: Methodology, Software, Writing – review & editing. Vinh Quang Dinh: Conceptualization, Software, Writing – original draft. Hae-Gon Jeon: Formal analysis, Investigation, Writing – original draft. Moongu Jeon: Formal analysis, Investigation, Resources, Supervision, Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by Institute of Information communications Technology Planning Evaluation (IITP) grant funded by the Korea Government (MSIT) (No. 2014-3-00077, Development of Global Multi-target Tracking and Event Prediction Techniques Based on Real-time Large-Scale Video Analysis).

Yeongmin Ko received the B.S. degree in School of Electrical Engineering from Gwangju Institute of Science and Technology (GIST), Gwangju, South Korea, in 2017. He is currently pursuing the Ph.D. degree with the School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology. His current research interests include computer vision, self-driving, and deep learning.

References (62)

JiangX. et al.
A review of multimodal image matching: Methods and applications
Inf. Fusion
(2021)
AbdullakuttyF. et al.
A review of state-of-the-art in Face Presentation Attack Detection: From early development to advanced deep learning and multi-modal fusion methods
Inf. Fusion
(2021)
WangC. et al.
DiCyc: GAN-based deformation invariant cross-domain information fusion for medical image synthesis
Inf. Fusion
(2021)
YilmazC.S. et al.
A theoretical and practical survey of image fusion methods for multispectral pansharpening
Inf. Fusion
(2022)
BayH. et al.
Speeded-up robust features (SURF)
Comput. Vis. Image Underst.
(2008)
DimitriG.M. et al.
Multimodal and multicontrast image fusion via deep generative models
Inf. Fusion
(2022)
KimM. et al.
Joint patch clustering-based dictionary learning for multimodal image fusion
Inf. Fusion
(2016)
YanP. et al.
Unsupervised learning framework for interest point detection and description via properties optimization
Pattern Recognit.
(2021)
MaJ. et al.
FusionGAN: A generative adversarial network for infrared and visible image fusion
Inf. Fusion
(2019)
ChoiY. et al.
KAIST multi-spectral day/night data set for autonomous and assisted driving
IEEE Trans. Intell. Transp. Syst. (T-ITS)
(2018)

CorneanuC.A. et al.

Survey on rgb, 3d, thermal, and multimodal approaches for facial expression recognition: History, trends, and affect-related applications

IEEE Trans. Pattern Anal. Mach. Intell.

(2016)

T. Zhi, B.R. Pires, M. Hebert, S.G. Narasimhan, Multispectral Imaging for Fine-Grained Recognition of Powders on...

P. Saponaro, S. Sorensen, A. Kolagunda, C. Kambhamettu, Material classification with thermal imagery, in: Proceedings...

S. Hwang, J. Park, N. Kim, Y. Choi, I. So Kweon, Multispectral pedestrian detection: Benchmark dataset and baseline,...

L. Zhang, X. Zhu, X. Chen, X. Yang, Z. Lei, Z. Liu, Weakly aligned cross-modal learning for multispectral pedestrian...

D. Xu, W. Ouyang, E. Ricci, X. Wang, N. Sebe, Learning cross-modal deep representations for robust pedestrian...

LoweD.G.

Distinctive image features from scale-invariant keypoints

Int. J. Comput. Vis.

(2004)

D. Quan, X. Liang, S. Wang, S. Wei, Y. Li, N. Huyan, L. Jiao, AFD-Net: Aggregated Feature Difference Learning for...

D. Quan, S. Fang, X. Liang, S. Wang, L. Jiao, Cross-spectral image patch matching by learning features of the spatially...

C.A. Aguilera, F.J. Aguilera, A.D. Sappa, C. Aguilera, R. Toledo, Learning cross-spectral similarity measures with deep...

ZhouL. et al.

Robust matching for SAR and optical images using multiscale convolutional gradient features

IEEE Geosci. Remote Sens. Lett.

(2022)

WuY. et al.

Commonality autoencoder: Learning common features for change detection from heterogeneous images

IEEE Trans. Neural Netw. Learn. Syst.

(2022)

E. Rosten, T. Drummond, Machine learning for high-speed corner detection, in: Proceedings of European Conference on...

LiuC. et al.

Sift flow: Dense correspondence across scenes and its applications

IEEE Trans. Pattern Anal. Mach. Intell.

(2011)

M. Brown, S. Susstrunk, Multi-spectral SIFT for scene category recognition, in: Proceedings of IEEE Conference on...

MouatsT. et al.

Multispectral stereo odometry

IEEE Trans. Intell. Transp. Syst.

(2015)

X. Shen, L. Xu, Q. Zhang, J. Jia, Multi-modal and multi-spectral registration for natural images, in: Proceedings of...

HeoY.S. et al.

Robust stereo matching using adaptive normalized cross-correlation

IEEE Trans. Pattern Anal. Mach. Intell.

(2010)

HeoY.S. et al.

Joint depth map and color consistency estimation for stereo images with different illuminations and cameras

IEEE Trans. Pattern Anal. Mach. Intell.

(2012)

P. Pinggera12, T. Breckon, H. Bischof, On cross-spectral stereo matching using dense gradient features, in: Proceedings...

N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Proceedings of IEEE Conference on...

Cited by (4)

A unified feature-spatial cycle consistency fusion framework for robust image matching
2023, Information Fusion
Robust image matching is a fundamental and long-standing open problem in computer vision. Conventional wisdom has exploited redundancy to improve the robustness of image matching (e.g., from pairwise to multi-image correspondence), which works well in the spatial domain. Inspired by the success of global optimization-based approaches, we propose a novel extension of cycle consistency from multi-image to multi-descriptor matching in this paper, which integrates useful information from the feature domain. More specifically, we build upon previous work of permutation synchronization and construct a novel cycle consistency model for multi-descriptor matching. The construction of cycle consistency model is based on the analogy between multi-image matching and multi-descriptor matching in a virtual universe. It allows us to formulate multi-image and multi-descriptor matching as a constrained global optimization problem. We have developed a spectral relaxation algorithm to solve this optimization problem, admitting an efficient implementation via fast singular value decomposition (SVD). To demonstrate the robustness of the proposed method named Cycle Consistency Fusion (C2F), we have evaluated it in terms of both raw matching accuracy (pairwise or multi-image) and several higher level downstream tasks such as homography and camera pose estimation. Extensive experimental results have shown that our C2F outperforms state-of-the-art methods consistently across different datasets and vision tasks.
Relational Representation Learning Network for Cross-Spectral Image Patch Matching
2024, arXiv
Deep Learning-Based Post-Earthquake Structural Damage Recognition
2024, SSRN
Vehicle Counting on Vietnamese Street
2023, IEEE Workshop on Statistical Signal Processing Proceedings

Yong-Jun Chang received the B.S. degree in electronic engineering and avionics from Korea Aerospace University, Gyeonggi-do, Korea, in 2014, and he received the M.S. degree in information and communications and the Ph.D degree in electrical engineering and computer science from Gwangju Institute of Science and Technology (GIST), Gwangju, Korea, in 2016 and 2021. In 2021, he was a researcher of Korea Culture Technology Institute in GIST. He is currently a research engineer of Hyundai Rotem. His research interests are in computer vision and deep learning.

Vinh Quang Dinh received the B.S. degree in computer science from Nong Lam University, Ho Chi Minh City, Vietnam, in 2008, and the M.S. and Ph.D. degrees in electrical and computer engineering from Sungkyunkwan University, Suwon, South Korea, in 2013 and 2016, respectively. From 2016 to 2017, he was a Postgraduate Researcher with Sungkyunkwan University. From 2017 to 2020, he was a Postgraduate Researcher with the Gwangju Institute of Science and Technology. In 2020, he joined Vietnamese-German University, where he is currently a Lecturer with the School of Electrical Engineering and Computer Science. His current research interests include computer vision and deep learning.

Hae-Gon Jeon received the BS degree in the School of Electrical and Electronic Engineering from Yonsei University in 2011, the MS degree and Ph.D. degree in the School of Electrical Engineering from KAIST in 2013 and in 2018, respectively. He was a postdoctoral researcher of the Robotics Institute at Carnegie Mellon University. He is currently affiliated with both AI Graduate School and the School of Electrical Engineering and Computer Science at GIST as an assistant professor. He is a winner of the Best Ph.D. Thesis Award 2018 in KAIST. His research interests include computational imaging, 3D reconstruction and machine learning.

Moongu Jeon received the B.S. degree in architectural engineering from Korea University, Seoul, South Korea, in 1988, and the M.S. and Ph.D. degrees in computer science and scientific computation from the University of Minnesota, Minneapolis, MN, USA, in 1999 and 2001, respectively. As the master’s degree researcher, he was involved in optimal control problems with the University of California at Santa Barbara, Santa Barbara, CA, USA, from 2001 to 2003, and then moved to the National Research Council of Canada, where he was involved in the sparse representation of high-dimensional data and the image processing, until July 2005. In 2005, he joined the Gwangju Institute of Science and Technology, Gwangju, South Korea, where he is currently a Full Professor with the School of Electrical Engineering and Computer Science. His current research interests include machine learning, computer vision, and artificial intelligence.

View full text

Full length articleSpectral-invariant matching network

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Related works

Spectral-invariant matching network

Experimental results

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Inf. Fusion

Inf. Fusion

Inf. Fusion

Inf. Fusion

Comput. Vis. Image Underst.

Inf. Fusion

Inf. Fusion

Pattern Recognit.

Inf. Fusion

KAIST multi-spectral day/night data set for autonomous and assisted driving

IEEE Trans. Intell. Transp. Syst. (T-ITS)

Survey on rgb, 3d, thermal, and multimodal approaches for facial expression recognition: History, trends, and affect-related applications

IEEE Trans. Pattern Anal. Mach. Intell.

Distinctive image features from scale-invariant keypoints

Int. J. Comput. Vis.

Robust matching for SAR and optical images using multiscale convolutional gradient features

IEEE Geosci. Remote Sens. Lett.

Commonality autoencoder: Learning common features for change detection from heterogeneous images

IEEE Trans. Neural Netw. Learn. Syst.

Sift flow: Dense correspondence across scenes and its applications

IEEE Trans. Pattern Anal. Mach. Intell.

Multispectral stereo odometry

IEEE Trans. Intell. Transp. Syst.

Robust stereo matching using adaptive normalized cross-correlation

IEEE Trans. Pattern Anal. Mach. Intell.

Joint depth map and color consistency estimation for stereo images with different illuminations and cameras

IEEE Trans. Pattern Anal. Mach. Intell.

Full length article
Spectral-invariant matching network