In this paper, we propose a Three-stream Inception Former-based Action Recognition Network, called TIFAR-Net to recognize actions in Infrared (IR) videos. It consists of two major stages. First, fine-tuning and feature extraction using a Inception Transformer (IFormer) network, and second, feature fusion and classification using Multi-Head Self-Attention (MHSA) network. Specifically, input IR videos are converted into compact yet effective representations referred to as Optical Flow Motion Images, Optical Flow Dynamic Images and Infrared Motion Images. The first two types of images are constructed using optical flow fields, while the third type of image is computed directly from raw IR frames. These image-based representations enable us to fine-tune a three-stream IFormer network, which is a hybrid Convolution Neural Network and Vision Transformer model. Features extracted from each stream are concatenated and passed through MHSA module to prune unimportant information and capture key global information from input images. Finally, classification is performed using fully connected and Softmax layers. Through extensive ablation experiments, we verified that our proposed TIFAR-Net improves the performance of IR action recognition and achieves state-of-the-art results on InfAR dataset (88.5%), IITR-IAR dataset (77.93%) and UNISV dataset(97.08%).
Data Availability
No datasets were generated or analysed during the current study.
All authors made substantial contributions to the concept and design of the paper. Methodology was done by JI and ASR, software development was done by JI and RV, and project administration/supervision was done by JI and ASR. All the authors read and approved the manuscript.
Imran, J., Rajput, A.S. & Vashisht, R. Tifar-net: three-stream inception former-based action recognition network for infrared videos. SIViP 19, 192 (2025). https://doi.org/10.1007/s11760-024-03796-9
