Abstract:
In the last decade, Convolutional Neural Networks (CNNs) have become a dominant algorithm in solving various domains such as computer vision, self-driving cars, medical i...Show MoreMetadata
Abstract:
In the last decade, Convolutional Neural Networks (CNNs) have become a dominant algorithm in solving various domains such as computer vision, self-driving cars, medical imaging, and natural language processing. The core operation of the CNNs is convolution layer that can aggregate input features around local windows in a short-range manner and learn relative positions inside each window. For long-range modeling, common CNNs stack a bunch of convolutional layers that result in high computational costs to enlarge receptive field. Recently, Vision Transformers (ViTs) and its improvements have outperformed CNNs in the rankings of language, vision, and audio research. The main goal of the ViTs is that the model can extract short-range and long-range features in one layer. With this strategy, the network structure of the ViTs is simpler than CNNs. However, ViTs have quadratic complexity with the spatial length of the input feature. In the last year, many methods are proposed to relax the cost of ViTs and bring complicated designs of CNNs into ViT-based models. Inspired by the insightful properties of the ViTs and CNNs, this paper introduces a Local and Global Fourier Network (LGFNet) that jointly learns local and global receptive fields in the frequency domain rather than the spatial or time domain in conventional CNNs and ViTs. The input features, local, and global kernels are transformed to the frequency domain through Fast Fourier Transform. The local features are learned by a convolution between the input feature and local kernels. Concurrently, matrix multiplication between the input feature and global kernels is performed to extract low frequencies from the input Fourier feature. Since local and global Fourier features are complementary, the LGFNet efficiently fuses these information by summation operation based on the similarity degrees of the input signals. Therefore, our LGFNet performs unified representation from the input feature. To evaluate the effectiveness ...
Date of Conference: 19-21 June 2023
Date Added to IEEE Xplore: 31 August 2023
ISBN Information: