Loading [a11y]/accessibility-menu.js
MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition | IEEE Conference Publication | IEEE Xplore

MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition


Abstract:

To efficiently extract spatiotemporal features of video for action recognition, most state-of-the-art methods integrate 1D temporal convolutional filters into 2D CNN back...Show More

Abstract:

To efficiently extract spatiotemporal features of video for action recognition, most state-of-the-art methods integrate 1D temporal convolutional filters into 2D CNN backbones. However, they all exploit 1D temporal convolutional filters of fixed kernel size (i.e., 3) in their network building block, thus have suboptimal temporal modeling capability to handle both long-term and short-term actions. To address this problem, we first investigate the impacts of different kernel sizes for the 1D temporal convolutional filters. Then, we propose a simple yet efficient operation called Mixed Temporal Convolution (MixTConv), which consists of multiple depthwise 1D convolutional filters with different kernel sizes. By plugging MixTConv into the conventional 2D CNN backbone ResNet-50, we further propose an efficient and effective network architecture named MSTNet for action recognition, and achieve state-of-the-art results on multiple large-scale benchmarks.
Date of Conference: 10-15 January 2021
Date Added to IEEE Xplore: 05 May 2021
ISBN Information:
Print on Demand(PoD) ISSN: 1051-4651
Conference Location: Milan, Italy

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.