ABSTRACT
In this talk, I will first discuss approaches that reduce the GFLOPs during inference for 3D convolutional neural networks (CNN) and vision transformers. While state-of-the-art 3D CNNs and vision transformers achieve very good results on action recognition datasets, they are computationally very expensive and require many GFLOPs. While the GFLOPs of a 3D CNN or vision transformer can be decreased by reducing the temporal feature resolution or the number of tokens, there is no setting that is optimal for all input clips. I will therefore discuss two differentiable sampling approaches that can be plugged into any existing 3D CNN or vision transformer architecture. The sampling approaches adapt the computational resources to the input video such that as much resources as needed but not more than necessary are used to classify a video. The approaches substantially reduce the computational cost (GFLOPs) of state-of-the-art networks while preserving the accuracy. In the second part, I will discuss an approach that generates annotated training samples of very rare classes. It is based on a generative adversarial network (GAN) that jointly synthesizes images and the corresponding segmentation mask for each image. The generated data can then be used for one-shot video object segmentation.
Index Terms
- Efficient CNNs and Transformers for Video Understanding and Image Synthesis
Recommendations
WaveCNet: Wavelet Integrated CNNs to Suppress Aliasing Effect for Noise-Robust Image Classification
Though widely used in image classification, convolutional neural networks (CNNs) are prone to noise interruptions, i.e. the CNN output can be drastically changed by small image noise. To improve the noise robustness, we try to integrate CNNs with wavelet ...
Group emotion recognition with individual facial emotion CNNs and global image based CNNs
ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal InteractionThis paper presents our approach for group-level emotion recognition in the Emotion Recognition in the Wild Challenge 2017. The task is to classify an image into one of the group emotion such as positive, neutral or negative. Our approach is based on ...
Towards Efficient Adversarial Training on Vision Transformers
Computer Vision – ECCV 2022AbstractVision Transformer (ViT), as a powerful alternative to Convolutional Neural Network (CNN), has received much attention. Recent work showed that ViTs are also vulnerable to adversarial examples like CNNs. To build robust ViTs, an intuitive way is ...
Comments