keynote

Efficient CNNs and Transformers for Video Understanding and Image Synthesis

Author:
Jürgen Gall

Department of Information Systems and Artificial Intelligence, University of Bonn, Germany and Lamarr Institute for Machine Learning and Artificial Intelligence, Germany

Department of Information Systems and Artificial Intelligence, University of Bonn, Germany and Lamarr Institute for Machine Learning and Artificial Intelligence, Germany

0000-0002-9447-3399
View Profile

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia RetrievalJune 2023Pages 670https://doi.org/10.1145/3591106.3592300

Published:12 June 2023Publication History

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

Pages 670

ABSTRACT

In this talk, I will first discuss approaches that reduce the GFLOPs during inference for 3D convolutional neural networks (CNN) and vision transformers. While state-of-the-art 3D CNNs and vision transformers achieve very good results on action recognition datasets, they are computationally very expensive and require many GFLOPs. While the GFLOPs of a 3D CNN or vision transformer can be decreased by reducing the temporal feature resolution or the number of tokens, there is no setting that is optimal for all input clips. I will therefore discuss two differentiable sampling approaches that can be plugged into any existing 3D CNN or vision transformer architecture. The sampling approaches adapt the computational resources to the input video such that as much resources as needed but not more than necessary are used to classify a video. The approaches substantially reduce the computational cost (GFLOPs) of state-of-the-art networks while preserving the accuracy. In the second part, I will discuss an approach that generates annotated training samples of very rare classes. It is based on a generative adversarial network (GAN) that jointly synthesizes images and the corresponding segmentation mask for each image. The generated data can then be used for one-shot video object segmentation.

Index Terms

Efficient CNNs and Transformers for Video Understanding and Image Synthesis
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Theory of computation

Index terms have been assigned to the content through auto-classification.

Recommendations

WaveCNet: Wavelet Integrated CNNs to Suppress Aliasing Effect for Noise-Robust Image Classification
Though widely used in image classification, convolutional neural networks (CNNs) are prone to noise interruptions, i.e. the CNN output can be drastically changed by small image noise. To improve the noise robustness, we try to integrate CNNs with wavelet ...
Read More
Group emotion recognition with individual facial emotion CNNs and global image based CNNs
ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal Interaction

This paper presents our approach for group-level emotion recognition in the Emotion Recognition in the Wild Challenge 2017. The task is to classify an image into one of the group emotion such as positive, neutral or negative. Our approach is based on ...
Read More
Towards Efficient Adversarial Training on Vision Transformers
Computer Vision – ECCV 2022
Abstract
Vision Transformer (ViT), as a powerful alternative to Convolutional Neural Network (CNN), has received much attention. Recent work showed that ViTs are also vulnerable to adversarial examples like CNNs. To build robust ViTs, an intuitive way is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
June 2023
694 pages
ISBN:9798400701788
DOI:10.1145/3591106
Editors:
Ioannis (Yiannis) Kompatsiaris
Centre for Research and Technology Hellas, Greece
,
Jiebo Luo
University of Rochester,USA
,
Nicu Sebe
University of Trento, Italy
,
Angela Yao
National University of Singapore, Singapore
,
Vasileios Mezaris
Centre for Research and Technology Hellas, Greece
,
Symeon Papadopoulos
Centre for Research and Technology Hellas, Greece
,
Adrian Popescu
CEA LIST, France
,
Zi (Helen) Huang
University of Queensland, Australia
Copyright © 2023 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 June 2023
Check for updates
Qualifiers
- keynote
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate254of830submissions,31%
Upcoming Conference
ICMR '24

Sponsor:

sigmm

International Conference on Multimedia Retrieval

June 10 - 14, 2024

Phuket , Thailand
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 58
  Total Downloads
- Downloads (Last 12 months)58
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Efficient CNNs and Transformers for Video Understanding and Image Synthesis

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

ABSTRACT

Cited By

Index Terms

Recommendations

WaveCNet: Wavelet Integrated CNNs to Suppress Aliasing Effect for Noise-Robust Image Classification

Group emotion recognition with individual facial emotion CNNs and global image based CNNs

Towards Efficient Adversarial Training on Vision Transformers