research-article

Temporally Efficient Gabor Transformer for Unsupervised Video Object Segmentation

Authors:
Jiaqing Fan

Nanjing University of Aeronautics and Astronautics, Nanjing, China

Nanjing University of Aeronautics and Astronautics, Nanjing, China

0000-0001-5659-7457
View Profile

,
Tiankang Su

Nanjing University of Information Science and Technology, Nanjing, China

Nanjing University of Information Science and Technology, Nanjing, China

0000-0002-6470-8230
View Profile

,
Kaihua Zhang

Ministry of Education & Nanjing University of Information Science and Technology, Nanjing, China

Ministry of Education & Nanjing University of Information Science and Technology, Nanjing, China

0000-0002-1613-3401
View Profile

,
Bo Liu

Walmart Global Tech, Sunnyvale, CA, USA

Walmart Global Tech, Sunnyvale, CA, USA

0000-0002-2113-7853
View Profile

,
Qingshan Liu

Nanjing University of Posts and Telecommunications, Nanjing, China

Nanjing University of Posts and Telecommunications, Nanjing, China

0000-0002-5512-6984
View Profile

MM '23: Proceedings of the 31st ACM International Conference on MultimediaOctober 2023Pages 3394–3402https://doi.org/10.1145/3581783.3612017

Published:27 October 2023Publication History

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 3394–3402

ABSTRACT

Spatial-temporal structural details of targets in video (e.g. varying edges, textures over time) are essential to accurate Unsupervised Video Object Segmentation (UVOS). The vanilla multi-head self-attention in the Transformer-based UVOS methods usually concentrates on learning the general low-frequency information (e.g. illumination, color), while neglecting the high-frequency texture details, leading to unsatisfying segmentation results. To address this issue, this paper presents a Temporally efficient Gabor Transformer (TGFormer) for UVOS. The TGFormer jointly models the spatial dependencies and temporal coherence intra- and inter-frames, which can fully capture the rich structural details for accurate UVOS. Concretely, we first propose an effective learnable Gabor filtering Transformer to mine the structural texture details of the object for accurate UVOS. Then, to adaptively store the redundant neighboring historical information, we present an efficient dynamic neighboring frame selection module to automatically choose the useful temporal information, which simultaneously relieves the blurry frame and reduces the computation burden. Finally, we make the UVOS model be a fully Transformer architecture, meanwhile aggregating the information from space, Gabor and time domains, yielding a strong representation with rich structure details. Extensive experiments on five mainstream UVOS benchmarks (DAVIS2016, FBMS, DAVSOD, ViSal, and MCL) demonstrate the superiority of the presented solution to sate-of-the-art methods.

Supplemental Material

MM22-fp1423.mp4

mp4

77.7 MB

Download

References

Yael Adini, Yael Moses, and Shimon Ullman. 1997. Face recognition: The problem of compensating for changes in illumination direction. TPAMI (1997).Google Scholar
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luvc ić, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In ICCV.Google Scholar
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML.Google Scholar
Adam Botach, Evgenii Zheltonozhskii, and Chaim Baskin. 2022. End-to-end referring video object segmentation with multimodal transformers. In CVPR.Google Scholar
Yi-Wen Chen, Xiaojie Jin, Xiaohui Shen, and Ming-Hsuan Yang. 2022. Video Salient Object Detection via Contrastive Features and Attention Modules. In WACV.Google Scholar
Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, and Jianbing Shen. 2019. Shifting more attention to video salient object detection. In CVPR.Google Scholar
Jiaqing Fan, Tiankang Su, Kaihua Zhang, and Qingshan Liu. 2022. Bidirectionally Learning Dense Spatio-temporal Feature Propagation Network for Unsupervised Video Object Segmentation. In ACMMM.Google Scholar
Yuchao Gu, Lijuan Wang, Ziqin Wang, Yun Liu, Ming-Ming Cheng, and Shao-Ping Lu. 2020. Pyramid constrained self-attention network for fast video salient object detection. In AAAI.Google Scholar
John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, and Bryan Catanzaro. 2022. Adaptive fourier neural operators: Efficient token mixers for transformers. In ICLR.Google Scholar
Yuan-Ting Hu, Jia-Bin Huang, and Alexander G Schwing. 2018. Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. In ECCV.Google Scholar
Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. 2017. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In CVPR.Google Scholar
Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan, Jianbing Shen, and Ling Shao. 2021. Full-duplex strategy for video object segmentation. In ICCV.Google Scholar
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neural networks for human action recognition. TPAMI (2012).Google Scholar
Liming Jiang, Bo Dai, Wayne Wu, and Chen Change Loy. 2021. Focal frequency loss for image reconstruction and synthesis. In ICCV.Google Scholar
Hansang Kim, Youngbae Kim, Jae-Young Sim, and Chang-Su Kim. 2015. Spatiotemporal saliency detection for video sequences based on random walk with restart. TIP (2015).Google Scholar
Youngjo Lee, Hongje Seong, and Euntai Kim. 2022. Iteratively selecting an easy reference frame makes unsupervised video object segmentation easier. In AAAI.Google Scholar
Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. 2018a. Independently recurrent neural network (indrnn): Building a longer and deeper rnn. In CVPR.Google Scholar
Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, and C-C Jay Kuo. 2018b. Instance embedding transfer to unsupervised video object segmentation. In CVPR.Google Scholar
Chengjun Liu and Harry Wechsler. 2002. Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. TIP (2002).Google Scholar
Daizong Liu, Dongdong Yu, Changhu Wang, and Pan Zhou. 2021. F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation. In AAAI.Google Scholar
Yong Liu, Ran Yu, Fei Yin, Xinyuan Zhao, Wei Zhao, Weihao Xia, and Yujiu Yang. 2022. Learning Quality-aware Dynamic Memory for Video Object Segmentation. In ECCV.Google Scholar
Xiankai Lu, Wenguan Wang, Martin Danelljan, Tianfei Zhou, Jianbing Shen, and Luc Van Gool. 2020a. Video object segmentation with episodic graph memory networks. In ECCV.Google Scholar
Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. 2019. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In CVPR.Google Scholar
Xiankai Lu, Wenguan Wang, Jianbing Shen, Yu-Wing Tai, David J Crandall, and Steven CH Hoi. 2020b. Learning video object segmentation from unlabeled videos. In CVPR.Google Scholar
Sachin Mehta and Mohammad Rastegari. 2022. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. In ICLR.Google Scholar
Peter Ochs, Jitendra Malik, and Thomas Brox. 2013. Segmentation of moving objects by long term video analysis. TPAMI (2013).Google Scholar
Youwei Pang, Xiaoqi Zhao, Lihe Zhang, and Huchuan Lu. 2020. Multi-scale interactive network for salient object detection. In CVPR.Google Scholar
Namuk Park and Songkuk Kim. 2022. How Do Vision Transformers Work?. In ICLR.Google Scholar
Gensheng Pei, Yazhou Yao, Guo-Sen Xie, Fumin Shen, Zhenmin Tang, and Jinhui Tang. 2022. Hierarchical Feature Alignment Network for Unsupervised Video Object Segmentation. In ECCV.Google Scholar
Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR.Google Scholar
Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. 2021. Global filter networks for image classification. In NeurIPS.Google Scholar
Sucheng Ren, Chu Han, Xin Yang, Guoqiang Han, and Shengfeng He. 2020. Tenet: Triple excitation network for video salient object detection. In ECCV.Google Scholar
Sucheng Ren, Wenxi Liu, Yongtuo Liu, Haoxin Chen, Guoqiang Han, and Shengfeng He. 2021. Reciprocal transformations for unsupervised video object segmentation. In CVPR.Google Scholar
Christian Schmidt, Ali Athar, Sabarinath Mahadevan, and Bastian Leibe. 2022. D2Conv3D: Dynamic Dilated Convolutions for Object Segmentation in Videos. In WACV.Google Scholar
Yuki Tatsunami and Masato Taki. 2022. Sequencer: Deep LSTM for Image Classification. In NeurIPS.Google Scholar
Wenguan Wang, Xiankai Lu, Jianbing Shen, David J Crandall, and Ling Shao. 2019a. Zero-shot video object segmentation via attentive graph neural networks. In ICCV.Google Scholar
Wenguan Wang, Jianbing Shen, and Ling Shao. 2015. Consistent video saliency using local gradient flow optimization and global refinement. TIP (2015).Google Scholar
Wenguan Wang, Hongmei Song, Shuyang Zhao, Jianbing Shen, Sanyuan Zhao, Steven CH Hoi, and Haibin Ling. 2019b. Learning unsupervised video object segmentation through visual attention. In CVPR.Google Scholar
Jun Wei, Shuhui Wang, and Qingming Huang. 2020. F3Net: fusion, feedback and focus for salient object detection. In AAAI.Google Scholar
Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. 2022. Language as Queries for Referring Video Object Segmentation. In CVPR.Google Scholar
Jiangtao Xie, Fei Long, Jiaming Lv, Qilong Wang, and Peihua Li. 2022. Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification. In CVPR.Google Scholar
Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. 2018. Youtube-vos: Sequence-to-sequence video object segmentation. In ECCV.Google Scholar
Qinwei Xu, Ruipeng Zhang, Ya Zhang, Yanfeng Wang, and Qi Tian. 2021. A fourier-based framework for domain generalization. In CVPR.Google Scholar
Shu Yang, Lu Zhang, Jinqing Qi, Huchuan Lu, Shuo Wang, and Xiaoxing Zhang. 2021. Learning Motion-Appearance Co-Attention for Zero-Shot Video Object Segmentation. In ICCV.Google Scholar
Zhao Yang, Qiang Wang, Luca Bertinetto, Weiming Hu, Song Bai, and Philip HS Torr. 2019. Anchor diffusion for unsupervised video object segmentation. In ICCV.Google Scholar
Hongxu Yin, Arash Vahdat, Jose M Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. 2022. A-ViT: Adaptive Tokens for Efficient Vision Transformer. In CVPR.Google Scholar
Bingyao Yu, Wanhua Li, Xiu Li, Jiwen Lu, and Jie Zhou. 2021. Frequency-aware spatiotemporal transformers for video inpainting detection. In ICCV.Google Scholar
Kaihua Zhang, Zicheng Zhao, Dong Liu, Qingshan Liu, and Bo Liu. 2021b. Deep Transport Network for Unsupervised Video Object Segmentation. In ICCV.Google Scholar
Miao Zhang, Jie Liu, Yifei Wang, Yongri Piao, Shunyu Yao, Wei Ji, Jingjing Li, Huchuan Lu, and Zhongxuan Luo. 2021a. Dynamic context-sensitive filtering network for video salient object detection. In ICCV.Google Scholar
Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu, and Lei Zhang. 2020. Suppress and balance: A simple gated network for salient object detection. In ECCV.Google Scholar
Mingmin Zhen, Shiwei Li, Lei Zhou, Jiaxiang Shang, Haoan Feng, Tian Fang, and Long Quan. 2020. Learning discriminative feature with crf for unsupervised video object segmentation. In ECCV.Google Scholar
Tianfei Zhou, Jianwu Li, Xueyi Li, and Ling Shao. 2021. Target-aware object discovery and association for unsupervised video multi-object segmentation. In CVPR.Google Scholar
Tianfei Zhou, Fatih Porikli, David J Crandall, Luc Van Gool, and Wenguan Wang. 2022. A survey on deep learning technique for video segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, 6 (2022), 7099--7122.Google ScholarDigital Library
Tianfei Zhou, Shunzhou Wang, Yi Zhou, Yazhou Yao, Jianwu Li, and Ling Shao. 2020. Motion-attentive transition for zero-shot video object segmentation. In AAAI.Google Scholar

Index Terms

Temporally Efficient Gabor Transformer for Unsupervised Video Object Segmentation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Gabor-Filtering-Based Completed Local Binary Patterns for Land-Use Scene Classification
BIGMM '15: Proceedings of the 2015 IEEE International Conference on Multimedia Big Data

Remote sensing land-use scene classification has a wide range of applications including forestry, urban-growth analysis, and weather forecasting. This paper presents an effective image representation method, Gabor-filtering-based completed local binary ...
Read More
MEViT: Motion Enhanced Video Transformer for Video Classification
MultiMedia Modeling
Abstract
Due to the advantages in extracting the long-range dependencies, self-attention based transformers are widely used to model the spatio-temporal features for video classification, which achieves competitive performance compared to 3D CNNs. To ...
Read More
Unsupervised Segmentation of Color-Texture Regions in Images and Video

A new method for unsupervised segmentation of color-texture regions in images and video is presented. This method, which we refer to as JSEG, consists of two independent steps: color quantization and spatial segmentation. In the first step, colors in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
gabor filtering
spatio-temporal information selection
unsupervised video object segmentation
video transformer
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 139
  Total Downloads
- Downloads (Last 12 months)139
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Temporally Efficient Gabor Transformer for Unsupervised Video Object Segmentation

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Gabor-Filtering-Based Completed Local Binary Patterns for Land-Use Scene Classification

MEViT: Motion Enhanced Video Transformer for Video Classification

Unsupervised Segmentation of Color-Texture Regions in Images and Video