skip to main content
10.1145/3595916.3626390acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Class-aware Convolution and Attentive Aggregation for Image Classification

Published: 01 January 2024 Publication History

Abstract

Deep learning has been proven to be effective in image classification tasks. However, existing methods may face difficulties in distinguishing complex images due to the distraction caused by diverse image content. To overcome this challenge, we propose a class-aware convolution and attentive aggregation (CA-Net) framework that improves the effectiveness of representation learning and reduces the influence of irrelevant background. CA-Net includes three main modules: the discrete representation learning (DRL) module that uses a group learning method to learn discriminative representations, the class-aware score of discrete representation (CSDR) module that infers class-aware scores to generate weights for representation learners, and the class-aware representation fusion module(CRF) that aggregates class-aware representations using the class-aware scores as a guide. Our experimental results on three benchmarking datasets show that CA-Net improves the performance of state-of-the-art backbones and enhances feature extraction robustness.

References

[1]
Eirikur Agustsson, Fabian Mentzer, 2017. Soft-to-hard vector quantization for end-to-end learning compressible representations. Advances in neural information processing systems (2017).
[2]
Herbert Bay, Tinne Tuytelaars, 2006. Surf: Speeded up robust features. In European conference on computer vision.
[3]
Gérard Biau and Erwan Scornet. 2016. A random forest guided tour. Test (2016).
[4]
Michael Calonder, Vincent Lepetit, [n. d.]. Binary robust independent elementary features. In Proceedings of the European Conference on Computer Vision.
[5]
Zhe Cao, Tomas Simon, 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[6]
François Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[7]
Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), Vol. 1.
[8]
Ali Diba, Vivek Sharma, 2017. Weakly supervised cascaded convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[9]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual Encoding for Video Retrieval by Text. TPAMI (2021).
[10]
Alexey Dosovitskiy, Lucas Beyer, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[11]
Qing-Ling Guan, Yuze Zheng, Lei Meng, Li-Quan Dong, and Qun Hao. 2023. Improving the Generalization of Visual Classification Models Across IoT Cameras via Cross-modal Inference and Fusion. IEEE Internet of Things Journal (2023).
[12]
Kaiming He, Xiangyu Zhang, 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[13]
Guosheng Hu, Yongxin Yang, 2015. When face recognition meets with deep learning: an evaluation of convolutional neural networks for face recognition. In Proceedings of the IEEE international conference on computer vision workshops.
[14]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2017. Imagenet classification with deep convolutional neural networks. Commun. ACM (2017).
[15]
Steve Lawrence, C Lee Giles, 1997. Face recognition: A convolutional neural-network approach. IEEE transactions on neural networks (1997).
[16]
Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06).
[17]
Yann LeCun, Léon Bottou, 1998. Gradient-based learning applied to document recognition. Proc. IEEE (1998).
[18]
Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Ng. 2006. Efficient sparse coding algorithms. Advances in neural information processing systems (2006).
[19]
Xiangxian Li, Haokai Ma, Lei Meng, and Xiangxu Meng. 2021. Comparative study of adversarial training methods for long-tailed classification. In ADVM.
[20]
Xiang Li, Lei Wu, Xu Chen, Lei Meng, and Xiangxu Meng. 2022. Dse-net: Artistic font image synthesis via disentangled style encoding. In ICME.
[21]
Xiang Li, Lei Wu, Changshuo Wang, Lei Meng, and Xiangxu Meng. 2023. Compositional Zero-Shot Artistic Font Synthesis. IJCAI (2023).
[22]
Xiangxian Li, Yuze Zheng, Haokai Ma, Zhuang Qi, Xiangxu Meng, and Lei Meng. 2023. Cross-modal Learning Using Privileged Information for Long-tailed Image Classification. CVM (2023).
[23]
Tony Lindeberg. 2012. Scale invariant feature transform. (2012).
[24]
Jinxing Liu, Junjin Xiao, Haokai Ma, Xiangxian Li, Zhuang Qi, Xiangxu Meng, and Lei Meng. 2022. Prompt Learning with Cross-Modal Feature Alignment for Visual Domain Adaptation. In CAAI.
[25]
Lingqiao Liu, Peng Wang, 2017. Compositional model based fisher vector coding for image classification. IEEE transactions on pattern analysis and machine intelligence (2017).
[26]
Tianhan Liu, Zhuang Qi, Zitan Chen, Xiangxu Meng, and Lei Meng. 2023. Cross-Training with Prototypical Distillation for improving the generalization of Federated Learning. ICME (2023).
[27]
Haokai Ma, Xiangxian Li, Lei Meng, and Xiangxu Meng. 2021. Comparative study of adversarial training methods for cold-start recommendation. In ADVM.
[28]
Haokai Ma, Zhuang Qi, Xinxin Dong, Xiangxian Li, Yuze Zheng, and Xiangxu Mengand Lei Meng. 2023. Cross-Modal Content Inference and Feature Enrichment for Cold-Start Recommendation. IJCNN (2023).
[29]
Haokai Ma, Ruobing Xie, Lei Meng, Xin Chen, Xu Zhang, Leyu Lin, and Jie Zhou. 2023. Exploring False Hard Negative Sample in Cross-Domain Recommendation. In Recsys.
[30]
Haokai Ma, Ruobing Xie, Lei Meng, Xin Chen, Xu Zhang, Leyu Lin, and Jie Zhou. 2023. Triple Sequence Learning for Cross-domain Recommendation. arXiv preprint arXiv:2304.05027 (2023).
[31]
Lei Meng, Long Chen, Xun Yang, Dacheng Tao, Hanwang Zhang, Chunyan Miao, and Tat-Seng Chua. 2019. Learning using privileged information for food recognition. In ACM MM.
[32]
Lei Meng, Fuli Feng, Xiangnan He, Xiaoyan Gao, and Tat-Seng Chua. 2020. Heterogeneous fusion of semantic and collaborative information for visually-aware food recommendation. In Proceedings of MM.
[33]
Timo Ojala, Matti Pietikainen, 2002. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on pattern analysis and machine intelligence (2002).
[34]
Wanli Ouyang, Zeng, 2016. DeepID-Net: Object detection with deformable part based convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (2016).
[35]
Xuran Pan, Chunjiang Ge, 2022. On the integration of self-attention and convolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[36]
Zhuang Qi, Yuqing Wang, Zitan Chen, Ran Wang, Xiangxu Meng, and Lei Meng. 2022. Clustering-based Curriculum Construction for Sample-Balanced Federated Learning. In CAAI.
[37]
Ethan Rublee, Vincent Rabaud, 2011. ORB: An efficient alternative to SIFT or SURF. In 2011 International conference on computer vision.
[38]
Mark Sandler, Andrew Howard, 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[39]
Ramprasaath R Selvaraju, Michael Cogswell, 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision.
[40]
Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. 2019. Annotating Objects and Relations in User-Generated Videos. Proceedings of the 2019 on International Conference on Multimedia Retrieval (2019).
[41]
Jianbo Shi 1994. Good features to track. In 1994 Proceedings of IEEE conference on computer vision and pattern recognition.
[42]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[43]
Weilin Sun, Xiangxian Li, Manyi Li, Yuqing Wang, Yuze Zheng, Xiangxu Meng, and Lei Meng. 2022. Sequential Fusion of Multi-view Video Frames for 3D Scene Generation. In CAAI.
[44]
Shan Suthaharan. 2016. Machine learning models and algorithms for big data classification. Integr. Ser. Inf. Syst (2016).
[45]
Christian Szegedy, Wei Liu, 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[46]
Hossein Talebi and Peyman Milanfar. 2021. Learning to resize images for computer vision tasks. In Proceedings of the IEEE/CVF international conference on computer vision.
[47]
Alexander Toshev and Christian Szegedy. 2014. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[48]
Deepak Geetha Viswanathan. 2009. Features from accelerated segment test (fast). In Proceedings of the 10th workshop on image analysis for multimedia interactive services, London, UK.
[49]
Jinjun Wang, Jianchao Yang, 2010. Locality-constrained linear coding for image classification. In 2010 IEEE computer society conference on computer vision and pattern recognition.
[50]
Yuqing Wang, Xiangxian Li, Haokai Ma, Zhuang Qi, Xiangxu Meng, and Lei Meng. 2022. Causal Inference with Sample Balancing for Out-of-Distribution Detection in Visual Classification. In CAAI.
[51]
Yuqing Wang, Xiangxian Li, Zhuang Qi, Jingyu Li, Xuelong Li, Xiangxu Meng, and Lei Meng. 2022. Meta-causal feature learning for out-of-distribution generalization. In ECCV.
[52]
Yuqing Wang, Zhuang Qi, Xiangxian Li, Jinxing Liu, Xiangxu Meng, and Lei Meng. 2023. Multi-channel Attentive Weighting of Visual Frames for Multimodal Video Classification. IJCNN (2023).
[53]
Saining Xie, Ross Girshick, 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[54]
Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval. SIGIR (2020).
[55]
Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021. Deconfounded Video Moment Retrieval with Causal Intervention. SIGIR (2021).
[56]
Xun Yang, Xueliang Liu, Meng Jian, Xinjian Gao, and Meng Wang. 2020. Weakly-Supervised Video Object Grounding by Exploring Spatio-Temporal Contexts. ACM MM (2020).
[57]
Zhun Zhong, Liang Zheng, 2020. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence.

Cited By

View all
  • (2024)Attentive Modeling and Distillation for Out-of-Distribution Generalization of Federated Learning2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687423(1-6)Online publication date: 15-Jul-2024
  • (2024)Image Classification Based on Low-Level Feature Enhancement and Attention MechanismNeural Processing Letters10.1007/s11063-024-11680-356:4Online publication date: 13-Aug-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia
December 2023
745 pages
ISBN:9798400702051
DOI:10.1145/3595916
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Attentive aggregation
  2. Class-aware
  3. Discrete Representation learning
  4. Group learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • TaiShan Scholars Program
  • Shandong Province Excellent Young Scientists Fund Program(Overseas)
  • the 20 Regulations for New Universities funding program of Jinan

Conference

MMAsia '23
Sponsor:
MMAsia '23: ACM Multimedia Asia
December 6 - 8, 2023
Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)67
  • Downloads (Last 6 weeks)4
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Attentive Modeling and Distillation for Out-of-Distribution Generalization of Federated Learning2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687423(1-6)Online publication date: 15-Jul-2024
  • (2024)Image Classification Based on Low-Level Feature Enhancement and Attention MechanismNeural Processing Letters10.1007/s11063-024-11680-356:4Online publication date: 13-Aug-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media