MHA-WoML: Multi-head attention and Wasserstein-OT for few-shot learning

Yang, Junyan; Jiang, Jie; Guo, Yanming

doi:10.1007/s13735-022-00254-5

MHA-WoML: Multi-head attention and Wasserstein-OT for few-shot learning

Regular Paper
Published: 21 September 2022

Volume 11, pages 681–694, (2022)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Junyan Yang¹^na1,
Jie Jiang¹^na1 &
Yanming Guo¹

317 Accesses
1 Citation
Explore all metrics

Abstract

Few-shot learning aims to classify novel classes with extreme few labeled samples. Existing metric-learning-based approaches tend to employ the off-the-shelf CNN models for feature extraction, and conventional clustering algorithms for feature matching. These methods neglect the importance of image regions and might trap in over-fitting problems during feature clustering. In this work, we propose a novel MHA-WoML framework for few-shot learning, which adaptively focuses on semantically dominant regions, and well relieves the over-fitting problem. Specifically, we first design a hierarchical multi-head attention (MHA) module, which consists of three functional heads (i.e., rare head, syntactic head and positional head) with masks, to extract comprehensive image features, and screen out invalid features. The MHA behaves better than current transformers in few-shot recognition. Then, we incorporate the optimal transport theory into Wasserstein distance and propose a Wasserstein-OT metric learning (WoML) module for category clustering. The WoML module focuses more on calculating the appropriately approximate barycenter to avoid the over accurate sub-stage fitting which may threaten the global fitting, thus alleviating the problem of over-fitting in the training process. Experimental results show that our approach achieves remarkably better performance compared to current state-of-the-art methods by scoring about 3% higher accuracy, across four benchmark datasets including MiniImageNet, TieredImageNet, CIFAR-FS and CUB200.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-attention network for few-shot learning based on nearest-neighbor algorithm

Article 10 February 2023

MARANet: Multi-scale Adaptive Region Attention Network for Few-Shot Learning

Spatial Attention Network for Few-Shot Learning

Data Availability Statement

All data include in this study are available upon request by contact with the corresponding author.

References

Graves A, Wayne G, Danihelka I (2014) Neural turing machines. arXiv preprint arXiv:1410.5401
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML
Chopra S, Hadsell R, LeCun Y (2005) Learning a similarity metric discriminatively, with application to face verification. In: CVPR
Bateni P, Barber J, van de Meent J-W, Wood F (2022) Enhancing few-shot image classification with unlabelled examples. In: WACV
Rodriguez P, Laradji I, Drouin A, Lacoste A (2020) Embedding propagation: Smoother manifold for few-shot classification. In: ECCV
Ziko I, Dolz J, Granger E, Ayed IB (2020) Laplacian regularized few-shot learning. In: ICML
Rizve MN, Khan S, Khan FS, Shah M (2021) Exploring complementary strengths of invariant and equivariant representations for few-shot learning. In: CVPR
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR
Lee K, Maji S, Ravichandran A, Soatto S (2019) Meta-learning with differentiable convex optimization. In: CVPR
Xie S, Girshick R, Dollar P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: CVPR
Ravichandran A, Bhotika R, Soatto S (2019) Few-shot learning with embedded class models and shot-free meta training. In: ICCV
Hiroyuki K (2020) Multi-view Wasserstein discriminant analysis with entropic regularized Wasserstein distance. In: ICASSP
Peyré G, Cuturi M (2019) Computational optimal transport: with applications to data science. Found Trends Machine Learn 11(5–6):355–607
Article MATH Google Scholar
Hu Y, Gripon V, Pateux S (2021) Leveraging the feature distribution in transfer-based few-shot learning. In: ICANN
Bendou Y, Hu Y, Lafargue R, Lioi G, Pasdeloup B, Pateux S (2021) Easy: Ensemble augmented-shot y-shaped learning: State-of-the-art few-shot classification with simple ingredients
Zhang H, Cao Z, Yan Z, Zhang C (2021) Sill-net: Feature augmentation with separated illumination representation. arXiv preprint arXiv:2102.03539
Yang S, Liu L, Xu M (2021) Free lunch for few-shot learning: distribution calibration. In: ICLR
Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2018) Mixup: Beyond empirical risk minimization. In: ICLR
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: CVPR
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Jones L, Lukasz Kaiser Polosukhin I (2017) Attention is all you need. In: NeurIPS
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: NAACL-HLT
Barz B, Rodner E, Garcia YG, Denzler J (2019) Detecting regions of maximal divergence for spatio-temporal anomaly detection. IEEE Trans Pattern Anal Mach Intell 41(5):1088–1101
Article Google Scholar
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Xu B, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: NeurIPS
Nezza ED, Lu CH (2015) Generalized monge-ampère capacities. Int Math Res Notices 2015(16):7287–7322
Article MATH Google Scholar
Yang I (2017) A convex optimization approach to distributionally robust Markov decision processes with Wasserstein distance. IEEE Control Syst Lett 1(1):164–169
Article MathSciNet Google Scholar
Zhang R, Li X, Zhang H, Nie F (2020) Deep fuzzy k-means with adaptive loss and entropy regularization. IEEE Trans Fuzzy Syst 28(11):2814–2824
Article Google Scholar
Daniel C (2019) Sinkhorn-knopp theorem for rectangular positive maps. Linear Multilinear Algebra 67(11):2345–2365
Article MathSciNet MATH Google Scholar
Muzellec B, Josse J, Boyer C, Cuturi M (2020) Missing data imputation using optimal transport. In: ICML
Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: CVPR
Voita E, Serdyukov P, Sennrich R, Titov, I (2019) Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In: ACL
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: ECCV
Vinyals O, Blundell C, Lillicrap T, koray kavukcuoglu Wierstra D (2016) Matching networks for one shot learning. In: NeurIPS
Ren M, Triantafillou E, Ravi S, Snell J, Swersky K, Tenenbaum JB, Larochelle H, Zemel RS Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676
Bertinetto L, Henriques JF, Torr PHS, Vedaldi A (2019) Meta-learning with differentiable closed-form solvers. In: ICLR
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset
Oreshkin BN, Rodriguez P, Lacoste A (2018) Tadam: Task dependent adaptive metric for improved few-shot learning. In: NeurIPS
Zagoruyko S, Komodakis N (2016) Wide residual networks. In: BMVC
Huang K, Geng J, Jiang W, Deng X, Xu Z (2021) Pseudo-loss confidence metric for semi-supervised few-shot learning. In: ICCV
Le D, Nguyen KD, Nguyen K, Tran Q-H, Nguyen R, Hua B-S (2021) Poodle: Improving few-shot learning via penalizing out-of-distribution samples. In: NeurIPS
Shao S, Xing L, Wang Y, Xu R, Zhao C, Wang Y, Liu B (2021) Mhfc: Multi-head feature collaboration for few-shot learning. In: ACM MM
Ye H-J, Hu H, Zhan D-C, Sha F (2020) Few-shot learning via embedding adaptation with set-to-set functions. In: CVPR
Wu J, Zhang T, Zhang Y, Wu F (2021) Task-aware part mining network for few-shot learning. In: ICCV
Yang L, Li L, Zhang Z, Zhou X, Zhou E, Liu Y (2020) Dpgn: Distribution propagation graph network for few-shot learning. In: CVPR
Jian Y, Torresani L (2022) Label hallucination for few-shot classification. In: AAAI
Chen D, Chen Y, Li Y, Mao F, He Y, Xue H (2021) Self-supervised learning for few-shot image classification. In: ICASSP
Bateni P, Goyal R, Masrani V, Wood F, Sigal L (2020) Improved few-shot visual classification. In: CVPR
Boudiaf M, Masud ZI, Rony J, Dolz J, Piantanida P, Ayed IB (2020) Transductive information maximization for few-shot learning. In: NeurIPS
Kye SM, Lee HB, Kim H, Hwang SJ (2020) Meta-learned confidence for few-shot learning. arXiv preprint arXiv:2002.12017
Li X, Sun Q, Liu Y, Zheng S, Zhou Q, Chua T-S, Schiele B (2019) Learning to self-train for semi-supervised few-shot classification. In: NeurIPS
Wang Y, Xu C, Liu C, Zhang L, Fu Y (2020) Instance credibility inference for few-shot learning. In: CVPR
Rajasegaran J, Khan S, Hayat M, Khan FS, Shah M (2020) Self-supervised knowledge distillation for few-shot learning. arXiv preprint arXiv:2006.09785
Yang F, Wang R, Chen X (2022) Sega: Semantic guided attention on visual prototype for few-shot learning. In: WACV
Esfandiarpoor R, Pu A, Hajabdollahi M, Bach SH (2020) Extended few-shot learning: Exploiting existing resources for novel tasks. arXiv preprint arXiv:2012.07176
Mangla P, Kumari N, Sinha A, Singh M, Krishnamurthy B, Balasubramanian VN (2020) Charting the right manifold: Manifold mixup for few-shot learning. In: WACV
Hu Y, Gripon V, Pateux S (2021) Graph-based interpolation of feature vectors for accurate few-shot classification. In: ICPR
Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: ICCV
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: CVPR

Download references

Author information

Junyan Yang and Jie Jiang are co-first authors to this work.

Authors and Affiliations

College of Systems Engineering, National University of Defense Technology, Changsha, Hunan, China
Junyan Yang, Jie Jiang & Yanming Guo

Authors

Junyan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yanming Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanming Guo.

Ethics declarations

Funding

This research was supported in part by the Ministry of Science and Technology of China under grant No. 2020AAA0108800. The authors have no financial or proprietary interests in any material discussed in this article.

Conflicts of Interest

The authors declare that they have no conflict of interests and this paper does not contain any studies with human participants or animals by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, J., Jiang, J. & Guo, Y. MHA-WoML: Multi-head attention and Wasserstein-OT for few-shot learning. Int J Multimed Info Retr 11, 681–694 (2022). https://doi.org/10.1007/s13735-022-00254-5

Download citation

Received: 16 June 2022
Revised: 06 August 2022
Accepted: 05 September 2022
Published: 21 September 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s13735-022-00254-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MHA-WoML: Multi-head attention and Wasserstein-OT for few-shot learning

Abstract

Access this article

Similar content being viewed by others

Self-attention network for few-shot learning based on nearest-neighbor algorithm

MARANet: Multi-scale Adaptive Region Attention Network for Few-Shot Learning

Spatial Attention Network for Few-Shot Learning

Data Availability Statement

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Funding

Conflicts of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MHA-WoML: Multi-head attention and Wasserstein-OT for few-shot learning

Abstract

Access this article

Similar content being viewed by others

Self-attention network for few-shot learning based on nearest-neighbor algorithm

MARANet: Multi-scale Adaptive Region Attention Network for Few-Shot Learning

Spatial Attention Network for Few-Shot Learning

Data Availability Statement

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Funding

Conflicts of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation