Skip to main content
Log in

MHA-WoML: Multi-head attention and Wasserstein-OT for few-shot learning

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

Few-shot learning aims to classify novel classes with extreme few labeled samples. Existing metric-learning-based approaches tend to employ the off-the-shelf CNN models for feature extraction, and conventional clustering algorithms for feature matching. These methods neglect the importance of image regions and might trap in over-fitting problems during feature clustering. In this work, we propose a novel MHA-WoML framework for few-shot learning, which adaptively focuses on semantically dominant regions, and well relieves the over-fitting problem. Specifically, we first design a hierarchical multi-head attention (MHA) module, which consists of three functional heads (i.e., rare head, syntactic head and positional head) with masks, to extract comprehensive image features, and screen out invalid features. The MHA behaves better than current transformers in few-shot recognition. Then, we incorporate the optimal transport theory into Wasserstein distance and propose a Wasserstein-OT metric learning (WoML) module for category clustering. The WoML module focuses more on calculating the appropriately approximate barycenter to avoid the over accurate sub-stage fitting which may threaten the global fitting, thus alleviating the problem of over-fitting in the training process. Experimental results show that our approach achieves remarkably better performance compared to current state-of-the-art methods by scoring about 3% higher accuracy, across four benchmark datasets including MiniImageNet, TieredImageNet, CIFAR-FS and CUB200.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability Statement

All data include in this study are available upon request by contact with the corresponding author.

References

  1. Graves A, Wayne G, Danihelka I (2014) Neural turing machines. arXiv preprint arXiv:1410.5401

  2. Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML

  3. Chopra S, Hadsell R, LeCun Y (2005) Learning a similarity metric discriminatively, with application to face verification. In: CVPR

  4. Bateni P, Barber J, van de Meent J-W, Wood F (2022) Enhancing few-shot image classification with unlabelled examples. In: WACV

  5. Rodriguez P, Laradji I, Drouin A, Lacoste A (2020) Embedding propagation: Smoother manifold for few-shot classification. In: ECCV

  6. Ziko I, Dolz J, Granger E, Ayed IB (2020) Laplacian regularized few-shot learning. In: ICML

  7. Rizve MN, Khan S, Khan FS, Shah M (2021) Exploring complementary strengths of invariant and equivariant representations for few-shot learning. In: CVPR

  8. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR

  9. Lee K, Maji S, Ravichandran A, Soatto S (2019) Meta-learning with differentiable convex optimization. In: CVPR

  10. Xie S, Girshick R, Dollar P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: CVPR

  11. Ravichandran A, Bhotika R, Soatto S (2019) Few-shot learning with embedded class models and shot-free meta training. In: ICCV

  12. Hiroyuki K (2020) Multi-view Wasserstein discriminant analysis with entropic regularized Wasserstein distance. In: ICASSP

  13. Peyré G, Cuturi M (2019) Computational optimal transport: with applications to data science. Found Trends Machine Learn 11(5–6):355–607

    Article  MATH  Google Scholar 

  14. Hu Y, Gripon V, Pateux S (2021) Leveraging the feature distribution in transfer-based few-shot learning. In: ICANN

  15. Bendou Y, Hu Y, Lafargue R, Lioi G, Pasdeloup B, Pateux S (2021) Easy: Ensemble augmented-shot y-shaped learning: State-of-the-art few-shot classification with simple ingredients

  16. Zhang H, Cao Z, Yan Z, Zhang C (2021) Sill-net: Feature augmentation with separated illumination representation. arXiv preprint arXiv:2102.03539

  17. Yang S, Liu L, Xu M (2021) Free lunch for few-shot learning: distribution calibration. In: ICLR

  18. Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2018) Mixup: Beyond empirical risk minimization. In: ICLR

  19. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: CVPR

  20. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Jones L, Lukasz Kaiser Polosukhin I (2017) Attention is all you need. In: NeurIPS

  21. Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: NAACL-HLT

  22. Barz B, Rodner E, Garcia YG, Denzler J (2019) Detecting regions of maximal divergence for spatio-temporal anomaly detection. IEEE Trans Pattern Anal Mach Intell 41(5):1088–1101

    Article  Google Scholar 

  23. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Xu B, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: NeurIPS

  24. Nezza ED, Lu CH (2015) Generalized monge-ampère capacities. Int Math Res Notices 2015(16):7287–7322

    Article  MATH  Google Scholar 

  25. Yang I (2017) A convex optimization approach to distributionally robust Markov decision processes with Wasserstein distance. IEEE Control Syst Lett 1(1):164–169

    Article  MathSciNet  Google Scholar 

  26. Zhang R, Li X, Zhang H, Nie F (2020) Deep fuzzy k-means with adaptive loss and entropy regularization. IEEE Trans Fuzzy Syst 28(11):2814–2824

    Article  Google Scholar 

  27. Daniel C (2019) Sinkhorn-knopp theorem for rectangular positive maps. Linear Multilinear Algebra 67(11):2345–2365

    Article  MathSciNet  MATH  Google Scholar 

  28. Muzellec B, Josse J, Boyer C, Cuturi M (2020) Missing data imputation using optimal transport. In: ICML

  29. Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: CVPR

  30. Voita E, Serdyukov P, Sennrich R, Titov, I (2019) Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In: ACL

  31. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: ECCV

  32. Vinyals O, Blundell C, Lillicrap T, koray kavukcuoglu Wierstra D (2016) Matching networks for one shot learning. In: NeurIPS

  33. Ren M, Triantafillou E, Ravi S, Snell J, Swersky K, Tenenbaum JB, Larochelle H, Zemel RS Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676

  34. Bertinetto L, Henriques JF, Torr PHS, Vedaldi A (2019) Meta-learning with differentiable closed-form solvers. In: ICLR

  35. Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset

  36. Oreshkin BN, Rodriguez P, Lacoste A (2018) Tadam: Task dependent adaptive metric for improved few-shot learning. In: NeurIPS

  37. Zagoruyko S, Komodakis N (2016) Wide residual networks. In: BMVC

  38. Huang K, Geng J, Jiang W, Deng X, Xu Z (2021) Pseudo-loss confidence metric for semi-supervised few-shot learning. In: ICCV

  39. Le D, Nguyen KD, Nguyen K, Tran Q-H, Nguyen R, Hua B-S (2021) Poodle: Improving few-shot learning via penalizing out-of-distribution samples. In: NeurIPS

  40. Shao S, Xing L, Wang Y, Xu R, Zhao C, Wang Y, Liu B (2021) Mhfc: Multi-head feature collaboration for few-shot learning. In: ACM MM

  41. Ye H-J, Hu H, Zhan D-C, Sha F (2020) Few-shot learning via embedding adaptation with set-to-set functions. In: CVPR

  42. Wu J, Zhang T, Zhang Y, Wu F (2021) Task-aware part mining network for few-shot learning. In: ICCV

  43. Yang L, Li L, Zhang Z, Zhou X, Zhou E, Liu Y (2020) Dpgn: Distribution propagation graph network for few-shot learning. In: CVPR

  44. Jian Y, Torresani L (2022) Label hallucination for few-shot classification. In: AAAI

  45. Chen D, Chen Y, Li Y, Mao F, He Y, Xue H (2021) Self-supervised learning for few-shot image classification. In: ICASSP

  46. Bateni P, Goyal R, Masrani V, Wood F, Sigal L (2020) Improved few-shot visual classification. In: CVPR

  47. Boudiaf M, Masud ZI, Rony J, Dolz J, Piantanida P, Ayed IB (2020) Transductive information maximization for few-shot learning. In: NeurIPS

  48. Kye SM, Lee HB, Kim H, Hwang SJ (2020) Meta-learned confidence for few-shot learning. arXiv preprint arXiv:2002.12017

  49. Li X, Sun Q, Liu Y, Zheng S, Zhou Q, Chua T-S, Schiele B (2019) Learning to self-train for semi-supervised few-shot classification. In: NeurIPS

  50. Wang Y, Xu C, Liu C, Zhang L, Fu Y (2020) Instance credibility inference for few-shot learning. In: CVPR

  51. Rajasegaran J, Khan S, Hayat M, Khan FS, Shah M (2020) Self-supervised knowledge distillation for few-shot learning. arXiv preprint arXiv:2006.09785

  52. Yang F, Wang R, Chen X (2022) Sega: Semantic guided attention on visual prototype for few-shot learning. In: WACV

  53. Esfandiarpoor R, Pu A, Hajabdollahi M, Bach SH (2020) Extended few-shot learning: Exploiting existing resources for novel tasks. arXiv preprint arXiv:2012.07176

  54. Mangla P, Kumari N, Sinha A, Singh M, Krishnamurthy B, Balasubramanian VN (2020) Charting the right manifold: Manifold mixup for few-shot learning. In: WACV

  55. Hu Y, Gripon V, Pateux S (2021) Graph-based interpolation of feature vectors for accurate few-shot classification. In: ICPR

  56. Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: ICCV

  57. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV

  58. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: CVPR

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanming Guo.

Ethics declarations

Funding

This research was supported in part by the Ministry of Science and Technology of China under grant No. 2020AAA0108800. The authors have no financial or proprietary interests in any material discussed in this article.

Conflicts of Interest

The authors declare that they have no conflict of interests and this paper does not contain any studies with human participants or animals by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, J., Jiang, J. & Guo, Y. MHA-WoML: Multi-head attention and Wasserstein-OT for few-shot learning. Int J Multimed Info Retr 11, 681–694 (2022). https://doi.org/10.1007/s13735-022-00254-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13735-022-00254-5

Keywords

Navigation