Principal characteristic networks for few-shot learning

https://doi.org/10.1016/j.jvcir.2019.02.006Get rights and content

Highlights

  • A principal characteristic is proposed to represent class based on different importance of support vectors.

  • A relative error loss term is appended to basic loss function to enlarge inter-class distance for accurate classification.

  • The eResNet is used to extract high-level abstract embedded vectors.

Abstract

Few-shot learning aims to build a classifier that recognizes unseen new classes given only a few samples of them. Previous studies like prototypical networks utilized the mean of embedded support vectors to represent the prototype that is the representation of class and yield satisfactory results. However, the importance of these different embedded support vectors is not studied yet, which are valuable factors that could be used to push the limit of the few-shot learning. We propose a principal characteristic network that exploits the principal characteristic to better express prototype, computed by distributing weights based on embedded vectors’ different importance. The high-level abstract embedded vectors are extracted from our eResNet embedding network. In addition, we proposed a mixture loss function, which enlarges the inter-class distance in the embedding space for accurate classification. Extensive experimental results demonstrate that our network achieves state-of-the-art results on the Omniglot, miniImageNet and Cifar100 datasets.

Introduction

Recently, deep learning has achieved significant progress in various tasks with large datasets in image classification [32], [31], [17], [44], object detection [3], [35], [36], [41], machine translation [2], [33], [34] and so on. In some areas, classification, recognition and detection abilities have exceeded humanity. However, these achievements rely on the deep models that are trained with a huge number of labeled samples. When it comes to small training sets, the deep networks suffers from the overfitting problem. Facing this problem, few-shot learning [4], [26], [18] is flourishing in recent years. Few-shot learning indicates that a classifier needs to correctly recognize new classes that are not available in training, when only few labeled samples in these new classes are given. This task may be difficult for existing deep models, but it is quite easy for humans. Even children can outline the concept of “tiger” by giving few pictures of a tiger that has never been seen before. Then they can generalize this concept, and recognize tigers correctly in other pictures. The motivation for accomplishing this task is not only this, but also many applications, such as recognition and classification of the rare cases in medical imaging for auxiliary diagnosis, search and recognition of suspects from the massive surveillance video for assistant reconnaissance and so on. The major advantage is that it only needs few labeled samples to achieve reasonable results, rather than millions of labeled samples. Thus sample annotation cost is greatly alleviated. For this type of few-shot learning task, what first comes to our mind is transfer learning [5], [37], [38], which can train on large datasets in advance, then fine-tune on the target small dataset to get a better result on target categories. However, studies have shown that when target categories diverge from training categories, the performance of pre-trained networks is greatly decreased [6]. In this case, abstracting the concept of the class hierarchy is needed rather than sample hierarchy. In addition, since the target small dataset only has few or even one labeled sample per class, direct fine-tuning cannot learn the class concept of the target small dataset well.

Solutions for few-shot learning tasks include the following methods: data augmentation, meta-learning and metric learning. For data-starved classes, data augmentation can be used to alleviate overfitting. The corresponding solution is to augment data at the feature level, such as hallucinating features [27], [28]. Their approaches have a certain accuracy improvement in few-shot classification, but cause of the extremely small data regime, the transformation mode is very limited and cannot solve the overfitting problem. Meta-learning methods [10], [14], [15], [23], [12] are widely favored for few-shot learning because they are built on the basis of tasks and learn high-level strategies between similar tasks. By learning good initial conditions [10], [14], task-level update strategies [10] or constructing external memory storages [15], [23] via RNNs to remember large amounts of information for comparison during testing, these approaches achieved good results. However, due to the use of RNNs, network architectures are complex and less efficient, while the metric learning methods are more simple and efficient. This kind of methods first learns the embedded vector of a sample from an embedding network, then directly computes nearest neighbor in the embedding space to accomplish prediction and classification [4], [8], [9], [16], [29], [30]. By using episodes [8] training methods, improved embedding space [9], [16] and learnable distance metrics [29], [30], the few-shot classification performance is further improved.

One of representative achievements of metric learning methods is the matching networks proposed by Vinyals et al. [8], in which an attention mechanism is used to predict the categories of query images based on the embedding network that learns from support sets. It uses sampled mini-batches called episodes during training to simulate test tasks. The approach makes the training environment more similar to the test environment, which improves generalization performance during the testing phase. Another major contribution is the proposed miniImageNet dataset for few-shot learning tasks, which is also widely used as a benchmark. Snell et al. [9] further explore the relationship among class embedded vectors in the embedding space. There exists a representation called “prototype” for each category. The corresponding class embedded vectors cluster around the class prototype, and the prototype is the mean value of embedded support vectors. Based on this, the prototypical network is proposed. After the class prototypes are obtained, the classification problem becomes the nearest neighbor of an embedded query vector versus various class prototypes in the embedding space. This work achieves good performance.

Our paper is based on the model training method of matching networks [8], and inspired by the idea of prototype [9] to propose principal characteristic networks. Aiming at the problem that it [9] simply uses the mean of embeddings to express class prototype, and cannot distinguish the different contributions of class embedded support vectors to the class prototype well, we propose the concept of principal characteristic. Exploiting different embedded support vectors on the prototype based on the target which is obvious or not in the image, different weights are distributed, and then weighted summation is used to obtain the principal characteristic vector to better express the prototype, which achieves better results. Specifically, we distribute weights based on the sum of absolute difference (SAD) of each embedded support vector versus the rest in a class. In addition, we put forward the mixture loss function in consideration of the situation that existing similar classes in the embedding space cause misclassification. Basic loss combined with relative error loss function is used to increase inter-class distances among embeddings in the embedding space, which reduces misclassification of similar classes while maintaining correct classification of normal classes. Moreover, we propose the eResNet structure based on residual networks [17] for the embedding network. Compared with common CNNs used by prior approaches, it can effectively increase the network depth to improve the general feature extraction degree without degradation. Compared to the smallest ResNet-18, we reduced the 88.86% of the parameters and achieved better results. In experiments after fine-tuning with support sets during testing, we achieve currently known state-of-the-art results on the Omniglot, miniImageNet and Cifar100 datasets.

The rest structure of this paper is organized as follows: In Section 2 we describe related work in the few-shot learning area. In Section 3 we define and explain the methods of our paper. First, we introduce the general architecture of our principal characteristic networks how to handle few-shot classification task, and then introduce each optimization method in detail. In Section 4 we show our experimental settings and experiment performances, which demonstrate the excellent performance of our paper on various datasets. In Section 5 we present our conclusions and prospect the next work.

Section snippets

Related work

After Yip et al. [1] started exploration of one or few-shot learning area using machine learning, the related works are roughly divided into the following types of methods. Beginning with data spaces, the related method is generative models and data augmentation. The generative model based on stroke [25] or part [24] has good performance in specific fields such as hand-written characters [25]. The generative model is a joint probability distribution of sub-parts, parts, and relationships among

Methods

The basic procedure of processing few-shot classification tasks in our paper is: we first input support and query images of an episode into our eResNet embedding network, and obtain the corresponding embedded vectors. In the embedding space, we use the proposed principal characteristic method to obtain the principal characteristic per class. Then we calculate the cosine distances from embedded query vectors to each class principal characteristic, and then use the proposed mixture loss function

Experiments

This section describes our experimental results in detail and compares against various baselines. The tasks are based on N-way, k-shot classification tasks. Specifically, it refers to: k labeled samples per class provided, total N classes which are new classes that have not seen in the training set. Then, based on this sample set, the network predicts whether a query image belongs to which class in these N classes. We use the same number of classes during training as testing. For example, we

Conclusion

We have proposed principal characteristic networks based on the training method of matching networks [8] and inspired by the prototype [9] idea. Aiming at the problem that common embedding networks extract low level of abstraction of samples, we propose an improved embedding network eResNet to improve extracting the high-level feature representation of samples. Aiming at the situation that the target in a sample is not obvious, we propose a weighted principal characteristic method based on the

Conflict of interest

We declare that we have no conflicts of interest to this work.

Acknowledgment

We express our sincere thanks to the anonymous reviewers for their helpful comments and suggestions to raise the standard of our paper. We would also like to thank Yiqing Hu, Xuchen Yao, Tingting Yu for helpful discussions on the presentation. This work is partly supported by the National Natural Science Foundation of China under Grant No. 61672202.

References (44)

  • K. Yip et al.

    Sparse representations for fast, one-shot learning

  • Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, W. Macherey, et al., Google’s Neural Machine Translation System:...
  • S. Ren, K. He, R.B. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks,...
  • G. Koch, R. Zemel, R. Salakhutdinov, Siamese neural networks for one-shot image recognition, in: ICML Deep Learning...
  • Y. Bengio

    Deep learning of representations for unsupervised and transfer learning

  • J. Yosinski et al.

    How transferable are features in deep neural networks

    Neural Inform. Process. Syst.

    (2014)
  • A. Graves, G. Wayne, I. Danihelka, Neural Turing Machines, 2014. ArXiv Preprint...
  • O. Vinyals et al.

    Matching networks for one shot learning

    Neural Inform. Process. Syst.

    (2016)
  • J. Snell et al.

    Prototypical networks for few-shot learning

    Neural Inform. Process. Syst.

    (2017)
  • S. Ravi, H. Larochelle, Optimization as a model for few-shot learning, in: ICLR 2017: International Conference on...
  • C.G. Atkeson et al.

    Locally weighted learning for control

    Artif. Intell. Rev.

    (1997)
  • N. Mishra, M. Rohaninejad, X. Chen, P. Abbeel, Meta-Learning with Temporal Convolutions, 2017. ArXiv Preprint...
  • A. Banerjee et al.

    Clustering with Bregman divergences

    J. Mach. Learn. Res.

    (2005)
  • C. Finn et al.

    Model-agnostic meta-learning for fast adaptation of deep networks

  • A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, T.P. Lillicrap, Meta-learning with memory-augmented neural...
  • S. Fort, Gaussian Prototypical Networks for Few-Shot Learning on Omniglot, 2018. ArXiv Preprint...
  • K. He et al.

    Deep residual learning for image recognition

  • B.M. Lake et al.

    One shot learning of simple visual concepts

    Cognit. Sci.

    (2011)
  • A. Krizhevsky, V. Nair, G. Hinton, Cifar-10 (Canadian Institute for Advanced Research), 2010....
  • O. Russakovsky et al.

    Imagenet large scale visual recognition challenge

    Int. J. Comput. Vision

    (2015)
  • D.P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, 2014. arXiv preprint...
  • S. Hochreiter et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • Cited by (34)

    • Meta-transfer-adjustment learning for few-shot learning

      2022, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Such data expansion solutions, on the other hand, are frequently customized to each dataset and cannot be used randomly on other datasets. Meta-learning [4–7] is different from data augmentation approaches in that it is a type of learning that employs tasks as the foundation for training and builds knowledge through a series of tasks. Task-based meta-learning approaches often model shallow neural networks to minimize model overfitting, which makes them unable to apply more powerful model architectures despite the high expense of conducting many comparable training tasks.

    • A Review of Modeling Techniques Jointly Driven by Knowledge and Data

      2023, Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China
    • Grade classification of wheat stripe rust disease based on deep learning

      2023, Journal of South China Agricultural University
    View all citing articles on Scopus

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text